Recently, I’ve been working on a project that scrapes Reddit looking for links to products on Amazon. Basically the idea being that there’s valuable info in what people are linking to and talking about online, and a starting point would be looking for links to Amazon products on Reddit. And the result of that work turned into Product Mentions.
To build this, and I can talk more about this later, I have two parts. First being a basic Rails app that displays the products and where they’re talked about, and the second being a Python app that does the scraping, and also displays the scraping logs for me using Flask. I thought of just combining the two functionalities at first, but decided It was easier in both regards to separate the two functionalities. The scraper populates the database, and the Rails app displays what’s in there. I hosted the Rails app on Heroku, and after some poking around, decided to also run the Python scraper on Heroku as well (for now at least!)
Also, if at this point, you’re thinking to yourself, “why the hell is he using an overpriced, web app hosting service like Heroku when there are so many other options available?” you’re probably half right, but in terms of ease of getting started, Heroku was by far the easiest PaaS to get this churning. Heroku is nice, and this set up is really simple, especially compared to some of the other PaaS options out there that require more configuration. You can definitely look for different options if you’re doing a more full web crawl, but this’ll work for a lot of purposes.
So what I’m going to describe here today, is how I went about running the scrapers on Heroku as background jobs, using clock and worker processes. I’ll also talk a little about what’s going on so it makes a little more sense than those copy paste tutorials I see a lot (though that type of tutorial from Heroku’s docs is what I used here, so I can’t trash them too badly!).
First file you’re going to need here is a worker file, which will perform the function that it sees coming off a queue. For ease, I’ll name this
worker.py file. This will connect to Redis, and just wait for a job to be put on the queue, and then run whatever it sees. First, we need
rq the library that deals with Redis in the background (all of this is assuming you’re in a virtualenv
$ pip install rq $ pip freeze > requirements.txt
This is the only external library you’re going to need for a functioning
worker.py file, as specified by the nice Heroku doc. This imports the required objects from rq, connects to Redis using either an environment variable (that would be set in a production / Heroku environment), creates a worker, and then calls work. So in the end, running
python worker.py will just sit there waiting to get jobs to run, in this case, scraping Reddit. We also have ‘high’ ‘default’ and ‘low’ job types, so the queue will know which ones to run first, but we aren’t going to need that here.
import os import redis from rq import Worker, Queue, Connection listen = ['high', 'default', 'low'] redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379') conn = redis.from_url(redis_url) if __name__ == '__main__': with Connection(conn): worker = Worker(map(Queue, listen)) worker.work()
Now that we have the worker set up, here’s the
clock.py file that I’m using to do the scraping. Here, it imports the
conn variable from the
worker.py file, uses that to make sure we’re connected to the same Redis queue. We also import the functions that use the scrapers from
run.py, and in this file, create functions that will enqueue the respective functions. Then we use
apscheduler to schedule when we want to call these functions, and then start the scheduler. If we run
python clock.py, we scheduler will run in perpetuity (hopefully), and then will call the correct code on the intervals we defined.