I was given the task of building a scalable web scraper to harvest connections between domains of a specific industry in order to generate a network model of that industry’s online presence. The scraper will need to run for a period of time, (say a week), and the resulting harvest would be the raw data from which the model would be generated. This harvesting process will need to be repeatable to create ‘snapshots’ of the network for future longitudinal analysis. The implementation will use scrapy with a MongoDB back end on a linux platform to be run (ultimately) in AWS.
This post (and subsequent ones) will provide some code samples and documentation of issues faced during the process of getting the scraper operational. The intent is to help bridge the gap between the initial scrapy tutorials and real-world code. The examples assume you have scrapy installed and running, and have at least worked through the basic tutorials. I did find many of the tutorials on the wiki very helpful and worked through several of them, (multiple times). Get a basic crawl spider up and running and then pick up here.
In this example, I hope to demonstrate the the following scrapy features:
-Adding a spider parameter and using it from the command line
-Getting current crawl depth and the referring url
-Setting crawl depth limits
The code samples below were from a scrapy project named farm1
1 2 3 4 5 6 7 8
session_id: a unique session id for each scrapy run or harvest
depth: the depth of the current page with respect to the start url
current_url: the url of the current page being processed
referring_url: the url of the site which was linked to the current page
The crawl spider:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
The spider uses the SgmlLinkExtractor and follows every link, (a later post will cover filtering which links to follow).
Adding a spider parameter and using it from the command line
Lines 14-16 in the spider shows the constructor which has a session_id parameter with a defaul assignment. The session id is assigned to the spider and persisted with each item processed and will be used to identify items from different harvest sessions. Defining the parameter in the constructor allows using it from the command line:
Getting current crawl depth and the referring url
Line 23 is where the current depth of the crawl is retrieved. The ‘depth’ key word is one of several predefined keys in the meta dictionary for the request and response objects. The depth value will be used in the analysis phase.
Line 25 shows how to retrieve the referring url. The goal of this project was harvesting connections between domains, not just data from individual pages. For each item processed, the connection between two links: referring_url –> current_url is stored. Implied with each connection is the directionality of the link which will help build a directed graph for analysis. This is enabled by default in the default middleware settings, (and note that ‘Referer’ is the correct usage).
Setting crawl depth limits
Rather than use ctrl-c to kill the crawling spider, you can set a depth limit at which the spider will not go beyond. I found this helpful during testing. DEPTH_LIMIT is a predefined setting that can be assigned in your settings file. This was the only additional setting used in this example other than the defaults created with the project.
The depth limit can also be set in the command line, (as can all pre-defined settings):
The command line assignment will take priority of the settings file. Note that the depth limit will have little affect if you are running a broad crawl. Ultimately, when this scraper is released into the wild, it will probably be set to run a broad crawl. But it will still need to troll deep to really pull out much of a domain’s public connections. As to how this will be handled, (a mix of broad + deep crawls), I am not sure but it will be documented here once it is figured out.