Introduction
Web crawling to gather information is a common technique used to efficiently collect information from across the web. As an introduction to web crawling, in this project we will use Scrapy, a free and open source web crawling framework written in Python[1]. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler. Even though Scrapy is a comprehensive infrastructure to support web crawling, you will face different kinds of challenges in real applications, e.g., dynamic JavaScript or your IP being blocked.
The project contains 3 parts. Each part is an extension of the previous one. The end goal is to code a Scrapy project that can crawl tens of thousands of apps from the Xiaomi AppStore, or any other app store with which you are familiar.
Project Description
First stage
Create a Scrapy pr oject to crawl the content in the Xiaomi Appstore homepage or any other Appstore homepage
Second stage
Save the crawled content in MongoDB[2]. Install Python MongoDB driver and modify pipelines.py to insert crawled data into MongoDB.
Third stage
Crawl more content by following next page links. So far you have likely only crawled the content of the home page. We need to use Splash[3] and ScrapyJS[4] to re-render the web page to transform the dynamic part to static content if the next page link is written in JavaScript
Setup Requirements
- python2.7
- Scrapy 1.0+
- Splash
- ScrapyJS
- MongoDB
Reference
[1]Scrapy http://scrapy.org [2]MongoDB https://www.mongodb.org/ [3]Splash & ScrapyJS https://github.com/scrapinghub/scrapy-splash [4]ScrapyJS & ScrapyJS https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/