Introduction

Web crawling to gather information is a common technique used to efficiently collect information from across the web. As an introduction to web crawling, in this project we will use Scrapy, a free and open source web crawling framework written in Python[1]. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler. Even though Scrapy is a comprehensive infrastructure to support web crawling, you will face different kinds of challenges in real applications, e.g., dynamic JavaScript or your IP being blocked.

The project contains 3 parts. Each part is an extension of the previous one. The end goal is to code a Scrapy project that can crawl tens of thousands of apps from the Xiaomi AppStore, or any other app store with which you are familiar.

Project Description

First stage

Create a Scrapy pr oject to crawl the content in the Xiaomi Appstore homepage or any other Appstore homepage

Second stage

Save the crawled content in MongoDB[2]. Install Python MongoDB driver and modify pipelines.py to insert crawled data into MongoDB.

Third stage

Crawl more content by following next page links. So far you have likely only crawled the content of the home page. We need to use Splash[3] and ScrapyJS[4] to re-render the web page to transform the dynamic part to static content if the next page link is written in JavaScript

Setup Requirements

python2.7
Scrapy 1.0+
Splash
ScrapyJS
MongoDB

Reference

[1]Scrapy http://scrapy.org [2]MongoDB https://www.mongodb.org/ [3]Splash & ScrapyJS https://github.com/scrapinghub/scrapy-splash [4]ScrapyJS & ScrapyJS https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/

Introduction

Introduction

Project Description

First stage

Second stage

Third stage

Setup Requirements

Reference

results matching ""

No results matching ""