Web Crawler
Outline
Scrapy at a Glance

  - [`items.py` -- 定义了爬取数据的schema](#itemspy----定义了爬取数据的schema)
  - [`pipelines.py` -- 定义了爬取数据之后需要的加工处理流程](#pipelinespy----定义了爬取数据之后需要的加工处理流程)
  - [`parse()` -- 爬取当前页面的apps【黄色方块】](#parse----爬取当前页面的apps【黄色方块】)
  - [`parse_item()` -- 爬取当前页面每个app相关推荐的apps【蓝色方块】](#parseitem----爬取当前页面每个app相关推荐的apps【蓝色方块】)

3.如何存储爬取的数据
4.爬虫常见问题一：如何不被block
5.爬虫常见问题二：如何Render Javascript
6. 如何展示爬取的数据
7. 爬虫的应用

相关资源：

Web Crawler

此项目利用python的scrapy框架搭建了一个爬虫
大部分内容基于如下的太阁的爬虫视频
太阁微项目2 AppStore之爬虫（一）
太阁微项目3 AppStore之爬虫（二）
太阁微项目4 AppStore之爬虫（三）

Outline

如何爬取一个页面
如何爬取更多的页面
如何存储爬取的数据
爬虫常见问题一：如何不被block
爬虫常见问题二：如何Render Javascript
如何展示爬取的数据

Scrapy at a Glance

最小的运行单位是Spider，Spider会像Internet发送爬取request
收到相应的response后，通过parser去处理，把网络格式的数据解析成python的数据结构
接下来，pipelines会对已经存进python的数据进行进一步的加工，最后加载到database或者file中

具体步骤

1.如何爬取一个页面

这里以爬取华为appstore为例每个数据都是放在相应的tag中，当我们知道我们想要爬取的特定的数据的相关的tag后，我们就能够通过parsers的帮助定义相应的pattern去抽取想要的数据安装Scrapy:

$ pip install scrapy

创建一个scrapy项目实例

$ scrapy startproject appstore

点入创建的文件夹，在文件夹appstore/spiders下创建spider源码文件

$ touch huawei_spider.py

通过命令tree查看scrapy项目的架构，发现创建appstore实例时，scrapy会自动生成4个主要的源码文件

`items.py` -- 定义了爬取数据的schema

比如我想爬取title, url, appid, intro这4个数据，那在items.py中的schema定义如下

import scrapy
class AppstoreItem(scrapy.Item)
    # define the fields for your item here like:
    title = scrapy.Field()
    url = scrapy.Field()
    appid = scrapy.Field()
    intro = scrapy.Field()

`pipelines.py` -- 定义了爬取数据之后需要的加工处理流程

这里贴出了一个pipeline的代码，它的任务是把爬取的数据按照格式appid, title, intro存储到本地文件appstore.dat

class AppstorePipeline(object):
    def __init__(self):
        self.file = open('appstore.dat', 'wb')
    def process_item(self, item, spider):
        val = '{0}\t{1}\t{2}\n".format(item['appid'], item['title'], item['intro'])
        self.file.write(val)
        return item  # 每一行数据就按照格式ID, title, intro显示

`settings.py` -- scrapy的配置

这个配置文件可以设置很多东西，这里列举其中几个常用的设置

如果你有多个pipelines同时运行，你就需要定义各个pipelines的优先级。数字越小，优先级越高
```
ITEM_PIPELINES = {
'appstore.pipelines.AppstorePipeline': 300,
}
```
定义每次send request的延迟，防止被识别为机器人
```
DOWNLOAD_DELAY=5
```

`spiders/huawei_spider/py` -- 主代码文件

调用了以上3个文件来执行爬虫。

在此文件中定义爬虫的起始点start_urls【以爬取华为appstore为例】

class HuaweiSpider(BaseSpider):
    name = "appstore"

    allowed_domains = ["huawei.com"]

    start_urls = [
        "http://appstore.huawei.com/more/all"
    ]

截图红框中的代码是为了爬取title的text

编写完毕上述4个python源文件后，就可以试运行scrapy了

$ cd appstore
$ scrapy crawl huawei

查看结果

$ cat appstore.dat

2.如何爬取更多的页面

如果我不仅仅想爬当前的app，我还想抓取这个app相关推荐的apps 首先，我们需要添加一个new field到items.py的schema中

import scrapy
class AppstoreItem(scrapy.Item)
    # define the fields for your item here like:
    title = scrapy.Field()
    url = scrapy.Field()
    appid = scrapy.Field()
    intro = scrapy.Field()
    recommended = scrapy.Field() # new field

还需要修改spiders/huawei_spider.py使它能够爬取相关推荐的apps 这里我们定义了两个函数来实现这个Crawler Unit的概念

`parse()` -- 爬取当前页面的apps【黄色方块】

统计当前页面有多少个app需要爬取，为之后的遍历做准备
遍历每个app时，发送request到相关推荐apps的『详情页面』

遍历结束后，发送request到下一页，继续爬取

def parse(self, response):
  """
  response.body is a result of render.html call; it contains HTML processed by a browser.
  here we parse the html
  :param response:
  :return: request to detail page & request to next page if exists
  """
  # count apps on current page
  page = Selector(response)
  divs = page.xpath('//div[@class="list-game-app dotline-btn nofloat"]')
  current_url = response.url
  print "num of app in current page: ", len(divs)
  print "current url: ", current_url

  # parse details when looping apps on current page
  count = 0
  for div in divs:
      if count >= 2:
          break
      item = AppstoreItem()
      info = div.xpath('.//div[@class="game-info  whole"]')
      detail_url = info.xpath('./h4[@class="title"]/a/@href').extract_first()
      item["url"] = detail_url
      req = Request(detail_url, callback=self.parse_detail_page)
      req.meta["item"] = item
      count += 1
      yield req

  # go to next page
  page_ctrl = response.xpath('//div[@class="page-ctrl ctrl-app"]')
  isNextPageThere = page_ctrl.xpath('.//em[@class="arrow-grey-rt"]').extract()

  if isNextPageThere:
      current_page_index = int(page_ctrl.xpath('./span[not(@*)]/text()').extract_first()) # "div[not(@attr)]"(not any on specific attr)
      if current_page_index >= 5: # 这里只爬取前5页
          print "let's stop here for now"
          return
      next_page_index = str(current_page_index + 1)

      next_page_url = self.start_urls[0] + "/" + next_page_index

      print "next_page_index: ", next_page_index, "next_page_url: ", next_page_url
      request = scrapy.Request(next_page_url, callback=self.parse, meta={ # render the next page
          'splash': {
              'endpoint': 'render.html',
              'args': {'wait': 0.5}
          },
      })
      yield request
  else:
      print "this is the end!"

`parse_item()` -- 爬取当前页面每个app相关推荐的apps【蓝色方块】

爬取相关推荐页面上的详情信息
爬取完毕后，返回到上级页面

数据通过req的meta属性来传递，在parse()中先定义req.meta["item"] = item，然后传递到parse_item()中赋值给item，再传递回去

def parse_detail_page(self, response):
  """
  GET details for each app
  :param response:
  :return: item
  """
  item = response.meta["item"]

  # details about current app
  item["image_url"] = response.xpath('//ul[@class="app-info-ul nofloat"]//img[@class="app-ico"]/@lazyload').extract()[0]
  item["title"] = response.xpath('//ul[@class="app-info-ul nofloat"]//span[@class="title"]/text()').extract_first().encode('utf-8')
  item["appid"] = re.match(r'http://.*/(.*)', item["url"]).group(1)
  item["intro"] = response.xpath('//div[@class="content"]/div[@id="app_strdesc"]/text()').extract_first().encode('utf-8')

  # recommended apps
  divs = response.xpath('//div[@class="unit nofloat corner"]/div[@class="unit-main nofloat"]/div[@class="app-sweatch  nofloat"]')
  recommends = []
  for div in divs:
      rank = div.xpath('./div[@class="open nofloat"]/em/text()').extract_first()
      name = div.xpath('./div[@class="open nofloat"]/div[@class="open-info"]/p[@class="name"]/a/@title').extract()[0].encode('utf-8')
      url = div.xpath('./div[@class="open nofloat"]/div[@class="open-info"]/p[@class="name"]/a/@href').extract_first()
      rec_appid = re.match(r'http://.*/(.*)', url).group(1)
      recommends.append({'name': name, 'rank': rank, 'appid': rec_appid})

  item["recommends"] = recommends
  yield item

当我们想要爬取更多页面的时候，其实就需要像『遍历树结构』一样去遍历整个站点

scrapy默认支持request URL的去重，不用担心爬到重复的数据
start URLs就相当于是root of the tree

有两种方法得到『下一页』的地址

自增ID获得下一页的地址
从第一页中找到下一页的地址，然后request『下一页』

两种方法最后都需要判断『下一页』是否存在

3.如何存储爬取的数据

为什么这里用mongoDB来存储爬虫数据？

MongoDB 只有行没有列，每一行都是一个字典，里面的key-value pairs数量不定，类型不定，嵌套的深度不定。所以才说它是NoSQL

MongoDB的优点

数据库，数据表　无需显示创建，用的时候自动创建好处：使用简单便捷
当用到数据库的时候，它会自动创建，不需要你先创建

MongoDB的缺点：

一不小心就创建了多余的库或表，用错了库名，表名没有报错，仍旧正常进行。数据结构上，没有任何强制要求，无限灵活，但是一不小心就弄脏了数据。
如果你用错了库名或者表名，它就会自动创建一个新的库或者表，也不会报错！【代价】
一个新手很容易通过错误的命令弄脏数据库！

MongoDB适合数据重要性不强，结构要求不严格的数据，爬虫爬出来的初步数据就是如此。所以非常适合用mongoDB来存储这里的爬虫数据

部署并测试mongoDB

mongo是客户端命令
mongod是服务器端命令

通过homebrew安装mongodb

$ brew install mongodb

mongodb数据默认存在/data/db下，所以需要创建这个文件夹

$ sudo chown xxx /data/db
# 请把xxx替换为自己当前的用户名，如果不确定可以先run $ whoami

但是也可以放在自家的目录下，只不过需要让mongoDB能看见这个路径，这样可以省去修改权限的麻烦

$ mkdir -p ~/data/db
$ mongod –dbpath ~/data/db #最好设成alias $ mongod=”mongod –dbpath ~/data/db”

把mongodb/bin加入$PATH

$ touch .base_profile
$ vim .base_profile

加入以下地址以后重启terminal

export MONGO_PATH=/usr/local/Cellar/mongodb/3.2.1
export PATH=MONGO_PATH/bin:$PATH

启动mongodb

$ mongod

此时可以开始使用query来查询存放在mongoDB的数据了

【注】要保持运行mongod窗口的持续运行，需要在另一个terminal窗口启动mongodb client

scrapy中使用mongoDB

通过python package pymongo来实现连接在pipelines.py中添加mongoDB的pipeline，实现装载item内的数据到mongoDB中

import pymongo
class AppstoreMongodbPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        """
        return an instance of this pipeline
        crawler.settings --> settings.py
        get mongo_uri & mongo_database from settings.py
        :param crawler:
        :return: crawler instance
        """
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        """
        process data here before loading to mongodb
        :param item:
        :param spider:
        :return: item
        """
        collection_name = item.__class__.__name__  # use itemName as the collectionName
        self.db[collection_name].remove({}) # clean the collection when new crawling starts
        self.db[collection_name].insert(dict(item))
        return item

在settings.py中添加相关的配置信息

ITEM_PIPELINES = {
   'appstore.pipelines.AppstoreWritePipeline': 1,
   'appstore.pipelines.AppstoreImagesPipeline': 2,
   'appstore.pipelines.AppstoreMongodbPipeline': 3,
}
# mongo db settings
MONGO_URI = "127.0.0.1:27017"
MONGO_DATABASE = "appstore"

4.爬虫常见问题一：如何不被block

问题

情景1：每次上网发送请求时，我们都会发useragent信息给网站server 比如正常上网你的useragent是chrome，molliza，safari等但是如果你用scrapy你的useragent就是spider，那么网页服务器端就会屏蔽这样的useragent
情景2：当我们对server发出的request过于频繁的时候，server有可能block我们的IP

解决方案
使用Proxy从已构建好的useragent库中随机选取一个useragent来作为当前时段的身份用于爬取数据，每隔一段时间就换一个身份用来伪装自己的爬虫身份

在settings.py添加用于实现上述功能的中间件

DOWNLOADER_MIDDLEWARES = {
    'appstore.random_useragent.RandomUserAgentMiddleware': 400,
}

在settings.py同一级目录中添加一个新的python源文件random_useragent.py，从网上收集几个能用的useragent来构建useragent库

import random
from scrapy import log
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware


class RandomUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, settings, user_agent='Scrapy'):
        super(RandomUserAgentMiddleware, self).__init__()
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)
            spider.log(
                    u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
                    level=log.DEBUG
                )

    """
    the default user_agent_list composes chrome, IE, Firefox, Mozilla, Opera,
    for more user agent strings, you can find it in http://www.useragentstring.com/pages/useragentstring.php
    """
    user_agent_list = [
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.45 Safari/535.19",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.45 Safari/535.19",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.11 Safari/535.19",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.66 Safari/535.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36",
    ]

5.爬虫常见问题二：如何Render Javascript

使用Javascript渲染和处理网页是种非常常见的做法，如何处理一个大量使用Javascript的页面是Scrapy爬虫开发中一个常见的问题。

scrapy-splash利用Splash将javascript和Scrapy集成起来，使得Scrapy可以抓取动态网页。

Splash是一个javascript渲染服务，是实现了HTTP API的轻量级浏览器，底层基于Twisted和QT框架，Python语言编写。所以首先你得安装Splash实例

官网建议使用docker容器安装方式Splash。具体教程见这里

在settings.py中我们需要设置添加splash为中间件的相关配置

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}
SPLASH_URL = 'http://192.168.99.100:8050'  #'DOCKER_HOST_IP:CONTAINER_PORT'
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter' #设置Splash自己的去重过滤器
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage' #如果你使用Splash的Http缓存，那么还要指定一个自定义的缓存后台存储介质

在appstoreSpider.py中应用Splash来解析起始页中的JS代码

class HuaweiSpider(BaseSpider):
    name = "appstore"

    allowed_domains = ["huawei.com"]

    start_urls = [
        "http://appstore.huawei.com/more/all"
    ]

    # render since the start url
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

scrapy-splash是用来host splash这个service的server端
我们发出的request会首先发给splash这个middleware，经过渲染处理后才会发到appstore，appstore收到request，返回的response也会首先经过splash的处理解析生成新的HTML page，然后才回到scrapy手中

6. 如何展示爬取的数据

Flask是一个python微框架，非常小的框架。只需要几行代码，一个python文件就能创建一个web网页出来。非常适合不追求美观，不追求复杂逻辑，但是需要一个网页显示结果的情景

在settings.py的同级目录中，创建templates/appstore_index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>App Stats from Huawei appstore</title>
</head>
<!--use liquid tag and liquid output here to help render-->
<body>
    {% for app in apps %}
    <img src="{{ url_for('static', filename=app.image_paths[0]) }}"</img>
    <h3>{{ app.title }}</h3>
    <div>{{ app.intro }}</div>
    <div>{{ app.url }}</div>
    <ul>
        {% for rec in app.recommends %}
        <li>{{ rec.rank }} : {{ rec.name }}</li>
        {% endfor %}
    </ul>
    <hr />
    {% endfor %}
</body>
</html>

依然在settings.py的同级目录中，创建server.py这个源代码文件

# coding=utf-8
__author__ = 'jing'

from flask import Flask, render_template
import pymongo
from settings import MONGO_URI, MONGO_DATABASE

app = Flask(__name__, static_folder = "images")  # instantiate flask

@app.route("/")
def hello():
    client = pymongo.MongoClient(MONGO_URI)
    db = client[MONGO_DATABASE]
    apps = db["AppstoreItem"].find()
    client.close()
    return render_template("appstore_index.html", apps=apps)  # render anything we have in each app

if __name__ == "__main__":
    app.run(debug=True) # some error won't show up until you enable debugging feature

7. 爬虫的应用

爬虫是很多的项目之始爬到的数据可以用有很多应用

例子：太阁第一期的爬虫项目pi，用爬虫构建了一个美食搜索
更多有趣的爬虫应用可以参见知乎的这篇文章和这篇文章

这本GitBook的下一章节会讲到其中的一种应用：Information Retrieval

2.爬虫搭建

Web Crawler

Outline

Scrapy at a Glance

具体步骤

1.如何爬取一个页面

`items.py` -- 定义了爬取数据的schema

`pipelines.py` -- 定义了爬取数据之后需要的加工处理流程

`settings.py` -- scrapy的配置

`spiders/huawei_spider/py` -- 主代码文件

2.如何爬取更多的页面

`parse()` -- 爬取当前页面的apps【黄色方块】

`parse_item()` -- 爬取当前页面每个app相关推荐的apps【蓝色方块】

3.如何存储爬取的数据

为什么这里用mongoDB来存储爬虫数据？

部署并测试mongoDB

scrapy中使用mongoDB

4.爬虫常见问题一：如何不被block

5.爬虫常见问题二：如何Render Javascript

6. 如何展示爬取的数据

7. 爬虫的应用

相关资源：

results matching ""

No results matching ""

Web Crawler

Outline

Scrapy at a Glance

具体步骤

1.如何爬取一个页面

items.py -- 定义了爬取数据的schema

pipelines.py -- 定义了爬取数据之后需要的加工处理流程

settings.py -- scrapy的配置

spiders/huawei_spider/py -- 主代码文件

2.如何爬取更多的页面

parse() -- 爬取当前页面的apps【黄色方块】

parse_item() -- 爬取当前页面每个app相关推荐的apps【蓝色方块】

3.如何存储爬取的数据

为什么这里用mongoDB来存储爬虫数据？

部署并测试mongoDB

scrapy中使用mongoDB

4.爬虫常见问题一：如何不被block

5.爬虫常见问题二：如何Render Javascript

6. 如何展示爬取的数据

7. 爬虫的应用

相关资源：

results matching ""

No results matching ""

`items.py` -- 定义了爬取数据的schema

`pipelines.py` -- 定义了爬取数据之后需要的加工处理流程

`settings.py` -- scrapy的配置

`spiders/huawei_spider/py` -- 主代码文件

`parse()` -- 爬取当前页面的apps【黄色方块】

`parse_item()` -- 爬取当前页面每个app相关推荐的apps【蓝色方块】