Scrapy爬虫之图片爬取


1.创建项目

进入项目文件夹,打开windows PowerShell,创建爬虫项目:

scrapy startproject spider(项目名)

2.创建爬虫

进入创建的爬虫项目文件夹,创建爬虫:

scrapy genspider dogcat(爬虫名字) www.xxx.com

3.属性封装

进入items.py文件:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    
    # 定义各种属性
    src = scrapy.Field()
    

4.爬虫机制定义

进入爬虫py文件(dogcat.py),导入封装属性包:

from spider.items import SpiderItem

根据对应网页书写 xpath 定义,上传item对象:

class DogcatSpider(scrapy.Spider):
    name = 'dogcat'
    # allowed_domains = ['sc.chinaz.com']
    start_urls = ['http://sc.chinaz.com/tupian/']

    # 爬虫机制设定
    def parse(self, response):

        # 一个id属性为container下的所有div
        div_list = response.xpath('//div[@id="container"]/div')
        for div in div_list:
            # 获得div下a下img的scr2属性值(伪属性)
            src = div.xpath('./div/a/img/@src2').extract_first()

            # 实例化对象
            item = SpiderItem()
            item['src'] = src

            # 提交管道
            yield item

5.设置配置信息

打开网页F12,选择Network,随意选择一个提交文件。打开其Headers选项,拉至最后复制User-Agent属性。将这个值填入setting.py中的USER_AGENT值处(解除注释状态)。为了让爬取过程更清晰,我们可以在其上一行加入如下设置:

LOG_LEVEL = "ERROR"

同时将ROBOTSTXT_OBEY改成False:

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

6.管道类重封装

考虑到爬取图片数据,Scrapy特意准备了一个ImagesPipeline的图像爬取类,我们只需要对其进行继承,重书写即可。具体如下,将pipeline.py文件内容全部注释,重新书写如下:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
# from itemadapter import ItemAdapter
#
#
# class SpiderPipeline:
#     def process_item(self, item, spider):
#         return item

import scrapy
from scrapy.pipelines.images import ImagesPipeline

# 需要重写三个方法
class imgsPipeline(ImagesPipeline):

    # 根据图片地址进行图片的请求
    def get_media_requests(self, item, info):
        # 请求
        yield scrapy.Request(item['src'])

    # 指定图片存储的路径
    def file_path(self, request, response=None, info=None, *, item=None):
        # 得到图片名称
        imgName = request.url.split('/')[-1]
        return imgName

    def item_completed(self, results, item, info):
        # 返回下一个即将被执行的管道类
        return item     

而图片的存储路径只需要在setting.py的最后一行加入如下配置即可:

# 指定图片存储目录(若不存在则会自动创建)
IMAGES_STORE = './imgs' 

同时在setting.py中打开管道类的开关(大约68行左右),将管道类名字换成自定义的管道类名字即可:

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'spider.pipelines.imgsPipeline': 300,
}

7.爬取图片

在命令行运行如下命令进行图片爬取:

scrapy crawl dogcat

8.参考文献

https://www.bilibili.com/video/BV1ha4y1H7sx?p=69&t=1


文章作者: Peyton
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Peyton !
  目录