1.创建项目
进入项目文件夹,打开windows PowerShell,创建爬虫项目:
scrapy startproject spider(项目名)2.创建爬虫
进入创建的爬虫项目文件夹,创建爬虫:
scrapy genspider dogcat(爬虫名字) www.xxx.com3.属性封装
进入items.py文件:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    
    # 定义各种属性
    src = scrapy.Field()
    4.爬虫机制定义
进入爬虫py文件(dogcat.py),导入封装属性包:
from spider.items import SpiderItem根据对应网页书写 xpath 定义,上传item对象:
class DogcatSpider(scrapy.Spider):
    name = 'dogcat'
    # allowed_domains = ['sc.chinaz.com']
    start_urls = ['http://sc.chinaz.com/tupian/']
    # 爬虫机制设定
    def parse(self, response):
        # 一个id属性为container下的所有div
        div_list = response.xpath('//div[@id="container"]/div')
        for div in div_list:
            # 获得div下a下img的scr2属性值(伪属性)
            src = div.xpath('./div/a/img/@src2').extract_first()
            # 实例化对象
            item = SpiderItem()
            item['src'] = src
            # 提交管道
            yield item5.设置配置信息
打开网页F12,选择Network,随意选择一个提交文件。打开其Headers选项,拉至最后复制User-Agent属性。将这个值填入setting.py中的USER_AGENT值处(解除注释状态)。为了让爬取过程更清晰,我们可以在其上一行加入如下设置:
LOG_LEVEL = "ERROR"同时将ROBOTSTXT_OBEY改成False:
# Obey robots.txt rules
ROBOTSTXT_OBEY = False6.管道类重封装
考虑到爬取图片数据,Scrapy特意准备了一个ImagesPipeline的图像爬取类,我们只需要对其进行继承,重书写即可。具体如下,将pipeline.py文件内容全部注释,重新书写如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
# from itemadapter import ItemAdapter
#
#
# class SpiderPipeline:
#     def process_item(self, item, spider):
#         return item
import scrapy
from scrapy.pipelines.images import ImagesPipeline
# 需要重写三个方法
class imgsPipeline(ImagesPipeline):
    # 根据图片地址进行图片的请求
    def get_media_requests(self, item, info):
        # 请求
        yield scrapy.Request(item['src'])
    # 指定图片存储的路径
    def file_path(self, request, response=None, info=None, *, item=None):
        # 得到图片名称
        imgName = request.url.split('/')[-1]
        return imgName
    def item_completed(self, results, item, info):
        # 返回下一个即将被执行的管道类
        return item     而图片的存储路径只需要在setting.py的最后一行加入如下配置即可:
# 指定图片存储目录(若不存在则会自动创建)
IMAGES_STORE = './imgs' 同时在setting.py中打开管道类的开关(大约68行左右),将管道类名字换成自定义的管道类名字即可:
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'spider.pipelines.imgsPipeline': 300,
}7.爬取图片
在命令行运行如下命令进行图片爬取:
scrapy crawl dogcat 
                     
                     
                        
                        