1.创建项目
进入项目文件夹,打开windows PowerShell,创建爬虫项目:
scrapy startproject spider(项目名)
2.创建爬虫
进入创建的爬虫项目文件夹,创建爬虫:
scrapy genspider dogcat(爬虫名字) www.xxx.com
3.属性封装
进入items.py文件:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class SpiderItem(scrapy.Item):
# define the fields for your item here like:
# 定义各种属性
src = scrapy.Field()
4.爬虫机制定义
进入爬虫py文件(dogcat.py),导入封装属性包:
from spider.items import SpiderItem
根据对应网页书写 xpath 定义,上传item对象:
class DogcatSpider(scrapy.Spider):
name = 'dogcat'
# allowed_domains = ['sc.chinaz.com']
start_urls = ['http://sc.chinaz.com/tupian/']
# 爬虫机制设定
def parse(self, response):
# 一个id属性为container下的所有div
div_list = response.xpath('//div[@id="container"]/div')
for div in div_list:
# 获得div下a下img的scr2属性值(伪属性)
src = div.xpath('./div/a/img/@src2').extract_first()
# 实例化对象
item = SpiderItem()
item['src'] = src
# 提交管道
yield item
5.设置配置信息
打开网页F12,选择Network,随意选择一个提交文件。打开其Headers选项,拉至最后复制User-Agent属性。将这个值填入setting.py中的USER_AGENT值处(解除注释状态)。为了让爬取过程更清晰,我们可以在其上一行加入如下设置:
LOG_LEVEL = "ERROR"
同时将ROBOTSTXT_OBEY改成False:
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
6.管道类重封装
考虑到爬取图片数据,Scrapy特意准备了一个ImagesPipeline的图像爬取类,我们只需要对其进行继承,重书写即可。具体如下,将pipeline.py文件内容全部注释,重新书写如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
# from itemadapter import ItemAdapter
#
#
# class SpiderPipeline:
# def process_item(self, item, spider):
# return item
import scrapy
from scrapy.pipelines.images import ImagesPipeline
# 需要重写三个方法
class imgsPipeline(ImagesPipeline):
# 根据图片地址进行图片的请求
def get_media_requests(self, item, info):
# 请求
yield scrapy.Request(item['src'])
# 指定图片存储的路径
def file_path(self, request, response=None, info=None, *, item=None):
# 得到图片名称
imgName = request.url.split('/')[-1]
return imgName
def item_completed(self, results, item, info):
# 返回下一个即将被执行的管道类
return item
而图片的存储路径只需要在setting.py的最后一行加入如下配置即可:
# 指定图片存储目录(若不存在则会自动创建)
IMAGES_STORE = './imgs'
同时在setting.py中打开管道类的开关(大约68行左右),将管道类名字换成自定义的管道类名字即可:
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'spider.pipelines.imgsPipeline': 300,
}
7.爬取图片
在命令行运行如下命令进行图片爬取:
scrapy crawl dogcat