scrapy 重写 imagesPipeline

纪梦鱼字数: 1514 阅读耗时: 3 分钟 2022/10/28 2025/12/17 博客独享热度: 7 评论: 0

本文最后更新于 2025-12-17，文章内容可能已经过时。

scrapy爬取多媒体资源数据

使用一个专有的管道类ImagesPipeline

具体的编码流程：

1.在爬虫文件中进行图片/视频的链接提取
2.将提取到的链接封装到items对象中，提交给管道
3.在管道文件中自定义一个父类为ImagesPipeline的管道类，且重写三个方法即可：

  def get_media_requests(self, item, info):接收爬虫文件提交过来的item对象，然后对图片地址发起网路请求，返回图片的二进制数据
  def file_path(self, request, response=None, info=None, *, item=None)：指定保存图片的名称
  def item_completed(self, results, item, info)：返回item对象给下一个管道类
    ```

4.在配置文件中开启指定的管道，且通过IMAGES_STORE = 'girlsLib’操作指定图片存储的文件夹。

import scrapy
from itemadapter import ItemAdapter

from scrapy.pipelines.images import ImagesPipeline

#自定义的管道类一定要继承与ImagesPipeline
class mediaPileline(ImagesPipeline):
    #重写三个父类的方法来完成图片二进制数据的请求和持久化存储
    #可以根据图片地址，对其进行请求，获取图片数据
    #参数item：就是接收到的item对象
    def get_media_requests(self, item, info):
        img_src = item['src']
        yield scrapy.Request(img_src)
    #指定图片的名称（只需要返回图片存储的名称即可）
    def file_path(self, request, response=None, info=None, *, item=None):
        img_name = request.url.split('/')[-1]
        print(img_name,'下载保存成功！')
        return f'{img_name}.jpg'
    #如果没有下一个管道类，该方法可以不写
    def item_completed(self, results, item, info):
        return item #可以将当前的管道类接收到item对象传递给下一个管道类2.

~~return imgName~~