Scrapy之动态网页爬取

动态网页通过JavaScript与html交互

可以通过Splash渲染引擎爬取

1.下载Splash

使用Docker安装

注意,如果你是windows家庭版则安装方式不同

可以参考Scrapy-Splash的安装(windows篇)

由于我是教育版，直接安装docker for desktop 就行

Docker Desktop for Mac and Windows | Docker

Docker 教程 | 菜鸟教程 (runoob.com)

网上教程有点乱,我看这个还挺好的

下载Desktop重启后可能会提示WSL2更新,直接在相关网站更新就行了

在 Windows 10 上安装 WSL | Microsoft Docs

安装好后可以在power shell里运行

1	docker run -d -p 8050:8050 scrapinghub/splash

应该会跳出一个页面(可能时间比较长)

Splash 简介与安装 - 孔雀东南飞 - 博客园 (cnblogs.com)

可以看看这篇博客

然后下载scrapy-splash 一个python库操作splash的

1	pip install scrapy-splash

Splash - A javascript rendering service — Splash 3.5 documentation

splash的文档

服务端点

render.html 提供JavaScript页面渲染服务

url参数为要渲染的网址

execute 执行用户自定义的渲染脚本(lua),在页面中执行JavaScript代码

请求方式post lua_source 用户自定义的脚本

实战

下载了scrapy-splash后配置settings.py文件

import scrapy.downloadermiddlewares.httpcompression
import scrapy_splash
BOT_NAME = 'splash_example'

SPIDER_MODULES = ['splash_example.spiders']
NEWSPIDER_MODULE = 'splash_example.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
SPIDER_MIDDLEWARES = {
#    'splash_example.middlewares.SplashExampleSpiderMiddleware': 543,
   'scrapy_splash.SplashDeduplicateArgsMiddleware':100
}

DOWNLOADER_MIDDLEWARES = {
#    'splash_example.middlewares.SplashExampleDownloaderMiddleware': 543,
    'scrapy_splash.SplashCookiesMiddleware':732,
    'scrapy_splash.SplashMiddleware':725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810
    }
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

SPLASH_URL = "http://localhost:8050"

使用splashRequest()请求即可

url,args,cache_args

import scrapy
import scrapy_splash


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/js/']

    def start_requests(self):
        args = {}
        for url in self.start_urls:
            yield scrapy_splash.SplashRequest(url)

    def parse(self, response):
        for sel in response.xpath("//div[@class='quote']"):
            quote = sel.xpath("./span[1]/text()").extract_first()
            author = sel.xpath("string(./span[2])").extract_first()
            yield {
                'quote':quote,
                'author':author
            }
        href = response.xpath("//li[@class='next']/a/@href").extract_first()
        if href:
            url = response.urljoin(href)
            yield scrapy_splash.SplashRequest(url)

当然要打开docker的splash

Sekyoro的博客小屋

Scapy_动态网页爬取

Scrapy之动态网页爬取

1.下载Splash

实战