Scrapy之动态网页爬取
动态网页通过JavaScript与html交互
可以通过Splash渲染引擎爬取
1.下载Splash
使用Docker安装
注意,如果你是windows家庭版则安装方式不同
可以参考Scrapy-Splash的安装(windows篇)
由于我是教育版,直接安装docker for desktop 就行
Docker Desktop for Mac and Windows | Docker
网上教程有点乱,我看这个还挺好的
下载Desktop重启后可能会提示WSL2更新,直接在相关网站更新就行了
在 Windows 10 上安装 WSL | Microsoft Docs
安装好后可以在power shell里运行1
docker run -d -p 8050:8050 scrapinghub/splash
应该会跳出一个页面(可能时间比较长)
Splash 简介与安装 - 孔雀东南飞 - 博客园 (cnblogs.com)
可以看看这篇博客
然后 下载scrapy-splash 一个python库操作splash的1
pip install scrapy-splash
Splash - A javascript rendering service — Splash 3.5 documentation
splash的文档
服务端点
render.html 提供JavaScript页面渲染服务
url参数为要渲染的网址
execute 执行用户自定义的渲染脚本(lua),在页面中执行JavaScript代码
请求方式post lua_source 用户自定义的脚本
实战
下载了scrapy-splash后配置settings.py文件1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26import scrapy.downloadermiddlewares.httpcompression
import scrapy_splash
BOT_NAME = 'splash_example'
SPIDER_MODULES = ['splash_example.spiders']
NEWSPIDER_MODULE = 'splash_example.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
SPIDER_MIDDLEWARES = {
# 'splash_example.middlewares.SplashExampleSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware':100
}
DOWNLOADER_MIDDLEWARES = {
# 'splash_example.middlewares.SplashExampleDownloaderMiddleware': 543,
'scrapy_splash.SplashCookiesMiddleware':732,
'scrapy_splash.SplashMiddleware':725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
SPLASH_URL = "http://localhost:8050"
使用splashRequest()请求即可
url,args,cache_args1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26import scrapy
import scrapy_splash
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/js/']
def start_requests(self):
args = {}
for url in self.start_urls:
yield scrapy_splash.SplashRequest(url)
def parse(self, response):
for sel in response.xpath("//div[@class='quote']"):
quote = sel.xpath("./span[1]/text()").extract_first()
author = sel.xpath("string(./span[2])").extract_first()
yield {
'quote':quote,
'author':author
}
href = response.xpath("//li[@class='next']/a/@href").extract_first()
if href:
url = response.urljoin(href)
yield scrapy_splash.SplashRequest(url)
当然要打开docker的splash