素材港爬虫详解
古之立大事者,不惟有超世之才,亦必有坚忍不拔之志。一直以来词都比较穷,今天爬一个素材港放到文件中没事就打开看看,希望能丰富一下咱这卑微的知识库
- 先看一下网页结构,网站地址
这个页面整的挺好看,又整洁又不整洁,希望爬了这个素材港我的词能多一些哈哈。
可见首页就是列表页了,咱们就爬着些素材。先抓一下包看看
可见数据都是在html中的,然后看一下详情页数据是什么样子的
详情页的数据也是也是在html中的,再看连接地址貌似也是一个类型的id,回到列表页源码看一下
没错了就是所在列表的a标签href属性,那就明了了
2. 先请求数据列表页并解析出来每个类型
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
# 'Cookie': 'PHPSESSID=scbu6lb3p48tbebr6ipp86e9i2',
'Pragma': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}
response = requests.get('http://www.sucaixiang.com/', headers=headers, verify=False)
res = etree.HTML(response.text)
data_list = res.xpath("//div[@class='lexicon-list am-avg-sm-2 am-avg-md-4 am-avg-lg-4']/div[@class='lexicon-items']/div/a")
for data in data_list:
title = ''
t_l = data.xpath(".//text()")
for t in t_l:
title += t
url = data.xpath("./@href")[0]
print(title,url)
ok,然后再请求详情页的链接获取详情数据并解析出来每一条素材句子
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
# 'Cookie': 'PHPSESSID=scbu6lb3p48tbebr6ipp86e9i2',
'Pragma': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}
response = requests.get(url, headers=headers, verify=False)
res = etree.HTML(response.text)
data_list = res.xpath("//div[@class='lexicon-content']/ul/li/span")
for data in data_list:
conent = data.xpath("./text()")[0]
print(conent)
最后再以类型名为文件名,以素材为内容文件中
data_list = res.xpath("//div[@class='lexicon-content']/ul/li/span")
with open(f'{title}.txt','a',encoding='utf-8')as f:
for data in data_list:
conent = data.xpath("./text()")[0]
print(conent)
f.write(conent)
f.write('\n')
完美,接下来就可以没事就学学词了
转载自:https://juejin.cn/post/7226503380242579514