通过python的scrapy库爬取游戏资讯scrapy的创建说明以及一个可运行的项目 自从《黑神话:悟空》发布我也是玩
前言
自从《黑神话:悟空》发布我也是玩了70多个小时,也算是圆了小时候的大圣梦! 由于一直在埋头玩,关于黑神话这几天的资讯完全没有关注过,我就想,获取一下黑神话最近的相关资讯,正好需要用的python中的scrapy库。 简单来讲就是通过scrapy爬取一些资讯网址上面的资讯!
安装
scrapy库,使用pip即可下载安装:
pip install scrapy
创建项目
- 在当前目录下创建一个名为game_news的项目
scrapy startproject game_news
- 在项目中创建一个名为
youmin
域名为gamersky.com
的爬虫
scrapy genspider youmin gamersky.com
代码方面
settings.py
BOT_NAME = "game_news"
SPIDER_MODULES = ["game_news.spiders"]
NEWSPIDER_MODULE = "game_news.spiders"
SPIDER_MAX_PAGES = 10
USER_AGENT = "game news"
ROBOTSTXT_OBEY = False
JSON_FILE_NAME = "news.json"
ITEM_PIPELINES = {
"game_news.pipelines.JsonPipeline": 300,
}
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 1
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
pipelines.py
import json
class JsonPipeline:
def __init__(self, file_name):
self.file_name = file_name
@classmethod
def from_crawler(cls, crawler):
return cls(
file_name=crawler.settings.get("JSON_FILE_NAME", "news.json")
)
def open_spider(self, spider):
self.file = open(self.file_name, "w", encoding="utf-8")
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
self.file.write(json.dumps(dict(item), ensure_ascii=False) + "\n")
return item
items.py
import scrapy
class GameNewsItem(scrapy.Item):
title = scrapy.Field()
description = scrapy.Field()
time = scrapy.Field()
image_url = scrapy.Field()
article_url = scrapy.Field()
origin = scrapy.Field()
4.spiders/youmin.py
import scrapy
import json
from ..items import GameNewsItem
class YouminSpider(scrapy.Spider):
name = "youmin"
allowed_domains = ["gamersky.com"]
def start_requests(self):
url = 'https://db2.gamersky.com/LabelJsonpAjax.aspx?jsondata={"type":"updatenodelabel", "nodeId":"11007", "page":1}'
yield scrapy.Request(
url=url,
callback=self.parse,
meta={"page": 1}
)
def parse(self, response):
json_data = json.loads(response.text[1: -2])
html_content = json_data["body"]
new_selector = scrapy.selector.Selector(text=html_content)
for item in new_selector.xpath('//li'):
game_item = GameNewsItem()
game_item["title"] = item.xpath('.//a[@class="tt"]/text()').get()
game_item["article_url"] = item.xpath('.//a[@class="tt"]/@href').get()
game_item["description"] = item.xpath('.//div[@class="txt"]/text()').get()
game_item["time"] = item.xpath('.//div[@class="time"]/text()').get()
game_item["image_url"] = item.xpath('.//img/@src').get()
game_item["origin"] = "gamersky"
yield game_item
page = response.meta["page"] + 1
url = f'https://db2.gamersky.com/LabelJsonpAjax.aspx?jsondata={{"type":"updatenodelabel", "nodeId":"11007", "page":{page} }}'
if page < self.settings.get("SPIDER_MAX_PAGES", 50):
yield scrapy.Request(
url=url,
callback=self.parse,
meta={"page": page}
)
结果
运行
scrapy crawl youmin
运行之后会在项目根目录生成一个名为news.json
的数据文件,里面就是游民的新闻资讯了
下面是该项目的git地址,有兴趣可以看看
转载自:https://juejin.cn/post/7408742801851924489