通过python的scrapy库爬取游戏资讯scrapy的创建说明以及一个可运行的项目自从《黑神话：悟空》发布我也是玩

站长

2024年09月01日 16:02 · 阅读数 50

前言

自从《黑神话：悟空》发布我也是玩了70多个小时，也算是圆了小时候的大圣梦！由于一直在埋头玩，关于黑神话这几天的资讯完全没有关注过，我就想，获取一下黑神话最近的相关资讯，正好需要用的python中的scrapy库。简单来讲就是通过scrapy爬取一些资讯网址上面的资讯！

安装

scrapy库，使用pip即可下载安装：

pip install scrapy

创建项目

在当前目录下创建一个名为game_news的项目

scrapy startproject game_news

在项目中创建一个名为youmin域名为gamersky.com的爬虫

scrapy genspider youmin gamersky.com

代码方面

settings.py

BOT_NAME = "game_news"

SPIDER_MODULES = ["game_news.spiders"]
NEWSPIDER_MODULE = "game_news.spiders"

SPIDER_MAX_PAGES = 10

USER_AGENT = "game news"

ROBOTSTXT_OBEY = False

JSON_FILE_NAME = "news.json"


ITEM_PIPELINES = {
   "game_news.pipelines.JsonPipeline": 300,
}

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 1

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

pipelines.py

import json


class JsonPipeline:
    def __init__(self, file_name):
        self.file_name = file_name

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            file_name=crawler.settings.get("JSON_FILE_NAME", "news.json")
        )

    def open_spider(self, spider):
        self.file = open(self.file_name, "w", encoding="utf-8")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        self.file.write(json.dumps(dict(item), ensure_ascii=False) + "\n")
        return item

items.py

import scrapy


class GameNewsItem(scrapy.Item):
    title = scrapy.Field()
    description = scrapy.Field()
    time = scrapy.Field()
    image_url = scrapy.Field()
    article_url = scrapy.Field()
    origin = scrapy.Field()

4.spiders/youmin.py

import scrapy
import json
from ..items import GameNewsItem


class YouminSpider(scrapy.Spider):
    name = "youmin"
    allowed_domains = ["gamersky.com"]

    def start_requests(self):
        url = 'https://db2.gamersky.com/LabelJsonpAjax.aspx?jsondata={"type":"updatenodelabel", "nodeId":"11007", "page":1}'
        yield scrapy.Request(
            url=url,
            callback=self.parse,
            meta={"page": 1}
        )

    def parse(self, response):
        json_data = json.loads(response.text[1: -2])
        html_content = json_data["body"]
        new_selector = scrapy.selector.Selector(text=html_content)

        for item in new_selector.xpath('//li'):
            game_item = GameNewsItem()
            game_item["title"] = item.xpath('.//a[@class="tt"]/text()').get()
            game_item["article_url"] = item.xpath('.//a[@class="tt"]/@href').get()
            game_item["description"] = item.xpath('.//div[@class="txt"]/text()').get()
            game_item["time"] = item.xpath('.//div[@class="time"]/text()').get()
            game_item["image_url"] = item.xpath('.//img/@src').get()
            game_item["origin"] = "gamersky"
            yield game_item

        page = response.meta["page"] + 1
        url = f'https://db2.gamersky.com/LabelJsonpAjax.aspx?jsondata={{"type":"updatenodelabel", "nodeId":"11007", "page":{page} }}'
        if page < self.settings.get("SPIDER_MAX_PAGES", 50):
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                meta={"page": page}
            )

结果

运行

scrapy crawl youmin

运行之后会在项目根目录生成一个名为news.json的数据文件，里面就是游民的新闻资讯了通过python的scrapy库爬取游戏资讯scrapy的创建说明以及一个可运行的项目自从《黑神话：悟空》发布我也是玩下面是该项目的git地址，有兴趣可以看看

瞎老弟的GitHub

转载自:https://juejin.cn/post/7408742801851924489

通过python的scrapy库爬取游戏资讯scrapy的创建说明以及一个可运行的项目 自从《黑神话：悟空》发布我也是玩

前言

安装

创建项目

代码方面

结果

通过python的scrapy库爬取游戏资讯scrapy的创建说明以及一个可运行的项目自从《黑神话：悟空》发布我也是玩