likes
comments
collection
share

通过python的scrapy库爬取游戏资讯scrapy的创建说明以及一个可运行的项目 自从《黑神话:悟空》发布我也是玩

作者站长头像
站长
· 阅读数 20

前言

自从《黑神话:悟空》发布我也是玩了70多个小时,也算是圆了小时候的大圣梦! 由于一直在埋头玩,关于黑神话这几天的资讯完全没有关注过,我就想,获取一下黑神话最近的相关资讯,正好需要用的python中的scrapy库。 简单来讲就是通过scrapy爬取一些资讯网址上面的资讯!


安装

scrapy库,使用pip即可下载安装:

pip install scrapy

创建项目

  1. 在当前目录下创建一个名为game_news的项目
scrapy startproject game_news
  1. 在项目中创建一个名为youmin域名为gamersky.com的爬虫
scrapy genspider youmin gamersky.com

代码方面

  1. settings.py
BOT_NAME = "game_news"

SPIDER_MODULES = ["game_news.spiders"]
NEWSPIDER_MODULE = "game_news.spiders"

SPIDER_MAX_PAGES = 10

USER_AGENT = "game news"

ROBOTSTXT_OBEY = False

JSON_FILE_NAME = "news.json"


ITEM_PIPELINES = {
   "game_news.pipelines.JsonPipeline": 300,
}

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 1

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
  1. pipelines.py
import json


class JsonPipeline:
    def __init__(self, file_name):
        self.file_name = file_name

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            file_name=crawler.settings.get("JSON_FILE_NAME", "news.json")
        )

    def open_spider(self, spider):
        self.file = open(self.file_name, "w", encoding="utf-8")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        self.file.write(json.dumps(dict(item), ensure_ascii=False) + "\n")
        return item
  1. items.py
import scrapy


class GameNewsItem(scrapy.Item):
    title = scrapy.Field()
    description = scrapy.Field()
    time = scrapy.Field()
    image_url = scrapy.Field()
    article_url = scrapy.Field()
    origin = scrapy.Field()

4.spiders/youmin.py

import scrapy
import json
from ..items import GameNewsItem


class YouminSpider(scrapy.Spider):
    name = "youmin"
    allowed_domains = ["gamersky.com"]

    def start_requests(self):
        url = 'https://db2.gamersky.com/LabelJsonpAjax.aspx?jsondata={"type":"updatenodelabel", "nodeId":"11007", "page":1}'
        yield scrapy.Request(
            url=url,
            callback=self.parse,
            meta={"page": 1}
        )

    def parse(self, response):
        json_data = json.loads(response.text[1: -2])
        html_content = json_data["body"]
        new_selector = scrapy.selector.Selector(text=html_content)

        for item in new_selector.xpath('//li'):
            game_item = GameNewsItem()
            game_item["title"] = item.xpath('.//a[@class="tt"]/text()').get()
            game_item["article_url"] = item.xpath('.//a[@class="tt"]/@href').get()
            game_item["description"] = item.xpath('.//div[@class="txt"]/text()').get()
            game_item["time"] = item.xpath('.//div[@class="time"]/text()').get()
            game_item["image_url"] = item.xpath('.//img/@src').get()
            game_item["origin"] = "gamersky"
            yield game_item

        page = response.meta["page"] + 1
        url = f'https://db2.gamersky.com/LabelJsonpAjax.aspx?jsondata={{"type":"updatenodelabel", "nodeId":"11007", "page":{page} }}'
        if page < self.settings.get("SPIDER_MAX_PAGES", 50):
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                meta={"page": page}
            )

结果

运行

scrapy crawl youmin

运行之后会在项目根目录生成一个名为news.json的数据文件,里面就是游民的新闻资讯了 通过python的scrapy库爬取游戏资讯scrapy的创建说明以及一个可运行的项目 自从《黑神话:悟空》发布我也是玩 下面是该项目的git地址,有兴趣可以看看

瞎老弟的GitHub

转载自:https://juejin.cn/post/7408742801851924489
评论
请登录