Python爬虫框架在多来一个Ruia，解决问题的途径再次增多。

站长

2023年03月25日 11:21 · 阅读数 128

帮粉丝写爬虫会成为这个系列的最后 10 篇内容，如果你有想要爬取的网站，在评论区告诉我吧，当然爬虫 100 例课程结束之后，还会持续为大家提供一些更加有趣的爬虫博客或者 Python 视频课

写在前面-Ruia

本篇博客介绍的框架为Ruia，一款基于 asyncio 和 aiohttp 的异步爬虫框架，其余过多的介绍咱就不废话了，直接看一下 github 地址：github.com/howie6879/r…

github 主页上罗列了几个特点，一会我们实际体验一下，看框架开发者是否有夸大的成分~

Easy: Declarative programming
Fast: Powered by asyncio
Extensible: Middlewares and plugins
Powerful: JavaScript support

安装比较简单pip install -U ruia

有人可能会多问一句，上面命令-U是啥意思啊：-U 是 --upgrade，意思是如果已安装就升级到最新版。

打开文档说明，找到如下内容

Ruia is An asynchronous web scraping micro-framework, powered by asyncio and aiohttp, aims at making crawling url as convenient as possible. Write less, run faster is Ruia's philosophy.

又到了展示英语能力的时候了，没错上面说了一堆，核心是在说，写的少，跑得快。

其他文档内容，可以自行查阅 docs.python-ruia.org/index.html，文档写的相对比较详细。

编码时间

Ruia 把编写爬虫分成了三步，我们也按照这三个步骤实践一下，今天我们要爬取的是例子是半次元网站https://bcy.net/illust/toppost100里面的写作内容，你以为我会去爬取哪些图吗？神奇的梦想橡皮擦点击的是下面图的选项卡-写作榜。 Python爬虫框架在多来一个Ruia，解决问题的途径再次增多。当然地址也需要更换一下https://bcy.net/novel/toppost100

Define item

定义一个 data item 注意官网用到的是css_select，没错是 CSS 选择器，那么我们也模仿一下吧。

from ruia import Item, TextField, AttrField
class BCYItem(Item):
    target_item = TextField(css_select='div.rank-index-item')
    titles = TextField(css_select='h2.rank-item-title',many=True)

    # title = TextField(css_select='a.storylink')
    # url = AttrField(css_select='a.storylink', attr='href')

这里尤其注意一下target_item，该内容不可以省略，官方手册也对这个字段给出了相应的解释，我们看一下

target_item is a built-in Ruia field, indicates that the HTML element matched by its selectors contains one item. In this example, we are crawling a catalogue of Hacker News, and there are many news items in one page. target_item tells Ruia to focus on these HTML elements when extracting field.

那么不才，梦想橡皮擦的英语又有用武之地了

target_item 是 Ruia 的一个内置字段，表示在 HTML 所有元素中匹配到一项。例如，我们爬取新闻网站的目录，每个目录都有多条新闻，target_item 告诉 Ruia 在提取字段的时候关注这些 HTML 元素。

是不是有些绕，那换一种说法，就是先用 target_item 获取网页中重复的区域，然后在提取这些区域中的具体元素。例如，我们爬取的半次元网站里面如果你打开源码，能看到如下内容，target_item 就是先把这些 div 标记出来，之后在提取元素的时候围绕这些 div 提取。 Python爬虫框架在多来一个Ruia，解决问题的途径再次增多。神奇的是它明明有中文文档，但是我没找到我翻译的这段....

Test item

测试部分就比较简单了，注意下面的代码和上面有些许不同，细节在BCYItem这个类里面，我修改了部分地方，自行核对吧

from ruia import Item, TextField, AttrField
import asyncio

class BCYItem(Item):
    target_item = TextField(css_select='div.rank-index-item')
    title = TextField(css_select='h2.rank-item-title')

async def test_item():
    url = 'https://bcy.net/novel/toppost100'
    async for item in BCYItem.get_items(url=url):
        print(item.title)

if __name__ == '__main__':
    # Python 3.7 Required.
    asyncio.run(test_item())
	# For Python 3.6
    # loop = asyncio.get_event_loop()
    # loop.run_until_complete(test_item())

Write spider 与 Run

在编写细节的时候发现，这个...网站它竟然是下拉刷新，没有分页，具体咱不做分析了，非常简单，找到 API 就行

bcy.net/apiv3/rank/…

为了方面，我们只抓取一页吧

from ruia import Item, TextField, AttrField, Spider
import aiofiles
class BCYItem(Item):
    target_item = TextField(css_select='div.rank-index-item')
    title = TextField(css_select='h2.rank-item-title')

class BCYSpider(Spider):
    concurrency = 2
    # start_urls = [f'https://youweb.com/news?p={index}' for index in range(3)]
    start_urls = ['https://bcy.net/novel/toppost100']
    async def parse(self, response):
        async for item in BCYItem.get_items(html=response.html):
            yield item
    async def process_item(self, item: BCYItem):
        """Ruia build-in method"""
        async with aiofiles.open('./bcy.txt', 'a',encoding='utf-8') as f:
            await f.write(str(item.title) + '\n')

if __name__ == '__main__':
    BCYSpider.start()

运行之后，发现数据成功抓取下来，注意上述代码在运行时，需要先安装一个新的扩展aiofiles，pip install aiofiles

写在后面

中文的文档在https://github.com/howie6879/ruia/tree/master/docs/cn这里可以进行查阅，整体框架应用下来，功能还是非常强大的，编写源码的过程中出现的异常确实不多，可以在解决问题上在新增加上一个 Ruia 了。本篇博客关于Middleware与Plugin没有涉及，如果有时间，请尽量学习一下。

帮粉丝写爬虫会成为这个系列的最后 10 篇内容，如果你有想要爬取的网站，在评论区告诉我吧

转载自:https://juejin.cn/post/7134303786641653796