基于 asyncio 的Python异步爬虫框架_python教程

上一篇: 1. Python3源码—内建对象下一篇:[原] Python 开发者如何正确使用...

基于 asyncio 的Python异步爬虫框架

发布时间：2019-06-23 发布网站：脚本宝典

脚本宝典收集整理的这篇文章主要介绍了基于 asyncio 的Python异步爬虫框架，脚本宝典觉得挺不错的，现在分享给大家，也给大家做个参考。

aspider
A web scraping micro-framework based on asyncio.
轻量异步爬虫框架aspider，基于asyncio，目的是让编写单页面爬虫更方便更迅速，利用异步特性让爬虫更快（减少在IO上的耗时）
介绍

      
      
      
      
      
pip install aspider
ITem
对于单页面，只要实现框架定义的 Item 就可以实现对目标数据的抓取：

      
      
      PE="button" class="copyCode code-tool" data-toggle="tooltip" data-placement="top" data-clipboard-text="import asyncio

From aspider import Request

request = Request("https://news.ycombinator.COM/")
response = asyncio.get_event_loop().run_until_complete(request.fetch())

# Output
# [2018-07-25 11:23:42,620]-Request-iNFO  
# " title="" data-original-title="复制">
      
      
import asyncio

from aspider import Request

request = Request("https://news.ycombinator.com/")
response = asyncio.get_event_loop().run_until_complete(request.fetch())

# Output
# [2018-07-25 11:23:42,620]-Request-INFO  <GET: https://news.ycombinator.com/>
# <Response url[text]: https://news.ycombinator.com/ status:200 metadata:{}>
Spider
对于页面目标较多，需要进行深度抓取时，Spider就派上用场了

      
      
      
      
      
import aiofiles

from aspider import AttrField, TextField, Item, Spider


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value


class HackerNewsSpider(Spider):
    start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2']

    async def parse(self, res):
        items = await HackerNewsItem.get_items(htML=res.body)
        for item in items:
            async with aiofiles.open('./hacker_news.txt', 'a') as f:
                await f.write(item.title + 'n')


if __name__ == '__main__':
    HackerNewsSpider.start()
支持JS的加载
Request类也可以很好的工作并返回内容，这里以这个为例演示下抓取需要加载js才可以抓取的例子：

      
      
      
      
      
request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())
PRint(response.body)
如果喜欢，可以玩玩看，项目Github地址：aspider