面向对象保存爬虫数据 - Python

发布时间:2022-06-27 发布网站:脚本宝典
脚本宝典收集整理的这篇文章主要介绍了面向对象保存爬虫数据 - Python脚本宝典觉得挺不错的,现在分享给大家,也给大家做个参考。

面向对象保存保存数据。

1,CSV

代码:

  1 """
  2     豆瓣top250四种保存方式
  3 """
  4 import csv
  5 import random
  6 import time
  7 import parsel
  8 import requests
  9 
 10 class doubanSpider():
 11     # url = 'https://movie.douban.COM/top250'
 12     headers = {
 13         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Applewebkit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
 14         'Cookie': 'cookie'
 15     }
 16     def __inIT__(self, url, headers=headers):
 17         self.url = url
 18         self.headers = headers
 19 
 20     def getHtml(self):
 21         response = requests.get(url=self.url, headers=self.headers)
 22         response.encoding = response.apparent_encoding
 23         response.encoding = 'utf-8'
 24         return response.text
 25 
 26     def parseHtmlByxpath(self):
 27         movieListDatas = []
 28         movieDictDatas = []
 29         selector = parsel.Selector(self.gethtml())
 30         results = selector.xpath('//div/ol/li')
 31         for item in results:
 32             title = item.xpath('.//div[@class="hd"]/a/span[1]/text()').get()
 33             movieinfo = item.xpath('.//div[@class="bd"]/p/text()').getall()
 34             director = movieInfo[0].split('   ')[0].strip()
 35             try:
 36                 actors = movieInfo[0].split('   ')[1].strip()
 37             except:
 38                 actors = '请从详情页获取!'
 39             releaseYear = movieInfo[1].split('xa0/xa0')[0].strip()
 40             country = movieInfo[1].split('xa0/xa0')[1].strip()
 41             movieTyPE = movieInfo[1].split('xa0/xa0')[2].strip()
 42             movieStar = item.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').get()
 43             reviewCount = item.xpath('.//div[@class="star"]/span[last()]/text()').get()
 44             try:
 45                 oneWordDes = item.xpath('.//p[@class="quote"]/span/text()').get()
 46             except:
 47                 oneWordDes = None
 48             movieListDatas.append([title, director, actors, releaseYear, country, movieType, movieStar, reviewCount, oneWordDes])
 49             dit = {
 50                 '电影名称':title,
 51                 '导演':director,
 52                 '演员':actors,
 53                 '年份':releaseYear,
 54                 '国家':country,
 55                 '类型':movieType,
 56                 '评分':movieStar,
 57                 '评论总数':reviewCount,
 58                 '一句话描述':oneWordDes,
 59             }
 60             PRint(dit)
 61             movieDictDatas.append(dit)
 62             # print(title, director, actors, releaseYear, country, movieType, movieStar, reviewCount, oneWordDes, sep=' | ')
 63             # print(movieDictDatas)
 64             # print(movieListDatas)
 65 
 66         return movieListDatas
 67     def saveToCsv(self):
 68         f = open('20211229豆瓣top250.csv', mode='a', encoding='utf-8-sig', newline='')
 69         csvWriter = csv.DictWriter(f, fieldnames=[
 70             '影名称',
 71             '导演',
 72             '演员',
 73             '年份',
 74             '国家',
 75             '类型',
 76             '评分',
 77             '评论总数',
 78             '一句话描述',
 79         ])
 80         csvWriter.writeheader() # 写入头
 81         datas = self.parseHtmlByXpath()
 82         for data in datas:
 83             csvWriter.writerow(data)
 84         f.close()
 85 
 86     def saveTocsv2(self):
 87         f = open('20211229豆瓣250.csv', mode='a', encoding='utf-8', newline='')
 88         lis = ['电影名称',
 89             '导演',
 90             '演员',
 91             '年份',
 92             '国家',
 93             '类型',
 94             '评分',
 95             '评论总数',
 96             '一句话描述',]
 97         csvWriter = csv.writer(f)
 98         csvWriter.writerow(lis)
 99         datas = self.parseHtmlByXpath()
100         for data in datas:
101             csvWriter.writerow(data)
102         f.close()
103 
104     def run(self):
105         self.saveTocsv2()
106 
107 if __name__ == "__main__":
108     for start in range(0, 250+1, 25):
109         print(f'************************正在爬取{int(start/25 + 1)}页内容************************')
110         time.sleep(random.uniform(2,5))
111         url = f'https://movie.douban.com/top250?start={start}&filter='
112         app = douBanSpider(url=url)
113         app.run()
114         break

 

脚本宝典总结

以上是脚本宝典为你收集整理的面向对象保存爬虫数据 - Python全部内容,希望文章能够帮你解决面向对象保存爬虫数据 - Python所遇到的问题。

如果觉得脚本宝典网站内容还不错,欢迎将脚本宝典推荐好友。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。