数据采集与融合技术第四次实践_心得技巧

心得技巧

html5 HTML/Xhtml CSS XML/XSLT Dreamweaver教程 Frontpage教程心得技巧

上一篇: Java常用类 String StringBuff... 下一篇:11月15日Java学习

数据采集与融合技术第四次实践

发布时间：2022-06-30 发布网站：脚本宝典

脚本宝典收集整理的这篇文章主要介绍了数据采集与融合技术第四次实践，脚本宝典觉得挺不错的，现在分享给大家，也给大家做个参考。

第四次实践

作业①:

要求： 熟练掌握 scrapy 中 ITem、PiPEline 数据的序列化输出方法；Scrapy+xpath+MySQL 数据库存储技术路线爬取当当网站图书数据
候选网站：http://seArch.dangdang.COM/?key=python&am p;act=input
关键词：python
输出信息：MySQL的输出信息如下

实现过程：

首先在items.py里面设置元素变量名称：

id = scrapy.Field()             # 序号
title = scrapy.Field()         # 书名
author = scrapy.Field()        # 作者
publisher = scrapy.Field()     # 出版社
date = scrapy.Field()          # 出版日期
PRice = scrapy.Field()         # 价格
detail = scrapy.Field()        # 简介

在setting.py中设置请求头：

DEFAULT_REQUEST_HEADERS = {     # 设置请求头
    'accept': 'image/webp,*/*;q=0.8',
    'accept-language': 'zh-CN,zh;q=0.8',
    'referer': 'http://www.dangdang.com/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Applewebkit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
    'Cookie': '__permanent_id=20211020212531236321815860596663187; From=460-5-biaoti; order_follow_source=P-460-5-bi%7C%231%7C%23www.baidu.com%252Fother.php%253Fsc.af0000K6In1eyr_51xyLOLZful_DzomDVItDpz55payqb-vSCRkj2LUneDJ4qSze5IFz4rwEu%7C%230-%7C-; __ddc_15d_f=1636164705%7C!%7C_utm_brand_id%3D11106; __ddc_15d=1636167749%7C!%7C_utm_brand_id%3D11106; ddscreen=2; __visit_id=20211110081708199452166768718980880; __out_refer=; __rpm=%7Cmix_317715...1636504124506; search_passback=91da83bf7b229c4f3c128b61fc010000912dc80037128b61; __trace_id=20211110082847231355609193528387616; dest_area=country_id%3D9000%26province_id%3D111%26city_id%3D0%26district_id%3D0%26town_id%3D0; pos_9_end=1636504128928; pos_0_end=1636504129054; ad_ids=7332328%7C%231; pos_0_start=1636504129380',
}

在MySpider.py中利用Xpath爬取书籍信息：

dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
data = dammit.unicode_markup
selector = scrapy.Selector(text=data)
lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")        # 找到储存书籍信息的节点

查看网页，找到我们所需要的信息：

数据采集与融合技术第四次实践

代码实现：

title = li.xpath("./a[position()=1]/@title").extract_First()            # 取title内容为书籍名称
author = li.xpath("./p[@class='search_Book_author']/span[position()=1]/a/@title").extract_first()           # 作者信息在P节点下第一个span中
date = li.xpath("./p[@class='search_book_author']/span[position()=last()- 1]/text()").extract_first()       # 日期信息在p节点下倒数第二个span中
publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title ").extract_first()  # 印刷社信息在p节点下倒数第一个span中
price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()          # 价格信息
detail = li.xpath("./p[@class='detail']/text()").extract_first()        # 详细信息
# detail有时没有，结果None

将数据导入items中：

item = Demo1Item()          # 将数据导入items里面
item["id"] = MySpider.count
item["title"] = title.strip() if title else ""
item["author"] = author.strip() if author else ""
item["date"] = date.strip()[1:] if date else ""     # 将date中的“/”去掉
item["publisher"] = publisher.strip() if publisher else ""
item["price"] = price.strip() if price else ""
item["detail"] = detail.strip() if detail else ""
yield item

翻页处理：

数据采集与融合技术第四次实践

link=selector.xpath("//div[@class='paging']/ul[@name='Fy']/li[@class='next']/a/@ href").extract_first()
if MySpider.page < 5:           # 学号尾数为5
    url = response.urljoin(link)
    yield scrapy.Request(url=url, callback=self.parse)      # 回调parse,进行下一页的信息爬取
    MySpider.page += 1  # 更新页码

然后创建表book用来存储书籍信息，注意id的类型为int，bDate的类型为VARchar，而非date类型（会报错）：

在piplines.py中链接数据库，并调用item向表插入数据，顺便输出书籍信息：

def open_spider(self, spider):
    print("opened")
    try:
        self.con = pymysql.connect(host="127.0.0.1", port=3306, user="root",
                                   passwd="qwe1346790", db="DataCollection", charset="utf8")
        self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
        self.cursor.execute("delete from book")
        self.opened = True
        self.count = 0
    except Exception as err:
        print(err)
        self.opened = False

def close_spider(self, spider):
    if self.opened:
        self.con.commit()
        self.con.close()
        self.opened = False
    print("closed")
    print("总共爬取", self.count, "条书籍信息")

def process_item(self, item, spider):
    try:
        print(item["id"])
        print(item["title"])  # 打印书籍名称
        print(item["author"])  # 打印作者
        print(item["publisher"])  # 打印出版社
        print(item["date"])  # 打印出版日期
        print(item["price"])  # 打印价格
        print(item["detail"])  # 打印简介
        print()
        if self.opened:
            self.cursor.execute("insert into book (id,BTitle,bAuthor,bPublisher,bDate,bPrice,bDetail) values(%s,%s,%s,%s,%s,%s,%s)"
                 , (item["id"], item["title"], item["author"], item["publisher"], item["date"], item["price"], item["detail"]))
            self.count += 1
    except Exception as err:
        print(err)
    return item

运行run.py，得出结果：

from scrapy import cmdline
cmdline.execute("scrapy crawl mySpider -s LOG_ENABLED=False".split())

结果展示：

作业②:

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
候选网站：招商银行网：http://fx.cmbchina.com/hq/
输出信息：MySQL数据库存储和输出格式

Id Currency TSP CSP TBP CBP Time

1 港币 86.60 86.60 86.26 85.65 15：36：30

2......

Id	Currency	TSP	CSP	TBP	CBP	Time
1	港币	86.60	86.60	86.26	85.65	15：36：30
2......

实现过程：

同作业1，在items.py和setting.py中设置元素变量名称，以及设置请求头。在此就不再赘述。

在MySpider.py中利用Xpath爬取交易币信息：

数据采集与融合技术第四次实践

dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
data = dammit.unicode_markup
selector = scrapy.Selector(text=data)
trs = selector.xpath("//div[@id='realRateinfo']/table[@class='data']//tr[position()>1]")        # 找到储存交易币信息的节点
for tr in trs:
    Currency = tr.xpath("./td[@class='fontbold']/text()").extract_first()               # 交易币名称
    TSP = tr.xpath("./td[@class='numberright'][position()=1]/text()").extract_first()
    CSP = tr.xpath("./td[@class='numberright'][position()=2]/text()").extract_first()
    TBP = tr.xpath("./td[@class='numberright'][position()=3]/text()").extract_first()
    CBP = tr.xpath("./td[@class='numberright'][position()=4]/text()").extract_first()
    Time = tr.xpath("./td[@align='center'][position()=3]/text()").extract_first()
    item = Demo2Item()          # 将数据导入items里面
    item["Id"] = MySpider.count     # count+=1用来做序号
    item["Currency"] = Currency.strip()
    item["TSP"] = TSP.strip()
    item["CSP"] = CSP.strip()
    item["TBP"] = TBP.strip()
    item["CBP"] = CBP.strip()
    item["Time"] = Time.strip()
    MySpider.count += 1
    yield item

创建表TradingCurrency，同作业1在piplines.py中链接数据库，并调用item向表插入数据，顺便输出交易币信息：

数据采集与融合技术第四次实践

print(item["Id"])
print(item["Currency"])  # 打印书籍名称
print(item["TSP"])  # 打印作者
print(item["CSP"])  # 打印出版社
print(item["TBP"])  # 打印出版日期
print(item["CBP"])  # 打印价格
print(item["Time"])  # 打印简
print()
if self.opened:
    self.cursor.execute(
        "insert into TradingCurrency (Id,Currency,TSP,CSP,TBP,CBP,Time) values(%s,%s,%s,%s,%s,%s,%s)"
        , (item["Id"], item["Currency"], item["TSP"], item["CSP"], item["TBP"], item["CBP"], item["Time"]))
    self.count += 1

结果展示：
心得体会：

作业1与作业2都是scrapy框架+Xpath+MySQL数据库存储的题目。在作业2中遇到一个小问题，即交易币信息存储在节点tr中，但是第一个tr节点中包含的是表头，而非交易币信息。
```
selector.xpath("//div[@id='realRateInfo']/table[@class='data']//tr[position()>1]")        # 找到储存交易币信息的节点
```
用position()>1就可以过滤掉第一个tr节点啦。

作业③

要求：熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容；使用Selenium框架+ MySQL数据库存储技术路线爬取“沪深A股”、“上证A股”、“深证A股”3个板块的股票数据信息。
候选网站：东方财富网：http://quote.eastmoney.com/center/gridlist.html#hs_a_board

输出信息：MySQL数据库存储和输出格式如下，表头应是英文命名例如：序号id，股票代码：bStockNo……，由同学们自行定义设计表头：

序号	股票代码	股票名称	最新报价	涨跌幅	涨跌额	成交量	成交额	振幅	最高	最低	今开	昨收
1	688093	N世华	28.47	62.22%	10.92	26.13万	7.6亿	22.34	32.0	28.08	30.2	17.55
2......

实现过程：

分析网页，我们一共要爬取“沪深A股”、“上证A股”、“深证A股”三个股票模块的信息，由于要体现翻页，现规定每个模块爬取两页。

导入请求头，并设置变量page限制每个模块爬取页数：

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"}
page = 1  # 用来标识各个股票模块的爬取页数，每个爬取2页

创建模拟浏览器并设置不可视化：

chrome_options = Options()
# 浏览器不提供可视化页面
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
self.driver = webdriver.Chrome(chrome_options=chrome_options)

获取第一个股票模块的信息：

数据采集与融合技术第四次实践

lis = self.driver.find_elements_by_xpath("//div[@id='tab']//li")
sTitle = lis[self.moudle-1].find_element_by_xpath(".//a").text
trs = self.driver.find_elements_by_xpath("//table[@id='table_wrapper-table']/tbody/tr")     # 找到包含股票信息的节点
for tr in trs:
    sId = tr.find_element_by_xpath("./td[position()=1]").text
    sCode = tr.find_element_by_xpath("./td[position()=2]").text
    sName = tr.find_element_by_xpath("./td[position()=3]").text
    sPrice = tr.find_element_by_xpath("./td[position()=5]").text
    sApplies = tr.find_element_by_xpath("./td[position()=6]").text
    sForehead = tr.find_element_by_xpath("./td[position()=7]").text
    sVolume = tr.find_element_by_xpath("./td[position()=8]").text
    sTurnover = tr.find_element_by_xpath("./td[position()=9]").text
    sAmplitude = tr.find_element_by_xpath("./td[position()=10]").text
    sHighest = tr.find_element_by_xpath("./td[position()=11]").text
    sLowest = tr.find_element_by_xpath("./td[position()=12]").text
    sToday = tr.find_element_by_xpath("./td[position()=13]").text
    sYesterday = tr.find_element_by_xpath("./td[position()=14]").text

翻页处理：

数据采集与融合技术第四次实践

nextPage = self.driver.find_element_by_xpath("//div[@class='dataTables_wrapper']//a[@class='next paginate_button']")
time.sleep(2)
nextPage.click()
self.processSpider()

跳转模块处理：

数据采集与融合技术第四次实践

lis = self.driver.find_elements_by_xpath("//div[@id='tab']//li")
nextMoudle = lis[self.moudle-1].find_element_by_xpath("./a")        # 下一个模块
time.sleep(2)
self.driver.execute_script("arguments[0].click();", nextMoudle)     # 点击
# nextMoudle.click()    # 错误示例
self.processSpider()

主函数运行：

    def executeSpider(self, url):
        starttime = datetime.datetime.now()
        print("Spider starting......")
        self.startUp(url)
        print("Spider processing......")
        self.processSpider()
        print("Spider closing......")
        self.closeUp()
        for t in self.threads:
            t.join()
        print("Spider completed......")
        endtime = datetime.datetime.now()
        elapsed = (endtime - starttime).seconds
        print("total ", elapsed, " seconds elapsed")

Url = "http://quote.eastmoney.com/center/gridlist.html#hs_a_board"
spider = MySpider()
spider.executeSpider(Url)

结果展示：
心得体会：
1. 作业三主要考察 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容。在查找HTML元素时，与xpath结合使用十分方便。爬取时，特别注意在进入一个新的页面时要sleep一下，防止找不到页面。翻页处理上只需找到按钮并点击即可。
2. 遇到的问题及解决方法：
  
  在模块的跳转时，用click方法出现错误:
  
  element click intercepted: Element ... is not clickable at point (278, 13). Other element would receive the click: ...
  
  上网查阅资料后用下面这个语句就可轻松解决!
```
self.driver.execute_script("arguments[0].click();", nextMoudle)
```