心得技巧

html5 HTML/Xhtml CSS XML/XSLT Dreamweaver教程 Frontpage教程心得技巧

上一篇: kudu介绍下一篇:做题记录 of JRZQS

数据采集第四次作业

发布时间：2022-06-30 发布网站：脚本宝典

脚本宝典收集整理的这篇文章主要介绍了数据采集第四次作业，脚本宝典觉得挺不错的，现在分享给大家，也给大家做个参考。

作业一

作业①: 要求：熟练掌握 scrapy 中 ITem、PiPEline 数据的序列化输出方法；Scrapy+xpath+MySQL 数据库存储技术路线爬取当当网站图书数据候选网站：http://seArch.dangdang.COM/?key=python&am p;act=input 关键词：学生可自由选择输出信息：MySQL的输出信息如下

数据采集第四次作业

结果展示

数据采集第四次作业

代码部分

Bookscrapy.py

定义搜索关键词并构造url发起请求：

    key = 'python'
    source_url = 'http://search.dangdang.com/'

    def start_requests(self):
        url = self.source_url + "?key=" + self.key
        yield scrapy.Request(url=url, callback=self.parse)

parse函数解析返回的htML：

            dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            selector = scrapy.Selector(text=data)

            lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")
            for li in lis:
                title = li.xpath("./a[position()=1]/@title").extract_First()
                PRice = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
                author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()
                date = li.xpath("./p[@class='search_book_author']/span[position()=last()- 1]/text()").extract_first()
                publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title ").extract_first()
                detail = li.xpath("./p[@class='detail']/text()").extract_first()

创建bookitem对象：

                item = BookItem()
                item["title"] = title.strip() if title else ""
                item["author"] = author.strip() if author else ""
                item["date"] = date.strip()[1:] if date else ""#日期要特别处理，去掉/
                item["publisher"] = publisher.strip() if publisher else ""
                item["price"] = price.strip() if price else ""
                item["detail"] = detail.strip() if detail else ""
                yield item

翻页处理：

            link = selector.xpath("//div[@class='paging']/ul[@name='Fy']/li[@class='next'] / a / @ href").extract_first()
            #找到翻页信息，得到翻页下一个链接地址
            if link:
                if page == 3:#设置爬取页数
                    return
                page += 1
                url = response.urljoin(link)#通过response.urljoin函数整理成绝对地址
                yield scrapy.Request(url=url, callback=self.parse)#递归调用parse函数，实现下一个网页的数据爬取

item.py

定义需要爬取的信息

  class BookItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      title = scrapy.Field()
      author = scrapy.Field()
      date = scrapy.Field()
      publisher = scrapy.Field()
      detail = scrapy.Field()
      price = scrapy.Field()
     pass

bookpipeline.py

实现数据保存到sql中 尝试连接数据库并创建表：

            self.con = pymysql.connect(host="localhost", port=3306, user="root",passwd = "ly213213", db = "scrapy", charset = "utf8")
            #连接到数据库
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            creatable = ''' 
                 create table if not exists books(
                        BTitle VARchar(512) Primary key,
                        bAuthor varchar(256),
                        bPublisher varchar(256),
                        bDate varchar(32),
                        bPrice varchar(16),
                        bDetail text)
                 '''
            self.cursor.execute(creatable)
            self.cursor.execute("delete From books")
            self.opened = True
            self.count = 0
        except Exception as err:
            print(err)

数据插入并关闭连接：

        try:
            print(item['title'] + "t" + item['author'] + "t" + item['publisher'] + "t" + item['date'] + "t" +
                  item['price'] + "t" + item['detail'])  #控制台输出
            if self.opened:
                sql = '''INSERT INTO books(bTitle,bAuthor,bPublisher,bDate,bPrice,bDetail) VALUES("%s","%s","%s","%s","%s","%s")'''
                arg = (item['title'], item['author'], item['publisher'], item['date'], item['price'], item['detail'])
                self.cursor.execute(sql, arg)
                self.count += 1
                if self.count == 117 :#控制一下爬取内容的数量
                    self.con.commit()
                    self.con.close()
                    self.opened = False
                    print("closed")
                    print("总共爬取", self.count, "本书籍")
        except Exception as err:
            print(err)

setting.py和run.py部分大同小异，在此不做展示。

心得体会

1、本道题目是对书中代码的一个复现，加深了对scrapy框架的理解，对pipeline中数据存储代码的编写更加熟练了

作业二

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。候选网站：招商银行网：http://fx.cmbchina.com/hq/ 输出信息：MySQL数据库存储和输出格式

Id	Currency	TSP	CSP	TBP	CBP	Time
1	港币	86.60	86.60	86.26	85.65	15：36：30
2......

结果展示

数据采集第四次作业

代码部分

bookspider.py

定义url并发起请求：

    url = "http://fx.cmbchina.com/hq/"

    def start_requests(self):
        yield scrapy.Request(url=self.url, callback=self.parse)

parse函数解析返回的html： 分析网页，发现所需爬取的信息都在tbody标签下，每一种货币信息对应一个tr标签

数据采集第四次作业

但是，在定位tr标签时，使用浏览器自带的xpath元素路径复制，查找返回空值，找不到原因，花费了很多时间。在询问了同学后，了解到tbody标签可能是自动补全的，难怪找不到。于是，需要自己编写xpath路径，跳过tbody标签。

       trs = selector.xpath("//div[@id='realRateinfo']/table//tr")
            #跳过tbody标签，不然找不到
            for i in range(1,len(trs)):
                currency = trs[i].xpath("./td[1]/text()").extract_first()
                tsp = trs[i].xpath("./td[4]/text()").extract_first()
                csp = trs[i].xpath("./td[5]/text()").extract_first()
                tbp = trs[i].xpath("./td[6]/text()").extract_first()
                cbp = trs[i].xpath("./td[7]/text()").extract_first()
                time = trs[i].xpath("./td[8]/text()").extract_first()

创建bankitem对象：

           item = BankItem()
                item["Id"] = i
                item["Currency"] = currency.strip()
                item["TSP"] = tsp.strip()
                item["CSP"] = csp.strip()
                item["TBP"] = tbp.strip()
                item["CBP"] = cbp.strip()
                item["Time"] = time.strip()
                yield item

这个网页好像不需要翻页，省事了哈哈。

item.py

定义需要爬取的信息:

class BankItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    Id = scrapy.Field()
    Currency = scrapy.Field()
    TSP = scrapy.Field()
    CSP = scrapy.Field()
    TBP = scrapy.Field()
    CBP = scrapy.Field()
    Time = scrapy.Field()
    pass

bankpipeline.py

实现数据保存到sql中

       try:
            connect = pymysql.connect(host="localhost", user="root", password="ly213213", database="scrapy",charset='utf8')  #连接数据库
            cur = connect.cursor()  #建立游标
            creatable = '''
                 create table if not exists bank(
                        Id int(5) not null,
                        Currency char(30) not null,
                        TSP char(20) not null,
                        CSP char(20) not null,
                        TBP char(20) not null,
                        CBP char(20) not null,
                        Time char(40) not null
                        )
               '''
            cur.execute(creatable)
            sql = '''INSERT INTO bank(Id, Currency,TSP,CSP,TBP,CBP,Time) VALUES("%s","%s","%s","%s","%s","%s","%s")'''
            arg = (item['Id'], item['Currency'], item['TSP'], item['CSP'], item['TBP'], item['CBP'],item['Time'])
            cur.execute(sql, arg)
            connect.commit()#提交数据
        except Exception as err:
            print(err)

心得体会

1、有的时候不能无脑复制xpath路径，可能会翻车，还是需要多了解一些网页编写方面的知识，防止踩坑。 2、对scrapy框架下代码的编写更加熟练了。

作业三

要求：熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容；使用Selenium框架+ MySQL数据库存储技术路线爬取“沪深A股”、“上证A股”、“深证A股”3个板块的股票数据信息。候选网站：东方财富:http://quote.eastmoney.com/center/gridlist.html#hs_a_board 输出信息：MySQL数据库存储和输出格式如下，表头应是英文命名例如：序号id，股票代码：bStockNo……，由同学们自行定义设计表头：

序号	股票代码	股票名称	最新报价	涨跌幅	涨跌额	成交量	成交额	振幅	最高	最低	今开	昨收
1	688093	N世华	28.47	62.22%	10.92	26.13万	7.6亿	22.34	32.0	28.08	30.2	17.55
2......

结果展示

数据采集第四次作业

代码部分

通过观察，这个财富网的页面不是静态的html文档，是一个动态界面，需要使用selenium模拟浏览器执行javascript程序处理进行爬取。

网页爬取部分

创建chrome浏览器：

        chrome_options = Options()#设置浏览器参数
        chrome_options.add_argument('--headless')#设置浏览器不可见
        chrome_options.add_argument('--disable-gpu')#禁止调用gpu
        self.driver = webdriver.Chrome(options=chrome_options)#启动浏览器
        self.wait = WebDriverWait(self.driver, 10)#设置等待时间10秒
        self.driver.get(url)# 获取url链接

对网页数据进行爬取： 通过分析网页发现，页面的股票信息都在tbody标签下，每一只股票的信息对应一个tr标签，使用xpath进行元素查找或许会比较快速。

数据采集第四次作业

            trs = self.driver.find_elements_by_xpath(
                "//div[@class='listview full']/table[@id='table_wrapper-table']/tbody/tr")
            for tr in trs:
                Id = tr.find_element_by_xpath(".//td[position()=1]").text  # 序号
                Code = tr.find_element_by_xpath(".//td[position()=2]/a").text  # 代码
                Name = tr.find_element_by_xpath(".//td[position()=3]/a").text  # 名称
                Newprice = tr.find_element_by_xpath(".//td[position()=5]/span").text  # 最新价
                UpdownPercent = tr.find_element_by_xpath(".//td[position()=6]/span").text  # 涨跌幅
                Updownqouta = tr.find_element_by_xpath(".//td[position()=7]/span").text  # 涨跌额
                Turnover = tr.find_element_by_xpath(".//td[position()=8]").text  # 成交量

                BusinessVal = tr.find_element_by_xpath(".//td[position()=9]").text  # 成交额
                Amplitude = tr.find_element_by_xpath(".//td[position()=10]").text  # 振幅
                Max = tr.find_element_by_xpath(".//td[position()=11]/span").text  # 最高
                Min = tr.find_element_by_xpath(".//td[position()=12]/span").text  # 最低
                Today = tr.find_element_by_xpath(".//td[position()=13]/span").text  # 今开
                Yesterday = tr.find_element_by_xpath(".//td[position()=14]").text  # 昨收

实现翻页： 这里找到翻页按钮对应的元素，利用selenium模拟点击翻页：

数据采集第四次作业

        if self.driver.find_elements_by_xpath("//*[@id='main-table_paginate']//a[@class='next paginate_button disabled']"):
            return False#如果xpath找到这个元素，说明已经到达了最后一页，爬取停止
        else:#否则，进行翻页爬取
            button_next = self.wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="main-table_paginate"]/a[2]')))
            #找到下一页对应的元素
            time.sleep(5)
            button_next.click()#点击下一页翻页
            # webdriver.Chrome().refresh()#是否需要刷新一下网页？
            self.processSpider()

实现对三个模块的爬取： 通过观察，发现三个模块的url如下：

数据采集第四次作业

这里采用修改url的方式进行：

url1 = "http://quote.eastmoney.com/center/gridlist.html#hs_a_board" #沪深A股
url2 = "http://quote.eastmoney.com/center/gridlist.html#sh_a_board" #上证A股
url3 = "http://quote.eastmoney.com/center/gridlist.html#sz_a_board" #深圳A股
option = input("请输入1,2,3 选择想要爬取的股票信息：")
if option == 1:
    url = url1
elif option == 2:
    url = url2
else:
    url = url3

数据库部分：

尝试连接数据库并创建表存储信息：

            self.con = pymysql.connect(host="localhost", port=3306, user="root",passwd = "ly213213", db = "scrapy", charset = "utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            sql= ''' 
                create table if not exists stocks(
                    Id varchar(32), Code varchar(32) ,Name varchar(32),Newprice varchar(32),
                    UpdownPercent varchar(32),Updownqouta varchar(32),Turnover varchar(32),
                    BusinessVal varchar(32),Amplitude varchar(32),Max varchar(32),
                    Min varchar(32),Today varchar(32),Yesterday varchar(32))
                '''
            self.cursor.execute(sql)
            self.cursor.execute("delete from stocks")
            self.opened = True

数据插入

                if self.opened:# 将结果存入数据库
                    sql ='''insert into stocks (Id, Code, Name, Newprice, UpdownPercent, Updownqouta, Turnover, BusinessVal, Amplitude, Max, Min, Today, Yesterday) values ("%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s")'''
                    arg =(str(Id) ,str(Code) ,str(Name) ,str(Newprice) ,str(UpdownPercent) ,str(Updownqouta) ,str(Turnover) ,str(BusinessVal) ,str(Amplitude) ,str(Max) ,str(Min) ,str(Today) ,str(Yesterday))
                    self.cursor.execute(sql,arg)#将数据插入数据库

关闭连接

            self.con.commit()
            self.con.close()
            self.driver.close()

心得体会

1、对selenium框架的初次尝试，加深了对此框架的理解，体会到selenium框架下网页爬取的过程。 2、selenium实现模拟浏览器访问网页，采用此框架爬取需要渲染的网页有很大优势。

代码地址：https://gitee.com/linyu17/crawl_project/tree/master/%E7%AC%AC%E5%9B%9B%E6%AC%A1%E4%BD%9C%E4%B8%9A

脚本宝典总结

以上是脚本宝典为你收集整理的数据采集第四次作业全部内容，希望文章能够帮你解决数据采集第四次作业所遇到的问题。

如果觉得脚本宝典网站内容还不错，欢迎将脚本宝典推荐好友。

本图文内容来源于网友网络收集整理提供，作为学习参考使用，版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ：384754419，请注明来意。

上一篇: kudu介绍下一篇:做题记录 of JRZQS

猜你在找的心得技巧相关文章

clion结合vcpkg以及GTest的使用 2022-07-07
EGF 2022-06-06
ExtJS 布局-Column布局（Column layout） 2022-06-05
颜色之ARGB与RGB、RGBA的区别与介绍 2022-04-15
rgba中的a是什么意思 CSS之RGBA颜色指南 2022-04-15
rootfs -根文件系统制作 2022-07-07
网页简单布局之结构与表现原则分享 2022-04-15
小项目中怎么防止Vue的闪现画面效果 2022-04-15
隐藏 Web 中的元素方法及优缺点教程详解 2022-04-15
告别硬编码让你的前端表格自动计算的实例代码 2022-04-15

全站导航更多