Python爬虫学习（一）_python教程

上一篇: Django1.7+python 2.78+pycharm使... 下一篇:Python 闭包的理解

Python爬虫学习（一）

发布时间：2019-07-02 发布网站：脚本宝典

脚本宝典收集整理的这篇文章主要介绍了Python爬虫学习（一），脚本宝典觉得挺不错的，现在分享给大家，也给大家做个参考。

获得网页内容
The websITe is the API
Reqests库
自动爬取htML页面，自动提交相关请求

Requests: HTTP for Humans™ — Requests 2.21.0 documentation
要学会看官方文档

      
      
      
      
      
r = requests.get(url)
#这里通过get方法构造了一个向服务器请求资源的Request对象
#返回的是response对象

Beautful soup
对web页面进行解析
网络爬虫，盗亦有道
 robots协议中对于网络爬虫的相关标准规定
一个通用的爬虫代码框架
网络连接并不一定是成立的，对于异常情况的处理非常重要


      
      
      
      
      
#如果状态码不是200，则产生异常
r.raise_for_status()
................................
#通用爬虫框架
import requests

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()    #异常处理
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "Error"

if __name__ == "__main__":
    url = "http://www.baidu.COM"
    PRint(getHTMLText(url))

HTTP协议

注意get与post的区别

python数据类型，字典dict
使用{} ：的一系列键值对
robots协议
网站告知爬虫那些页面可以抓取，那些不能
网站根目录下的robot.txt文件
爬虫的header修改

      
      
      
      
      
url = "https://www.amazon.cn/dp/B078FFX8B6"
kv = {'User-agent' : 'Mozilla/5.0'}
r = requests.get(url, headers = kv)
网络图片的爬取和存储

      
      
      
      
      
import requests
path = "/Users/apple/Pictures/a.jpg"
url = "http://img0.dili360.com/ga/M01/48/E0/wKgBzFmyTcaACuVKACZ-qAthuNY888.tub.jpg@!rw9"
r = requests.get(url)

with oPEn(path, "wb") as f:
    f.write(r.content)

f.close()