2021年10月最新使用selenium爬取裁判文书数据(本文仅供技术交流使用)

发布时间:2022-07-01 发布网站:脚本宝典
脚本宝典收集整理的这篇文章主要介绍了2021年10月最新使用selenium爬取裁判文书数据(本文仅供技术交流使用)脚本宝典觉得挺不错的,现在分享给大家,也给大家做个参考。

作为一个java菜鸡,想了解一下python的爬虫,据说文书网反爬很厉害,遍去试试 好嘛 我去,啥啊,不讲武德 这个网站的特点首先符合了政府网站响应慢的特点,7百亿的访问量。。。,再加上时时刻刻的小机器人,正常访问都卡的一批

有事找度娘,网上最新的几种方案,最多的还是,破解post参数 pageid,ciphertext,__RequestVerificationToken 三个参数

2021年10月最新使用selenium爬取裁判文书数据(本文仅供技术交流使用)

我也试过了,都没人说过cookie参数怎么搞,都说登录之后,写死就行,反正我是没成功,“无权限访问接口”

继续换,试过web scraPEr。我去,啥啊,文书网超时严重,1分钟不带返回的,scraper还总出问题,最大的问题就是只能获取单页的,没啥用,果断放弃

正题,来了!!!!!!敲黑板,我要变了

selenium,模拟用户行为访问,xpath获取数据,暂时这个是搞得挺顺畅 文书网有个600条限制,就是说最大能查到600,在往后查就需要高级查询等条件了。

思路!!!!敲黑板

1、看见首页这个,法院地图没

2021年10月最新使用selenium爬取裁判文书数据(本文仅供技术交流使用)

把所有法院搞出来(什么F1f;不会搞,我也不会。。。),应该有什么政府网能查到这些法院名称,只提供个思路哈,因为我是针对某个法院做的高级搜索,然后再具体到月份(这样就能限制到600),(什么?要是超过600怎么办,大哥哪个法院一个月能上传有600多文书啊,文员不得累死–嗯嗯,我是这么认为滴滴滴滴滴滴滴,托下巴表情)

2、然后就是,程序控制浏览器,自动打开网址,登录,(登录成功后,有时候会让输入验证码,手动输入就行了)

在这之前呢,我手动大体看了下,13年以前的都没有数据(什么?有的有,大拇指,大家可以往前搞几年),(什么?要知道这个干嘛),要填入整月高级搜索丫丫丫丫,就是那个裁判日期,法院名称填上哪个法院就行了(更具体的搜索,自己填去)

2021年10月最新使用selenium爬取裁判文书数据(本文仅供技术交流使用)

登录成功后呢跳到主页 循环去吧,打开高级搜索,填上内容,点击搜索(等那么几十秒,这玩意不一定啊,1分钟最长了),全选文章,点击批量下载,点击下一页(等那么几十秒,这玩意不一定啊,1分钟最长了),点击全选文章,点击批量下载,点击下一页。。。。。最后一页下载完了!!!!打开高级搜索,填上内容,点击搜索(等那么几十秒,这玩意不一定啊,1分钟最长了),全选文章,点击批量下载,点击下一页(等那么几十秒,这玩意不一定啊,1分钟最长了),点击全选文章,点击批量下载,点击下一页。。。。。最后一页下载完了…(口渴)

3、上代码

谷歌浏览器,驱动

From selenium import webdriver
import time

bro = webdriver.Chrome(executable_path='chromedriver.exe')

# 打开网页
bro.get('https://wenshu.court.gov.cn/')

最大化窗口,为什么还是刷新一下呢,哎,这玩意加载不完整啊!后边还有刷新,大家试试就知道了

# 最大化窗口
bro.maximize_window()
time.sleep(2)
bro.refresh()
# 点击登录按钮
login_tag = bro.find_element_by_xpath('//*[@id="LOGinLi"]/a')
# 执行点击命令
time.sleep(2)
login_tag.click();
time.sleep(2)
bro.refresh()
# 切换到iframe登录窗口
bro.swITch_to.frame("contentIframe")

。。。。。。 。。。。。 。。。。 。。。 。。 。 不写了,大家下边看代码吧!!!!!!!

4、注意,敲黑板,完整代码,以下链接,嘿嘿嘿,只要 5 C币,大家搞一下哈!!!!!

什么!没看见链接,哎,公司搞什么安全软件,不让上传文件了!!瞬间损失了好几万!!!!

搞上!!!!!!!!!
from selenium import webdriver
import time

bro = webdriver.Chrome(executable_path='chromedriver.exe')

# 打开网页
bro.get('https://wenshu.court.gov.cn/')

# 最大化窗口
bro.maximize_window()
time.sleep(2)
bro.refresh()

# 点击登录按钮
login_tag = bro.find_element_by_xpath('//*[@id="loginLi"]/a')

# 执行点击命令
time.sleep(2)
login_tag.click();
time.sleep(2)
bro.refresh()

# 切换到iframe登录窗
bro.switch_to.frame("contentIframe")

# 定位 手机号,密码,登录按钮位置
username_path=bro.find_element_by_xpath('//*[@class="phone-number-input"]')
password_path=bro.find_element_by_xpath('//*[@class="password"]')
login_in=bro.find_element_by_xpath('//*[@id="root"]/div/form/div/div[3]/span')
time.sleep(1)
username_path.send_keys("")
time.sleep(1)
password_path.send_keys("")
start_time = [#"2008-01-01","2010-01-01","2011-01-01","2012-01-01","2013-01-01",
              #"2014-01-10","2014-02-01",
"2014-03-01","2014-04-01","2014-05-01","2014-06-01","2014-07-01","2014-08-01","2014-09-01",
"2014-10-01","2014-11-01","2014-12-01","2015-01-01","2015-02-01","2015-03-01","2015-04-01","2015-05-01",
"2015-06-01","2015-07-01","2015-08-01","2015-09-01","2015-10-01","2015-11-01","2015-12-01","2016-01-01",
"2016-02-01","2016-03-01","2016-04-01","2016-05-01","2016-06-01","2016-07-01","2016-08-01","2016-09-01",
"2016-10-01","2016-11-01","2016-12-01","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01",
"2017-06-01","2017-07-01","2017-08-01","2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01",
"2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01",
"2018-10-01","2018-11-01","2018-12-01","2019-01-01","2019-02-01","2019-03-01","2019-04-01","2019-05-01",
"2019-06-01","2019-07-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01","2020-01-01",
"2020-02-01","2020-03-01","2020-04-01","2020-05-01","2020-06-01","2020-07-01","2020-08-01","2020-09-01",
"2020-10-01","2020-11-01","2020-12-01","2021-01-01","2021-02-01","2021-03-01","2021-04-01","2021-05-01",
"2021-06-01","2021-07-01","2021-08-01","2021-09-01","2021-10-01"];
end_time = [#"2008-12-31","2010-12-31","2011-12-31","2012-12-31","2013-12-31",
            #"2014-02-10","2014-02-31",
"2014-03-31","2014-04-31","2014-05-31","2014-06-31","2014-07-31","2014-08-31","2014-09-31",
"2014-10-31","2014-11-31","2014-12-31","2015-01-31","2015-02-31","2015-03-31","2015-04-31","2015-05-31",
"2015-06-31","2015-07-31","2015-08-31","2015-09-31","2015-10-31","2015-11-31","2015-12-31","2016-01-31",
"2016-02-31","2016-03-31","2016-04-31","2016-05-31","2016-06-31","2016-07-31","2016-08-31","2016-09-31",
"2016-10-31","2016-11-31","2016-12-31","2017-01-31","2017-02-31","2017-03-31","2017-04-31","2017-05-31",
"2017-06-31","2017-07-31","2017-08-31","2017-09-31","2017-10-31","2017-11-31","2017-12-31","2018-01-31",
"2018-02-31","2018-03-31","2018-04-31","2018-05-31","2018-06-31","2018-07-31","2018-08-31","2018-09-31",
"2018-10-31","2018-11-31","2018-12-31","2019-01-31","2019-02-31","2019-03-31","2019-04-31","2019-05-31",
"2019-06-31","2019-07-31","2019-08-31","2019-09-31","2019-10-31","2019-11-31","2019-12-31","2020-01-31",
"2020-02-31","2020-03-31","2020-04-31","2020-05-31","2020-06-31","2020-07-31","2020-08-31","2020-09-31",
"2020-10-31","2020-11-31","2020-12-31","2021-01-31","2021-02-31","2021-03-31","2021-04-31","2021-05-31",
"2021-06-31","2021-07-31","2021-08-31","2021-09-31","2021-10-31"];
for index, item in enumerate(start_time):
  PRint(index, item)
  time.sleep(10)
  gaojisousuo=bro.find_element_by_xpath('//*[@class="advenced-seArch"]')
  gaojisousuo.click()
  fayuanVal=bro.find_element_by_xpath('//*[@id="s2"]')
  fayuanVal.clear()
  fayuanVal.send_keys("晋州市人民法院")
  startTime=bro.find_element_by_xpath('//*[@id="cprqStart"]')
  startTime.clear()
  startTime.send_keys(item)
  endTime=bro.find_element_by_xpath('//*[@id="cprqEnd"]')
  endTime.clear()
  endTime.send_keys(end_time[index])
  sousuo = bro.find_element_by_xpath('//*[@id="searchBTn"]')
  time.sleep(5)
  sousuo.click()
  time.sleep(60)
  
# 先判断是否有数据
  page_num_all = bro.find_element_by_xpath('//*[@id="_view_1545184311000"]/div[1]/div[2]/span')
  if page_num_all.text != '0':
      next = True
      page_num = 1
      while next:
       # 定位全选和批量下载
          all_select = bro.find_element_by_xpath('//*[@id="AllSelect"]')
          all_select.click()
          time.sleep(5)
          all_download = bro.find_element_by_xpath('//*[@id="_view_1545184311000"]/div[2]/div[4]/a[3]')
          all_download.click()
          time.sleep(5)

          next_click = bro.find_element_by_xpath('//*[@id="_view_1545184311000"]/div[last()]/a[last()]')
          class_name = next_click.get_attribute('class')
          if class_name == 'disabled pageButton':
              next = False
              break
          else:
			  next_click.click()
			  page_num += 1
			  print(page_num)
			  time.sleep(70)

注释不太完整哈,写着玩来着!思路还是上边的思路

脚本宝典总结

以上是脚本宝典为你收集整理的2021年10月最新使用selenium爬取裁判文书数据(本文仅供技术交流使用)全部内容,希望文章能够帮你解决2021年10月最新使用selenium爬取裁判文书数据(本文仅供技术交流使用)所遇到的问题。

如果觉得脚本宝典网站内容还不错,欢迎将脚本宝典推荐好友。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。