另类下载91的脚本—介绍大佬Kenneth Reitz的新爬虫库requests

wenguonideshou · 2018-3-1 11:41:01

本帖最后由 wenguonideshou 于 2018-4-12 14:24 编辑

众所周知Python界超级大佬Kenneth Reitz是大名鼎鼎的requests库的作者
近日，该大佬再次重磅发布Python html解析库requests-html（原文：Pythonic HTML Parsing for Humans™）
短短数日已经获得5000+ star
而且大佬还在夜以继日的更新，最新更新时间是3小时之前，当前最新版本是0.8.0
Gayhub地址：https://github.com/kennethreitz/requests-html
文档地址：http://html.python-requests.org/
注意：只支持Python3.6+

简单介绍：
该库有以下功能:

完全支持JavaScript（基于pyppeteer）!
支持CSS选择器（基于PyQuery）
支持XPath选择器
伪装user-agent
URL自动重定向
连接池和cookie池（基于requests.Session）
所有和requests一样的函数、参数
完善的文档和指导

安装方法：

$ pip install requests-html

GET方式请求 'python.org' :
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://python.org/')

当前页面的所有超链接（相对地址）:
>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/',
#列表太长，略过中间部分
'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}

当前页面的所有超链接（绝对地址）:
>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/',
#列表太长，略过中间部分
'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}

使用CSS选择器选取元素:
>>> about = r.html.find('#about', first=True)

获取元素文本内容:
>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure

获取元素的属性:
>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}

查看元素的HTML源代码:
>>> about.html
'\n[url=]About[/url]\n\n[url=]Applications[/url]\n[url=]Quotes[/url]\n[url=]Getting Started[/url]\n[url=]Help[/url]\nPython Brochure\n\n'

在元素内选取元素:
>>> about.find('a')
[, , , , , ]

元素内的超链接（绝对地址）:
>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}

在当前页面的HTML源代码内查找文本:
>>> r.html.search('Python is a {} language')[0]
programming

更复杂的CSS选择器 (从Chrome开发者工具复制而来):
>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'
>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.

同样支持Xpath:
>>> r.html.xpath('a')
[]

获取通过JavaScript展现的文本:
>>> r = session.get('http://python-requests.org')
>>> r.html.render()
>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'25'
# 注意，第一次使用render() 方法，系统会自动下载Chromium到home目录（Windows系统则为users目录），仅下载一次

不使用Requests做WEB请求，直接解析字符串:
>>> from requests_html import HTML
>>> doc = """"""
>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

说了那么多，我们实战看看吧！
在国外VPS/服务器爬

91.zip
(1.11 KB, 下载次数: 243)

2018-3-1 16:34 上传
点击文件名下载附件

本机开启小飞机，本机爬

91_ss.zip
(1.15 KB, 下载次数: 216)

2018-3-1 16:34 上传
点击文件名下载附件

其中部分代码使用eqblog的爬虫代码，在此表示感谢

是不是觉得Python很有趣很强大？
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓

985464672 · 2018-3-8 13:06:09

C:\Users\mmmcz\Desktop\91>python 91.py
Traceback (most recent call last):
  File "91.py", line 85, in
main(4284)
  File "91.py", line 78, in main
for url in page_url.html.absolute_links:
  File "D:\anaconda3\lib\site-packages\requests_html.py", line 329, in absolute_links
return set(gen())
  File "D:\anaconda3\lib\site-packages\requests_html.py", line 326, in gen
for link in self.links:
  File "D:\anaconda3\lib\site-packages\requests_html.py", line 300, in links
return set(gen())
  File "D:\anaconda3\lib\site-packages\requests_html.py", line 291, in gen
for link in self.find('a'):
  File "D:\anaconda3\lib\site-packages\requests_html.py", line 227, in find
for found in self.pq(selector)
  File "D:\anaconda3\lib\site-packages\requests_html.py", line 124, in pq
self._pq = PyQuery(self.html)
  File "D:\anaconda3\lib\site-packages\requests_html.py", line 90, in html
return self.raw_html.decode(self.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 10884: invalid continuation byte

这种错误是什么第一次遇到

O2sun · 2018-3-1 11:43:11

本帖最后由 O2sun 于 2018-3-1 11:49 编辑

开源让生活更美好

早晚我会成为肾亏MJJ

=============================

肉测很好用感谢2位作者分享

京东 · 2018-3-1 11:43:20

厉害支持大佬

ajun59420 · 2018-3-1 11:44:08

看不懂，支持一下

rolfzh · 2018-3-1 11:45:52

厉害支持大佬

coverme · 2018-3-1 11:46:52

支持一下谢谢分享

openos · 2018-3-1 11:48:14

收藏一下，不过91的视频太模糊.

hbjzpm · 2018-3-1 11:50:25

提示: 作者被禁止或删除内容自动屏蔽

醉里耍大刀 · 2018-3-1 11:56:02

看来要学习python了

		立即注册	自动登录	找回密码
密码			立即注册