|
本帖最后由 wenguonideshou 于 2018-4-12 14:24 编辑
众所周知Python界超级大佬Kenneth Reitz是大名鼎鼎的requests库的作者
近日,该大佬再次重磅发布Python html解析库requests-html(原文:Pythonic HTML Parsing for Humans™)
短短数日已经获得5000+ star
而且大佬还在夜以继日的更新,最新更新时间是3小时之前,当前最新版本是0.8.0
Gayhub地址:https://github.com/kennethreitz/requests-html
文档地址:http://html.python-requests.org/
注意:只支持Python3.6+
简单介绍:
该库有以下功能:
完全支持JavaScript(基于pyppeteer)!
支持CSS选择器(基于PyQuery)
支持XPath选择器
伪装user-agent
URL自动重定向
连接池和cookie池(基于requests.Session)
所有和requests一样的函数、参数
完善的文档和指导
安装方法:
$ pip install requests-html
GET方式请求 'python.org' :
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://python.org/')
当前页面的所有超链接(相对地址):
>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/',
#列表太长,略过中间部分
'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}
当前页面的所有超链接(绝对地址):
>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/',
#列表太长,略过中间部分
'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}
使用CSS选择器选取元素:
>>> about = r.html.find('#about', first=True)
获取元素文本内容:
>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
获取元素的属性:
>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
查看元素的HTML源代码:
>>> about.html
'\n[url=]About[/url]\n\n[url=]Applications[/url]\n[url=]Quotes[/url]\n[url=]Getting Started[/url]\n[url=]Help[/url]\nPython Brochure\n\n'
在元素内选取元素:
>>> about.find('a')
[, , , , , ]
元素内的超链接(绝对地址):
>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}
在当前页面的HTML源代码内查找文本:
>>> r.html.search('Python is a {} language')[0]
programming
更复杂的CSS选择器 (从Chrome开发者工具复制而来):
>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'
>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
同样支持Xpath:
>>> r.html.xpath('a')
[]
获取通过JavaScript展现的文本:
>>> r = session.get('http://python-requests.org')
>>> r.html.render()
>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'25'
# 注意,第一次使用render() 方法,系统会自动下载Chromium到home目录(Windows系统则为users目录),仅下载一次
不使用Requests做WEB请求,直接解析字符串:
>>> from requests_html import HTML
>>> doc = """"""
>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}
说了那么多,我们实战看看吧!
在国外VPS/服务器爬

91.zip
(1.11 KB, 下载次数: 243)
2018-3-1 16:34 上传
点击文件名下载附件
本机开启小飞机,本机爬

91_ss.zip
(1.15 KB, 下载次数: 216)
2018-3-1 16:34 上传
点击文件名下载附件
其中部分代码使用eqblog的爬虫代码,在此表示感谢

是不是觉得Python很有趣很强大?
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
|
|