012:pyquery介绍与实战爬取糗事百科猫眼排行
很久没更新了。最近一直在使用pyquery做一些小爬虫文件。个人感觉是值得推荐的,本篇我来介绍下pq的用法及其实战。内容主要以代码为主。
PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同,所以不用再去费心去记一些奇怪的方法了。
Pyquery基础认识:
首先看一下
1、字符串的初始化
from pyquery import PyQuery as pq
html = '''<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
doc = pq(html)
print(doc)
print(type(doc))
print(doc('li'))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
运行结果:
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
- - - - - - - - - - - - - -- - -- -
<class 'pyquery.pyquery.PyQuery'>
- - - - - - - - - - - - - -- - -- -
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
2、打开html文件(注意路径)
from pyquery import PyQuery as pq
doc = pq(filename='index.html')
print(doc)
print(doc('head'))
- 1
- 2
- 3
- 4
运行结果:
<title>Title</title>
</head>
<body>
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
</body>
</html>
<head>
<meta charset="UTF-8"/>
<title>Title</title>
</head>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
3、打开网站
from pyquery import PyQuery as pq
import requests
# doc1 = pq(url='https://www.baidu.com')
# print(doc)
content = requests.get(url='https://www.baidu.com').content.decode('utf-8')
doc = pq(content)
print(doc('head'))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
4、基于CSS选择器查找
from pyquery import PyQuery as pq
html = '''<div>
<ul id = 'haha'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
doc = pq(html)
print(doc)
print("- - - - - - - - - - - - - - - - - -- - - - - - - - -- - - -")
# id等于haha下面的class等于item-0下的a标签下的span标签(注意层级关系以空格隔开)
print(doc('#haha .item-0 a span'))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
运行结果:
<div>
<ul id="haha">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>
- - - - - - - - - - - - - - - - - -- - - - - - - - -- - - -
<span class="bold">third item</span>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
5、可以通过已经查找的标签,查找这个标签下的子标签或者父标签,而不用从头开始查找。
from pyquery import PyQuery as pq
html = '''<div class=‘content’>
<ul id = 'haha'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
doc = pq(html)
item = doc('div ul')
print(item)
print("- ----------------------------------------")
# 注意这里查找ul标签的所有子标签,也就是li标签,下面是查找class属性的标签,如果你把class换成href肯定不行,它指的只是儿子并不是子子孙孙
print(item.children('[class]'))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
运行结果:
<ul id="haha">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
- ----------------------------------------
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
6、获取属性值
from pyquery import PyQuery as pq
html = '''<div class=‘content’>
<ul id = 'haha'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
doc = pq(html)
item = doc(".item-0.active a")
print(type(item))
print(item)
# 获取属性值的两种方法
print(item.attr.href)
print(item.attr('href'))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
运行结果:
注意class=item-0 active是一个class的属性,但是在pyquery里面要是中间也是空格隔开的话,
就变成了item-0下的active标签下的a标签了,所以这里空格必须改成点
<class 'pyquery.pyquery.PyQuery'>
<a href="link3.html"><span class="bold">third item</span></a>
link3.html
link3.html
- 1
- 2
- 3
- 4
7、获取标签的内容
from pyquery import PyQuery as pq
html = '''<div class=‘content’>
<ul id = 'haha'>
<li class="item-0">first item</li>
<li class="item-1">
<a href="link2.html">second item</a></li>
<li class="item-0 active">
<a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active">
<a href="link4.html">fourth item</a></li>
<li class="item-0">
<a href="link5.html">fifth item</a></li>
</ul></div>'''
doc = pq(html)
a = doc("a").text()
print(a)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
#结果很有趣,他是找到所有标签的值,然后给连到一起打出来
second item third item fourth item fifth item
- 1
高级提高:
8、Dom操作
1、属性的增加删除操作
from pyquery import PyQuery as pq
html = '''<div class=‘content’>
<ul id = 'haha'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
doc = pq(html)
li = doc('.item-0.active')
print(li)
#删除classactive
print(li.removeClass('active'))
#增加class属性haha
print(li.addClass('haha'))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
运行结果:
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 haha"><a href="link3.html"><span class="bold">third item</span></a></li>
- 1
- 2
- 3
- 4
- 5
是不是666
2、attrs和css
添加属性和值
from pyquery import PyQuery as pq
html = '''<div class=‘content’>
<ul id = 'haha'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.attr('id','id_test'))
print(li.css('font-size','20px'))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
运行结果
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" id="id_test"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" id="id_test" style="font-size: 20px"><a href="link3.html"><span class="bold">third item</span></a></li>
- 1
- 2
- 3
- 4
- 5
3、删除某个标签,在爬去过程中我们通常爬去一下标签或者内容下来的时候总会有些不想要的标签,这个时候我们可以用下面的类似方法删除这个标签。
from pyquery import PyQuery as pq
html = '''<div class='content'>
<ul id = 'haha'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul></div>'''
doc = pq(html)
data = doc('.content')
print(data.text())
#删除所有a标签
data.find('a').remove()
#再次打印
print(data.text())
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
运行结果:
first item second item third item fourth item fifth item
first item
- 1
- 2
常用方法介绍:
实战案例:
爬取腾讯招聘:
import requests
import random,os
from pyquery import PyQuery as pq
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
]
USER_AGENTS=random.choice(USER_AGENTS)
headers = {
"User-Agent":USER_AGENTS
}
def req(kw,page):
if not os.path.exists('centent'):
os.makedirs('centent')
for i in range(page):
page = i * 10
base_url = "https://hr.tencent.com/position.php?keywords={}&start={}"
url = base_url.format(kw,page)
response = requests.get(url=url,headers=headers).content.decode('utf-8')
shuju(response)
def shuju(response):
doc = pq(response)
items = doc('table')
tr = items('tr').text().split(" ")
tr =tr[1:-2]
for data in tr:
data= data.split("\n")
print("正在下载- - - - - - - - -")
try:
cont =("岗位: "+data[0]+" 类别:"+data[1]+" 人数:"+data[2]+" 地点:"+data[3]+" 发布时间:"+data[4]+'\n')
with open("centent/%s.txt"%kw,'a+',encoding='utf-8')as fp:
fp.write(cont)
except:
pass
if __name__ == '__main__':
kw = input("请输入职位名字:")
page = int(input("请输入页码:"))
req(kw,page)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
爬取糗事百科:
from pyquery import PyQuery as pq
import requests,os
base_url= "https://www.qiushibaike.com/hot/page/1/"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
response = requests.request('get',base_url,headers=headers).content.decode('utf-8')
doc=pq(response)
items =doc('#content-left')
name = items('h2').text().split(' ')
content = items('.content span').text().split(' ')
dianzan = items('.stats-vote i').text().split(' ')
pinglun = items('.stats-comments i').text().split(' ')
for i in range(25):
data = "名字: "+name[i]+"内容: "+content[i]+'\n'+" 点赞:"+dianzan[i]+" 评论: "+pinglun[i]+'\n'
print(data)
with open("qiushi.txt","a+",encoding='utf-8')as fp:
fp.write(data)
img_all = items('.thumb a img')
img_list =[]
for img in img_all:
img_list.append(img.attrib['src'])
for j in img_list:
img_url = "https:"+j
print(img_url)
iii = requests.request('get',img_url,headers=headers).content
img_name=img_url[-10:]
kw="qiushi"
if not os.path.exists("./" + kw):
os.mkdir("./" + kw)
with open("./%s/%s" % (kw, img_name), "ab") as f:
f.write(iii)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
爬取猫眼排行:
from pyquery import PyQuery as pq
import requests
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
url = "https://maoyan.com/board"
response = requests.request('get',url,headers=headers)
content_all = response.content.decode('utf-8')
doc = pq(content_all)
# print(doc)
items = doc('div dd')
# print(items)
paiming = items('.board-index').text().split(" ")
name = items('.name').text().split(" ")
zhuyan =items('.star').text().split(" ")
daytime =items('.releasetime').text().split(" ")
pingfen = items('.score').text().split(" ")
img_url = items('.board-img')
print(name)
url_list = []
for url in img_url:
# print(url.attrib['data-src'])
url_list.append(url.attrib['data-src'])
# print(type(url_list))
for i in range(10):
data = "排名:"+paiming[i]+" 电影:"+name[i]+" "+zhuyan[i]+" "+daytime[i]+" 评分:"+pingfen[i]+" 图片链接: "+url_list[i]
with open('maoyan.txt','a+',encoding='utf-8')as fp:
print(data)
fp.write(data+'\n')
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
文章来源: blog.csdn.net,作者:考古学家lx,版权归原作者所有,如需转载,请联系作者。
原文链接:blog.csdn.net/weixin_43582101/article/details/88047828
- 点赞
- 收藏
- 关注作者
评论(0)