Python爬虫:xpath常用方法示例
【摘要】 # -*-coding:utf-8-*-
html = """
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<di...
# -*-coding:utf-8-*-
html = """
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a> <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
"""
from scrapy.selector import Selector
sel = Selector(text=html)
print("================title===============")
title_by_xpath = sel.xpath("//title//text()").extract_first()
print(title_by_xpath)
title_by_css = sel.css("title::text").extract_first()
print(title_by_css)
print("================href===============")
hrefs = sel.xpath("//a/@href").extract()
print(hrefs)
hrefs_by_css = sel.css("a::attr(href)").extract()
print(hrefs_by_css)
print("================img===============")
imgs = sel.xpath("//a[contains(@href, 'image')]/@href").extract()
print(imgs)
imgs_by_css = sel.css("a[href*=image]::attr(href)").extract()
print(imgs_by_css)
print("================src===============")
src = sel.xpath("//a[contains(@href, 'image')]/img/@src").extract()
print(src)
src_by_css = sel.css("a[href*=image] img::attr(src)").extract()
print(src_by_css)
print("================ re ===============")
text_by_re = sel.css("a[href*=image]::text").re(r"Name:\s*(.*)")
print(text_by_re)
print("================ xpath ===============")
div = sel.xpath("//div") # 相对路径
print(div)
a = div.xpath(".//a").extract() # 从当前提取所有元素
print(a)
print("================ text ===============")
text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'
sel1 = Selector(text=text)
# a下面的文字
a = sel1.xpath("//a/text()").extract()
print(a)
# a 下面所有的文字,包括strong
a = sel1.xpath("//a//text()").extract()
print(a)
# 解析出所有文字内容
a = sel1.xpath("string(//a)").extract()
print(a)
a = sel1.xpath("string(.)").extract()
print(a)
# 简化写法,推荐
xp = lambda x: sel.xpath(x).extract()
all_a = xp("//a/text()")
print(all_a)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
文章来源: pengshiyu.blog.csdn.net,作者:彭世瑜,版权归原作者所有,如需转载,请联系作者。
原文链接:pengshiyu.blog.csdn.net/article/details/80364436
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)