- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

使用BeautifulSoup库解析htm、xml文档

yd_221104950 发表于 2020/12/03 23:15:39 2020/12/03

【摘要】 BeautifulSoup 安装： ~/Desktop$ sudo pip install beautifulsoup4 1 测试： from bs4 import BeautifulSoup if __name__ == "__main__": # 第一个参数是html文档文本，第二个参数是指定的解析器 soup = BeautifulSoup('<p>dat...

BeautifulSoup
安装：

~/Desktop$ sudo pip install beautifulsoup4

  
 
  1

测试：

from bs4 import BeautifulSoup

if __name__ == "__main__": # 第一个参数是html文档文本，第二个参数是指定的解析器 soup = BeautifulSoup('<p>data</p>', 'html.parser') print(soup.prettify())

  
 
  1
  2
  3
  4
  5
  6

输出：

<p>
 data
</p>

  
 
  1
  2
  3

说明安装成功了。

Beautiful Soup库也叫bs4，Beautiful Soup库是解析、遍历、维护 “标签树”的功能库。

Beautiful Soup库解析器：

解析器	使用方法	条件
bs4的HTML解析器	BeatifulSoup(mk,‘html.parser’)	pip install bs4
lxml的HTML解析器	BeatifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeatifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeatifulSoup(mk,‘html5lib’)	pip install html5lib

Beatiful Soup类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织元素，分别用<>和</>标明开头和结尾。
Name	标签的名字，<p>…</p>的名字是‘p’，格式:<tag>.name
Attributes	标签的属性，字典形式组织，格式:<tag>.attrs
NavigableString	标签内非属性字符串，<>…</p> 中的字符串，格式:<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

示例：

import requests
from bs4 import BeautifulSoup


def handle_url(url): try: r = requests.get("http://www.baidu.com") r.raise_for_status() if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding demo = r.text soup = BeautifulSoup(demo, 'html.parser') # a标签有很多个，但soup.a返回第一个 print(soup.a) # <class 'bs4.element.Tag'> print(type(soup.a)) # 标签名a print(soup.a.name) # <class 'str'> print(type(soup.a.name)) # 标签内的属性的字典，键值对 print(soup.a.attrs) # <class 'dict'> print(type(soup.a.attrs)) # 获取a标签的href属性值 print(soup.a.attrs['href']) # <class 'str'> print(type(soup.a.attrs['href'])) # 标签的内容 print(soup.a.string) # a标签的父元素 print(soup.a.parent) except: print("fail fail fail")


if __name__ == "__main__": url = "http://www.baidu.com" handle_url(url)


  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40

Beatiful Soup遍历HTML元素

Html具有树型结构，因此遍历有三种：
下行遍历：

属性	说明
.contents	子节点的列表，将<tag> 所有儿子节点存入列表
.children	子节点的迭代类型，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

import requests
from bs4 import BeautifulSoup


def handle_url(url): try: r = requests.get("http://www.baidu.com") r.raise_for_status() if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding demo = r.text soup = BeautifulSoup(demo, 'html.parser') print(soup.head) # head标签的儿子节点 print(soup.head.contents) # 是list列表类型 print(type(soup.head.contents)) # head有5个儿子节点 print(len(soup.head.contents)) # 取出head的第5个儿子节点 print(soup.head.contents[4]) # 使用children遍历儿子节点 for child in soup.head.children: print(child) # 使用descendants遍历子孙节点 for child in soup.head.descendants: print(child) except: print("fail fail fail")


if __name__ == "__main__": url = "http://www.baidu.com" handle_url(url)


  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35

上行遍历：

属性	说明
.parent	节点的父亲标签
.parents	节点先辈的迭代类型，用于循环遍历先辈节点

import requests
from bs4 import BeautifulSoup


def handle_url(url): try: r = requests.get("http://www.baidu.com") r.raise_for_status() if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding demo = r.text soup = BeautifulSoup(demo, 'html.parser') # html标签的父节点是它自己 print(soup.html.parent) # soup本身也是一种特殊的标签节点，它的父节点是None空 print(soup.parent) # title标签的父节点 print(soup.title.parent) # 遍历title标签的先辈节点 for parent in soup.title.parents: if parent is None: print(parent) else: print(parent.name) except: print("fail fail fail")


if __name__ == "__main__": url = "http://www.baidu.com" handle_url(url)


  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32

平行遍历：：必须发生在同一个父节点下

属性	说明
.next_sibling	返回按照 HTML文本顺序的下一个平等节点标签
.previous_sibling	返回按照 HTML文本顺序的上一个平等节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平等节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平等节点标签

import requests
from bs4 import BeautifulSoup


def handle_url(url): try: r = requests.get("http://www.baidu.com") r.raise_for_status() if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding demo = r.text soup = BeautifulSoup(demo, 'html.parser') # title的前一个平行节点 print(soup.title.previous_sibling) # link的下一个平行节点 print(soup.link.next_sibling) # 遍历meta标签的所有的后续平行节点 for sibling in soup.meta.next_siblings: print(sibling) # 遍历title标签的所有前续的平行节点 for sibling in soup.title.previous_siblings: print(sibling) except: print("fail fail fail")


if __name__ == "__main__": url = "http://www.baidu.com" handle_url(url)


  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31

文章来源: blog.csdn.net，作者：WongKyunban，版权归原作者所有，如需转载，请联系作者。

原文链接：blog.csdn.net/weixin_40763897/article/details/96497520

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

使用BeautifulSoup库解析htm、xml文档

Beatiful Soup遍历HTML元素

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

使用BeautifulSoup库解析htm、xml文档

Beatiful Soup遍历HTML元素

全部回复

设置昵称

关于作者

目录

热门推荐查看更多

相关文章

加入云驻计划，成为创作者

相关产品