python bs4 清理html标签

python 通过bs4 处理html中不需要的文字和代码。
导入bs4模块

from bs4 import BeautifulSoup
...
...
soup = BeautifulSoup(a, 'lxml')

删除html中的style和script：

[s.extract() for s in soup("style")]
[s.extract() for s in soup("script")]

删除html中的id为media和class为nav的代码：

[s.extract() for s in soup.select("#media")]
[s.extract() for s in soup.select(".nav")]

删除img标签中除去src外的属性：

for i in soup.findAll('img')：
    for attr in list(i.attrs):
        if  attr != 'src':
            del i[attr]

删除父节点1：

在这个例子中，我们首先通过find()方法找到了class属性为"child"的<div>标签，然后获取了它的父节点<div class="parent">，接着使用unwrap()方法将它删除掉。

# 假设有以下html代码
html = """
<div class="parent">
    <div class="child">child</div>
    <p class="ez-toc-title">段落</p>
    <ul><li>列表</li><li>列表</li></ul>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
child_tag = soup.find("div", class_="child")
parent = child_tag.parent
parent.unwrap()  # 删除父节点
print(soup)

删除父节点2：（个人觉得相比第一种方法更好）

在这个例子中，我们首先获取要删除的父标签，然后遍历内部的内容放到列表里，最后把列表里的内容加起来就行了。

# 假设有以下html代码
html = """
<div class="parent">
    <div class="child">child</div>
    <p class="ez-toc-title">段落</p>
    <ul><li>列表</li><li>列表</li></ul>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
content = soup.select('.parent')[0]
# str：提取content bs4里的所有内容
children_tag = []
for i in content.children:
    children_tag.append(str(i))
content = ''.join(children_tag)

删除注释代码：

from bs4 import BeautifulSoup, Comment
...
comments = soup.findAll(text=lambda text: isinstance(text, Comment))
[comment.extract() for comment in comments]

其他的数据处理请参考官方文档。
面对一些特殊的情况可能需要使用正则和python的.replace()方法。

参考：bs4官方文档

python bs4 清理html标签

添加新评论

心情

最新文章

最近回复

分类

标签

归档

其它