Python爬虫工具BeautifulSoup使用详解

酒酿小小丸子

12746人浏览 · 2023-07-12 10:52:37

酒酿小小丸子 · 2023-07-12 10:52:37 发布

一、模块简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找、修改文档的方式。Beautiful Soup会帮你节省工作时间。

二、方法利用

1、安装beautifulsoup

pip install beautifulsoup4

2、引入模块

from bs4 import beautifulsoup

3、选择解析器解析指定内容

soup=beautifulsoup(解析内容,解析器)

常用解析器：html.parser,lxml,xml,html5lib

有时候需要安装安装解析器：比如pip3 install lxml

BeautifulSoup默认支持Python的标准HTML解析库，但是它也支持一些第三方的解析库：

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup,"html.parser")`	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, “lxml”)`	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml-xml"])`、 `BeautifulSoup(markup, "xml")`	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

几个简单的浏览结构化数据的方法

#获取Tag，通俗点就是HTML中的一个个标签
soup.title                    # 获取整个title标签字段：<title>The Dormouse's story</title>
soup.title.name               # 获取title标签名称  ：title
soup.title.parent.name        # 获取 title 的父级标签名称：head
soup.p                        # 获取第一个p标签字段：<p class="title"><b>The Dormouse's story</b></p>
soup.p['class']               # 获取第一个p中class属性值：title
soup.p.get('class')           # 等价于上面
soup.a                        # 获取第一个a标签字段
soup.find_all('a')            # 获取所有a标签字段
soup.find(id="link3")         # 获取属性id值为link3的字段
soup.a['class'] = "newClass"  # 可以对这些属性和内容等等进行修改
del bs.a['class']             # 还可以对这个属性进行删除
soup.find('a').get('id')      # 获取class值为story的a标签中id属性的值
soup.title.string             # 获取title标签的值  ：The Dormouse's story

三、具体利用

1、获取拥有指定属性的标签

方法一：获取单个属性
soup.find_all('div',id="even")            # 获取所有id=even属性的div标签
soup.find_all('div',attrs={'id':"even"})    # 效果同上

方法二:
soup.find_all('div',id="even",class_="square")            # 获取所有id=even并且class=square属性的div标签
soup.find_all('div',attrs={"id":"even","class":"square"})    # 效果同上

2、获取标签的属性值

方法一：通过下标方式提取
for link in soup.find_all('a'):
    print(link['href'])        //等同于 print(link.get('href'))

方法二：利用attrs参数提取
for link in soup.find_all('a'):
    print(link.attrs['href'])

3、获取标签中的内容

divs = soup.find_all('div')        # 获取所有的div标签
for div in divs:                   # 循环遍历div中的每一个div
    a = div.find_all('a')[0]      # 查找div标签中的第一个a标签     
    print(a.string)              # 输出a标签中的内容

如果结果没有正确显示，可以转换为list列表

4、stripped_strings

去除\n换行符等其他内容 stripped_strings

divs = soup.find_all('div')
for div in divs:
    infos = list(div.stripped_strings)        # 去掉空格换行等
    bring(infos)

四、输出

1、格式化输出prettify()

prettify() 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行

markup = '<a href="http://example.com/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
soup.prettify()
# '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >\n...'
print(soup.prettify())
# <html>
#  <head>
#  </head>
#  <body>
#   <a href="http://example.com/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >
#    I linked to
#    <i>
#     example.com
#    </i>
#   </a>
#  </body>
# </html>

2、get_text()

如果只想得到tag中包含的文本内容,那么可以调用 get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:

markup = '<a href="http://example.com/" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)
soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com'

如果你也喜欢编程，想通过学习Python获取更高薪资，这里给大家分享一份Python学习资料。

学习资源推荐

除了上述分享，如果你也喜欢编程，想通过学习Python获取更高薪资，这里给大家分享一份Python学习资料。

这里给大家展示一下我进的兼职群和最近接单的截图

兼职群

😝朋友们如果有需要的话，可以V扫描下方二维码联系领取，也可以内推兼职群哦~

学好 Python 不论是就业还是做副业赚钱都不错，但要学会 Python 还是要有一个学习规划。最后大家分享一份全套的 Python 学习资料，给那些想学习 Python 的小伙伴们一点帮助！

### 1.Python学习路线

python学习路线图1

2.Python基础学习

01.开发工具

02.学习笔记

在这里插入图片描述

03.学习视频

在这里插入图片描述

3.Python小白必备手册

4.数据分析全套资源

在这里插入图片描述

5.Python面试集锦

01.面试资料

在这里插入图片描述

02.简历模板

在这里插入图片描述

因篇幅有限，仅展示部分资料，添加上方即可获取👆

------ 🙇‍♂️ 本文转自网络，如有侵权，请联系删除 🙇‍♂️ ------

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

java集合或对象转化为json数组或者对象的方法

JSON-lib这个Java类包用于把bean,map和XML转换成JSON并能够把JSON转回成bean和DynaBean。下载地址:http://json-lib.sourceforge.net/还要需要的第3方包:org.apache.commons(3.2以上版本)org.apache.oronet.sf.ezmorph(ezmorph-1.0.4.jar)nu

GitCode 开源社区

SDL Trados 2019 和 SDL Trados 2021 最新版本的下载地址 (2021年七月更新)

SDL Trados 2019 和 SDL Trados 2021 最新版本的下载地址SDL Trados 2019 CU 8 fix GS and Language Cloud 相关网络问题，正常使用没有必要更新下载地址:https://update.sdl.com/updates/update1/studio15/live/SDLTradosStudio2019_SR2_15.2.8.3007

GitCode 开源社区

Struts.xml配置返回JSON数据

网易编辑器的代码编辑功能不怎么样唉~！测试struts.xml中result参数的不同返回不同的json数据目的是为了比较result中type不同和result中参数的不同所产生的效果如果查询的是所有的数据，在action中定义的类型如下：private List entities;public List getEntities() {return entities;}1．第一种：在xml文件