python爬虫(五)网页解析器

待我称王封你为后i 2021-09-27 04:36 441阅读 0赞

网页解析器:是从网页中提取有价值数据的工具  
![这里写图片描述][70]  
python 有四种网页解析器:  
1 正则表达式:模糊匹配解析  
2 html.parser:结构化解析  
3 Beautiful Soup :结构化解析  
4 lxml:结构化解析  
其中 Beautiful Soup 功能很强大,有html.parse和 lxml的解析器.  
结构化解析-DOM(Document Object Model)树

![这里写图片描述][70 1]

下载 beautifulSoup  
pip install beautifulsoup4

![这里写图片描述][70 2]

beautifulSoup 语法:  
![这里写图片描述][70 3]

其中find\_all方法会搜索满足要求的所有节点  
find方法只会搜索第一个满足要求的节点

节点的介绍:  
![这里写图片描述][70 4]

一 创建beautifulSoup对象  
![这里写图片描述][70 5]

二 搜索节点  
![这里写图片描述][70 6]

其中beautifulSoup有个强大的功能是 可以传入正则表达式来匹配的内容.  
class\_ 这里加一个下划线是因为避免与python关键字冲突所以用一个下划线.

三 访问节点信息

![这里写图片描述][70 7]

实例测试:

from bs4 import BeautifulSoup
 import re
 
 
 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """
 
 soup=BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')
 
 print('获取所有链接')
 links=soup.find_all('a')
 for link in links :
 print (link.name, link['href'],link.get_text())
 
 print('获取Lacie链接')
 linknode=soup.find_all('a',href='http://example.com/lacie')
 for link in linknode :
 print (link.name, link['href'],link.get_text())
 
 print('正则匹配')
 linknode=soup.find_all('a',href=re.compile(r'ill'))
 for link in linknode :
 print (link.name, link['href'],link.get_text())
 
 print('获取P')
 pnode=soup.find_all('p',class_='title')
 for link in pnode :
 print (link.name,link.get_text())

学习自:慕课网.

[70]: /images/20210923/45e1e2229a8e4a3f910f441d4d891c44.png
[70 1]: /images/20210923/23ed69c6ffa243309d5b46ee80ac3083.png
[70 2]: /images/20210923/95dadbb786bf4c26aae658a8e160d649.png
[70 3]: /images/20210923/4bebe9efb6ac4f83870463ec21554926.png
[70 4]: /images/20210923/183609c1c0b2424795ae901dbd3cc8b3.png
[70 5]: /images/20210923/aad7ed1ea7ad43439c60247f21e3b479.png
[70 6]: /images/20210923/b0df0a9c0ee943ecac38869f4d55e7c6.png
[70 7]: /images/20210923/a1631df252584da1a479d0ff12c69778.png