文章内容提取库 goose 简介

电玩女神 2022-05-06 06:52 269阅读 0赞

爬虫抓取数据有两个头疼的点，写过爬虫的小伙伴们一定都深有体会：

1.  网站的**防抓取**机制。你要尽可能将自己伪装成“一个人”，骗过对方的服务器反爬验证。
2.  网站的**内容提取**。每个网站都需要你做不同的处理，而且网站一旦改版，你的代码也得跟着更新。

第一点没什么捷径可走，套路见得多了，也就有经验了。关于第二点，今天咱们就来介绍一个小工具，在某些需求场景下，或许可以给你省不少事。

## **Goose** ##

**Goose** 是一个**文章内容提取器**，可以从任意资讯文章类的网页中提取**文章主体**，并提取**标题、标签、摘要、图片、视频**等信息，且**支持中文**网页。它最初是由 [http://Gravity.com][http_Gravity.com] 用 Java 编写的。python-goose 是用 Python 重写的版本。

有了这个库，你从网上爬下来的网页可以直接获取正文内容，无需再用 bs4 或正则表达式一个个去处理文本。

项目地址：  
（py2） [https://github.com/grangier/python-goose][https_github.com_grangier_python-goose]  
（py3） [https://github.com/goose3/goose3][https_github.com_goose3_goose3]

## **安装** ##

网上大多数教程提到的 **python-goose** 项目目前只支持到 python 2.7。可以通过 pip 安装：

pip install goose-extractor

或者安装官网上的方法从源代码安装：

mkvirtualenv --no-site-packages goose
    git clone https://github.com/grangier/python-goose.git
    cd python-goose
    pip install -r requirements.txt
    python setup.py install

我找到一个 python 3 的版本 **goose3**：

pip install goose3

经过我一些简单的测试，未发现两个版本在结果上有太大的差异。

**快速上手**

这里使用 goose3，而 python-goose 只要把其中的 goose3 改成 goose 即可，接口都是一样的。以我之前发过的一篇文章 [如何用Python抓抖音上的小姐姐][Python] 为抓取目标来做个演示。

from goose3 import Goose
    from goose3.text import StopWordsChinese
    # 初始化，设置中文分词
    g = Goose({ 'stopwords_class': StopWordsChinese})
    # 文章地址
    url = 'http://zhuanlan.zhihu.com/p/46396868'
    # 获取文章内容
    article = g.extract(url=url)
    # 标题
    print('标题：', article.title)
    # 显示正文
    print(article.cleaned_text)

输出：

![v2-8eca35a24febc806fd023b09f13759a9_b.jpg][]

除了标题 **title** 和正文 **cleaned\_text** 外，还可以获取一些额外的信息，比如：

*  **meta\_description**：摘要
 *  **meta\_keywords**：关键词
 *  **tags**：标签
 *  **top\_image**：主要图片
 *  **infos**：包含所有信息的 dict
 *  **raw\_html**：原始 HTML 文本

如有有些网站限制了程序抓取，也可以根据需要添加 **user-agent** 信息：

g = Goose({'browser_user_agent': 'Version/5.1.2 Safari/534.52.7'})

如果是 goose3，因为使用了 **requests** 库作为请求模块，因此还可以以相似方式配置 **headers、proxies** 等属性。

在上述示例中使用到的 `StopWordsChinese` 为中文分词器，可一定程度上提高中文文章的识别准确率，但更耗时。

## **其他说明** ##

1.  
Goose 虽然方便，但并不能保证每个网站都能精确获取，因此**适合大规模文章的采集**，如热点追踪、舆情分析等。它只能从概率上保证大多数网站可以相对准确地抓取。我经过一些尝试后发现，抓取英文网站优于中文网站，主流网站优于小众网站，文本的提取优于图片的提取。

2.  
从项目中的 **requirements.txt** 文件可以看出，goose 中使用到了 **Pillow、lxml、cssselect、jieba、beautifulsoup、nltk**，goose3 还用到了 **requests**，我们之前很多文章和项目中都有所涉及：

[这个男人让你的爬虫开发效率提升8倍][8]  
[【编程课堂】jieba-中文分词利器][jieba-]

3.  
如果你是使用基于 python2 的 goose，有可能会遇到**编码**上的问题（尤其是 windows 上）。这方面可以在公众号对话里回复关键词 **编码**，我们有过相关的讲解。

4.  
除了 goose 外，还有其他的正文提取库可以尝试，比如 **python-boilerpipe、python-readability** 等。

**实例**

最后，我们来用 goose3 写小一段代码，自动抓取 **爱范儿、雷锋网、DoNews** 上的新闻文章：

from goose3 import Goose
    from goose3.text import StopWordsChinese
    from bs4 import BeautifulSoup
    
    g = Goose({'stopwords_class': StopWordsChinese})
    urls = [
        'https://www.ifanr.com/',
        'https://www.leiphone.com/',
        'http://www.donews.com/'
    ]
    url_articles = []
    for url in urls:
        page = g.extract(url=url)
        soup = BeautifulSoup(page.raw_html, 'lxml')
        links = soup.find_all('a')
        for l in links:
            link = l.get('href')
            if link and link.startswith('http') and any(c.isdigit() for c in link if c) and link not in url_articles:
                url_articles.append(link)
                print(link)
    
    for url in url_articles:
        try:
            article = g.extract(url=url)
            content = article.cleaned_text
            if len(content) > 200:
                title = article.title
                print(title)
                with open('homework/goose/'   title   '.txt', 'w') as f:
                    f.write(content)
        except:
            pass

这段程序所做的事情就是：

1.  抓取网站首页
2.  从页面上提取地址中带有数字的链接（因为文章页基本带数字，这里为了演示简单以此判断）
3.  抓取这些链接，提取正文。如果结果超过 200 个字，就保存成文件

效果：

![v2-052de20a0ae0e7781f5a7eceb99bfb82_b.jpg][]

在此基础上，你可以继续改进这个程序，让它不停地去寻找新的地址并抓取文章，并对获取到的文章进行词频统计、生成词云等后续操作。类似我们之前的分析案例 [数据分析：当赵雷唱民谣时他唱些什么？][Link 1]。进一步完善，相信你能做出更有意思的项目。

相关代码已上传，获取地址请在公众号（**Crossin的编程教室**）里回复关键字 **goose**

欢迎微信搜索及关注：**Crossin的编程教室**

![154816tsh373rusgwbxs7w.png][]

[http_Gravity.com]: https://link.zhihu.com/?target=http://Gravity.com
[https_github.com_grangier_python-goose]: https://link.zhihu.com/?target=https://github.com/grangier/python-goose
[https_github.com_goose3_goose3]: https://link.zhihu.com/?target=https://github.com/goose3/goose3
[Python]: http://zhuanlan.zhihu.com/p/46396868
[v2-8eca35a24febc806fd023b09f13759a9_b.jpg]: http://crossin-forum.b0.upaiyun.com/upload-by-code/20181013/v2-8eca35a24febc806fd023b09f13759a9_b.jpg
[8]: http://zhuanlan.zhihu.com/p/38466193
[jieba-]: https://link.zhihu.com/?target=http://mp.weixin.qq.com/s?__biz=MjM5MDEyMDk4Mw==&mid=2650166445&idx=2&sn=4af384e6ad4ca33d76a3e4f93cba736b&chksm=be4b59d5893cd0c3b2454a644e4d2b43d2c5ff95a463b260201c01206a5b58bcc662dda48f0d&scene=21#wechat_redirect
[v2-052de20a0ae0e7781f5a7eceb99bfb82_b.jpg]: http://crossin-forum.b0.upaiyun.com/upload-by-code/20181013/v2-052de20a0ae0e7781f5a7eceb99bfb82_b.jpg
[Link 1]: http://zhuanlan.zhihu.com/p/25109074
[Python 1]: https://www.zhihu.com/question/20702054/answer/19022301
[Link 2]: https://zhuanlan.zhihu.com/p/25824007
[Link 3]: https://zhuanlan.zhihu.com/p/34685564
[debug]: https://zhuanlan.zhihu.com/p/45298171
[Python 2]: http://zhuanlan.zhihu.com/p/36064871
[Link 4]: https://zhuanlan.zhihu.com/p/44918640
[Link 5]: https://zhuanlan.zhihu.com/p/36581953
[Link 6]: http://zhuanlan.zhihu.com/p/29043669
[Link 7]: http://zhuanlan.zhihu.com/p/28726244
[Python 3]: https://zhuanlan.zhihu.com/p/37664927
[Link 8]: https://zhuanlan.zhihu.com/p/37814595
[requests]: https://zhuanlan.zhihu.com/p/38466193
[AI]: https://zhuanlan.zhihu.com/p/39937847
[154816tsh373rusgwbxs7w.png]: http://crossin-forum.b0.upaiyun.com/forum/201806/28/154816tsh373rusgwbxs7w.png