利用Python统计中文或英文文本词频（适合初学者）

比眉伴天荒 2023-02-27 05:45 22阅读 0赞

一篇文章出现了那些词语？那些词出现的次数最多？  
中文文本？英文文本？  
英文文本测试资源：哈默雷特：[https://python123.io/resources/pye/hamlet.txt][https_python123.io_resources_pye_hamlet.txt]  
中文文本测试资源：三国演义：[https://python123.io/resources/pye/threekingdoms.txt][https_python123.io_resources_pye_threekingdoms.txt]  
一、利用Python统计哈姆雷特  
二、利用Python统计三国演义

**一、首先处理英语文本：**  
对于英语文本，我们需要去噪化及归一化（英语文本中除了英语单词外还有各种符号，及大小写）。

def getText():
        txt = open("hamlet.txt", "r").read()
        txt = txt.lower()       #大写字母全部转化为小写字母
        for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~': #\在python表示转义字符，故\\才是反斜杠
           txt = txt.replace(ch, " ")       #用空格代替标点符号
        return txt
    
    hamletTxt = getText()
    words = hamletTxt.split()       #以空格为标志，对字符串进行切片处理，转化为列表类型
    counts = { }                     #利用字典表达词频
    for word in words:
        counts[word] = counts.get(word,0) + 1   #统计单词word出现频率
    items = list(counts.items())
    items.sort(key = lambda x:x[1], reverse = True)
    for i in range(10):#输出出现次数最多得前十的词语
        word, count = items[i]
        print("{0:<10}{1:>5}".format(word,count))

运行结果：  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1NjI0OTg5_size_16_color_FFFFFF_t_70]

**二、处理中文文本：**  
后文附有jieba库安装方法

import jieba
    txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
    words  = jieba.lcut(txt)
    counts = { }
    for word in words:
        if len(word) == 1:
            continue
        else:
            counts[word] = counts.get(word,0) + 1
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True) 
    for i in range(15):
        word, count = items[i]
        print ("{0:<10}{1:>5}".format(word, count))

运行结果：  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1NjI0OTg5_size_16_color_FFFFFF_t_70 1]  
如果要统计人物频率，我们则需要剔除一些不是名字的词语，并且如玄德，玄德曰都是指刘备……

改进代码：

import jieba
    excludes = { "将军","却说","荆州","二人","不可","不能","如此"}
    txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
    words  = jieba.lcut(txt)
    counts = { }
    for word in words:
        if len(word) == 1:
            continue
        elif word == "诸葛亮" or word == "孔明曰":
            rword = "孔明"
        elif word == "关公" or word == "云长":
            rword = "关羽"
        elif word == "玄德" or word == "玄德曰":
            rword = "刘备"
        elif word == "孟德" or word == "丞相":
            rword = "曹操"
        else:
            rword = word
        counts[rword] = counts.get(rword,0) + 1
    for word in excludes:
        del counts[word]
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True) 
    for i in range(10):
        word, count = items[i]
        print ("{0:<10}{1:>5}".format(word, count))

jieba库安装方法：  
win+R快捷键  
输入cmd打开命令行  
输入pip install jieba 等待安装成功  
![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1NjI0OTg5_size_16_color_FFFFFF_t_70 2]

[https_python123.io_resources_pye_hamlet.txt]: https://python123.io/resources/pye/hamlet.txt
[https_python123.io_resources_pye_threekingdoms.txt]: https://python123.io/resources/pye/threekingdoms.txt
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1NjI0OTg5_size_16_color_FFFFFF_t_70]: /images/20230209/7f1fc5e1a348454381ff42c65ae7ed07.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1NjI0OTg5_size_16_color_FFFFFF_t_70 1]: /images/20230209/4c09b402eead43f89606444143275d8e.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQ1NjI0OTg5_size_16_color_FFFFFF_t_70 2]: /images/20230209/0fc86ec4ffc740c0b6c9496ad2289aa4.png