Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

发布时间：2020-12-16 21:17:28 所属栏目：Python 来源：网络整理

导读：本篇章节讲解Python统计纯文本文件中英文单词出现个数的方法。供大家参考研究具体如下：第一版: 效率低 # -*- coding:utf-8 -*-#!python3path = 'test.txt'with open(path,encoding='utf-8',newline='') as f: word = [] words_dict= {} for lett

本篇章节讲解Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考，具体如下：

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 t n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果：

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存，性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'w+')
  #word_reg = re.compile(r'w+b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果：

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v)

运行结果：

childhood 1
innocence,1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy,1
by,1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish,1
time 1
childish 1
voices 1
once 1
restless,1
shackles 1
world 1
eroded 1
As 1
all 1
day,1
swarms 1
we 3
soul. 1
memories,1
in 1
without 1
like 1
beneficial 1
up,1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away,1
mind,1
focus 1
principle,1
hear 1
to 1
the 7
years 1
busy 1
souls,1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path,encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k,v)

运行结果：

We 1
are 1
busy 1
all 1
day,1
like 1
swarms 1
of 6
flies 1
without 1
souls,1
noisy,1
restless,1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by,1
childhood 1
away,1
we 3
grew 1
up,1
years 1
away 1
a 1
lot 1
memories,1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence,1
regardless 1
shackles 1
mind,1
indulge 1
in 1
world 1
buckish,1
focus 1
on 1
beneficial 1
principle,1
lost 1
themselves. 1

注：这里使用的测试文本test.txt如下：

We are busy all day,like swarms of flies without souls,noisy,restless,unable to hear the voices of the soul. As time goes by,childhood away,we grew up,years away a lot of memories,once have also eroded the bottom of the childish innocence,we regardless of the shackles of mind,indulge in the world buckish,focus on the beneficial principle,we have lost themselves.

PS：这里再为大家推荐2款相关统计工具供大家参考：

在线字数统计工具：
http://tools.aspzz.cn/code/zishutongji

在线字符统计与编辑工具：
http://tools.aspzz.cn/code/char_tongji

更多关于Python相关内容感兴趣的读者可查看本站专题：《Python文件与目录操作技巧汇总》、《Python文本文件操作技巧汇总》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》及《Python入门与进阶经典教程》

希望本文所述对大家Python程序设计有所帮助。

您可能感兴趣的文章:

python统计文本文件内单词数量的方法
Python3读取UTF-8文件及统计文件行数的方法
Python实现对excel文件列表值进行统计的方法
使用python统计文件行数示例分享
python脚本实现统计日志文件中的ip访问次数代码分享
Python实现统计文本文件字数的方法
Python统计文件中去重后uuid个数的方法
python 远程统计文件代码分享
用python统计代码行的示例(包括空行和注释)
Python统计python文件中代码,注释及空白对应的行数示例【测试可用】

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!