加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 编程开发 > Python > 正文

Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

发布时间:2020-12-16 21:17:28 所属栏目:Python 来源:网络整理
导读:本篇章节讲解Python统计纯文本文件中英文单词出现个数的方法。供大家参考研究具体如下: 第一版: 效率低 # -*- coding:utf-8 -*-#!python3path = 'test.txt'with open(path,encoding='utf-8',newline='') as f: word = [] words_dict= {} for lett

本篇章节讲解Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:

第一版: 效率低

# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
  word = []
  words_dict= {}
  for letter in f.read():
    if letter.isalnum():
      word.append(letter)
    elif letter.isspace(): #空白字符 空格 t n
      if word:
        word = ''.join(word).lower() #转小写
        if word not in words_dict:
          words_dict[word] = 1
        else:
          words_dict[word] += 1
        word = []
#处理最后一个单词
if word:
  word = ''.join(word).lower() # 转小写
  if word not in words_dict:
    words_dict[word] = 1
  else:
    words_dict[word] += 1
  word = []
for k,v in words_dict.items():
  print(k,v)

运行结果:

we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1

第二版:

缺点:遇到大文件要一次读入内存,性能不好

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
  data = f.read()
  word_reg = re.compile(r'w+')
  #word_reg = re.compile(r'w+b')
  word_list = word_reg.findall(data)
  word_list = [word.lower() for word in word_list] #转小写
  word_set = set(word_list) #避免重复查询
  # words_dict = {}
  # for word in word_set:
  #   words_dict[word] = word_list.count(word)
  # 简洁写法
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v in words_dict.items():
    print(k,v)

运行结果:

on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1

第三版:

# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'w+')
  for line in f:
    #line_words = word_reg.findall(line)
    #比上面的正则更加简单
    line_words = line.split()
    word_list.extend(line_words)
  word_set = set(word_list) # 避免重复查询
  words_dict = {word: word_list.count(word) for word in word_set}
  for k,v)

运行结果:

childhood 1
innocence,1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy,1
by,1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish,1
time 1
childish 1
voices 1
once 1
restless,1
shackles 1
world 1
eroded 1
As 1
all 1
day,1
swarms 1
we 3
soul. 1
memories,1
in 1
without 1
like 1
beneficial 1
up,1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away,1
mind,1
focus 1
principle,1
hear 1
to 1
the 7
years 1
busy 1
souls,1
indulge 1

第四版:使用Counter统计

# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path,encoding='utf-8') as f:
  word_list = []
  word_reg = re.compile(r'w+')
  for line in f:
    line_words = line.split()
    word_list.extend(line_words)
  words_dict = dict(collections.Counter(word_list)) #使用Counter统计
  for k,v)

运行结果:

We 1
are 1
busy 1
all 1
day,1
like 1
swarms 1
of 6
flies 1
without 1
souls,1
noisy,1
restless,1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by,1
childhood 1
away,1
we 3
grew 1
up,1
years 1
away 1
a 1
lot 1
memories,1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence,1
regardless 1
shackles 1
mind,1
indulge 1
in 1
world 1
buckish,1
focus 1
on 1
beneficial 1
principle,1
lost 1
themselves. 1

注:这里使用的测试文本test.txt如下:

We are busy all day,like swarms of flies without souls,noisy,restless,unable to hear the voices of the soul. As time goes by,childhood away,we grew up,years away a lot of memories,once have also eroded the bottom of the childish innocence,we regardless of the shackles of mind,indulge in the world buckish,focus on the beneficial principle,we have lost themselves.

PS:这里再为大家推荐2款相关统计工具供大家参考:

在线字数统计工具:
http://tools.aspzz.cn/code/zishutongji

在线字符统计与编辑工具:
http://tools.aspzz.cn/code/char_tongji

更多关于Python相关内容感兴趣的读者可查看本站专题:《Python文件与目录操作技巧汇总》、《Python文本文件操作技巧汇总》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》及《Python入门与进阶经典教程》

希望本文所述对大家Python程序设计有所帮助。

您可能感兴趣的文章:

  • python统计文本文件内单词数量的方法
  • Python3读取UTF-8文件及统计文件行数的方法
  • Python实现对excel文件列表值进行统计的方法
  • 使用python统计文件行数示例分享
  • python脚本实现统计日志文件中的ip访问次数代码分享
  • Python实现统计文本文件字数的方法
  • Python统计文件中去重后uuid个数的方法
  • python 远程统计文件代码分享
  • 用python统计代码行的示例(包括空行和注释)
  • Python统计python文件中代码,注释及空白对应的行数示例【测试可用】

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读