加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

NLP算法之一(LDA理论和实际部分_ _邮件数据处理)

发布时间:2020-12-14 03:13:25 所属栏目:大数据 来源:网络整理
导读:正经的理理解LDA,分为下述5个步骤: 一个函数:gamma函数 四个分布:二项分布、多项分布、beta分布、Dirichlet分布 一个概念和一个理理念:共轭先验和?贝叶斯框架 两个模型:pLSA、LDA 一个采样:Gibbs采样 例子的数据信息格式: 例子:读大量的邮件信息,

正经的理理解LDA,分为下述5个步骤:

一个函数:gamma函数
四个分布:二项分布、多项分布、beta分布、Dirichlet分布
一个概念和一个理理念:共轭先验和?贝叶斯框架
两个模型:pLSA、LDA

一个采样:Gibbs采样


例子的数据信息格式:



例子:读大量的邮件信息,选出有用的信息。

import numpy as np
import pandas as pd
import re
df = pd.read_csv("input/HillaryEmails.csv")
# 原邮件数据中有很多Nan的值,直接扔了。
df = df[['Id','ExtractedBodyText']].dropna()
#预处理
def clean_email_text(text):
    text = text.replace('n'," ") #新行,我们是不需要的
    text = re.sub(r"-"," ",text) #把 "-" 的两个单词,分开。(比如:july-edu ==> july edu)
    text = re.sub(r"d+/d+/d+","",text) #日期,对主体模型没什么意义
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]",text) #时间,没意义
    text = re.sub(r"[w]+@[.w]+",text) #邮件地址,没意义
    text = re.sub(r"/[a-zA-Z]*[://]*[A-Za-z0-9-_]+.+[A-Za-z0-9./%&=?-_]+/i",text) #网址,没意义
    pure_text = ''
    # 以防还有其他特殊字符(数字)等等,我们直接把他们loop一遍,过滤掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter==' ':
            pure_text += letter
    # 再把那些去除特殊字符后落单的单词,直接排除。
    # 我们就只剩下有意义的单词了。
    text = ' '.join(word for word in pure_text.split() if len(word)>1)
    return text
docs = df['ExtractedBodyText']
docs = docs.apply(lambda s: clean_email_text(s))
print(docs.head(1).values)
doclist = docs.values
from gensim import corpora,models,similarities
import gensim
stoplist = ['very','ourselves','am','doesn','through','me','against','up','just','her','ours','couldn','because','is','isn','it','only','in','such','too','mustn','under','their','if','to','my','himself','after','why','while','can','each','itself','his','all','once','herself','more','our','they','hasn','on','ma','them','its','where','did','ll','you','didn','nor','as','now','before','those','yours','from','who','was','m','been','will','into','same','how','some','of','out','with','s','being','t','mightn','she','again','be','by','shan','have','yourselves','needn','and','are','o','these','further','most','yourself','having','aren','here','he','were','but','this','myself','own','we','so','i','does','both','when','between','d','had','the','y','has','down','off','than','haven','whom','wouldn','should','ve','over','themselves','few','then','hadn','what','until','won','no','about','any','that','for','shouldn','don','do','there','doing','an','or','ain','hers','wasn','weren','above','a','at','your','theirs','below','other','not','re','him','during','which']

texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in doclist]
texts[0]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[13]
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=dictionary,num_topics=20)
lda.print_topic(10,topn=5)
lda.print_topics(num_topics=20,num_words=5)



输出结果:

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读