NLP算法之一(LDA理论和实际部分_ _邮件数据处理)
发布时间:2020-12-14 03:13:25 所属栏目:大数据 来源:网络整理
导读:正经的理理解LDA,分为下述5个步骤: 一个函数:gamma函数 四个分布:二项分布、多项分布、beta分布、Dirichlet分布 一个概念和一个理理念:共轭先验和?贝叶斯框架 两个模型:pLSA、LDA 一个采样:Gibbs采样 例子的数据信息格式: 例子:读大量的邮件信息,
正经的理理解LDA,分为下述5个步骤: 一个函数:gamma函数四个分布:二项分布、多项分布、beta分布、Dirichlet分布 一个概念和一个理理念:共轭先验和?贝叶斯框架 两个模型:pLSA、LDA 一个采样:Gibbs采样 例子的数据信息格式: 例子:读大量的邮件信息,选出有用的信息。 import numpy as np import pandas as pd import re df = pd.read_csv("input/HillaryEmails.csv") # 原邮件数据中有很多Nan的值,直接扔了。 df = df[['Id','ExtractedBodyText']].dropna() #预处理 def clean_email_text(text): text = text.replace('n'," ") #新行,我们是不需要的 text = re.sub(r"-"," ",text) #把 "-" 的两个单词,分开。(比如:july-edu ==> july edu) text = re.sub(r"d+/d+/d+","",text) #日期,对主体模型没什么意义 text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]",text) #时间,没意义 text = re.sub(r"[w]+@[.w]+",text) #邮件地址,没意义 text = re.sub(r"/[a-zA-Z]*[://]*[A-Za-z0-9-_]+.+[A-Za-z0-9./%&=?-_]+/i",text) #网址,没意义 pure_text = '' # 以防还有其他特殊字符(数字)等等,我们直接把他们loop一遍,过滤掉 for letter in text: # 只留下字母和空格 if letter.isalpha() or letter==' ': pure_text += letter # 再把那些去除特殊字符后落单的单词,直接排除。 # 我们就只剩下有意义的单词了。 text = ' '.join(word for word in pure_text.split() if len(word)>1) return text docs = df['ExtractedBodyText'] docs = docs.apply(lambda s: clean_email_text(s)) print(docs.head(1).values) doclist = docs.values from gensim import corpora,models,similarities import gensim stoplist = ['very','ourselves','am','doesn','through','me','against','up','just','her','ours','couldn','because','is','isn','it','only','in','such','too','mustn','under','their','if','to','my','himself','after','why','while','can','each','itself','his','all','once','herself','more','our','they','hasn','on','ma','them','its','where','did','ll','you','didn','nor','as','now','before','those','yours','from','who','was','m','been','will','into','same','how','some','of','out','with','s','being','t','mightn','she','again','be','by','shan','have','yourselves','needn','and','are','o','these','further','most','yourself','having','aren','here','he','were','but','this','myself','own','we','so','i','does','both','when','between','d','had','the','y','has','down','off','than','haven','whom','wouldn','should','ve','over','themselves','few','then','hadn','what','until','won','no','about','any','that','for','shouldn','don','do','there','doing','an','or','ain','hers','wasn','weren','above','a','at','your','theirs','below','other','not','re','him','during','which'] texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in doclist] texts[0] dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] corpus[13] lda = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=dictionary,num_topics=20) lda.print_topic(10,topn=5) lda.print_topics(num_topics=20,num_words=5) 输出结果: (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |