NLP算法之一（LDA理论和实际部分_ _邮件数据处理）

发布时间：2020-12-14 03:13:25 所属栏目：大数据来源：网络整理

导读：正经的理理解LDA，分为下述5个步骤：一个函数：gamma函数四个分布：二项分布、多项分布、beta分布、Dirichlet分布一个概念和一个理理念：共轭先验和?贝叶斯框架两个模型：pLSA、LDA 一个采样：Gibbs采样例子的数据信息格式：例子：读大量的邮件信息，

正经的理理解LDA，分为下述5个步骤：

一个函数：gamma函数
四个分布：二项分布、多项分布、beta分布、Dirichlet分布
一个概念和一个理理念：共轭先验和?贝叶斯框架
两个模型：pLSA、LDA

一个采样：Gibbs采样

例子的数据信息格式：

例子：读大量的邮件信息，选出有用的信息。

import numpy as np
import pandas as pd
import re
df = pd.read_csv("input/HillaryEmails.csv")
# 原邮件数据中有很多Nan的值，直接扔了。
df = df[['Id','ExtractedBodyText']].dropna()
#预处理
def clean_email_text(text):
    text = text.replace('n'," ") #新行，我们是不需要的
    text = re.sub(r"-"," ",text) #把 "-" 的两个单词，分开。（比如：july-edu ==> july edu）
    text = re.sub(r"d+/d+/d+","",text) #日期，对主体模型没什么意义
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]",text) #时间，没意义
    text = re.sub(r"[w]+@[.w]+",text) #邮件地址，没意义
    text = re.sub(r"/[a-zA-Z]*[://]*[A-Za-z0-9-_]+.+[A-Za-z0-9./%&=?-_]+/i",text) #网址，没意义
    pure_text = ''
    # 以防还有其他特殊字符（数字）等等，我们直接把他们loop一遍，过滤掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter==' ':
            pure_text += letter
    # 再把那些去除特殊字符后落单的单词，直接排除。
    # 我们就只剩下有意义的单词了。
    text = ' '.join(word for word in pure_text.split() if len(word)>1)
    return text
docs = df['ExtractedBodyText']
docs = docs.apply(lambda s: clean_email_text(s))
print(docs.head(1).values)
doclist = docs.values
from gensim import corpora,models,similarities
import gensim
stoplist = ['very','ourselves','am','doesn','through','me','against','up','just','her','ours','couldn','because','is','isn','it','only','in','such','too','mustn','under','their','if','to','my','himself','after','why','while','can','each','itself','his','all','once','herself','more','our','they','hasn','on','ma','them','its','where','did','ll','you','didn','nor','as','now','before','those','yours','from','who','was','m','been','will','into','same','how','some','of','out','with','s','being','t','mightn','she','again','be','by','shan','have','yourselves','needn','and','are','o','these','further','most','yourself','having','aren','here','he','were','but','this','myself','own','we','so','i','does','both','when','between','d','had','the','y','has','down','off','than','haven','whom','wouldn','should','ve','over','themselves','few','then','hadn','what','until','won','no','about','any','that','for','shouldn','don','do','there','doing','an','or','ain','hers','wasn','weren','above','a','at','your','theirs','below','other','not','re','him','during','which']

texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in doclist]
texts[0]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
corpus[13]
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=dictionary,num_topics=20)
lda.print_topic(10,topn=5)
lda.print_topics(num_topics=20,num_words=5)

输出结果：

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!