scikit-learn:0.4 使用“Pipeline”统一vectorizer => trans
发布时间:2020-12-13 22:17:48 所属栏目:百科 来源:网络整理
导读:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html strong1、使用“Pipeline”统一vectorizer = transformer = classifier/strongfrom sklearn.pipeline import Pipelinetext_clf = Pipeline([('vect',CountVectorizer()
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
<strong>1、使用“Pipeline”统一vectorizer => transformer => classifier</strong> from sklearn.pipeline import Pipeline text_clf = Pipeline([('vect',CountVectorizer()),... ('tfidf',TfidfTransformer()),... ('clf',MultinomialNB()),... ]) text_clf = text_clf.fit(rawData.data,rawData.target) predicted = text_clf.predict(docs_new) <strong>#注意,这里是未经任何处理的原始文件,不是X_new_tfidf,否则出现下面错误。</strong> np.mean(predicted == y_new_target) Out[51]: 0.5 predicted = text_clf.predict(X_new_tfidf) Traceback (most recent call last): File "<ipython-input-52-20002e79f960>",line 1,in <module> predicted = text_clf.predict(X_new_tfidf) File "D:Anacondalibsite-packagessklearnpipeline.py",line 149,in predict Xt = transform.transform(Xt) File "D:Anacondalibsite-packagessklearnfeature_extractiontext.py",line 867,in transform _,X = self._count_vocab(raw_documents,fixed_vocab=True) File "D:Anacondalibsite-packagessklearnfeature_extractiontext.py",line 748,in _count_vocab for feature in analyze(doc): File "D:Anacondalibsite-packagessklearnfeature_extractiontext.py",line 234,in <lambda> tokenize(preprocess(self.decode(doc))),stop_words) File "D:Anacondalibsite-packagessklearnfeature_extractiontext.py",line 200,in <lambda> return lambda x: strip_accents(x.lower()) File "D:Anacondalibsite-packagesscipysparsebase.py",line 499,in __getattr__ raise AttributeError(attr + " not found") AttributeError: lower not found <strong>2、使用网格搜索调参</strong> from sklearn.grid_search import GridSearchCV parameters = {'vect__ngram_range': [(1,1),(1,2)],... 'tfidf__use_idf': (True,False),... 'clf__alpha': (1e-2,1e-3),... } gs_clf = GridSearchCV(text_clf,parameters,n_jobs=-1) #这里n_jobs=-1告诉grid search要自动识别机器有几个核,并使用所有的核并行跑程序。 gs_clf = gs_clf.fit(rawData.data,rawData.target) rawData.target_names[gs_clf.predict(['i love this book'])] 'positive folder' 输出效果最好的参数: >>> best_parameters,score,_ = max(gs_clf.grid_scores_,key=lambda x: x[1]) >>> for param_name in sorted(parameters.keys()): ... print("%s: %r" % (param_name,best_parameters[param_name])) ... clf__alpha: 0.001 tfidf__use_idf: True vect__ngram_range: (1,1) >>> score 1.000 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |