如何在scikit-learn中计算correclty交叉验证分数?
发布时间:2020-12-20 11:53:17 所属栏目:Python 来源:网络整理
导读:我正在做分类任务.不过,我的结果略有不同: #First Approachkf = KFold(n=len(y),n_folds=10,shuffle=True,random_state=False)pipe= make_pipeline(SVC())for train_index,test_index in kf: X_train,X_test = X[train_index],X[test_index] y_train,y_test
我正在做分类任务.不过,我的结果略有不同:
#First Approach kf = KFold(n=len(y),n_folds=10,shuffle=True,random_state=False) pipe= make_pipeline(SVC()) for train_index,test_index in kf: X_train,X_test = X[train_index],X[test_index] y_train,y_test = y[train_index],y[test_index] print ('Precision',np.mean(cross_val_score(pipe,X_train,y_train,scoring='precision'))) #Second Approach clf.fit(X_train,y_train) y_pred = clf.predict(X_test) print ('Precision:',precision_score(y_test,y_pred,average='binary')) #Third approach pipe= make_pipeline(SCV()) print('Precision',X,y,cv=kf,scoring='precision'))) #Fourth approach pipe= make_pipeline(SVC()) print('Precision',scoring='precision'))) 日期: Precision: 0.780422106837 Precision: 0.782051282051 Precision: 0.801544091998 /usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in cross_val_score(estimator,scoring,cv,n_jobs,verbose,fit_params,pre_dispatch) 1431 train,test,None,1432 fit_params) -> 1433 for train,test in cv) 1434 return np.array(scores)[:,0] 1435 /usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self,iterable) 798 # was dispatched. In particular this covers the edge 799 # case of Parallel used with an exhausted iterator. --> 800 while self.dispatch_one_batch(iterator): 801 self._iterating = True 802 else: /usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self,iterator) 656 return False 657 else: --> 658 self._dispatch(tasks) 659 return True 660 /usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self,batch) 564 565 if self._pool is None: --> 566 job = ImmediateComputeBatch(batch) 567 self._jobs.append(job) 568 self.n_dispatched_batches += 1 /usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __init__(self,batch) 178 # Don't delay the application,to avoid keeping the input 179 # arguments in memory --> 180 self.results = batch() 181 182 def get(self): /usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self) 70 71 def __call__(self): ---> 72 return [func(*args,**kwargs) for func,args,kwargs in self.items] 73 74 def __len__(self): /usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0) 70 71 def __call__(self): ---> 72 return [func(*args,kwargs in self.items] 73 74 def __len__(self): /usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator,scorer,train,parameters,return_train_score,return_parameters,error_score) 1522 start_time = time.time() 1523 -> 1524 X_train,y_train = _safe_split(estimator,train) 1525 X_test,y_test = _safe_split(estimator,train) 1526 /usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _safe_split(estimator,indices,train_indices) 1589 X_subset = X[np.ix_(indices,train_indices)] 1590 else: -> 1591 X_subset = safe_indexing(X,indices) 1592 1593 if y is not None: /usr/local/lib/python3.5/site-packages/sklearn/utils/__init__.py in safe_indexing(X,indices) 161 indices.dtype.kind == 'i'): 162 # This is often substantially faster than X[indices] --> 163 return X.take(indices,axis=0) 164 else: 165 return X[indices] IndexError: index 900 is out of bounds for size 900 所以,我的问题是上述哪种方法对于计算cross validated metrics是正确的?我相信我的分数受到污染,因为我对何时执行交叉验证感到困惑.那么,任何关于如何正确执行交叉验证分数的想法? UPDATE 在培训步骤中进行评估? X_train,X_test,y_test = train_test_split(X,random_state = False) clf = make_pipeline(SVC()) # However,fot clf,you can use whatever estimator you like kf = StratifiedKFold(y = y_train,random_state=False) scores = cross_val_score(clf,cv = kf,scoring='precision') print('Mean score : ',np.mean(scores)) print('Score variance : ',np.var(scores)) 解决方法
对于任何分类任务,使用StratifiedKFold交叉验证分割始终是好的.在分层KFold中,每个类别的样本数量与您的分类问题相同.
那么这取决于您的分类问题类型.总是很高兴看到精确度和召回分数.如果是二元分类偏差,人们倾向于使用ROC AUC分数: from sklearn import metrics metrics.roc_auc_score(ytest,ypred) 让我们看看你的解决方案: import numpy as np from sklearn.cross_validation import cross_val_score from sklearn.metrics import precision_score from sklearn.cross_validation import KFold from sklearn.pipeline import make_pipeline from sklearn.svm import SVC np.random.seed(1337) X = np.random.rand(1000,5) y = np.random.randint(0,2,1000) kf = KFold(n=len(y),random_state=42) pipe= make_pipeline(SVC(random_state=42)) for train_index,scoring='precision'))) # Here you are evaluating precision score on X_train. #Second Approach clf = SVC(random_state=42) clf.fit(X_train,average='binary')) # here you are evaluating precision score on X_test #Third approach pipe= make_pipeline(SVC()) print('Precision',scoring='precision'))) # Here you are splitting the data again and evaluating mean on each fold 因此,结果是不同的 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |