如何在scikit-learn中计算correclty交叉验证分数?
发布时间:2020-12-20 11:53:17 所属栏目:Python 来源:网络整理
导读:我正在做分类任务.不过,我的结果略有不同: #First Approachkf = KFold(n=len(y),n_folds=10,shuffle=True,random_state=False)pipe= make_pipeline(SVC())for train_index,test_index in kf: X_train,X_test = X[train_index],X[test_index] y_train,y_test
|
我正在做分类任务.不过,我的结果略有不同:
#First Approach
kf = KFold(n=len(y),n_folds=10,shuffle=True,random_state=False)
pipe= make_pipeline(SVC())
for train_index,test_index in kf:
X_train,X_test = X[train_index],X[test_index]
y_train,y_test = y[train_index],y[test_index]
print ('Precision',np.mean(cross_val_score(pipe,X_train,y_train,scoring='precision')))
#Second Approach
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print ('Precision:',precision_score(y_test,y_pred,average='binary'))
#Third approach
pipe= make_pipeline(SCV())
print('Precision',X,y,cv=kf,scoring='precision')))
#Fourth approach
pipe= make_pipeline(SVC())
print('Precision',scoring='precision')))
日期: Precision: 0.780422106837
Precision: 0.782051282051
Precision: 0.801544091998
/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in cross_val_score(estimator,scoring,cv,n_jobs,verbose,fit_params,pre_dispatch)
1431 train,test,None,1432 fit_params)
-> 1433 for train,test in cv)
1434 return np.array(scores)[:,0]
1435
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self,iterable)
798 # was dispatched. In particular this covers the edge
799 # case of Parallel used with an exhausted iterator.
--> 800 while self.dispatch_one_batch(iterator):
801 self._iterating = True
802 else:
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self,iterator)
656 return False
657 else:
--> 658 self._dispatch(tasks)
659 return True
660
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self,batch)
564
565 if self._pool is None:
--> 566 job = ImmediateComputeBatch(batch)
567 self._jobs.append(job)
568 self.n_dispatched_batches += 1
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __init__(self,batch)
178 # Don't delay the application,to avoid keeping the input
179 # arguments in memory
--> 180 self.results = batch()
181
182 def get(self):
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
70
71 def __call__(self):
---> 72 return [func(*args,**kwargs) for func,args,kwargs in self.items]
73
74 def __len__(self):
/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
70
71 def __call__(self):
---> 72 return [func(*args,kwargs in self.items]
73
74 def __len__(self):
/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator,scorer,train,parameters,return_train_score,return_parameters,error_score)
1522 start_time = time.time()
1523
-> 1524 X_train,y_train = _safe_split(estimator,train)
1525 X_test,y_test = _safe_split(estimator,train)
1526
/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _safe_split(estimator,indices,train_indices)
1589 X_subset = X[np.ix_(indices,train_indices)]
1590 else:
-> 1591 X_subset = safe_indexing(X,indices)
1592
1593 if y is not None:
/usr/local/lib/python3.5/site-packages/sklearn/utils/__init__.py in safe_indexing(X,indices)
161 indices.dtype.kind == 'i'):
162 # This is often substantially faster than X[indices]
--> 163 return X.take(indices,axis=0)
164 else:
165 return X[indices]
IndexError: index 900 is out of bounds for size 900
所以,我的问题是上述哪种方法对于计算cross validated metrics是正确的?我相信我的分数受到污染,因为我对何时执行交叉验证感到困惑.那么,任何关于如何正确执行交叉验证分数的想法? UPDATE 在培训步骤中进行评估? X_train,X_test,y_test = train_test_split(X,random_state = False)
clf = make_pipeline(SVC())
# However,fot clf,you can use whatever estimator you like
kf = StratifiedKFold(y = y_train,random_state=False)
scores = cross_val_score(clf,cv = kf,scoring='precision')
print('Mean score : ',np.mean(scores))
print('Score variance : ',np.var(scores))
解决方法
对于任何分类任务,使用StratifiedKFold交叉验证分割始终是好的.在分层KFold中,每个类别的样本数量与您的分类问题相同.
那么这取决于您的分类问题类型.总是很高兴看到精确度和召回分数.如果是二元分类偏差,人们倾向于使用ROC AUC分数: from sklearn import metrics metrics.roc_auc_score(ytest,ypred) 让我们看看你的解决方案: import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import precision_score
from sklearn.cross_validation import KFold
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
np.random.seed(1337)
X = np.random.rand(1000,5)
y = np.random.randint(0,2,1000)
kf = KFold(n=len(y),random_state=42)
pipe= make_pipeline(SVC(random_state=42))
for train_index,scoring='precision')))
# Here you are evaluating precision score on X_train.
#Second Approach
clf = SVC(random_state=42)
clf.fit(X_train,average='binary'))
# here you are evaluating precision score on X_test
#Third approach
pipe= make_pipeline(SVC())
print('Precision',scoring='precision')))
# Here you are splitting the data again and evaluating mean on each fold
因此,结果是不同的 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |

