• 售前

  • 售后

热门帖子
入门百科

sklearn中的交叉验证的实现(Cross-Validation)

[复制链接]
分裂的硬盘悔 显示全部楼层 发表于 2021-10-25 19:22:24 |阅读模式 打印 上一主题 下一主题
  1. sklearn
复制代码
是利用python举行呆板学习中一个非常全面和洽用的第三方库,用过的都说好。本日重要纪录一下
  1. sklearn
复制代码
中关于交叉验证的各种用法,重要是对
  1. sklearn
复制代码
官方文档 Cross-validation: evaluating estimator performance举行讲解,英文程度好的发起读官方文档,内里的知识点很详细。
先导入必要的库及数据集
  1. In [1]: import numpy as np
  2. In [2]: from sklearn.model_selection import train_test_split
  3. In [3]: from sklearn.datasets import load_iris
  4. In [4]: from sklearn import svm
  5. In [5]: iris = load_iris()
  6. In [6]: iris.data.shape, iris.target.shape
  7. Out[6]: ((150, 4), (150,))
复制代码
1.train_test_split

对数据集举行快速打乱(分为训练集和测试集)
这里相称于对数据集举行了shuffle后按照给定的
  1. test_size
复制代码
举行数据集划分。
  1. In [7]: X_train, X_test, y_train, y_test = train_test_split(
  2.   ...:     iris.data, iris.target, test_size=.4, random_state=0)
  3.   #这里是按照6:4对训练集测试集进行划分
  4. In [8]: X_train.shape, y_train.shape
  5. Out[8]: ((90, 4), (90,))
  6. In [9]: X_test.shape, y_test.shape
  7. Out[9]: ((60, 4), (60,))
  8. In [10]: iris.data[:5]
  9. Out[10]:
  10. array([[ 5.1, 3.5, 1.4, 0.2],
  11.     [ 4.9, 3. , 1.4, 0.2],
  12.     [ 4.7, 3.2, 1.3, 0.2],
  13.     [ 4.6, 3.1, 1.5, 0.2],
  14.     [ 5. , 3.6, 1.4, 0.2]])
  15. In [11]: X_train[:5]
  16. Out[11]:
  17. array([[ 6. , 3.4, 4.5, 1.6],
  18.     [ 4.8, 3.1, 1.6, 0.2],
  19.     [ 5.8, 2.7, 5.1, 1.9],
  20.     [ 5.6, 2.7, 4.2, 1.3],
  21.     [ 5.6, 2.9, 3.6, 1.3]])
  22. In [12]: clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
  23. In [13]: clf.score(X_test, y_test)
  24. Out[13]: 0.96666666666666667
复制代码
2.cross_val_score

对数据集举行指定次数的交叉验证并为每次验证效果评测
其中,
  1. score
复制代码
默认是以 scoring='f1_macro'举行评测的,余外针对分类或回归还有:



这必要
  1. from sklearn import metrics
复制代码
,通过在
  1. cross_val_score
复制代码
指定参数来设定评测标准;
  1. cv
复制代码
指定为
  1. int
复制代码
类型时,默认利用
  1. KFold
复制代码
  1. StratifiedKFold
复制代码
举行数据集打乱,下面会对
  1. KFold
复制代码
  1. StratifiedKFold
复制代码
举行先容。
  1. In [15]: from sklearn.model_selection import cross_val_score
  2. In [16]: clf = svm.SVC(kernel='linear', C=1)
  3. In [17]: scores = cross_val_score(clf, iris.data, iris.target, cv=5)
  4. In [18]: scores
  5. Out[18]: array([ 0.96666667, 1.    , 0.96666667, 0.96666667, 1.    ])
  6. In [19]: scores.mean()
  7. Out[19]: 0.98000000000000009
复制代码
除利用默认交叉验证方式外,可以对交叉验证方式举行指定,如验证次数,训练集测试集划分比例等
  1. In [20]: from sklearn.model_selection import ShuffleSplit
  2. In [21]: n_samples = iris.data.shape[0]
  3. In [22]: cv = ShuffleSplit(n_splits=3, test_size=.3, random_state=0)
  4. In [23]: cross_val_score(clf, iris.data, iris.target, cv=cv)
  5. Out[23]: array([ 0.97777778, 0.97777778, 1.    ])
复制代码
  1. cross_val_score
复制代码
中同样可利用
  1. pipeline
复制代码
举行流水线操作
  1. In [24]: from sklearn import preprocessing
  2. In [25]: from sklearn.pipeline import make_pipeline
  3. In [26]: clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
  4. In [27]: cross_val_score(clf, iris.data, iris.target, cv=cv)
  5. Out[27]: array([ 0.97777778, 0.93333333, 0.95555556])
复制代码
3.cross_val_predict
  1. cross_val_predict
复制代码
  1. cross_val_score
复制代码
很相像,不外不同于返回的是评测效果,
  1. cross_val_predict
复制代码
返回的是
  1. estimator
复制代码
的分类效果(或回归值),这个对于后期模子的改善很紧张,可以通过该预测输出对比实际目标值,准确定位到预测出错的地方,为我们参数优化及标题排查十分的紧张。
  1. In [28]: from sklearn.model_selection import cross_val_predict
  2. In [29]: from sklearn import metrics
  3. In [30]: predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
  4. In [31]: predicted
  5. Out[31]:
  6. array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  7.     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  8.     0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
  9.     1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
  10.     1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,
  11.     2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
  12.     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
  13. In [32]: metrics.accuracy_score(iris.target, predicted)
  14. Out[32]: 0.96666666666666667
复制代码
4.KFold

K折交叉验证,这是将数据集分成K份的官方给定方案,所谓K折就是将数据集通过K次分割,使得所有数据既在训练集出现过,又在测试集出现过,当然,每次分割中不会有重叠。相称于无放回抽样。
  1. In [33]: from sklearn.model_selection import KFold
  2. In [34]: X = ['a','b','c','d']
  3. In [35]: kf = KFold(n_splits=2)
  4. In [36]: for train, test in kf.split(X):
  5.   ...:   print train, test
  6.   ...:   print np.array(X)[train], np.array(X)[test]
  7.   ...:   print '\n'
  8.   ...:   
  9. [2 3] [0 1]
  10. ['c' 'd'] ['a' 'b']
  11. [0 1] [2 3]
  12. ['a' 'b'] ['c' 'd']
复制代码
5.LeaveOneOut
  1. LeaveOneOut
复制代码
其实就是
  1. KFold
复制代码
的一个特例,由于利用次数比较多,因此独立的界说出来,完全可以通过
  1. KFold
复制代码
实现。
  1. In [37]: from sklearn.model_selection import LeaveOneOut
  2. In [38]: X = [1,2,3,4]
  3. In [39]: loo = LeaveOneOut()
  4. In [41]: for train, test in loo.split(X):
  5.   ...:   print train, test
  6.   ...:   
  7. [1 2 3] [0]
  8. [0 2 3] [1]
  9. [0 1 3] [2]
  10. [0 1 2] [3]
  11. #使用KFold实现LeaveOneOtut
  12. In [42]: kf = KFold(n_splits=len(X))
  13. In [43]: for train, test in kf.split(X):
  14.   ...:   print train, test
  15.   ...:   
  16. [1 2 3] [0]
  17. [0 2 3] [1]
  18. [0 1 3] [2]
  19. [0 1 2] [3]
复制代码
6.LeavePOut

这个也是
  1. KFold
复制代码
的一个特例,用
  1. KFold
复制代码
实现起来稍贫苦些,跟
  1. LeaveOneOut
复制代码
也很像。
  1. In [44]: from sklearn.model_selection import LeavePOut
  2. In [45]: X = np.ones(4)
  3. In [46]: lpo = LeavePOut(p=2)
  4. In [47]: for train, test in lpo.split(X):
  5.   ...:   print train, test
  6.   ...:   
  7. [2 3] [0 1]
  8. [1 3] [0 2]
  9. [1 2] [0 3]
  10. [0 3] [1 2]
  11. [0 2] [1 3]
  12. [0 1] [2 3]
复制代码
7.ShuffleSplit
  1. ShuffleSplit
复制代码
咋一看用法跟
  1. LeavePOut
复制代码
很像,其实两者完全不一样,
  1. LeavePOut
复制代码
是使得数据集经过数次分割后,所有的测试集出现的元素的聚集便是完备的数据集,即无放回的抽样,而
  1. ShuffleSplit
复制代码
则是有放回的抽样,只能说经过一个足够大的抽样次数后,包管测试集出现了完成的数据集的倍数。
  1. In [48]: from sklearn.model_selection import ShuffleSplit
  2. In [49]: X = np.arange(5)
  3. In [50]: ss = ShuffleSplit(n_splits=3, test_size=.25, random_state=0)
  4. In [51]: for train_index, test_index in ss.split(X):
  5.   ...:   print train_index, test_index
  6.   ...:   
  7. [1 3 4] [2 0]
  8. [1 4 3] [0 2]
  9. [4 0 2] [1 3]
复制代码
8.StratifiedKFold

这个就比较好玩了,通过指定分组,对测试集举行无放回抽样。
  1. In [52]: from sklearn.model_selection import StratifiedKFold
  2. In [53]: X = np.ones(10)
  3. In [54]: y = [0,0,0,0,1,1,1,1,1,1]
  4. In [55]: skf = StratifiedKFold(n_splits=3)
  5. In [56]: for train, test in skf.split(X,y):
  6.   ...:   print train, test
  7.   ...:   
  8. [2 3 6 7 8 9] [0 1 4 5]
  9. [0 1 3 4 5 8 9] [2 6 7]
  10. [0 1 2 4 5 6 7] [3 8 9]
复制代码
9.GroupKFold

这个跟
  1. StratifiedKFold
复制代码
比较像,不外测试集是按照肯定分组举行打乱的,即先分堆,然后把这些堆打乱,每个堆里的顺序还是固定稳定的。
  1. In [57]: from sklearn.model_selection import GroupKFold
  2. In [58]: X = [.1, .2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
  3. In [59]: y = ['a','b','b','b','c','c','c','d','d','d']
  4. In [60]: groups = [1,1,1,2,2,2,3,3,3,3]
  5. In [61]: gkf = GroupKFold(n_splits=3)
  6. In [62]: for train, test in gkf.split(X,y,groups=groups):
  7.   ...:   print train, test
  8.   ...:   
  9. [0 1 2 3 4 5] [6 7 8 9]
  10. [0 1 2 6 7 8 9] [3 4 5]
  11. [3 4 5 6 7 8 9] [0 1 2]
复制代码
10.LeaveOneGroupOut

这个是在
  1. GroupKFold
复制代码
上的底子上杂乱度又减小了,按照给定的分组方式将测试集分割下来。
  1. In [63]: from sklearn.model_selection import LeaveOneGroupOut
  2. In [64]: X = [1, 5, 10, 50, 60, 70, 80]
  3. In [65]: y = [0, 1, 1, 2, 2, 2, 2]
  4. In [66]: groups = [1, 1, 2, 2, 3, 3, 3]
  5. In [67]: logo = LeaveOneGroupOut()
  6. In [68]: for train, test in logo.split(X, y, groups=groups):
  7.   ...:   print train, test
  8.   ...:   
  9. [2 3 4 5 6] [0 1]
  10. [0 1 4 5 6] [2 3]
  11. [0 1 2 3] [4 5 6]
复制代码
11.LeavePGroupsOut

这个没啥可说的,跟上面谁人一样,只是一个是单组,一个是多组
  1. from sklearn.model_selection import LeavePGroupsOut
  2. X = np.arange(6)
  3. y = [1, 1, 1, 2, 2, 2]
  4. groups = [1, 1, 2, 2, 3, 3]
  5. lpgo = LeavePGroupsOut(n_groups=2)
  6. for train, test in lpgo.split(X, y, groups=groups):
  7.   print train, test
  8.   
  9. [4 5] [0 1 2 3]
  10. [2 3] [0 1 4 5]
  11. [0 1] [2 3 4 5]
复制代码
12.GroupShuffleSplit

这个是有放回抽样
  1. In [75]: from sklearn.model_selection import GroupShuffleSplit
  2. In [76]: X = [.1, .2, 2.2, 2.4, 2.3, 4.55, 5.8, .001]
  3. In [77]: y = ['a', 'b','b', 'b', 'c','c', 'c', 'a']
  4. In [78]: groups = [1,1,2,2,3,3,4,4]
  5. In [79]: gss = GroupShuffleSplit(n_splits=4, test_size=.5, random_state=0)
  6. In [80]: for train, test in gss.split(X, y, groups=groups):
  7.   ...:   print train, test
  8.   ...:   
  9. [0 1 2 3] [4 5 6 7]
  10. [2 3 6 7] [0 1 4 5]
  11. [2 3 4 5] [0 1 6 7]
  12. [4 5 6 7] [0 1 2 3]
复制代码
13.TimeSeriesSplit

针对时间序列的处置惩罚,防止将来数据的利用,分割时是将数据举行从前到后切割(这个说法其实不太适当,由于切割是延续性的。。)
  1. In [81]: from sklearn.model_selection import TimeSeriesSplit
  2. In [82]: X = np.array([[1,2],[3,4],[1,2],[3,4],[1,2],[3,4]])
  3. In [83]: tscv = TimeSeriesSplit(n_splits=3)
  4. In [84]: for train, test in tscv.split(X):
  5.   ...:   print train, test
  6.   ...:   
  7. [0 1 2] [3]
  8. [0 1 2 3] [4]
  9. [0 1 2 3 4] [5]
复制代码
这个
  1. repo
复制代码
用来纪录一些python本领、册本、学习链接等,欢迎
  1. star
复制代码
github所在

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x

帖子地址: 

回复

使用道具 举报

分享
推广
火星云矿 | 预约S19Pro,享500抵1000!
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

草根技术分享(草根吧)是全球知名中文IT技术交流平台,创建于2021年,包含原创博客、精品问答、职业培训、技术社区、资源下载等产品服务,提供原创、优质、完整内容的专业IT技术开发社区。
  • 官方手机版

  • 微信公众号

  • 商务合作