如何使用Python(scikit-learn)计算FactorAnalysis得分？

发布时间：2020-12-20 13:51:34 所属栏目：Python 来源：网络整理

导读：我需要进行探索性因子分析,并使用 Python计算每个观察的分数,假设只有1个潜在因素.似乎sklearn.decomposition.FactorAnalysis()是要走的路,但遗憾的是 documentation和 example(遗憾的是我无法找到其他例子)对我来说还不够清楚如何完成工作. 我有以下测试文

我需要进行探索性因子分析,并使用 Python计算每个观察的分数,假设只有1个潜在因素.似乎sklearn.decomposition.FactorAnalysis()是要走的路,但遗憾的是 documentation和 example(遗憾的是我无法找到其他例子)对我来说还不够清楚如何完成工作.

我有以下测试文件,包含29个29变量的观察结果(test.csv)：

49.6,34917,24325.4,305,101350,98678,254.8,276.9,47.5,1,3,5.6,3.59,11.9,97.5,97.6,8,10,100,96.93,610.1,1718.22,6.7,28,5
275.8,14667,11114.4,775,75002,74677,30,109,9.1,6.5,3.01,8.2,1558,2063.17,5.5,64,5
2.3,9372.5,8035.4,4.6,8111,8200,8.01,130,1.2,5,3.33,6.09,97.9,67.3,342.3,99.96,18.3,53,1457.27,4.8,4
7.10,13198.0,13266.4,1.1,708,695,6.1,80,0.4,4,3.1,97.8,45,82.7,99.68,4.5,13.8,3
1.97,2466.7,2900.6,19.7,5358,5335,10.1,23,0.5,2,3.14,97.3,97.2,9,74.5,98.2,99.64,79.8,54,1367.89,6.4,12,4
2.40,2999.4,2218.2,0.80,2045,2100,8.9,1.5,2.82,8.6,97.4,47.2,323.8,99.996,13.6,24,1249.67,2.7,3
0.59,4120.8,5314.5,0.54,14680,13688,14.9,117,2.94,3.4,97.7,11.8,872.6,9.3,52,1251.67,14,2
0.72,2067.7,2364,367,298,7.2,60,2.5,2.97,10.5,74.7,186.8,99.13,57,1800.45,2
1.14,2751.9,3066.8,3.5,1429,1498,7.7,1.6,2.86,76.7,240.1,99.93,1259.97,15,3
1.29,4802.6,5026.1,7859,7789,1.9,98,34,297.5,99.95,1306.44,8.5,4
0.40,639.0,660.3,1.3,25,0.1,94.2,4.3,50,1565.44,19.2,4
0.26,430.7,608.1,33,7,6,76.5,98.31,1490.08,4
4.99,2141.2,2357.6,3.60,339,320,8.1,0.2,5.9,58.1,206.3,99.58,13.2,95,1122.92,14.2,2
0.36,1453.7,1362.2,3.50,796,785,3.7,98.1,91.4,214.6,99.74,7.5,1751.98,11.5,1657.5,2421.1,2.8,722,690,11,37.4,404.2,99.98,10.9,35,1772.33,10.2,3
1.14,5635.2,5649.6,2681,2530,5.4,20,0.3,50.1,384.7,99.02,11.6,27,1306.08,16,2
0.6,1055.9,1487.9,69,65,63,137.9,5.1,48,1595.06,4
0.08,795.3,1174.7,1.40,85,76,2.2,39.3,149.3,98.27,1903.9,2
0.90,2514.0,2644.4,2.6,1173,1104,43,0.8,58.7,170.5,80.29,1292.72,2
0.27,870.4,949.7,1.8,252,240,31,64.5,6.6,29,1483.18,3
0.41,1295.1,2052.3,2.60,2248,2135,6.0,71.1,261.3,91.86,21,1221.71,9.4,4
1.10,3544.2,4268.9,2.1,735,730,1.7,317.2,99.62,9.8,46,1271.63,3
0.22,899.3,888.2,1.80,220,218,3.6,22.5,70.79,10.6,32,1508.02,4
0.24,1712.8,1735.5,1.30,41,3.28,16.6,720.2,1324.46,2
0.2,558.4,631.9,60.7,99.38,1535.08,2
0.21,599.9,1029,70,85.7,48.6,221.2,40,1381.44,25.6,2
0.10,131.3,190.6,2.9,58.9,189.4,6.9,42,1525.58,17.4,3
0.44,3881.4,5067.3,0.9,2732,2500,11.2,2.67,14.5,1326.2,99.06,1120.54,10.3,2
0.18,1024.8,1651.3,1.01,358,345,15.9,790.2,1531.04,3
0.46,682.9,784.2,103,166.3,44,1373.6,13.5,2
0.12,370.4,420.0,1.10,2.57,51.6,120,99.85,1297.94,3
0.03,552.4,555.1,49,33.6,594.5,3.2,1184.34,3
0.21,1256.5,2434.8,1265,1138,6.3,20.1,881,99.1,3.9,1265.93,7.8,3
0.09,320.6,745.7,37,49.2,376.4,39,1285.11,3
0.08,452.7,570.9,18,4.7,0.6,2.45,97.1,19.9,1103.8,22,1562.61,21.9,3
0.13,967.9,947.2,74,4.0,1.4,30.1,503.1,99.999,55,1269.33,2
0.07,495.0,570.3,3.62,13,29.8,430.5,99.7,4.9,1461.79,14.6,2
0.17,681.9,537.4,113,98.3,74.3,1290.16,3
0.05,639.7,898.2,0.40,3.0,1221.1,1372,4
0.65,2067.8,2084.2,2.50,414,398,7.3,0.7,2.16,60.1,146.3,10.4,1059.68,7.4,804.4,1416.4,3.30,579,602,4.2,2492.3,95.4,1345.76,2

使用我根据官方示例和this post编写的代码
我得到了奇怪的结果.码：

from sklearn import decomposition,preprocessing
from sklearn.cross_validation import cross_val_score
import csv
import numpy as np

data = np.genfromtxt('test.csv',delimiter=',')

def compute_scores(X):
    n_components = np.arange(0,len(X),1)
    X = preprocessing.scale(X) # data normalisation attempt
    pca = decomposition.PCA()
    fa = decomposition.FactorAnalysis(n_components=1)

    pca_scores,fa_scores = [],[]
    for n in n_components:
        pca.n_components = n
        fa.n_components = n
        #pca_scores.append(np.mean(cross_val_score(pca,X))) # if I attempt to compute pca_scores I get the error.
        fa_scores.append(np.mean(cross_val_score(fa,X)))

    print pca_scores,fa_scores
compute_scores(data)

代码输出：

[],[-947738125363.77405,-947738145459.86035,-947738159924.70471,-947738174662.89746,-947738206142.62854,-947738179314.44739,-947738220921.50684,-947738223447.3678,-947738277298.33545,-947738383772.58606,-947738415104.84912,-947738406361.44482,-947738394379.30359,-947738456528.69275,-947738501001.14319,-947738991338.98291,-947739381280.06506,-947739389033.33557,-947739434992.48047,-947739549511.2655,-947739355699.70959,-947739879828.51514,-947739898216.39099,-947739905804.71033,-947739902618.47791,-947738564594.54639,-948816122907.87366,-947744046601.55029,-947738624937.61292,-947738625325.73486,-947738626111.14441,-947738624973.92188,-947738625200.06946,-947738625568.65027,-947738625528.69666,-947738625359.41992,-947738624906.67529,-947738625652.12439,-947739509002.01868,-947738625426.81946,-947738625380.45837]

这个结果远非预期的结果.这是此任务的R代码和相同的数据.它的输出正常(结果接近某些能够执行FA的IBM程序的输出)：

data <-read.csv("test.csv",header=F)
col_names <- names(data)
drops <- c()

for (name in col_names){
  st_dev <- sd(data[,name],na.rm = T)
  if (st_dev == 0){
    drops <- c(drops,name)
  }
}

da_nal <- data[,!(names(data) %in% drops)]
factanal(na.omit(da_nal),factors = 1,scores = 'regression')$scores

此代码的输出是：

Factor1
1   4.89102190
2   3.65004187
3   0.14628700
4  -0.20255897
5  -0.01565570
6  -0.16438863
7   0.40835986
8  -0.25823984
9  -0.20813064
10  0.09390067
11 -0.28891296
12 -0.28882753
13 -0.26624358
14 -0.25202275
15 -0.25181326
16 -0.15653679
17 -0.28702281
18 -0.28865654
19 -0.23251509
20 -0.28066125
21 -0.18714387
22 -0.24969113
23 -0.28302552
24 -0.28712610
25 -0.29196529
26 -0.28659988
27 -0.29502523
28 -0.15802910
29 -0.27440118
30 -0.29083667
31 -0.29548220
32 -0.29461059
33 -0.23594859
34 -0.29654336
35 -0.29759659
36 -0.29085001
37 -0.29539071
38 -0.29234303
39 -0.29702103
40 -0.27595130
41 -0.27184361

所以我希望在Python中获得类似的结果(我知道我不会得到确切的数字),但我不知道如何.

解决方法

似乎我想出了如何获得分数.

from sklearn import decomposition,preprocessing
import numpy as np

data = np.genfromtxt('rangir_test.csv',')
data = data[~np.isnan(data).any(axis=1)]
data_normal = preprocessing.scale(data)
fa = decomposition.FactorAnalysis(n_components = 1)
fa.fit(data_normal)
for score in fa.score_samples(data_normal):
    print score

不幸的是,输出(见下文)与factanal()的输出非常不同.任何有关分解的建议.FactorAnalysis()将不胜感激.

Scikit-learn分数输出：

-69.8587183816
-116.353511148
-24.1529840248
-36.5366398005
-7.87165586175
-24.9012815104
-23.9148486368
-10.047780535
-4.03376369723
-7.07428842783
-7.44222705099
-6.25705487929
-13.2313513762
-13.3253819521
-9.23993173528
-7.141616656
-5.57915693405
-6.82400483045
-15.0906961724
-3.37447211233
-5.41032267015
-5.75224753811
-19.7230390792
-6.75268922909
-4.04911793705
-10.6062761691
-3.17417070498
-9.95916350005
-3.25893428094
-3.88566777358
-3.30908856716
-3.58141292341
-3.90778368669
-4.01462493538
-11.6683969455
-5.30068548445
-24.3400870389
-7.66035331181
-13.8321672858
-8.93461397086
-17.4068326999

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!