加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

[bigdata-099] R语言评分卡 德国信贷数据集

发布时间:2020-12-14 03:08:40 所属栏目:大数据 来源:网络整理
导读:1. R语言官网 https://www.r-project.org/? 2. 安装R语言 apt-get install r-base 3. 运行 在命令行输入 R,进入交互界面,然后输入1+2,得到3,表明安装成功 4. 下载rstudio https://www.rstudio.com/products/rstudio/ 在这两个之间任意选一个: https://d
1. R语言官网 https://www.r-project.org/? 2. 安装R语言 apt-get install r-base 3. 运行 在命令行输入 R,进入交互界面,然后输入1+2,得到3,表明安装成功 4. 下载rstudio https://www.rstudio.com/products/rstudio/ 在这两个之间任意选一个: https://download1.rstudio.org/rstudio-1.0.143-amd64.deb https://download1.rstudio.org/rstudio-1.0.143-amd64-debian.tar.gz 执行完毕,运行rstudio,启动界面,既可进入ide编辑。 5. 一个R语言的评分卡模型 https://mp.weixin.qq.com/s?__biz=MzI4NjU3NTc2Ng==&mid=100000086&idx=1&sn=1ffbb9970c4aef7cda55e9167bff3b7e&chksm=6bdb9caf5cac15b9a1bd831d5ffc98855c7a222a3ae54be5d03483432eee17b47785961e3395&mpshare=1&scene=1&srcid=0124JAvifX5A9CYNeX4NzF2h&pass_ticket=6A8%2BS2uaF5GhO8xXYfQQZhl3tjAHsmCyP2LsJsw2QWlKOgu%2BCC9qW2GoAh0BYmjP#rd 注意,文章里的代码是示意性的,不可执行。 源码: https://github.com/frankhlchi/R-scorecard 5.1 下载caret包 sudo R install.packages("caret") 如果需要安装新包,建议都是用root用户启动r然后再执行install。然后,再切回普通账户做开发。 5.2 下载测试数据 wget https://raw.githubusercontent.com/frankhlchi/R-scorecard/master/german_credit.csv 5.2 运行 R 5.3执行如下命令 --------------------------- library(caret) library(smbinning) library(ggplot2) #load the data german_credit <- read.csv("~/german_credit.csv") train <-createDataPartition(y=german_credit$Creditability,p=0.75,list=FALSE) train2 <- german_credit[train,] test2 <- german_credit[-train,] #Explore data distribution? ggplot(german_credit,aes(x = Duration,y = ..count..,)) + geom_histogram(fill = "blue",colour = "grey60",size = 0.2,alpha = 0.2,binwidth = 5) ggplot(german_credit,aes(x = CreditAmount,binwidth = 1000) ggplot(german_credit,aes(x = Age,aes(x =Creditability,stat="count") #Optimal Binning Durationresult=smbinning(df=train2,y="Creditability",x="Duration",p=0.05) CreditAmountresult=smbinning(df=train2,x="CreditAmount",p=0.05)? Ageresult=smbinning(df=train2,x="Age",p=0.05)? smbinning.plot(CreditAmountresult,option="WoE",sub="CreditAmount")? smbinning.plot(Durationresult,sub="Duration") smbinning.plot(Ageresult,sub="Age") --------------------------- 6. 探索式建模 6.1 数据--德国信贷数据集,数据地址 https://raw.githubusercontent.com/frankhlchi/R-scorecard/master/german_credit.csv 数据格式 --------------------- Creditability,AccountBalance,Duration,PaymentStatusofPreviousCredit,Purpose,CreditAmount,ValueSavings,Lengthofcurrentemployment,Instalmentpercent,Sex&Marital Status,Guarantors,DurationinCurrentaddress,Mostvaluableavailableasset,Age,ConcurrentCredits,Typeofapartment,NoofCreditatthisBank,Occupation,Noofdependents,Telephone,ForeignWorker 1,1,18,4,2,1049,21,3,1 1,9,2799,36,12,841,23,2122,39,2 1,2171,38,10,2241,48,8,3398,2 --------------------- 6.2 导入数据 read.csv函数 --------------------- german_credit <- read.csv("./german_credit.csv") --------------------- 6.3 随机抽样出一个数据集? http://topepo.github.io/caret/data-splitting.html train <-createDataPartition(y=german_credit$Creditability,list=FALSE) 随机从数据集抽取75%的数据的索引,作为训练集。这里,Creditability特征的概率分布必需是跟原数据集一致。Creditability的取值是0或1,表示还款能力,是类标记,因此抽取测试数据集的时候,要根据类标记的部分的概率分布来抽取,以免出现类标记不一致的情况。 然后,根据抽取的index,生成测试数据集和训练数据集。 ----------------------- train <- createDataPartition(y=german_credit$Creditability,] ----------------------- 6.4 ggplot2绘图 http://ggplot2.org/ 绘制Duration这个量值,y的..count..是特殊函数,..count..,..density.. are returned by a stat transformation of the original data set ---------------------- ggplot(german_credit,binwidth = 5) ---------------------- 6.5 变量分箱 binning 三种分段 等距分段、等深分段、最优分段。 递归分箱 ----------------------- Durationresult=smbinning(df=train2,p=0.05)? -----------------------

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读