Scraping Data Table from Website by R
For many quantitative analysis,we often consider factors such climate or economics data. Those data are usually displayed as tables on the websites and can be collected from there. However,to copy-paste with EXCEL is an extremely inefficient way. Our task is to realize a semi-automated web data scraping program,whenever a modeling refresh is needed,the raw data should be prepared. Here we take 2 examples: Several packages need to be loaded before hand. require(XML,quietly = T,warn.conflicts = F)
require(quantmod,warn.conflicts = F)
require(RMySQL,warn.conflicts = F)
require(dplyr,warn.conflicts = F)
Climate Data Collection - XML::readHTMLTableIn XML package,there is a function which enable R to directly read tables from website (when it really contains tables). readHTMLTable(doc,header = NA,colClasses = NULL,skip.rows = integer(),trim = TRUE,elFun = xmlValue,as.data.frame = TRUE,which = integer(),...)
webtable <- readHTMLTable(http://en.tutiempo.net/climate/06-2015/ws-545110.html,header = T,which = 1,stringsAsFactors = F
What if you want to download data of different countries and different period? Changing the latter 2 parts,you could collect as many info as you want. GrabClimate <- function(year,month,country){
month <- paste(ifelse(nchar(month)==1,"0",""),sep = "")
# link_tail: the collection of main APAC countries
link_tail <- c(`AUSTRALIA` = "/ws-947680.html",# SYDNEY
`CHINA` = "/ws-545110.html",# BEIJING
`INDIA` = "/ws-421820.html",# NEW DEHLI
`INDONESIA` = "/ws-967490.html",# JAKARTA
`JAPAN` = "/ws-476620.html",# TOKYO
`KOREA` = "/ws-471080.html",# SEOUL
`MALAYSIA` = "/ws-486470.html",# KUALA LUMPUR
`NEW ZEALAND` = "/ws-934390.html",# WELLINGTON
`PHILIPPINES` = "/ws-984250.html",# MANILA
`THAILAND` = "/ws-484540.html",# BANGKOK
`TAIWAN` = "/ws-589680.html",# TAIPEI
`VIETNAM` = "/ws-488200.html") # HA NOI
# the address need to be adjusted according to the country to be modeled
link <- paste("http://en.tutiempo.net/climate/","-",year,link_tail[toupper(country)],sep = "")
webtable <- readHTMLTable(link,stringsAsFactors = F,na.strings = c("-"))
# Add 2 columns and remove the bottom 2 lines
webtable <- data.frame(Country = toupper(country),Year = as.numeric(year),Month = as.numeric(month),webtable[1:(nrow(webtable)-2),])
# Remove unuseful columns and rename the table
webtable <- select(webtable,-(VV:VG))
names(webtable) <- c("Country","Year","Month","Day","Avg_Temp","Max_Temp","Min_Temp","Atm_Pressure","Avg_Humidity","Rainfall","Ind_Rain_Drizzle","Ind_Snow","Ind_Storm","Ind_Fog")
rownames(webtable) <- NULL
# Replace "-" by NA.
webtable <- mutate(webtable,Day = as.numeric(Day),Ind_Rain_Drizzle = ifelse(Ind_Rain_Drizzle == "o",1,0),Ind_Snow = ifelse(Ind_Snow == "o",Ind_Storm = ifelse(Ind_Storm == "o",Ind_Fog = ifelse(Ind_Fog == "o",0))
webtable
}
Using this function in embedded loops you can get the full set of data (Suppose we assign the data to Stock Index Data Collection - quantmod::getSymbolsIn # the list of major APAC stock index code
sym_stock <- c(`AUSTRALIA` = "^AORD",# All Ordinaries
`CHINA` = "^SSEC",# Shanghai Composite
`INDIA` = "^BSESN",# Bombay BSE SENSEX / BSE 30
`INDONESIA` = "^JKSE",# Jakarta Composite
`JAPAN` = "^N225",# Tokyo Share Nikkei 225
`KOREA` = "^KS11",# SEOul Composite
`MALAYSIA` = "^KLSE",# FTSE Bursa Malaysia KLCI
`NEW ZEALAND` = "^NZ50",# New Zealand NZX 50 Index
`PHILIPPINES` = "^PSI",# Manila
`THAILAND` = "SET",# BANGKOK
`TAIWAN` = "^TWII",# Taiwan Weighted
`VIETNAM` = "VNM") # Vietnam Index
stock_ind <- data.frame(matrix(numeric(),ncol = 8,dimnames = list(NULL,c("Country","Date","Open","High","Low","Close","Volumne","AdjClose"))))
for (i in 1:12){
# auto.assign = F,do not assign the read table under the name of index code
temp <- as.data.frame(suppressWarnings(getSymbols(sym_stock[i],auto.assign = F,from = "2013-01-01",to = "2015-07-31")))
names(temp) <- c("Open","AdjClose")
stock_ind <- rbind(stock_ind,data.frame(Country = names(sym_stock)[i],Date = rownames(temp),temp))
}
Load Data to MySQL
# create a database "example"
con <- dbConnect(MySQL(),host = "xx.xxx.xxxx.xxx",port = 3306,user = "Admin",password = "Power_Overwhelming")
dbSendQuery(con,"CREATE SCHEMA `example`")
# connection to MySQL db "example" using dplyr method
con1 <- src_mysql(dbname = "example",password = "Power_Overwhelming")
# connection to MySQL db "example" using RMySQL method
con2 <- dbConnect(MySQL(),db = "example",password = "Power_Overwhelming")
To create new tables,we can use copy_to(con1,climate,"climate",temporary = F)
copy_to(con1,stock_ind,"stock",temporary = F)
If the table already exists: db_insert_into(con2,table = "climate",value = climate)
db_insert_into(con2,table = "stock",value = stock_ind)
Add index to optimize query speed: dbSendQuery(con2,"ALTER TABLE climate ADD KEY climate(Country,Year,Month,Day,Avg_Temp,Max_Temp,Min_Temp,Rainfall,Ind_Rain,Ind_Snow,Ind_Storm,Ind_Fog)")
dbSendQuery(con2,"ALTER TABLE stock ADD KEY stock(Open,High,Low,Close,Volumne,AdjClose)")
We can use (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |