亚马逊提供的大数据分析公共数据集(海量)
在大数据分析时,一个困难是海量的数据本地存储困难,而且下载耗费的时间极长。例如1T数据,如果下载网速是3MBps(目前中国的平均宽带速度),那要4天才能下载完。有些数据集有几十T,那光下载就要几个月。 亚马逊的AWS云服务平台上为了解决这个困难提供了很多常用的大规模数据集 Public Data Sets https://aws.amazon.com/datasets ,无需下载即可在亚马逊AWS EC2上使用。 以Linux为例,方法是:
1-3都在AWS的图形界面上操作,非常直观。4-5就是两行命令,就可以立即开始使用上T的数据了——比如CommonCrawl有50T 需要注意的是价格,目前EBS的价格最便宜是1G一个月5美分,也就是说CommonCrawl的数据一个月要花2500美元,外加读写的费用 目前在线的五十多个数据集是: 1000 Genomes Project 千人基因组计划,详见http://en.wikipedia.org/wiki/1000_Genomes_Project The 1000 Genomes Project,initiated in 2008,is an international public-private consortium that aims to build the most detailed map of human genetic variation available. 1980 US Census 美国1980年人口普查数据 Data from the 1980 US Census 1990 US Census 美国1990年人口普查数据 Data from the 1990 US Census 2000 US Census 美国2000年人口普查数据 Data from the 2000 US Census 2003-2006 US Economic Data 美国2003-2006经济数据 US Economic Data for years 2003 to 2006 2008 TIGER/Line Shapefiles 美国2000年人口普查与详细的政区划分 Census 2000 and Current United States shapefiles 3D Version of the PubChem Library PubSem有机小分子生物活性数据三维版 3D Version of the PubChem Library AnthroKids – Anthropometric Data of Children 70年代的儿童人体测量数据 Anthropometric data on children from two studies in 1975 and 1977 Apache Software Foundation Public Mail Archives Apache基金会的到2011年为止的邮件列表 A collection of all publicly available Apache Software Foundation mail archives as of July 11,2011 Business and Industry Summary Data 美国工商业数据 US Business and Industry Summary Data C57BL/6J by C3H/HeJ Mouse Cross (Sage Bionetworks) 老鼠杂交数据 C57BL/6J by C3H/HeJ mouse cross from the Jake Lusis lab at UCLA Common Crawl Corpus 50亿网页 A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use. Daily Global Weather Measurements,1929-2009 (NCDC,GSOD) 80年的按日全球天气数据 A collection of daily weather measurements (temperature,wind speed,humidity,pressure,&c.) from 9000+ weather stations around the world. DBpedia 3.5.1 DBpedia结构化知识库 DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web Denisova Genome 丹尼索瓦人基因组 The high-coverage genome sequence of a Denisovan individual sequenced to ~30x coverage on the Illumina platform. Together with their sister group the Neandertals,Denisovans are the most closely related extinct relatives of currently living humans. Enron Email Data 安然电子邮件数据 Enron email data publicly released as part of FERC’s Western Energy Markets investigation converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The email is provided in Microsoft PST,IETF MIME,and EDRM XML formats. Ensembl – FASTA Database Files Ensembl真核生物基因组转录与翻译模型 Ensembl sequence databases of transcript and translation models Ensembl Annotated Human Genome Data (FASTA Release 73) 人类与其他50个物种的基因序列 The Ensembl project produces genome databases for human as well as over 50 other species,and makes this information freely available. Ensembl Annotated Human Genome Data (MySQL Release 73) 人类与其他50个物种的基因序列,MySQL版 The Ensembl project produces genome databases for human as well as over 50 other species,189);">Federal Contracts from the Federal Procurement Data Center (USASpending.gov) 美国联邦政府的合同 A data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov. Federal Reserve Economic Data – Fred 美联储经济数据时间序列 Database of 20,059 U.S. economic time series. Freebase Data Dump Freebase知识图谱 Freebase is an open database of the world’s information,covering millions of topics in hundreds of categories Freebase Quad Dump Freebase知识图谱四元组格式 A data dump of all the current facts and assertions in Freebase Freebase Simple Topic Dump Freebase知识图谱简化的主题数据 A data dump of the basic identifying facts about every topic in Freebase GenBank 基因银行序列数据库 An annotated collection of all publicly available DNA sequences including more than 85.7B bases and 82.8M sequence records. Google Books Ngrams 谷歌图书的ngram语言模型 A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/. Human Liver Cohort (Sage Bionetworks) 人类肝脏基因表达 Human Liver Cohort characterizing gene expression in liver samples Human Microbiome Project 人体微生物群数据 Human Microbiome Project Data Set Illumina – Jay Flatley (CEO of Illumina) Human Genome Data Set 人体基因组数据 Jay Flatley (CEO of Illumina) human genome data set. Influenza Virus (including updated Swine Flu sequences) 流感病毒数据 NCBI Influenza Resource Center Data. Japan Census Data 日本人口统计数据 Multiple data sets including: (1) Population Census of Japan (1995,2000,2005,2010),(2) Establishment and Enterprise Census of Japan (1999,2001,2004,2006),and (3) Economic Census of Japan (2009). Labor Statistics Databases 美国劳工部的统计数据 Various Labor Statistics M-Lab dataset: Network Diagnostic Tool (NDT) 2009年互联网性能(如网速)诊断数据 NDT test results created through Measurement Lab (M-Lab) between February 2009 and September 2009 M-Lab dataset: Network Path and Application Diagnosis tool (NPAD) 2009年互联网路由,包头等测试数据 NPAD test results created through Measurement Lab (M-Lab) between February 2009 and September 2009 Marvel Universe Social Graph 一个虚拟的社交网络关系图 This dataset is an example of a social collaboration network based on the characters in The Marvel Universe,that is,the artificial world that takes place in the universe of the Marvel comic books. Material Safety Data Sheets 材料安全数据 230,000 Material Safety Data Sheets. Million Song Dataset 百万歌曲数据 The Million Songs Collection is a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks. Million Song Sample Dataset 百万歌曲数据库的1万子集 This is a 10,000 song subset of audio features and metadata from the Million Songs collection – a collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks. Model Organism Encyclopedia of DNA Elements (modENCODE) 模式生物生命百科全书 A collection of data from the modENCODE project ( http://www.modencode.org ) NASA NEX NASA的地球卫星地图与气候变迁 Three NASA NEX datasets are now available,including climate projections and satellite images of Earth. OpenStreetMap Rendering Database 开源的全球地图数据 A PostGIS 8.3 data cluster of all OpenStreetMap data for the planet. Petroleum Public Data Set (working Title) 石油数据 Public-domain data for the oil & gas industry,assembled from the contributions of participating agencies in the United States,Canada and around the world. This data provides industry stakeholders with an opportunity to focus their efforts on the analysis and interpretation of this data without concern for the trivial and time-consuming tasks of locating,downloading,reformatting and integrating the data prior to value-added work being performed. PubChem Library 有机小分子生物活性数据 A data set of information on the biological activities of small molecules. Sloan Digital Sky Survey DR6 Subset 斯隆数字化巡天 The Sloan Digital Sky Survey is the most ambitious astronomical survey ever undertaken. The Cannabis Sativa Genome 大麻基因 Whole Genome Shotgun Sequencing of the Cannabis Sativa Cultivar “Chemdawg” The WestburyLab USENET corpus 4万多个USENET新闻组数据 The WestburyLab USENET corpus is an anonymized compilation of postings from 47,860 English-language newsgroups from 2005-2010. Transportation Databases 美国交通部的航空,航海,公路,铁路,管道,自行车等统计数据 Various transportation statistics Twilio/Wigle.net Street Vector Data Set 完整的美国街道名与地址 Twilio/Wigle.net database of mapped US street names and address ranges. Unigene NCBI的转录组数据库 UniGene: An Organized View of the Transcriptome. University of Florida Sparse Matrix Collection 佛罗里达大学的稀疏矩阵数据集 The University of Florida Sparse Matrix Collection is a large,widely available,and actively growing set of sparse matrices that arise in real applications. Wikipedia Extraction (WEX) 维基百科用Freebase增强过的结构化数据 A processed dump of the English language Wikipedia Wikipedia Page Traffic Statistic V3 维基百科2011年3个月的按小时访问量 This dataset contains a 150 GB sample of the data used to power trendingtopics.org. It includes a full 3 months of hourly page traffic statistics from Wikipedia (1/1/2011-3/31/2011). Wikipedia Page Traffic Statistics 维基百科2009年7个月的按小时访问量 Contains 7 months of hourly pageview statistics for all articles in Wikipedia Wikipedia Traffic Statistics V2 维基百科2009-2010年16个月按小时访问量 Contains 16 months of hourly pageview statistics for all articles in Wikipedia Wikipedia XML Data 维基百科2009版,XML格式 A complete copy of all Wikimedia wikis,in the form of wikitext source and metadata embedded in XML. YRI Trio Dataset 三个约鲁巴人的完整基因组 Complete genome sequence data for three Yoruba individuals from Ibadan,Nigeria 请回复此公众号“亚马逊”可获取word文字版。更多资料关注下面二维码,加微信:hai299014 (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |