基于R语言实现的交通时空大数据处理
基于R语言实现的交通时空大数据处理
链接:https://github.com/toddwschneider/nyc-taxi-data Unified New York City Taxi and Uber dataCode in support of this post:?Analyzing 1.1 Billion NYC Taxi and Uber Trips,with a Vengeance This repo provides scripts to download,process,and analyze data for over 1.1 billion taxi and Uber trips originating in New York City. The data is stored in a?PostgreSQL?database,and uses?PostGIS?for spatial calculations,in particular mapping latitude/longitude coordinates to census tracts. The?yellow and green taxi data?comes from the NYC Taxi & Limousine Commission,and?Uber data?comes via FiveThirtyEight,who obtained it via a FOIL request. InstructionsYour mileage may vary,but on my MacBook Air,this process took about 3 days to complete. The unindexed database takes up 267 GB on disk. Adding indexes for improved query performance increases total disk usage to 375 GB. 1. Install?PostgreSQL?and?PostGISBoth are available via?Homebrew?on Mac OS X 2. Download raw taxi data./download_raw_data.sh 3. Initialize database and set up schema./initialize_database.sh 4. Import taxi data into database and map to census tracts./import_trip_data.sh 5. Optional: download and import Uber data from FiveThirtyEight's GitHub repository ./download_raw_uber_data.sh? 6. AnalysisAdditional Postgres and?R?scripts for analysis are in the?analysis/?folder,or you can do your own! Schema
Other data sourcesThese are bundled with the repository,so no need to download separately,but:
Data issues encountered
Why not use BigQuery or Redshift?Google BigQuery?and?Amazon Redshift?would probably provide significant performance improvements over PostgreSQL. A lot of the data is already available on BigQuery,but in scattered tables,and each trip has only by latitude and longitude coordinates,not census tracts and neighborhoods. PostGIS seemed like the easiest way to map coordinates to census tracts. Once the mapping is complete,it might make sense to load the data back into BigQuery or Redshift to make the analysis faster. Note that BigQuery and Redshift cost some amount of money,while PostgreSQL and PostGIS are free. Questions/issues/contacttodd@toddwschneider.com,or open a GitHub issue (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |