Jim Gray Summary Home Page
Microsoft?eScience Group As you may be aware,Jim Gray has?gone missing. ? We (his colleagues in Microsoft Research) have heard from many of his collaborators about projects and collaborations that he had underway with them and who are unsure how to proceed. If you find yourself in this situation,please email?grayproj@microsoft.comand we will follow up with you to find the best way forward. Jim Gray is a researcher and manager of Microsoft Research's?eScience Group. His primary research interests are in databases and transaction processing systems -- with particular focus on using computers to make scientists more productive. He and his group are working in the areas of astronomy,geography,hydrology,oceanography,biology,and health care. He continues a long-standing interest on building supercomputers with commodity components,thereby reducing the cost of storage,processing,and networking by factors of 10x to 1000x over low-volume solutions. This includes work on building fast networks,on building huge web servers with?CyberBricks,and building very inexpensive and very high-performance storage servers. Jim also is working with the astronomy community to build the?world-wide telescope?and has been active in building online databases like?http://terraService.Net?andhttp://skyserver.sdss.org. When the entire world's astronomy data is on the Internet and is accessible as a single distributed database,the Internet will be the world's best telescope. This is part of the larger agenda of getting all information online and easily accessible (digital libraries,digital government,online science ...). More generally,he is working with the science community (Oceanography,Hydrology,environmental monitoring,..) to build the world-wide digital library that integrates all the world's scientific literature and the data in one easily-accessible collection. He is active in the research community,is an ACM,NAE,NAS,and AAAS Fellow,and received the ACM Turing Award for his work on transaction processing. He also edits of a series of books on data management. What's new?? ??Performance of a Sun X4500 under Windows,NTFS and SQLserver 2005,” (pdf) Sun loaned this storage/compute brick to JHU for some of the eScience internet services we are building there.?? This preliminary performance report shows it to be a balanced system (4 cpus,16GB ram,48 disks,24TB all in 4U using 800W.) Here is the?spreadsheet?with the numbers for the graphs. And?here?is a zip of the test tools and scripts. A radical view of Flash Disks:?document?and?talk. ?“SkyServer Traffic Report – The First Five Years,” is a study of the traffic on Skyserver.sdss.org,an eScience website.?? Done jointly with Vik Singh,Alex? Szalay,Ani Thakar,Jordan Raddick,Bill Boroski,Svetlana Lebedeva,and Brian Yanny it analyzes the traffic trying to see how people and programs use the site,the data,and the batch job system. “Cross-Matching Multiple Spatial Observations and Dealing with Missing Data”,?with Alex Szalay,Tamás Budavári,Robert Lupton,Maria Nieto-Santisteban,and Ani Thakar explains how to spatially correlate observations of the same area (of the sky or earth or..). “?Life Under Your Feet: An End-to-End Soil Ecology Sensor Network,Database,Web Server,and Analysis Service,” with Katalin Szlavecz,Andreas Terzis,Razvan Musaloiu-E,Joshua Cogan,Sam Small,Stuart Ozer,Randal Burns,and Alex Szalay of JHU we built a end-to-end soil monitoring system deployed at a Baltimore urban forest. Sensor moisture and temperature reports are stored and calibrated in a database. The measurement database is published through Web Services interfaces. In addition,analysis tools let scientists analyze current and historical data and help manage the sensor network. “?GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management?,” with Naga K. Govindaraju,Ritesh Kumar,Dinesh Manocha of UNC built a sorter that uses the Graphics Processing Unit (GPU) to sort real fast. I helped with the IO and with writing this report so that I could read it :). The GPUs have 10x the memory bandwidth and processing power of the CPU,and the gap is widening,so we have to learn how to use them. This is my first experience in this new world -- it’s a vector-coprocessor,it’s a SIMD machine,its really different -- and so a lot of fun. You get to rethink all your assumptions. “?Empirical Measurements of Disk Failure Rates and Error Rates?”,with Catharine van Ingen describes moving two petabytes using inexpensive computers and reports the errors we observed -- SATA uncorrectable read errors happen,but they are not the main problem. Three papers on doing a modern Finite Element Analysis System using off-the-shelf database and visualization tools for data management and data analysis:? “Petascale Computational Systems: Balanced CyberInfrastructure in a Data-Centric World,”?pdf,?is a letter from Gordon Bell,Alex Szalay,and me to NSF the Cyberinfrastructure Directorate. It argues that Computational Science is changing to be?data intensive. NSF should support?balanced systems,not just CPU farms; but also petascale IO and networking. NSF should allocate resources to support abalanced Tier-1 through Tier-3 national cyber-infrastructure. Alex Szalay and Gyorgy Fekete of JHU and Bonnie Freiberg of SQLserver and I worked to get a spatial indexing sample into SQLserver 2005.This is a public domain implementation of the HTM algorithms documented in?“Indexing the Sphere with the Hierarchical Triangular Mesh,”?pdf,?( MSR-TR-2005-123).The library itself with examples is described in?“Using Table Valued Functions in SQL Server 2005 To Implement a Spatial Data Library,”?pdf,?(MSR-TR-2005-122). 20 years ago today the Datamation article:?A Measure of Transaction Processing?appeared. Charles Levine and I thought it was time to benchmark a PC -- my 2-year-old TabletPC to be exact. We did TPC-B (DebitCredit without the message handling) and got about 8ktps (!). The 4-page report and 6-page script that builds the database and runs the benchmark in a half hour is:?Thousands of DebitCredit Transactions-Per-Second: Easy and Inexpensive?Abstract: A $2k computer can execute about 8k transactions per second. This is 80x more than one of the largest US bank’s 1970’s traffic – it approximates the total US 1970’s financial transaction volume. Very modest modern computers can easily solve yesterday’s problems. A second paper with a broader perspective appeared in?“A Measure of Transaction Processing 20 Years Later,”?with a broader perspective appeared as MSR-TR-2005-57 and in June 2005 IEEE Data Engineering Bulletin. eScience:?My involvement with the astronomers continues to be fun: We have built a batch system for long-running queries:?“Batch is back: CasJobs,serving multi-TB data on the Web,”?William O’Mullane,Nolan Li,Maria A. Nieto-Santisteban,Alexander S. Szalay,February 2005. Have been musing about where scientific data management is going,? Storage Architecture:?Peter Kukol and others have been working on moving bulk data: the goal is to move 1 Giga Byte per second from CERN (Geneva Switzerland) to Pasadena California so that the Physicists in California can see the data as it comes out of the Large Hadron Collider (LHC) that will come online in 2008. Many other science disciplines need this as well. This paper shows how to do local IO fast.“Sequential File Programming Patterns and Performance with .NET,”?Peter Kukol,Jim Gray (describes and measures programming patterns for sequential file access in the .NET Framework. The default behavior provides excellent performance on a single disk – 50 MBps both reading and writing. Using large request sizes and file pre-allocation has quantifiable benefits. .NET unbuffered IO delivers 800 MBps on a 16-disk array,but buffered IO delivers about 12% of that performance. Consequently,high-performance file and database utilities are still forced to use unbuffered IO for maximum sequential performance. The report is accompanied by?downloadable source code?that demonstrates the concepts and code that was used to obtain these measurements With Caltech (Harvey Newman et al),CERN,AMD,Newisys,and the Windows? networking group we have been working to move data from CERN to Caltech (11,000 km) at 1 GBps (one gigabyte per second.) We have not succeeded yet. Our progress is reported at?Gigabyte Bandwidth Enables Global Co-Laboratories?(4.2 MB MSword),(?pdf of slides + transcript?(2.4MB)) in a presentation with Harvey Newman and I gave at?Windows Hardware Engineering Conference,Seattle,WA,3 May,2004. Peter Kukol’s “?Sequential Disk IO Tests for GBps Land Speed Record,” tells how we move data the first and last meter at about 2GBps TerraServer:?Our investigation of CyberBricks continues with various whitepapers about our experiences:?“TerraServer Bricks – A High Availability Cluster Alternative,”?Tom Barclay; Wyman Chong; Jim Gray,describes the migration of the TerraServer to a brick hardware design and describes our experience operating it over the last year. It makes an interesting contrast to?TerraServer Cluster and SAN Experience,” Tom Barclay,describes our experience operating the TerraServer SAN cluster as a “classic” enterprise configuration for the three years.?TerraService.NET: An Introduction to Web Services: tells how Tom Barclay converted the TerraServer to a web service,and how the USDA uses that web service.?“A Quick Look at SATA Disk Performance,”?Tom Barclay,Wyman Chong,Jim Gray investigates the use of?Storage Bricks:?low-cost,commodity components for multi-terabyte SQL Server databases. One issue has been the shortcomings of Parallel ATA (PATA) disks. Serial ATA (SATA) drives address many of these problems. This article evaluates SATA drive performance and reliability. Each disk delivers about 50 MBps sequential and about 75 read IOps and 130 write IOps on random IO. It is the sequel to?TeraScale SneakerNet: Using Inexpensive Disks for Backup,Archiving,and Data Exchange" that describes the storage bricks we use for data interchange,archiving,and backup/restore. Gives price,performance,and some rationale. Deep Thought :) :?An extended abstract of keynote talk at ACM SIGMOD 2004,Paris,France “?The Revolution in Database Architecture,” that enumerates the enormous changes happening to database system architecture.? Grid Computing:?Distributed Computing Economics?discusses the economic tradeoffs of doing Grid-scale distributed computing (WAN rather than LAN clusters). It argues that computations must be nearly stateless and have more than 10 hours of cpu time per GB of network traffic before outsourcing the computation makes economic sense. This is part of the more general discussion of Grid computing. My views are presented in memo?Microsoft and Grid Computing,a PowerPoint presentation?Web Services,Large Databases,and what Microsoft is doing in the Grid Computing space,and?Microsoft and Grid Computing Interview?for?Grid-Middleware Spectra. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |