Drew Robb investigates how off-the-shelf storage hardware, SQL Server and terabytes of celestial data are helping to pave the way for a boundless new era of desktop astronomy.
The usual image of an astronomer shows someone peering through the eyepiece of a large telescope observing the heavens. In reality, they are more likely to be peering at a computer screen, running queries and simulations, or studying the digitized output of telescopes on the other side of the world.
On March 15, the Sloan Digital Sky Survey (SDSS) released its latest data set to researchers and the broad public - Data Release 2. This data set contains over six terabytes of images and the properties of more than 88 million celestial objects. This data is available on the web at www.sdss.org/DR2 or in a more public friendly format at the SkyServer site. Visitors can pan and zoom around the universe using a sort of celestial version of Mapquest and click on an object to find out the properties of a star, galaxy or quasar.
While this can be fun, from a scientific viewpoint the most important feature is the ability to query that data set for objects that meet the requirements of a research project. To fulfill this demand meant a fundamental change in the way astronomy information is normally stored and managed.
"The volume of data we were projecting was so large that the traditional methods that scientists were using wouldn't cut it any more," says Johns Hopkins University associate research scientist Ani Thakar.
The SDSS is a project of the Astrophysical Research Consortium, a group of more than 200 astronomers at 13 institutions around the world. Its multi-year project is to map one-quarter of the sky and determine the brightness and position of several hundred million objects in it. It gathers information using a 2.5 meter telescope at the Apache Point Observatory in New Mexico. The telescope contains one of the largest imaging cameras in the world. While a typical large telescope contains a single CCD chip, the SDSS camera contains an array of thirty 4-megapixel chips.
Every two weeks, SDSS FedExes the raw imaging data to the U.S. Department of Energy's Fermi National Accelerator Laboratory in Batavia, Illinois for processing. There it is analyzed, calibrated, put into ASCII CSV (comma separated values) format and shipped to SDSS to add to its database. Using a database is a change from the usual way astronomical data is managed.
"The whole idea of putting them into databases is first of all to ensure the integrity of the data, be able to back out changes and things like that," Thakar explains. "The other big thing is to provide fast access to the data."
Page 2: Moving past FITS
Normally the data from a telescope is recorded in FITS (flexible image transport system) files, a binary transport mechanism that is used extensively for astronomy data. While this is adequate for small batches of information, when talking about the hundreds of millions of records that will eventually reside in SDSS data store, FITS is too cumbersome for rapid data access.
"In order for you to search for objects that were of interest for your research would take hours, maybe days," Thakar continues.
SDSS started out using an object oriented database (OODB), but that didn't meet the performance requirements. It decided to switch to a relational database.
Jim Gray, a "distinguished engineer" in Microsoft's Scalable Servers Research Group and manager of the company's Bay Area Research Center in San Francisco, California, helped SDSS set up on Microsoft's SQL Server 2000. The database resides on a series of off the shelf RAID 0/5 arrays with a total cost of under $10,000. The SQL database came on line with the Early Data Release in June 2001. Initially the SQL Server was just for the public access, while scientists would continue to use the OODB.
But that didn't last for long.
"In the first six months, the SQL database stole the show," says Thakar. "It was so much faster and easier to use that many of the scientists started using it too."
As a result, everything was moved over to SQL Server.
The SkyServer site offers visitors several options for getting data depending on their level of expertise. There are form-based queries that anyone can use. Hard core users can run SQL queries, or submit a batch file and come back later to view the results. Users can download their results in text, CSV or XML formats. Visitors can also use a graphic interface to locate an area of the sky, zoom in and click on a particular object to find out its properties.
So far, over 200 papers have been published based on data from the SDSS. And there are many more to come as its use speeds up the research process.
"Being able to pose questions in a few hours and get answers in a few minutes changes the way one views the data: you can experiment interactively," Microsoft's Jim Gray and Johns Hopkins University astronomy professor Alex Szalay wrote in their paper The World-Wide Telescope, an Archetype for Online Science. "When queries take three days and hundreds of lines of code, one asks many fewer questions and so gets fewer answers."