23 December 2011

Does Not Compute

Title: How Will Astronomy Archives Survive the Data Tsunami?
Authors: G.B. Berriman & S.L. Groom

Astronomy is abound with data. The scope of currently archived data ranges from large telescope projects to personal data that astronomers have accumulated over the ages from their long nights spent at ground based telescopes. The ability to store and access this data has been implemented fairly successfully, but the quantity of data is increasing rapidly (~0.5 PB of data per year) and is set to explode in the near future (60 PB total archived data by 2020). With this rapid increase of data comes the requirement for more storage space as well as more bandwidth to facilitate large and numerous database queries/downloads.
Figure 1: Growth of data downloads and data queries at IRSA
from 2005 - 2011

The primary focus of this paper concerns how we are to address data archival issues both from the server-side and client-side perspective. LSST, ALMA, and the SKA are projected to generate literally PBs of data and it is possible that our current infrastructure is insufficient to support the needs of those observatories. For instance, even if there is a database large enough to store and archive the data, searching the database for a particular image or set of images would require parsing through all of the data. We might also consider the bandwidth strain if numerous people are downloading data remotely. For the case of LSST, a projected 10 GB image file size would make this intractable. Data rates would suffer greatly.

Berriman and Groom study potential issues that are already cropping up in small archival data sets and attempt to provide a paths forward. These paths include developing innovative means to memory-map the stored data in order to lessen the load on the server and allow for more rapid data discovery, providing server-side reduction and analysis procedures, utilizing cloud computing in order to out-source the data storage issues, and implementing GPU computing. Current technologies are not in place to make all of these paths immediately beneficial, meaning the astronomical community should be looking to promote and partner with cyber initiatives and also educate their own members so they may more effectively contribute to the overall efficiency of computational power.

This last suggestions from the authors is what stirred up the most conversation. In particular, the authors recommend all graduate students in astronomy should be required to take classes in a long list of computational courses (i.e., software engineering). A quick analysis of their learning requirements means that a typical graduate student would be required to take an additional 3-6 courses. That's about an extra year for that course work. While the addition of an extra year for graduate students doesn't seem very attractive, it was suggested that summer workshops would be extremely helpful and advantageous. A 1-2 week program could potentially provide an intensive introduction to many of the highlighted skills astronomers might soon be expected to have (parallel programming, scripting, development of code, database technology). One comment even threw out the idea that Dartmouth hold such a school - quite possible, so keep an eye out!

What do you think about the future of computing in astronomy? Do we need to up the computational coursework for students or just hire computer scientists? Are there any tools or technologies you believe might be beneficial for astronomers to implement?