There's a very interesting preprint of a paper from Google Research due to be presented at OSDI '06 in November. It's Bigtable: A Distributed Storage System for Structured Data by Chang et al. (PDF here).
Bigtable is a database that runs on top of the Google File System. I first heard of it in the May '06 interview of Google's Jeff Dean by O'Reilly's Radar. Like the databases supporting other major Internet applications, it's not a straight relational database. This paper gives the details for those of you interested in such things.
It also includes the following gee whiz items:
- The Google web crawl creates two tables totally 850 terabytes of data.
- Google Analytics apparently keeps all clicks, forever. It's currently using two tables totally 220 terabytes.
Hmm..., if they really keep everything forever, this could end up larger than their web crawl.
- The maps for Google Earth occupy a mere 70 terabytes.