Large Scalability - Papers and implementations

Large Scalability - Papers and implementations #

In recent years the Googles and Amazons on this world have released papers on how to scale computing and processing to terrabytes of data. These publications have led to the implementation of various open source projects that benefit from that knowledge. However mapping the various open source projects to the original papers and assigning tasks that these projects solve is not always easy.

With no guarantee of completeness this lists provides a short mapping from open source project to publication.

There are further overviews available online as well as a set of slides from the NOSQL debrief.

Map Reduce	Hadoop Core Map Reduce	Distributed programming on rails, 5 Hadoop questions, 10 Map Reduce Tips
GFS	HDFS (Hadoop File System)	Distributed file system for unstructured data
Bigtable	HBase, Hypertable	Distributed storage for structured data, When to use HBase.
Chubby	Zookeeper	Distributed lock- and naming service
Sawzall	PIG, Cascading, JAQL, Hive	Higher level langage for writing map reduce jobs
Protocol Buffers	Protocol Buffers, Thrift, Avro, more traditional: Hessian, Java serialization	Data serialization, early benchmarks
Some NoSQL storage solutions	CouchDB, MongoDB	CouchDB: document database
Dynamo	Dynomite, Voldemort, Cassandra	Distributed key-value stores
Index	Lucene	Search index
Index distribution	katta, Solr, nutch	Distributed Lucene indexes
Crawling	nutch, Heritrix, droids, Grub, Aperture	Crawling linked pages