Large Scalability - Papers and implementations #
In recent years the Googles and Amazons on this world have released papers on how to scale computing and processing to
terrabytes of data. These publications have led to the implementation of various open source projects that benefit from
that knowledge. However mapping the various open source projects to the original papers and assigning tasks that these
projects solve is not always easy.
With no guarantee of completeness this lists provides a short mapping from
open source project to publication.
There are further overviews available online as well as
a set of slides from the NOSQL
debrief.
Map Reduce | Hadoop Core Map Reduce | Distributed programming on rails, 5 Hadoop questions, 10 Map Reduce Tips |
GFS | HDFS (Hadoop File System) | Distributed file system for unstructured data |
Bigtable | HBase, Hypertable | Distributed storage for structured data, When to use HBase. |
Chubby | Zookeeper | Distributed lock- and naming service |
Sawzall | PIG, Cascading, JAQL, Hive | Higher level langage for writing map reduce jobs |
Protocol Buffers | Protocol Buffers, Thrift, Avro, more traditional: Hessian, Java serialization | Data serialization, early benchmarks |
Some NoSQL storage solutions | CouchDB, MongoDB | CouchDB: document database |
Dynamo | Dynomite, Voldemort, Cassandra | Distributed key-value stores |
Index | Lucene | Search index |
Index distribution | katta, Solr, nutch | Distributed Lucene indexes |
Crawling | nutch, Heritrix, droids, Grub, Aperture | Crawling linked pages |