Archive

Posts Tagged ‘hbase’

Cloudera in Berlin

November 14th, 2011 at 8:24pm

Cloudera is hosting another round of trainings in Berlin in November this year. In addition to the trainings on Apache Hadoop this time around there will also be trainings on Apache HBase.

Register online via:

Get Together ,

FOSDEM - HBase at Facebook Messaging

February 17th, 2011 at 8:17pm

Nicolas Spiegelberg gave an awesome introduction not only to the architecture that powers Facebook messaging but also to the design decisions behind their use of Apache HBase as a storage backend. Disclaimer: HBase is being used for message storage, for attachements with Haystack a different backend is used.

The reasons to go for HBase include its strong consistency model, support for auto failover, load balancing of shards, support for compression, atomic read-modify-write support and the inherent Map/Reduce support.

When going from MySQL to HBase some technological problems had to be solved: Coming from MySQL basically all data was normalised - so in an ideal world, migration would have involved one large join to port all the data over to HBase. As this is not feasable in a production environment instead what was done was to load all data into an intermediary HBase table, join the data via Map/Reduce and import all into the target HBase instance. The whole setup was run in a dark launch - being fed with parallel life traffic for performance optimisation and measurement.

The goald was zero data loss in HBase - which meant using the Apache Hadoop append branch of HDFS. The re-designed the HBase master in the process to avoid having a single point of failure, backup masters are handled by zookeeper. Lots of bug fixes went back from Facebooks engineers to the HBase code base. In addition for stability reason rolling restarts were added for upgrades, performance improvements, consistency checks.

The Apache HBase community received lots of love from Facebook for their willingness to work together with the Facebook team on better stability and performance. Work on improvements was shared between teams in an amazing open and inclusive model to development.

One additional hint: FOSDEM videos of all talks including this one have been put online in the meantime.

General , , , ,

CFP - Berlin Buzzwords 2011 - search, score, scale

January 26th, 2011 at 8:00am

This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany,
taking place in Berlin.


Call for Presentations Berlin Buzzwords

http://berlinbuzzwords.de

Berlin Buzzwords 2011 - Search, Store, Scale

6/7 June 2011

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • IR / Search - Lucene, Solr, katta or comparable solutions
  • NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
  • Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives

Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.

Important Dates (all dates in GMT +2)

  • Submission deadline: March 1st 2011, 23:59 MEZ
  • Notification of accepted speakers: March 22th, 2011, MEZ.
  • Publication of final schedule: April 5th, 2011.
  • Conference: June 6/7. 2011

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no later than March 1st, 2011. Acceptance notifications will be sent out soon after the submission deadline. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a lightning (10min), short (20min) or long (40min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) If you’d like to pitch your brand new product in your talk, please let us know as well - there will be extra space for presenting new ideas, awesome products and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us.

Follow @berlinbuzzwords on Twitter for updates. News on the conference will be published on our website at http://berlinbuzzwords.de.

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Schedule and further updates on the event will be published on http://berlinbuzzwords.de Please re-distribute this CfP to people who might be interested.

Contact us at:

newthinking communications GmbH
Schönhauser Allee 6/7
10119 Berlin, Germany
Julia Gemählich
Isabel Drost
+49(0)30-9210 596

Berlin Buzzwords , , , , ,

Devoxx – Day 2 HBase

December 9th, 2010 at 9:25pm

Devoxx featured several interesting case studies of how HBase and Hadoop can be used to scale data analysis back ends as well as data serving front ends.

Twitter

Dmitry Ryaboy from Twitter explained how to scale high load and large data systems using Cassandra. Looking at the sheer amount of tweets generated each day it becomes obvious that with a system like MySQL alone this site cannot be run.

Twitter has released several of their internal tools under a free software license for others to re-use – some of them being rather straight forward, others more involved. At Twitter each Tweet is annotated by a user_id, a time stamp (ok if skewed by a few minutes) as well as a unique tweet_id. In order to come up with a solution for generating the latter one they built a library called snowflake. Though rather simple algorithm even works in a cross data-centre set-up: The first bits are composed of the current time stamp, the following bits encode the data-centre, after that there is room for a counter. The tweet_ids are globally ordered by time and distinct across data-centres without the need for global synchronisation.

With gizzard Twitter released a rather general sharding implementation that is used internally to run distributed versions of Lucene, MySQL as well as Redis (to be introduced for caching tweet timelines due to its explicit support for lists as data structures for values that are not available in memcached).

FlockDB for large scale social graph storage and analysis. Rainbird for time series analysis, though with OpenTSDB there is something comparable available for HBase. Haplocheirus for message vector caching (currently based on memcached, soon to be migrated to Redis for its richer data structures). The queries available through the front-end are rather limited thus making it easy to provide pre-computed, optimised version in the back-end. As with the caching problem a tradeoff between hit rate on the pool of pre-computed items vs. storage cost can be made based on the observed query distribution.

In the back-end of Twitter various statistical and data mining analysis are run on top of Hadoop HBase To compute potentially interesting followers for users, to extract potentially interesting products etc.
The final take-home message here: Go from requirements to final solution. In the space of storage systems there is not such thing as a silver bullet. Instead you have to carefully evaluate features and properties of each solutions as your data and load increase.

Facebook

When implementing Facebook Messaging (a new feature that was announced this week) Facebook decided to go for HBase instead of Cassandra. The requirements of the feature included massive scale, long-tail write access to the database (which more or less ruled out MySQL and comparable solutions) and a need for strict ordering of messages (which ruled out any eventually consistent system. The decision was made to use HBase.

A team of 15 developers (including operations and frontend) was working on the system for one year before it was finally released. The feature supports for integration of facebook messaging, IM, SMS and mail into one single system making it possible to group all messages by conversation no matter which device was used to send the message originally. That way each user’s inbox turns into a social inbox.

Adobe

Cosmin Lehene presented four use cases of Hadoop at Adobe. The first one dealt with creating and evaluating profiles of the Adobe Media Player. Users would be associated with a vector giving more information on what types of genre the meda they consumed belonged to. These vectors would then be used to generate recommendations for additional content to view in order to increase consumption rate. Adobe built a clustering system that would interface Mahout’s canopy- and k-means implementations with their HBase backend for user grouping. Thanks Cosmin for including that information in your presentation!

A second use case focussed on finding out more on the usage of flash on the internet. Using Google to search for flash content was no good as only the first 2000 results could be viewed thus resulting in a highly skewed sample. Instead they used a mixture of nutch and HBase for storage to retrieve the content. Analysis was done with respect to various features of flash movies, such as frame rates. The analysis revealed a large gap between the perceived typical usage and the actual usage of flash on the internet.

The third use case involves analysis of images and usage patterns on the Photoshop-in-a-browser edition of Photoshop.com. The forth use case dealt with scaling the infrastructure that powers businesscatalyst – a turn-key online business platform solution including analysis, campaigning and more. When purchased by Adobe the system was very successful business-wise. However the infrastructure was by no means able to put up with the load it had to accommodate. Changing to a back-end based on HBase led to better performance, faster report generation.

General, Hacking, Mahout , , , , , ,

Devoxx – Day two – Hadoop and HBase

December 8th, 2010 at 9:24pm

In his session on the current state of Hadoop Tom went into a little more detail not only on the features released in the latest release or on the roadmap for upcoming releases (including Kerberos based security, append support, warm standby namenode and others).
He also gave a very interesting view on the current Hadoop ecosystem. More and more projects are currently being created that either extend Hadoop or are built on top of Hadoop. Several of these are being run as projects at the Apache Software Foundation, however some are available outside of Apache only. Using graphviz he created a graph of projects depending on or extending Hadoop and from that provided a rough classification of these projects.

As to be expected HDFS and Map/Reduce are part of the very basis of this ecosystem. Right next to them sits zookeeper, a distributed coordination and looking service.

Storage systems extending the capabilities of HDFS include HBase that adds random read/write as well as realtime access to the otherwise batch-oriented distributed file-system. With PIG and Hive and Cascading three projects are making it easier to formulate complex queries for Hadoop. Among the three, PIG is mainly focussed on expressing data filtering and processing, with SQL support being added over time as well. Hive came from the need for SQL formulation on Hadoop clusters. Cascading goes a slightly different way, providing a Java API for easier query formulation. The new kid on the block sort of is Plume, a project initiated by Ted Dunning that has the goal of coming up with a Map/Reduce abstraction layer inspired by Google’s Flume Java publication.

There are several projects for data import into HDFS. Sqoop can be used for interfacing with RDMBS. Chukwa and Flume deals with feeding log data into the filesystem. For general co-ordination and workflow orchestration there is the release of Oozie, originally developed at Yahoo! as well as support for workflow definition in Cascading.

When storing data in Hadoop it is a common requirement to find a compact, structured representation of the data to store. Though human readable, xml files are not very compact. However when using any binary format, schema evolution commonly is a problem: Adding, renaming or deleting fields in most cases causes the need to upgrade all code interacting with the data as well as re-formatting already stored data. With Thrift, Avro and Protocol Buffers there are three options available for storing data in a compact, structured binary format. All three projects come with support for schema evolution by providing users no only to deal with missing data but also by providing a means to map old to new fields and vice versa.

General , ,

Devoxx – University – Cassandra, HBase

December 6th, 2010 at 9:20pm

During the morning session FIXME Ellison gave an introduction to the distributed NoSQL database Cassandra. Being generally based on the Dynamo paper from Amazon the key-value store distributes key/value pairs according to a consistent hashing schema. Nodes can be added dynamically making the system well suited for elastic scaling. In contrast to Dynamo, Cassandra can be tuned for the required consistency level. The system is tuned for storing moderately sized key/value pairs. Large blobs of data should not be stored into it. A recent addition to Cassandra has been the integration with Hadoop’s Map/Reduce implementation for data processing. In addition Cassandra comes with a Thrift interface as well as higher level interfaces such as Hector.

In the afternoon Michael Stack and Jonathan Grey gave an overview of HBase covering basic installation, going into more detail concerning the APIs. Most interesting to me was the HBase architecture and optimisation details. The systems is inspired by Google’s BigTable paper. It uses Apache HDFS as storage back-end inheriting the failure resilience of HDFS. The system uses Zookeeper for co-ordination and meta-data storage. However fear not, Zookeeper comes packaged with HBase, so in case you have not yet setup your own Zookeeper installation, HBase will do that for you on installation.

HBase is split into a master server for holding meta-data and region servers for storing the actual data. When storing data HBase optimises storage for data locality. This is done in two ways: On write the first copy usually goes to the local disk of the client, so even when storage exceeds the size of one block and remote copies get replicated to differing nodes, at least the first copy gets stored on one machine only. During a compaction phase that is scheduled regularly (usually every 24h, jitter time can be added to lower the load on the cluster) data is re-shuffled for optimal layout. When running Map/Reduce jobs against HBase this data locality can be easily exploited. HBase comes with its own input and output formats for Hadoop jobs.

HBase comes not only with Map/Reduce integration, it also publishes a Thrift interface, a REST interface and can be queried from an HBase shell.

General , , ,

Devoxx University – Productive programmer, HBase

December 4th, 2010 at 9:17pm

The first day at Devoxx featured several tutorials – most interesting to me was the pragramatic programmer. The speaker also is the author of the equally named book at O’Reilly. The book was the result of the observation that developers today are more and more IDE bound, no longer able to use the command line effectively. The result are developers that are unnecessarily slow when creating software. The goal was to bring usage patterns of productive software development to juniors how grew up in a GUI only environment. However half-way through the book, it became apparent that a book on command line wizardry only is barely interesting at all. So the focus was shifted and now includes more general productivity patterns.
The goal was to accelerate development – mostly by avoiding time consuming usage patterns (minimise mouse usage) and automation of repetitive tasks (computers are good at doing dull, repetitive tasks – that’s what they are made for.
Second goal was increasing focus. Two main ingredients to that are switching off anything that disturbs the development flow: No more pop-ups, not more mail notifications, no more flashing side windows. If you have ever had the effect of thinking “So late already?” when your colleagues were going out for lunch – then you know what is meant by being in the flow. It takes up to 20min to get into this mode – but just the fraction of a second to be thrown out. With developers being significantly more productive in this state it makes sense to reduce the risk of being thrown out.
Third goal was about canonicality, fourth one on automation.
During the morning I hopped on and off the Hadoop talk as well – the tutorial was great to get into the system, Tom White went into detail also explaining several of the most common advanced patterns. Of course not that much new stuff if you sort-of know the system already :)

General, Hacking , , ,

Apache Con – Hadoop, HBase, Httpd

November 24th, 2010 at 11:19pm

The first Apache Con day featured several presentations on NoSQL databases (track sponsored by Day software), a Hadoop track as well as presentations on Httpd and an Open source business track.

Since its inception Hadoop always was intended to be run in trusted environments firewalled from hostile users or even attackers. As such it never really supported any security features. This is about the change with the new Hadoop release including better Kerberos based security.

When creating files in Hadoop a long awaited feature was append support. Basically up to now writing to Hadoop was a one-of job: Open a file, write your data, close it and be done. Re-opening and appending data was not possible. This situation is especially bad for HBase as its design relies on being able to append data to an existing file. There have been efforts for adding append support to HDFS earlier as well as an integration of such patches by third party vendors. However only with a current Hadoop version Append is officially supported by HDFS.

A very interesting use case-wise of the Hadoop stack was presented by $name-here from $name. They are using a Hadoop cluster to provide a repository of code released under free software licenses. The business case is to enable corporations to check their source code against existing code and spot license infringements. This does not only include linking to free software under incompatible licenses but also developers copying pieces of free code, e.g. copying entire classes or functions into internal projects that originally were available only under copyleft licenses.

The speaker went into some detail explaining the initial problems they had run into: Obviously it’s no good idea to mix and match Hadoop and HBase versions freely. Instead it is best practice to use only versions claimed to be compatible by the developers. Another common mistake is to leave parameters of both projects at their defaults. The default parameters are supposed to be fool-proof. However they are optimised to work well for Hadoop newbies who want to try out the system on a single node cluster and in a distributed setting obviously need more attention. Other anti-patterns include storing only tiny little files in the cluster thus quickly running out of memory on the namenode (that stores all file information including block mappings in main memory for faster access).

In the NoSQL track Jonathan Grey from Facebook gave a very interesting overview on the current state of HBase. Turns out that Facebook would announce only a few days after that their internal use of HBase for the newly launched feature of Facebook messaging.

HBase has adopted a release cycle including development/ production releases to get their systems into interested users’ hands more quickly: Users willing to try out new experimental features can use the development releases of HBase. Those who are not should go for the stable releases.

After focussing on improving performance in the past months the project is currently focussing on stability: Data loss is to be avoided by all means. Zookeeper is to be integrated more tightly for storing configuration information thus enabling live reconfiguration (at least to some extend). In addition also HBase is targeting to integrate stored procedures like behaviour: As explained in Googles Percolator paper $LINK batch oriented processing get’s you only so far. If data that gets added constantly it makes sense to give up on some of the throughput batch-based systems provide and instead optimise for shorter processing cycles by implementing event triggered processing.

On recommendation of one of neofonie’s sys-ops I visited some of the httpd talks: First Rich Bowen gave an overview of unusual tasks one can solve with httpd. The list included things like automatically re-writing http response content to match your application. There is even a spell checker for request URLs: Given marketing has given your flyer to the press with a typo in the url, chances are that the spellchecking module can fix these automatically for each request: Common mistakes covered are switched letters, numbers replaced by letters etc. The performance cost has to be paid only in case no hit could be found – so instead of returning a 404 right away the server first tries to find the document by taking into account common mis-spellings.

Apache Con , , , ,

Apache Mahout @ Devoxx Tools in Action Track

November 1st, 2010 at 9:32am

This year’s Devoxx will feature several presentations coming from the Apache Hadoop ecosystem including Tom White on the basics of Hadoop: HDFS, MapReduce, Hive and Pig as well as Michael Stack on HBase.

In addition there will be a brief Tools in Action presentation on Monday evening featuring Apache Mahout.

Please let me know if you are going to Devoxx - would be great to meet some more Apache people there, maybe have dinner at one of the conference days.

General, Mahout, Software Foundation , , , , ,

Apache Hadoop in Debian Squeeze

July 17th, 2010 at 12:04pm

After using Mandrake for quite a while (still blaming my boyfriend Thilo for infecting not only my computer but also myself first with that system, then with the more general idea of Free Software - but that’s another story.) after finishing my master’s thesis I started using GNU Debian Linux (back then in the version code-named Woody). Since I always had a GNU Debian on my private box as my main operating system - even installed it on my MacBook following the steps in the Debian Wiki.

As I am also an Apache Mahout committer, closely related to the Apache Hadoop project, I always found it kind of sad that there were no Hadoop packages in the official Debian repositories. I tried multiple times to find some time to get into Debian packaging myself, I learned what “debian/rules” is all about and discovered some of the intricacies of packaging Java based software. However I have to admit that I never was able to find enough time to really finish that task.

A few weeks before this year’s FOSDEM I learned on the Apache Hadoop as well as on the Debian Java lists that a guy called Thomas Koch was working on solving bug 535861 - ITP to package Hadoop. We met at FOSDEM where I tried to raise some attention in the audience for Thomas’ plans (back then he was in need for help with a few last missing pieces). In addition I invited him for Berlin Buzzwords to get in touch with other Hadoop developers and users for further input.

I am really happy that by now Hadoop has made it into the official Debian package repositories - as soon as Debian Squeezeapt-get install [Hadoop component you need]: Debian package search.

Squeeze

If you want to speed up the process of Squeeze being released as stable version: Help fixing the remaining bugs in that distribution. There are various Debian Bug Squashing Parties being organised around the world. Next one in Berlin is on next Monday, the one for Munich is running this weekend. Just got the information that Fefe posted in his blog a link to the Mozilla bug bounty:

The packages are based on the upstream Apache Hadoop distribution, being comparably new they are intended for development machines at the moment. If you are using Debian and want to work with Hadoop - this is a great opportunity to help making the packages more stable by simply using them and reporting your experiences back to the Debian community.

In addition Debian now also provides packages for Zookeeper as well as HBase - though the HBase version is not yet production ready as the HDFS-append patch is still missing.

To follow the general state and progress of these packages feel free to follow the packages pages for Hadoop, HBase, Zookeeper respectively.

Thomas currently plans to work more closely with upstream e.g. to tidy up the chaos in the start-up scripts and other minor glitches. So watch out for further improvements.

In addition I just saw another interesting ITP in the Debian bugtracker: Wishlist: katta. I am sure there are quite a few others as well.

Hadoop , , ,