Notebook - OSS office at Adobe

2017-05-12 06:46
tl;dr: This post summarises what I learnt at the Adobe Open Source Summit last week about which aspects to think of when running an open source office. It's mainly a mental note for myself, hopefully others will find it useful as well.

Longer version:

This is another post in a series of articles on "random stuff I leant in Basel last week". When I was invited to Adobe's open source summit I had no idea what to expect from it. In retrospect I'm glad I accepted the invitation - it was a great mix of talks, a valuable insight into how others are adopting open source processes and their motivation for open sourcing code and helping out upstream.

The first in a series of talks gave an overview of Adobe's open source office. Currently the expertise collected there includes one human with a software development background, one human with a marketing background and one human with program management background showing the breadth of topics faced when interacting with open source projects.

The self-set goal of these people is to make contributing to open source as easy as possible. That includes making people comfortable internally with the way many open source projects work. It means making (e.g. legal, branding) review processes of contributions internally as easy as possible. It means supporting teams with checking contributions for license conformance, sensitive (e.g. personal) information that shouldn't be shared publicly. It also means supporting teams that want to run their own open source projects.

One idea that I found very intriguing was to run their team through a public github repository, in a sort-of eat-your-own-dogfood kind-of way.

One of the artifacts maintained in that repository is a launch checklist that people can follow which includes reminders for creating a blog post, publishing on the website, sharing information on social media. Most likely those are aspects that are often forgotten even in established projects when it comes to topics like releasing and promoting a new software release version.

Another aspect I found interesting was the creation of an OSS advisory board - a group of people intersted in open development and open source. A wider group of people tasked with figuring out how to best approach open development, how to get the organisation to move away from a private by default strategy towards an open by default way of collaborating across team boundaries.

Many of these topics fit well within what was shared earlier this year in the Linux foundation webcast on how to go about creating an open source office in organisations.

Devoxx – Day 2 HBase

2010-12-09 21:25
Devoxx featured several interesting case studies of how HBase and Hadoop can be used to scale data analysis back ends as well as data serving front ends.

Twitter



Dmitry Ryaboy from Twitter explained how to scale high load and large data systems using Cassandra. Looking at the sheer amount of tweets generated each day it becomes obvious that with a system like MySQL alone this site cannot be run.

Twitter has released several of their internal tools under a free software license for others to re-use – some of them being rather straight forward, others more involved. At Twitter each Tweet is annotated by a user_id, a time stamp (ok if skewed by a few minutes) as well as a unique tweet_id. In order to come up with a solution for generating the latter one they built a library called snowflake. Though rather simple algorithm even works in a cross data-centre set-up: The first bits are composed of the current time stamp, the following bits encode the data-centre, after that there is room for a counter. The tweet_ids are globally ordered by time and distinct across data-centres without the need for global synchronisation.

With gizzard Twitter released a rather general sharding implementation that is used internally to run distributed versions of Lucene, MySQL as well as Redis (to be introduced for caching tweet timelines due to its explicit support for lists as data structures for values that are not available in memcached).

FlockDB for large scale social graph storage and analysis. Rainbird for time series analysis, though with OpenTSDB there is something comparable available for HBase. Haplocheirus for message vector caching (currently based on memcached, soon to be migrated to Redis for its richer data structures). The queries available through the front-end are rather limited thus making it easy to provide pre-computed, optimised version in the back-end. As with the caching problem a tradeoff between hit rate on the pool of pre-computed items vs. storage cost can be made based on the observed query distribution.

In the back-end of Twitter various statistical and data mining analysis are run on top of Hadoop HBase To compute potentially interesting followers for users, to extract potentially interesting products etc.
The final take-home message here: Go from requirements to final solution. In the space of storage systems there is not such thing as a silver bullet. Instead you have to carefully evaluate features and properties of each solutions as your data and load increase.

Facebook



When implementing Facebook Messaging (a new feature that was announced this week) Facebook decided to go for HBase instead of Cassandra. The requirements of the feature included massive scale, long-tail write access to the database (which more or less ruled out MySQL and comparable solutions) and a need for strict ordering of messages (which ruled out any eventually consistent system. The decision was made to use HBase.

A team of 15 developers (including operations and frontend) was working on the system for one year before it was finally released. The feature supports for integration of facebook messaging, IM, SMS and mail into one single system making it possible to group all messages by conversation no matter which device was used to send the message originally. That way each user's inbox turns into a social inbox.

Adobe



Cosmin Lehene presented four use cases of Hadoop at Adobe. The first one dealt with creating and evaluating profiles of the Adobe Media Player. Users would be associated with a vector giving more information on what types of genre the meda they consumed belonged to. These vectors would then be used to generate recommendations for additional content to view in order to increase consumption rate. Adobe built a clustering system that would interface Mahout's canopy- and k-means implementations with their HBase backend for user grouping. Thanks Cosmin for including that information in your presentation!

A second use case focussed on finding out more on the usage of flash on the internet. Using Google to search for flash content was no good as only the first 2000 results could be viewed thus resulting in a highly skewed sample. Instead they used a mixture of nutch and HBase for storage to retrieve the content. Analysis was done with respect to various features of flash movies, such as frame rates. The analysis revealed a large gap between the perceived typical usage and the actual usage of flash on the internet.

The third use case involves analysis of images and usage patterns on the Photoshop-in-a-browser edition of Photoshop.com. The forth use case dealt with scaling the infrastructure that powers businesscatalyst – a turn-key online business platform solution including analysis, campaigning and more. When purchased by Adobe the system was very successful business-wise. However the infrastructure was by no means able to put up with the load it had to accommodate. Changing to a back-end based on HBase led to better performance, faster report generation.