ApacheCon EU - part 09

2012-11-18 20:54
In the Solr track Elastic Search and Solr Cloud went into competition. The comparison itself was slightly apples-and-oranges like as the speaker compared the current ES version based on Lucene 3.x and Solr Cloud based on Lucene 4.0. During the comparison it still turned out that both solutions are more or less comparable - so choice again depends on your application. However I did like the conclusion: The speaker did not pick a clear winner in terms of projects. However he did have another clear winner: The user community will benefit from there being two projects as this kind of felt competition did speed up development already considerably.

The day finished with hoss' Stump the Chump session: The audience was asked to submit questions before the session, the jury was than asked to pick the winning question that stumped Hoss the most.

Some interesting bits from that question: One guy had the problem of having to provide somewhat diverse results in terms e.g. manufacturers in his online shop. There are a few tricks to deal with this problem: a) clean your data - don't have items that use keyword spamming side by side with regular entries. Assuming this is done you could b) use grouping to collapse items from the same manufacturer and let the user drill deeper from there. Also using c) a secondary sorting value can help - one hint: Solr ships with a random value out of the box for such cases.

For me the last day started with hossman's session on boosting and scoring tricks with Solr - including a cute reference for explaining TF-IDF ranking to people (see also a message tweeted earlier for an explanation of what a picture taken during my wedding has to do with ranking documents):


Share photos on twitter with Twitpic


Though TF-IDF is standard IR scoring it probably is not enough for your application. There's a lot of domain knowledge that you can encode in your ranking:

  • novelty factors - number of ratings/ standard deviation of ratings - ranks controversial items on top that might be more interesting than just having the stuff that everyone loves
  • scarcity - people like buying what is nearly sold out
  • profit margin
  • create your score manually by an external factor - e.g. popularity by association like categories that are more popular than others or items that are more popular depending on the time of day or year


There are a few sledge hammers people usually think of that can turn against you really badly: Say you rank by novelty only - that is you sort by date. The counter example given was the case of the AOL-Time Warner merger - being a big story news papers would post essays on it, do evaluations etc. However also articles only remotely related to it would mention the case. So be the end of the week when searching for it you would find all those little only remotely relevant articles and have to search through all of them to find the really big and important essay.

There are cases where it seems like only recency is all that matters: Filter only for the most recent items and re-try only in case of no results. The counter example is the case where products are released just on a yearly basis but you set the filter to say a month. This way up until May 31 your users will run into the retry branch and get a whole lot of results. However when a new product comes out on June first from that day onward the old results won't be reachable anymore - leading to a very weird experience for those of your users who saw yesterday's results.

There were times when scoring was influenced by keyword stuffing to emulate higher scores - don't do that anymore, Solr does support sophisticated per field and document boosting that make such hacks superfluous.

Instead rather use edismax for weighting fields. Some hints on that one: configure omitNorms in order to avoid having keyword stuffing influence your ranking. Configure omitTermFrequencyAndPosition if the term frequency in any document does not really tell you much e.g. in case of small documents only.

With current versions of Solr you can use your custom scoring per field. In addition a few ones are shipped that come with options for tweaking - like for instance the sweetSpotSimilarity wher you can tell the scorer that up to a certain length no length penalisation should happen.

Create your own boost functions that in addition to TF-IDF rely on rating values, click rates, prices or category influences. There's even an external file field option to allow you to load your scoring value per document or category from an external file that can be updated on a much more frequent basis than you would otherwise want to re-index all documents in your solr. For those suffering from the "For business reasons this document must come first no matter what the algorithm says" syndrom - there's a query elevation component for very fine grained tuning of rankings per query. Keep in mind so that this can easily turn into a maintanance nightmare. However it can be handy when quickly fixing a highly valuable business based use case: With that component it is possible to explicitly exclude documents from matching and precisely setting where to rank individual documents.

When it comes to user analytics and personalisation many people think of highly sophisticated algorithms that need lots of data to be trained. Yes Mahout can help you with personalisation and recommendation - but there are a few low hanging fruits to grab before:

  • Use the history of registered users or those you can identify through cookies - track the keywords they are looking for, the sort and filter functions commonly used.
  • Bucket people by explicit or implicit demographics.
  • Even just grouping people by the os and browser they use can help to identify preferences.


All of this information is relatively cheap to get by and can be used in many creative ways:

  • Provide default sort and filter functions for returning users.
  • Filter on the current query but when scoring take the older query into account.
  • Based on category facet used before do boosting in the next search assuming that the two queries are related.


Essentially the goal is to identify three factors for your users: What is their preference, what is the differentiator and what is your confidence in your estimation.

Another option could be to use the SweetSpotPlateau: If someone clicked on a price range facet on the next related query do not hide other prices but boost those that are in the previous facet.

One side effect to keep in mind: Your cache hit rate will go down now you are tailoring your results to individual users or user groups.

Biggest news for Lucene and Solr was the release of Lucene 4 - find more details online in an article published recently.

ApacheConEU - part 08

2012-11-17 20:53
Jan Lehnardt's talk covered the history of CouchDB - including lessons learnt along the way. The first issue he went into: Shipping 1.0 is hard! They spent a lot of effort and time in order to have a stable database that won't loose your data - only to have a poorly patch slip in for 1.0 that resulted in data loss. The fury of action happening afterwards was truely amazing - people working on rolling shifts all over the planet to not only fix the issue but also provide recovery tooling for those affected by the bug. The lessons learnt form that are as obvious as they are often neglected: Both test coverage as well as code review are crucial for any software project.

The second topic Jan went into was the disctraction and tension that comes from having a company built around your favourite open source project. When going down this road keep in mind that the whole VC setup usually is very time consuming - the world starts revolving around the need to either gather more VC funding or make up a successful business case to support your company. All of this results in less time spent coding, friction around the fact that the corporate interests may not always be what is best for your open source project. In CouchDB the result was the explosion of the project founder who eventually left the project. This hit CouchDB particularly badly as the project essentially was built around the idea of the one brilliant coder, relied on his information channels for marketing. The lesson learnt was that having communications centralised that way can easily turn against you - don't trust your benevolent dictator.

Usually it is quite ok for users to move on - in particular if the project does no longer fit their needs. However having multiple key people leave at the same time can be detrimental, in particular if they are the vocal ones. In terms of lessons learnt: Embrace the fact that people will fail your software. Use the resulting knowledge about your application boundaries - or fix what failed them.

In terms of general advise: The world moved on after any of these cases. What does help is to ship what users need instead of running after the next big hype. Also good ideas will stick - using json as format and js for query formulation did make it into many other applications with the former also making it into the next SQL standard to be released in 2015. The goal should be to build stuff that is easy (and fun) to use.

In the mean time CouchDB grew up. Not only does it have another release and a new web site. It has turned into a project that is no longer a thing pushed forward by a single person but that moves on its own. The secret behind that development is to acknowledge that having just few people in the leading position will burn them out - make sure to enable others and that your strong leaders to get to lead. Oh and as any Apache project also CouchDB is happy about any new contributor joining the project.

When it comes to communication the Apache incubation process made sure to burn the "everything happens on the mailing list" mantra into their mind. Still IRC was a valuable way of communication for non-decision stuff like user support and community building. IRC is fun - in particular when you can train irc bots based on earlier communication to automatically answer incoming user questions.

Another option CouchDB used to fix the community issues was to meet with people face-2-face - for three days in Boston, later in Dublin, later in Vienna. In addition they added a roadmap for the next 2 to 3 years including points like:

  • faster releases - they switched to time based instead of feature based releases except for security patches
  • they are the first to use git@apache to make branching and merging easier
  • they are github lovers with pull requests ending up on their dev list
  • they enabled a Erlang beginners question list in order to be able to recruit new contributors in a world of lacking Erlang developers. A very specific result of that was that people are much more comfortable even asking simple question - and on a more practical note one question for the birds eye view of couchdb resulted in Jan spending an hour and a half drawing up that particular picture: Spending an hour on docs to get to really new people is time well spend.


In terms of PMC chair lessons learnt: The goal should be to get the right people to care about the right thing. Having people finish stuff helps - and is infectious.

In the end as an open source project your biggest asset is your community. Motivating more people to join is key. If for your target audience JIRA is one step too much talk to infra to figure out how to make things better (and help them with the solutions).

What is fascinating about CouchDB is the whole ecosystem around the project. CouchDB is not just a database project hosted at Apache. It comes with a really well working replication API. There are implementations in js running in Browsers, there's BigCouch (dynamo in Erlang on top of CouchDB), there is an iOS app, there is PouchDB (the couch for your pocket), TouchDB (iOS and android implementations on top of sqlLight). The fun part to watch is that the idea is bigger than the project at Apache. The bigger the ecosystem the better for the community - there's no need to fold everything into the original project.

And of course also CouchDB is hiring.

ApacheConEU - part 07

2012-11-16 20:51
Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it's configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stabelises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.


The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs and only point them to the data directories aafter verifying that the JVMs booted up correctly.

In terms of performance they learnt to go wide quickly - instead of spending hours on optimising their one huge box they ended up distributing their solrs to multiple separate machines.

In terms of ingestion pipelines: Theirs is currently based on an indexing directory convention, moving the indexing as soon as all data is ingested. The advantage here is the atomicity of mv that can be used - disadvantage is having to move around lots of data. Their goal is to go for hdfs for indexing soon and zookeeper for configuration handling.

In terms of testing: In contrast to having a dev-test-production environment their goal is to have an elastic compute cloud that can be adjusted accordingly. Though EC2 itstelf is very cost intensive, poses additional problems with firewalling and moving data their cloud computing could still be a solution - in particular given projects like cloud stack or open cloud. The goal would be to do cycle scavaging in their own datacenter, do heavy computations when there is not a lot of load on the system and turn those analysis of in case of incoming traffic.

When it comes to testing and monitoring changes they made good experiences with using JConsole (connecting it to several solrs at once through a simple ip discovery script) and solr meter as a performance debugging tool.

Some implementation details: They used Solr as some sort of NoSQL cache as well (for thousands of queries/s), push the schema definition to solr from the app, have common fields and the option for developers to add custom fields in the app. Their experience is to not do expensive stuff in solr but to move that outside - this applies in particular to content extraction. For storage they used an avro based binary file format (mainly in order to save storage space, have a versioned schema and fast compression and de-compression). They are starting to use tika as their pipeline and for auto content detection, scaling up with behemoth. They learnt to upgrade indexes without re-indexing using the lucene upgrade tooling. In addition they use Grimreaper to kill servers if anything goes wrong and restart it later.

ApacheConEU - part 06

2012-11-15 20:48
For the next session I joined the Tomcat crowd in Marc Thomas' to learn more on Tomcat reverse proxy configurations. One rather common setup is to have Tomcat connected to an httpd instance. One common issue encountered with this setup in particular when running httpd with the event mpm is the problem of thread exhaustion on tomcat's side. Fixes include always having more active tomcat threads than there can be httpd threads at any one time and to disable persistent connections. Keep in mind that tomcat performance does not degrade gracefully here - in case of thread exhaustion it just goes downhill very quickly. One famous example here was an issue with the ASF jira: After weeks of debugging bad performance, after blaming the hardware, the os, the JVM, the java implementation in generally finally the number of threads was increased resulting in a smoothly running system...

Another common configuration problem is to rename to deployed web application war - for instance in order to keep the version number in the war name itself - and change the path on httpd's side. This is bad for at least for reasons:

  • redirects will fail - you can configure ProxyReversePath which will fix some issues but does not affect all http headers
  • cookie paths break - you can configure CookiePathReverse here
  • links that are generated in the web app will fail - you can use mod_sed/ _substitute/ _proxy_html to fix that - however those configurations tend to become messy and are error prone
  • custom headers usually also break


If the only reason for doing such a thing is to keep the version number in the file name it might be an option to use "foo##bar.1.2.3" as filename - tomcat will ignore anything after the hashtags.

When dealing with proxying traffic make sure to inform tomcat about https termination events in order to correctly handle secure cookies and sessions. This is done automatically with mode_jk and mod_ajp, mod_proxy needs some more manual work. When dealing with virtual hosting make sure to use ProxyPreserveHeader in order to be able to switch hosts on Tomcat's side.


Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it's configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stableises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.


The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs

ApacheConEU - part 05

2012-11-14 20:47
The afternoon featured several talks on HBase - both it's implementation as well as schema optimisation. One major issue in schema design in the choice of key. Simplest recommendation is to make sure that keys are designed such that on reading data load will be evenly distributed accross all nodes to prevent region-server hot-spotting. General advise here are hashing or reversing urls.

When it comes to running your own HBase cluster make sure you know what is going on in the cluster at any point in time:

  • Hbase comes with tools for checking and fixing tables,
  • tools for inspecting hfiles that HBase stores data in,
  • commands for inspecting the binary write ahead log,
  • web interfaces for master and region servers,
  • offline meta data repair tooling.


When it comes to system monitoring make sure to track cluster behaviour over time e.g. by using Ganglia or OpenTSDB and configure your alerts accordingly.

One tip for high traffic sites - it might make sense to disable automatic splitting to avoid splits during peaks and rather postpone them to low traffic times. One rather new project presented to monitor region sizes was Hannibal.

At the end of his talk the speaker went into some more detail on problems encountered when rolling out HBase and lessons learnt:

  • the topic itself was new so both engineering and ops were learning.
  • at scale nothing that was tested on small scale works as advertised.
  • hardware issues will occur, tuning the configuration to your workload is crucial.
  • they used early versions - inevitably leading to stability issues.
  • it's crucial to know that something bad is building up before all systems start catching fire - monitoring and alerting the right thing is important. With Hadoop there are multiple levels to monitor: the hardware, os, jvm, Hadoop itself, HBase on top. It's important to correlate the metrics.
  • key- and schema design are key.
  • monitoring and knowledgable operations are important.
  • there are no emergency actions - in case of an emergency it just will take time to recover: Even if there is a backup, even just transferring the data back can take hours and days.
  • HBase (and Hadoop) is DevOps technology.
  • there is a huge learning curve to get to a state that makes using these systems easy.



In his talk on HBase schema design Lars George started out with an introduction to the HBase architecture. On the logical level it's best to think of HBase as a large distributed hash table - all data except for names are stored as byte arrays (with helpers to transform that data back into well known data types provided). The tables themselves are stored in a sparse format which means that null values essentially come for free.

On a more technical level the project uses zookeeper for master election, split tracking and state tracking. The architecture itself is based on log structured merge trees. All data initially ends up in the write ahead log - with data always being appended to the end this log can be written and ready very efficiently without any seek penalty. The data is inserted at the respected region server in memory (mem store, size of 64 MB typically) and synched to disk in regular intervals. In HBase all files are immutable - modifications are done only by writing new data and merging it in later. Deletes also happen by marking data as deleted instead of really deleting it. On a minor compaction the recently few files are being merged. On a major compaction all files are merged and deletes are being handled. Handling your own major compaction is possible as well.

In terms of performance lookup by key is the best you can do. If you do lookup by value this will result in a full-table scan. There is an option to give HBase a hint as to where to find the key when it is updated only infrequently - there is an option to provide a timestamp of roughly where to look for it. Also there are options to use Bloomfilters for better read performance. Another option is to move more data into the row key itself if that is the data you will be searching for very often. Make sure to de-normalize your data as HBase does not do any sophisticated joins, there will be duplication in your data as all should be pre-joined to get better performance. Have intelligent keys that make match your read/write patterns. Also make sure to keep your keys reasonably short as they are being repeated for each entry - so moving the whole data into the key isn't going to get you anything.

Speaking of read write patterns - as a general rule of thumb: to optimise for better write performance tune the memstore size. For better read performance tune the block cache size. One final hint: Anything below 8 servers really is just a test setup as it doesn't get you any real advantages.

ApacheConEU - part 04

2012-11-13 20:46
The second talk I went to was the one on the dev@hadoop.a.o insights given by Steve Loughran. According to Steve Hadoop has turned into what he calls an operating system for the data center - similar to Linux in that it's development is not driven by a vendor but by its users: Even though Hortenworks, Cloudera and MapR each have full time people working on Hadoop (and related projects), this work usually is driven by customer requirements which ultimately means that someone is running a Hadoop cluster that he has trouble with and wants to have fixed. In that sense the community at large has benefitted a lot from the financial crisis that Y! has slipped into: Most of the Hadoop knowledge that is now spread across companies like Linked.In, Facebook and others comes from engineers leaving Y! and joining one of those companies. With that also the development cycle of Hadoop has changed: While it was mostly driven by Y! schedule in the beginning - crunching out new releases nearly on a monthly basis with a dip around Christmas time it's got more irregular later - to pick up a more regular schedule just recently.



Image taken a few talks earlier in the same session.


In terms of version numbers: 1.x is to be considered stable, production ready with fixes and low risk patches applied only. For 2.x the picture is a bit different - currently that is in alpha stage, features and fixes go there first, new code is developed there first. However Hadoop is not just http://hadoop.apache.org - the ecosystem is much larger including projects like Mahout, Hama, HBase and Accumulo built on top of Hadoop and Hive, Pig, Sqoop and Flume for integraion and analysis, as well as oozie for coordination an zookeeper for distributed configuration. There's even more in incubation (or just recently graduated): Kafka for logging, whirr for cloud deployment, giraph for graph processing, ambari for cluster management, s4 for distributed stream processing, hcatalog and templeton for schema management, chuckwa for loggin purposes. All of the latter ones love helping hands. If you want to help out and want to "play Apache" those are the place to go to.

With such a large ecosystem one of the major pain points when rolling out your own Hadoop cluster is integrating all components you need: All projects release on separate release schedules, usually documenting against which version of Hadoop they were built. However finding a working combination is not always trivial. The first place to go to that springs to mind are the standard Linux distributions - however for their release and support cycles (e.g. Debian's guarantees for stable releases) the pace inside of Hadoop and the speed with which old versions were declared "no longer supported" still is too fast. So what alternatives are there? You can go with Cloudera who ship Apache Hadoop extended with their patches and additional proprietary software. You can opt for Hortenworks that ships the ASF Hadoop only. Or you can opt for Apache itself and either tame the zoo yourself or rely on BigTop that aims to integrate the latest versions.

Even though there are people working fulltime on the project there's still way more work to do than hands to help out. Some issues do fall through the cracks, in particular if you are the only person affected by the issue. Ultimately this may mean that if the bug only affects you you will have to be the one to fix the issue. Before going out and hacking the source itself it may make sense to go out and search for the stack trace you are looking at, the error message in front of you - look in JIRA and on the mailing list archives to see if there's someone else who has solved the problem already - and in an ideal case even provided a patch that just didn't make it into the distribution so far.

Also contributing to Hadoop is a great way to get to work on CS hard problems: There's distributed computing involved (clearly), there's consensus implementations like Paxos, there's work to optimise Hadoop for specific CPU architectures, there's scheduling and data placement problems to solve, machine learning and graph algorithm implementations and more. If you are into these topics - come and join, those are highly needed skills. People on the list tend to be friendly - they are "just" overloaded.

Of course there are barriers to entry - just like the Linux kernel Hadoop has become business critical for many of its users. Steve's recommendation for dealing with this circumstance was to not compete with existing work that is being done - instead either concentrate on areas that aren't covered yet or even better yet collaborate with others. A single lonely developer just cannot reasonably compete.

In terms of commit process Hadoop is running a review-than-commit protocol. Also it makes things seemingly more secure it also means that the development process is a lot slower, that things get neglected, that there is a lot of frustration also on the committers' side when having to update a patch oftentimes. In addition tests running for a long time doesn't make contributing substancially easier. With lots of valuable and critical data being stored in HDFS makeing radical changes there also isn't something that happens easily. The best way to really get started is to get trusted and have a track record: The maintanance cost for abandoned patches that are large, hard to grasp and no longer supported by the original author is just to high.

Get a track record by getting known on dev@ and meetups. Show your knowledge by helping out others, help with patches that are not your own, don't rewrite core initially, help with building plug-in points (like for shuffling implementations), help with testing across the whole configuration space, testing in unusual settings. Delegate the test at scale to those that have the huge clusters - there will always be issues revealed in their setup that do not cause any problems on a single machine or in a tiny cluster. Also make sure to download the package in the beta phase and test that it works in your problem space. Another way to get involved is by writing better documentation - either online or as a book. Share your experience.

One major challenge is work on major refactorings - there is the option to branch out, switch to a commit-than-review model but that only post-pones work to merge time. For independent works it's even more complicated to get the changes in. Also integrating post graduate work that can be very valuable isn't particularly simple - especially if there already is a lack in helping hand s- who's going to spend the work to mentor those students.

Some ideas to improve the situation would b the help with spreading the knowledge - in local university partnerships, google hangouts, local dev workshops, using git and gerrit for better distributed development and merging.