Archive

Archive for the ‘Apache Con’ Category

ApacheCon - Keynotes

November 25th, 2010 at 11:20pm

The first keynote was given by Dana Blankenhorn – a journalist and blogger regularly publishing tech articles with a clear focus on open source projects. Focussed on the evolution of open source projects with a special focus on Apache.

Coming from a research background the keynote given by Daniel Crichton from NASA was very interesting to me: According to the speaker scientists are facing challenges that are all to known to large and distributed corporations. Most areas in science is currently becoming more and more dependent on data intensive experiments. Examples include but are not limited to

  • The field of biology where huge numbers of experiments are needed to decipher the internal workings of proteins, or to be able to understand the fundamental concepts underlying data encoded in DNA.
  • In physics hadron collider experiments huge amounts of data are generated with each experiment. With facilities for running such experiments are expensive to build and the amount of data generated is for too large to be analysed by just one team groups of scientists are suddenly facing the issue of exchanging data with remote research groups. They suddenly run into the requirement of integrating their system to those of other groups. All of a sudden data formats and interfaces have to somehow be standardised.
  • Running space missions used to be limited to just a very small number of research institutions in a very tiny number of countries. However this is about to change as more countries are gaining the knowledge and facilities to run space missions. Again this leads to the need to be able to collaborate towards one common goal.

Not only are software systems so far distinct and incompatible. Even data formats used usually are incompatible. The result are scientists spending most of their time re-formatting, converting and importing datasets before being able to get any real work done. At the moment research groups are not used to working collaboratively in distributed teams. Usually experiments are run on specially crafted, one-of software that cannot be easily re-used, that does not adhere to any standards and that is being re-written over and over again by every research group. Re-using existing libraries is oftentimes a huge cultural shift as researchers seemingly are afraid of external dependencies, afraid of giving up control over part of their system.

One step into the right direction was taken by NASA earlier this year: They released their decision making support system OODT under a free software license (namely the Apache Software License) and put the project under incubation at Apache. The project currently is about to graduate to its own top level Apache project. This step is especially remarkable as successfully going through the incubator also means to have established a healthy community that is not only diverse but also open to accepting incoming patches and changes to the software. This means to not only give up control over your external dependencies but also having the project run in a meritocratic, community driven model. For the contributing organisation, this boils down to no longer having total control over the future roadmap of the project. In return this usually leads to higher community participation, and higher adoption in the wild.

Apache Con , ,

Apache Con – Hadoop, HBase, Httpd

November 24th, 2010 at 11:19pm

The first Apache Con day featured several presentations on NoSQL databases (track sponsored by Day software), a Hadoop track as well as presentations on Httpd and an Open source business track.

Since its inception Hadoop always was intended to be run in trusted environments firewalled from hostile users or even attackers. As such it never really supported any security features. This is about the change with the new Hadoop release including better Kerberos based security.

When creating files in Hadoop a long awaited feature was append support. Basically up to now writing to Hadoop was a one-of job: Open a file, write your data, close it and be done. Re-opening and appending data was not possible. This situation is especially bad for HBase as its design relies on being able to append data to an existing file. There have been efforts for adding append support to HDFS earlier as well as an integration of such patches by third party vendors. However only with a current Hadoop version Append is officially supported by HDFS.

A very interesting use case-wise of the Hadoop stack was presented by $name-here from $name. They are using a Hadoop cluster to provide a repository of code released under free software licenses. The business case is to enable corporations to check their source code against existing code and spot license infringements. This does not only include linking to free software under incompatible licenses but also developers copying pieces of free code, e.g. copying entire classes or functions into internal projects that originally were available only under copyleft licenses.

The speaker went into some detail explaining the initial problems they had run into: Obviously it’s no good idea to mix and match Hadoop and HBase versions freely. Instead it is best practice to use only versions claimed to be compatible by the developers. Another common mistake is to leave parameters of both projects at their defaults. The default parameters are supposed to be fool-proof. However they are optimised to work well for Hadoop newbies who want to try out the system on a single node cluster and in a distributed setting obviously need more attention. Other anti-patterns include storing only tiny little files in the cluster thus quickly running out of memory on the namenode (that stores all file information including block mappings in main memory for faster access).

In the NoSQL track Jonathan Grey from Facebook gave a very interesting overview on the current state of HBase. Turns out that Facebook would announce only a few days after that their internal use of HBase for the newly launched feature of Facebook messaging.

HBase has adopted a release cycle including development/ production releases to get their systems into interested users’ hands more quickly: Users willing to try out new experimental features can use the development releases of HBase. Those who are not should go for the stable releases.

After focussing on improving performance in the past months the project is currently focussing on stability: Data loss is to be avoided by all means. Zookeeper is to be integrated more tightly for storing configuration information thus enabling live reconfiguration (at least to some extend). In addition also HBase is targeting to integrate stored procedures like behaviour: As explained in Googles Percolator paper $LINK batch oriented processing get’s you only so far. If data that gets added constantly it makes sense to give up on some of the throughput batch-based systems provide and instead optimise for shorter processing cycles by implementing event triggered processing.

On recommendation of one of neofonie’s sys-ops I visited some of the httpd talks: First Rich Bowen gave an overview of unusual tasks one can solve with httpd. The list included things like automatically re-writing http response content to match your application. There is even a spell checker for request URLs: Given marketing has given your flyer to the press with a typo in the url, chances are that the spellchecking module can fix these automatically for each request: Common mistakes covered are switched letters, numbers replaced by letters etc. The performance cost has to be paid only in case no hit could be found – so instead of returning a 404 right away the server first tries to find the document by taking into account common mis-spellings.

Apache Con , , , ,

Apache Con – Hackathon days

November 23rd, 2010 at 11:17pm

This year on Halloween I left for a trip to Atlanta/GA. Apache Con US was supposed to take place there featuring two presentations on Apache Mahout – one by Grant Ingersoll explaining how to use Mahout to provide better search features in Solr, one by myself with a general introduction to what features Mahout provides, giving a bit more detailed information on how to use Mahout for classificaiton.

I spent most of Monday in Sally Khudairi’s media training. In the morning session she explained the Ins and Outs of successfully marketing your open source project: One of the most important questions is to be able to provide a dense but still accessible explanation of what your project is all about and how it differentiates from other projects potentially in the same space. As a first exercise attendees would meet in pairs interviewing each other about their respective project. When summarising the information I had gotten, Sally quickly pointed out additional pieces of valuable information I had totally forgotten to ask about:

  • First of all the full name of the interviewee, including the sur-name.
  • Second the background of the person with respect to the project. It seemed all to natural that someone you meet at Apache Con in a Media Training almost certainly is either founder or core-committer to the project. Still it is interesting to know more on how long he has been contributing, whether he maybe even co-founded the project.

After that first exercise we would go into detail on various publication formats. When releasing project information the first format that comes to mind are press releases. For software projects at the ASF these are created in a semi-standardised format containing

  • Background on the foundation itself.
  • Some general background on the project.
  • A few paragraphs on the news to be published on the project in an easily digestible format.
  • Contact information for more details.

Some of these parts can be re-used across different publications and occasions. It does make sense to keep these building blocks as a set of boilerplates ready to use when needed.

After lunch Michael Coté from redmonk visited us. Michael has a development background, currently he works as business analyst for redmonk. It is fairly simple to explain technical projects to fellow developers. To get some experience in explaining our project also to non-technical people Sally invited Michael to interview us. By the end of the interview Michael asked each whether they had any question for him. As understanding what machine learning can do for your average Joe programmer is not all to trivial I simply asked him for strategies for better explaining or show-casing our project. One option that came to his mind was to come up with one – or a few – example show cases where Mahout is applied to freely available datasets. Currently most data analysis systems are rather simple or based only on a very limited set of data. Showing on a few selected use cases what can be done with Mahout should be a good way to get quite some media attention for the project.

During the remaining time of the afternoon I started working a short explanation of Mahout and our latest release. The text was reviewed by the Mahout community. The text was published by Sally on the blog of the Apache Software foundation. I also used it as a basis for an article on heise open that got published that same day.

The second day was reserved for a mixture of attending the Barcamp session and hacking away at the Hackathon. Ross had talked me into giving an overview of various Hadoop use cases as that was requested by one of the attendees. However it turned out the guy wasn’t really interested in specific use cases: The discussion quickly turned into the more fundamental question of how far the ASF should go in promoting its projects. Should there be a budget for case studies? Should there even be some marketing department. Well, clearly that is out of scope for the foundation. And in addition would run contrary to it being a neutral ground for vendors to collaborate towards common goals while still separately making money providing consulting services, selling case studies etc.

During the Hackathon I was turned into a Mentor for Stanbol, a new project entering incubation just now. In addition I spent some time to finally catch up with the Mahout mailing list.

Apache Con ,

Travelling

November 22nd, 2010 at 11:17pm

Currently on my way back from a series of conferences in the past three weeks in the IC from Schiphol. After three weeks of conferences, lots of new input and lots of interesting projects I learned about it is finally time to head back and put the stuff I have learned to good use.


View Travelling in a larger map

As seems normal with open source conferences I got far more input on interesting projects than I can expect to ever get applied in on a daily basis. Still it is always inspiring to meet with other developers in the same field – or even quite different fields and learn more on what projects they are working on, how they solve various problems.

A huge Thank You goes to the DICODE EU research project for sponsoring the Apache Con and Devoxx trips, another Thanks to Sapo.pt for inviting me to Lisbon and covering travel expenses. A special thank you to the assistant at neofonie who made travel arrangements for Atlanta and Antwerp: It all worked without problems even up to me having a power outlet in the train that is finally taking me back.

Apache Con, General, Software Foundation , , , ,

Apache Mahout at Apache Con NA

October 15th, 2010 at 8:39pm

The upcoming Apache Con NA to take place in Atlanta will feature several tracks relevant to users of Apache Mahout, Lucene and Hadoop: There will be a full track on Hadoop as well as one on NoSQL on Wednesday featuring talks on the framework itself, Pig and Hive as well as presentations from users on special use cases and on their way of getting the system to production.

The track on Mahout, Lucene and friends starts on Thursday afternoon, followed by another series of Lucene presentations on Friday.

Also don’t miss the track on the community and business tracks for a glimpse behind the scenes. Of course there will also be tracks on well-known Apache Tomcat, httpd, OSGi and many, many more.

Looking forward to meeting you in Atlanta!

Apache Con , , , , ,

Hadoop at Heise c’t

January 31st, 2010 at 1:37pm

<surreptitious_advertising>
Interesting for those readers speaking German: Heise published an introductory article on Hadoop in its latest issue. Have fun reading.
<surreptitious_advertising/>

Thanks to Simon for proof-reading and providing valuable input. Thanks to Thilo Fromm for the hadoop graphics (unfortunately none of them got published in its original form), the catchy title, proof-reading the text over and over again and for keeping me sane during several past and coming months.

If you want to know more on Apache Hadoop, come watch my FOSDEM Hadoop talk next weekend. If you want to join discussions on Apache Hadoop and Lucene, stay tuned for a conference in Berlin on these topics.

Apache Con, General, Get Together, Hadoop , , ,

ApacheCon Oakland Roundup

November 19th, 2009 at 8:15pm

Two weeks ago ApacheCon US 2009 ended in Oakland California. Shane published a set of links to articles that contain information on what happened at Apache Con. Some of them are officially published by the Apache PRC project, others are write-ups of individuals on which talks they attended and which topics they considered particularly interesting.

Apache Con , ,

Apache Con US Wrap Up

November 16th, 2009 at 10:10pm

some weeks ago I attended ApacheConUS09 in Oakland/ California. In the mean time, videos of one of the sessions have been published online:

You can find a wrap up of the most prominent topics at the conference at heise (unfortunately Germany-only).

By far the largest topics at the conference:

  • Lucene - there was a meetup with over 100 attendees as well as two main tracks with Lucene focussed talks. New features of Lucene 2.9.* were in the center of interest: The new range search capabilities, segment search that improves caching, a new token stream api that makes annotating terms more flexible as well as a lot of performance improvements. Shortly after the conference, Lucene 2.9.1 as well as Solr 1.4 was released so end-users switching to the new version now benefit from better performance and several new features.

  • Hadoop - large scale data processing currently is one of the biggest topics. Be it logfile analysis, business intelligence or ad-hoc analysis of user data. Hadoop was covered by a user meetup as well as one track on the first conference day. The track started with an introduction by Owen O’Malley and Doug Cutting. It continued with talks on HBase, Hive, Pig and other projects from the Hadoop ecosystem.

But also projects like Apache Tomcat and Apache HTTPD were well covered within one to two sessions each.

Currently a hot topic within the foundation is the challenge of bringing the community together face-to-face. Apache projects have become so numerous that covering them all within 3+2 days of conference and trainings seems no longer feasable. One way to mitigate these problems might be to motivate people to do more local meetups potentially supported by ConCom as has already happened in the Lucene- and Hadoop-communities. A related topic is the task of community building and community growth within the ASF. Google Summer of Code has been a great way to integrate new people. However the model does not scale that well for the foundation. With ComDev a new project was founded with the goal to work on community development issues, talking to research, getting students into open source early on. The project is largely supported by Ross Gardler, who already has experience with teaching and promoting open source and free software in the research context being part of the open source watch project in the UK.

Apache Con US 09 brought together a large community of Apache software developers and users from all over the world who gathered in California, not only for the talks but also for face-to-face communication, coding together and exchanging ideas.

Update: Slides of my Mahout talk are now online.

Apache Con , ,

Lucene Meetup Oakland

November 4th, 2009 at 6:05am

Though pretty late in the evening the room is packed with some 100 people. Most of them solr or pure lucene java users. There are quite a few Lucene committers at the meetup from all over the world. Several even have heard about Mahout - some even used it :)

Some introductiory questions to index sizes and query volumn: 1 Mio documents seem pretty standard for Lucene deployments - several people run 10 Mio neither. Some people even use indexes with up to billions of documents in Lucene - but at low query volumn. Usually people run projects with about 10 queries per second, but up to 500.

Eric’s presentation gives a nice introduction to what is going on with Lucene/Solr in terms of user interfaces. He starts with an overview of the problems that libraries face when building search engines - especially the facetting side of life. Especially interesting seem Solaritas - a velocity response writer that makes it easy to render search responses not in xml but in simple templated output. He of course also included an overview of the features of LucidFind, the Lucid hosted search engine for all Lucene projects and sub-projects. Take Home message: The interface is the application, as are the urls. Facets are not just for lists.

Second talk is given by Uwe giving an overview of the implementation of numeric searches and range queries and numeric range filters in Lucene.

Third presenter is Stefan on katta - a project on top of Lucene that adds index splits, load balancing, index replication, failover, distributed TFIDF. The mission of katta is to build a distributed Lucene for large indexes under high query load. The project heavily relies on zookeeper for coordination. It uses Hadoop IPC for search communication.

Lighting talks include talks by

  • Zoie: A realtime search extension for Lucene, developed inside of LinkedIn and now open sourced at google code.
  • Jukka proposed a new project: A Lucene-based content mangement system.
  • Next presenter highlighted the problem of document-to-document search. The problem here is that queries are not just one or two terms but more like 40 terms.
  • Next talk shared some statistics: more than 2s at average leads to 40% abandonance rate for sites. The presenter is very interested in the Lucene Ajax project. Before using solr the speaker set up projects with solutions like Endeca or Mercato. Solr to him is an alternative that supports facetting.
  • Andzrej gives an overview of index pruning in 5min - giving details on which approaches are currently being discussed in research as well as in the Lucene jira for index pruning.
  • Next talk was on Lucy - a lucene port to C.
  • Last talk gave an overview of the findings on analysing the Lucene community.
  • One other lightning talk by a guy using and deploying Typo3 pages. Typo3 does come with an integrated search engine. The presenter’s group built an extension to Typo3 that integrates the CMS with Solr search.
  • The final last talk is done by Grant Ingersoll on Mahout. Thanks for that!

Big Thanks to Lucid for sponsoring the meetup.

Apache Con ,

Hadoop Get Together Berlin @ Apache Con US Barcamp

November 3rd, 2009 at 9:05pm

This is my first real day at ApacheCon US 2009. I arrived yesterday afternoon, was kept awake by three Lucene committers until midnight: “Otherwise you will have a very bad jetlag”… Admittedly it did work out: I slept like a baby until about 08:00a.m. the next morning and am not that tired today.

Today Hackthon, Trainings and barcamp Apache happen in parallel. Ross Gardler tricked me into doing a presentation on my experiences on doing local user meetups. I put the slides online.

The general consent was, that it is actually not that hard to do such a meetup - at least if you are have someone locally to help organizing or do it in a town you know very well. There are ways to get support from the ASF for doing such meetups - people help you get speakers, talk to potential sponsors or find a location. In my experience if doing the event in your hometown, finding a location is not that hard: Either you are lucky having someone like newthinking store around. Or you can contact you local university or even your employer to find some conference room that you can use for free.

Getting the first two to three meetups up and running - especially finding speakers - is hard. However you should be able to benefit from being part of an Apache project already and probably know your community and know who would be willing to speak at one of those meetups. Once the meetup is well established, it should be possible to find sponsors to pay for video taping, free beer and pizza.

Keep in mind that having a fixed schedule ready in advance helps to attract people - it’s always good to know why one should travel to the meetup by train or plane. Don’t forget to plan for time for socializing after the event - having some beer and maybe food together makes it easy for people to connect after the meetup.

Apache Con, Get Together , ,