Apache Con – last day #
Day three of Apache Con started with interesting talks on Tomcat 7, including an introduction to the new features of
that release. Those include better memory leak prevention and detection capabilities – the implementation of these
capabilities have lead to the discovery of various leaks that appear under more or less weird circumstances in famous
open source libraries and the JVM itself. But also better management and reporting facilities are part of the new
release.
As I started the third day over at the Tomcat track, unfortunately I missed the Tika and Nutch
presentations by Chris Mattman – so happy, that at least the slides were published online: $LINK. The development of
nutch was especially interesting for me as that was the first Apache project I got involved with back in 2004. Nutch
started out as a project with the goal of providing an open source alternative internet-scale search engine. Based on
Lucene as a indexer kernel, it also providing crawling, content extraction and link analysis.
Focussed on
building an internet scale search engine the need for a distributed processing environment quickly became apparent.
Initial implementations of a nutch distributed file system and a map reduce engine lead to the creation of the Apache
Hadoop project.
In recent years it was comparably quiet around nutch. Besides Hadoop also content extraction was
factored out of the project into Apache Tika. At the moment development is getting more momentum again. Future
developments are supposed to be focussed on building an efficient crawling engine. As storage backend the project wants
to leverage Apache HBase, for content extraction Tika is to be used, as indexing backend Solr.
I loved the
presentation by Geoffrey Young on how they used Solr to replace their old MySQL search based system for better
performance and more features at Ticketmaster. Indexing documents representing music CDs presents some special
challenges when it comes to domain modeling: There are bands with names like “!!!”. In addition users are very likely
to misspell certain artists names. In contrast to large search providers like Google these businesses usually have
neither human resources nor enough log data to provide comparable coverage e.g. when implementing spell-checking. A
very promising and agile approach taking instead was to parse log files for most common failing queries and from that
learn more about features needed by users: There were many queries including geo information coming from users looking
for an event at one specific location. As a result geo information was added to the index leading to happier users.