Moving from Fast to Solr

2009-11-19 20:34
Sesat has published a nice in-depth report on why to move from Fast to Solr. The article also includes a description of the steps taken to move over as well as several statistics:

http://sesat.no/moving-from-fast-to-solr-review.html

On a related topic, the following article details, where Apple is using Lucene/Solr to power it's search. Spoiler: Look at Spotlight, their desktop search, as well as on the iTunes search with about 800 QPS.

Update: As the site above could not be reached for quite some time, you should either look into the Google Cache version.

Open Source Expo 09

2009-11-16 22:17
I spent last Sunday and the following Monday at Open Source Expo Karlsruhe - co-located with web-tech and php-conference organized by the Software-and-Support Verlag. Together with Simon Willnauer I ran the Lucene/Mahout booth at the expo.

So far the conference is still very small (about 400 visitors) compared to free software community events. However the focus was set to be more on professional users, accordingly several projects showed that free software can be used successfully for various business use cases. Visitors were invited to ask Sun about their free software strategy. Questions concerning OpenJDK or MySQL were not uncommon. Large distributors like SuSE or Mandriva were present as well. But also smaller companies e.g. providing support for Apache OfBIZ were present.

The Apache Lucene project was invited as exhibitor as well. Together with PRC and ConCom we organized for an Apache banner. Lucid Imagination sponsored several Lucene T-Shirts to be distributed at the conference. At the very last minute information (abstract, links to projects and mailing lists and current users) was put together on flyers.

We arrived on Saturday, late evening. Together with a friend of mine we went for some indian food at a really good restaurant close to the hotel. Big thanks to her, for being our tourist guide - hope to see you back in Waldheim in December ;)



Sunday was pretty quiet - only few guests arrived at the weekend. I was invited by David Zuelke to give a brief introduction to Mahout during his MapReduce Hadoop tutorial workshop. Thanks, David. Though lunch was served already, people did stay to hear my presentation on large scale machine learning with Mahout. I got contacted by one of the students of Katarina Morik who was pretty interested in the project. Back at her research group people are working on Rapid Miner - a tool for easy machine learning. It comes with a graphical user interface that makes it simple to explore various algorithm configurations and data workflow setups. It would be interesting to see how this tool helps people to understand machine learning. Would also be very interesting to learn what form of contribution might be interesting and appropriate for research groups to contribute to Mahout. Maybe not code-wise but more in terms of discussions and background knowledge.

Sunday was a bit more busy, with more people attending the conferences. Simon got a slot to present Lucene at the Open Stage track and show off the new features of Lucene 2.9. Those using Lucene already could be tricked into telling their Lucene success-story at the beginning of the talk. At the booth we had a wide variety of people: From students trying to find a crawling and indexing system for their information retrieval course homework up to professionals with various questions on the Apache Lucene project. The experience of people at the conference varied widely. That proved to be a pretty good reality-check. Being part of the Lucene and the ASF community one might be tempted to think that not knowing about Lucene is almost impossible. Well, it seems to be less impossible than at least I expected.

One last success: As the picture shows, Yacy now is powered by Lucene as well - at least in terms of T-Shirt ;)

Lucene Meetup Oakland

2009-11-04 06:05
Though pretty late in the evening the room is packed with some 100 people. Most of them solr or pure lucene java users. There are quite a few Lucene committers at the meetup from all over the world. Several even have heard about Mahout - some even used it :)

Some introductiory questions to index sizes and query volumn: 1 Mio documents seem pretty standard for Lucene deployments - several people run 10 Mio neither. Some people even use indexes with up to billions of documents in Lucene - but at low query volumn. Usually people run projects with about 10 queries per second, but up to 500.

Eric's presentation gives a nice introduction to what is going on with Lucene/Solr in terms of user interfaces. He starts with an overview of the problems that libraries face when building search engines - especially the facetting side of life. Especially interesting seem Solaritas - a velocity response writer that makes it easy to render search responses not in xml but in simple templated output. He of course also included an overview of the features of LucidFind, the Lucid hosted search engine for all Lucene projects and sub-projects. Take Home message: The interface is the application, as are the urls. Facets are not just for lists.

Second talk is given by Uwe giving an overview of the implementation of numeric searches and range queries and numeric range filters in Lucene.

Third presenter is Stefan on katta - a project on top of Lucene that adds index splits, load balancing, index replication, failover, distributed TFIDF. The mission of katta is to build a distributed Lucene for large indexes under high query load. The project heavily relies on zookeeper for coordination. It uses Hadoop IPC for search communication.

Lighting talks include talks by

  • Zoie: A realtime search extension for Lucene, developed inside of LinkedIn and now open sourced at google code.
  • Jukka proposed a new project: A Lucene-based content mangement system.
  • Next presenter highlighted the problem of document-to-document search. The problem here is that queries are not just one or two terms but more like 40 terms.
  • Next talk shared some statistics: more than 2s at average leads to 40% abandonance rate for sites. The presenter is very interested in the Lucene Ajax project. Before using solr the speaker set up projects with solutions like Endeca or Mercato. Solr to him is an alternative that supports facetting.
  • Andzrej gives an overview of index pruning in 5min - giving details on which approaches are currently being discussed in research as well as in the Lucene jira for index pruning.
  • Next talk was on Lucy - a lucene port to C.
  • Last talk gave an overview of the findings on analysing the Lucene community.
  • One other lightning talk by a guy using and deploying Typo3 pages. Typo3 does come with an integrated search engine. The presenter's group built an extension to Typo3 that integrates the CMS with Solr search.
  • The final last talk is done by Grant Ingersoll on Mahout. Thanks for that!


Big Thanks to Lucid for sponsoring the meetup.

Open Source Expo

2009-10-29 07:38
Title: Open Source Expo
Location: Karlsruhe
Link out: Click here
Description: There will be a booth at Open source expo introducing interested visitors to the Apache projects Lucene and Mahout. Of course we are also happy to answer any questions on the ASF in general.
Start Date: 2009-11-15
End Date: 2009-11-16

Lucene 2.9 White Paper

2009-10-28 21:51
Lucid recently published a white paper that explains the changes and improvements that the new 2.9 release incorporates. Interesting for all who are thinking about upgrading to the new lucene version or generally want to know what is going on at Lucene.

Videos are up

2009-10-22 07:31
As of yesterday the videos of the last Apache Hadoop Get Together Berlin are available online.

Thanks to the speakers for providing insight in their projects and thanks to Cloudera for sponsoring the videos.

The next meetup will be announced soon - three talks have already been proposed. In addition, StudiVZ offered to sponsor video taping of the next Get Together. Looking forward to seeing you in Berlin in December.

Lucene 2.9 @ Heise

2009-10-06 18:13
After last week's Hadoop Get Together heise published an in-depth article on the changes and improvements that come with the latest Lucene 2.9 release.



Thanks to Simon Willnauer for helping me write this article and patiently explaining several new features. Thanks also to Uwe Schindler for kindly proof-reading the article before it was sent out to Heise.

Upcoming: Apache Hadoop Get Together Berlin

2009-09-23 19:00
This is a friendly reminder that the next Apache Hadoop Get Together takes place next week on Tuesday, 29th of September* at newthinking store (Tucholskystr. 48, Berlin).


  • Thorsten Schuett, Solving Puzzles with MapReduce.
  • Thilo Götz, Text analytics on jaql.
  • Uwe Schindler, Lucene 2.9 Developments.

Big thanks goes to newthinking store for providing the venue for free and to Cloudera for sponsoring videos of the talks. Links to the videos will be posted on , on the upcoming page linked above, as well as on the Cloudera Blog soon after the event. Yet another thanks goes to O'Reilly for providing three "Hadoop: The Definitive Guide" books to be raffled at the event.

The 7th Get Together is scheduled for December, 16th. If you would like to submit a talk or sponsor the event, please contact me.


Hope to see you in Berlin next week.


* The event is scheduled right before the UIMA workshop in Potsdam, which may be of interest to you if you are a UIMA user.

September 2009 Hadoop Get Together Berlin

2009-08-17 09:11
The newthinking store Berlin is hosting the Hadoop Get Together user group meeting. It features talks on Hadoop, Lucene, Solr, UIMA, katta, Mahout and various other projects that deal with making large amounts of data accessible and processable. The event brings together leaders from the developer and user communities. The speakers present projects that build on top of Hadoop, case studies of applications being built and deployed on Hadoop. After the talks there is plenty of time for discussion, some beer and food.

There is also a related Xing Group on the topic of building scalable information retrieval systems. Feel free to join and meet other developers dealing with the topic of building scalable solutions.


Agenda:

Please see upcoming page for updates.


  • Thilo Götz: JAQL
  • Uwe Schindler: Lucene 2.9
  • nugg.ad: Ad Recommendation with Hadoop
  • T. Schuett: Solving puzzles with Hadoop.


If you yourself would like to give a presentation: There are additional slots of 20 minutes each available. There is a beamer provided. Just bring your slides. To include your topic on this web site as well as the upcoming.org entry, please send your proposal to Isabel.

After the talks there will be time for an open discussion. We are going into a nearby restaurant after the event so there will be plenty of time for talking, discussing and new ideas.

Location

The Apache Hadoop Get Together takes place at the newthinking store Berlin:



newthinking store GmbH

Tucholskystr. 48

10117 Berlin



View Larger Map

Accomodation

  • Homeli - not exactly in walking distance, but only a few S-Bahn stations away. Very nice Bed and Breakfast hotel. (The offer is only valid if you stay for at least three nights.)

  • Circus Berlin is a combination of hostel and hotel close by.

  • Zimmer in Berlin is yet another Bed and Breakfast hotel.

  • House boat near Friedrichshain



Announcements

If you would like to be notified on news please subscribe to our mailinglist. The meetings usually are also announced on the project mailing lists as well as on the newthinking store website.


Contact

In case you have any trouble reaching the location or finding accomodation feel free to contact the organiser Isabel.

Past events

Lucene slides online

2009-06-30 10:04
The slides of the Lucene talk at the last Apache Hadoop Get Together Berlin are available online: Lucene Slides. Especially interesting to me are the last few slides which detail both index size and machine setup:

The installation is running on two standard PCs with 2 dual-core processors (usual speed, bought in January 2008 for about 4000 Euro). They have 32GB RAM, 24 GB are used as ramdisk for the index. Without ramdisk initial queries especially those accessing fields are slower but still acceptable. The index contains about 19 million documents, that is 80GB of indexed text + billions of annotated tags.