On Taming Text

2013-01-01 20:21
This time of the year I would usually post pictures of my bicycle standing in the snow somewhere in Tierpark. This year however I was tricked into using public transport instead: a) After my husband found a new job, we now share some of the route to work - and he isn't crazy going by bike when it's snowing. b) I got myself a Nexus7 earlier this month which obsoleted having to take paper books with me when using public transport. c) Early in December Grant Ingersoll asked me for feedback on the by now nearly finished "Taming Text (currently available as MEAP at Manning). So I even had a really interesting book to read on my way home.

Up to mid-December "Taming Text" was one of those books that always were very high on my to-read list: At least from the TOC it looked like the book to read if ever you wanted to write a search application. So I was really curious which topics it would cover and how deep explanations would go when I got the offer to read and review the book.

tl&dr



Short version: If you are building search applications - that is anything that makes a search box available on a web site, be it an online store or a new article archive - this is the book to read. It covers all the gory details of how to implement features we have come to take for granted when using search: Type ahead, spelling correction, facetting, automatic tagging and more. The book motivates what the value of these features is from the user side, explains how to implement these features with proven technologies like Apache Lucene, OpenNLP, and Mahout and how those projects work internally to provide you with the functionality you need.

Longer summary



Search can be as easy as providing one box in some corner on your web site that users can type into to find relevant pages. However when thinking about the topic just a little more some more handy features that users have come to expect come to mind:

  • Type ahead to avoid superfluous typing - it also comes in handy to avoid spelling errors and to know exactly which query actually will return a decent number of documents.
  • Spelling correction is pretty much standard - and avoids user frustration with hard to spell query terms.
  • Facetting is a great way to discover and explore more content in particular when there are a few structured attributes attached to your items (prices to books, colors to cars etc).
  • Named Entity Recognition is well known among publishers who use automatic tagging services to support their staff.


The authors of Taming Text decided to structure the book around the task of building an automatic Question Answering system. Throughout the book they present technologies that need to be orchestrated to build such an application but are each valuable in it's own right.

In contrast to Search Patterns (which is focused mainly on the product manager perspective and contains much less technical detail) Taming Text is the book to read for any engineer working on search applications. In contrast to books like Programming Collective Ingelligence Taming Text takes you one level further by not only showing the tools to use but also explaining their inner workings so that you can adapt them exactly to your use case. To me, Taming Text is the ideal complimentary book to Mahout in Action (for the machine learning part) and Lucene in Action for the search part.

Back in 1998 it was estimated that 80% of all information is unstructured data. In order to make sense of that wealth of data we need technologies that can deal with unstructured data. Search is one of the most basic but also most powerful ways to analyse texts. With a good mixture of theoretical background and hands-on-examples Taming Text guides you through the process of building a successful search application, no matter if you are dealing with a vast product database that you want to make more accessible to your users, with an ever growing news archive or with several blog posts and twitter messages that you want to extract data from.

Book: Search Patterns

2012-07-28 20:41
I got the book months ago during FOSDEM - the O'Reilly book table always is a pretty dangerous place as a meeting point for me: Search Patterns - Design for Discovery is one of those small, deceivingly beautiful books that manages to explain effective search engine design by focusing on the end user needs but going into some detail concerning the basics of search engine backends as well.

We use them on a daily basis not only for finding content on the web but also for navigating shopping sites, discovering news content and even finding articles on blogs and open source project pages. Many discovery tasks can be easily expressed as a search problem and as a result tackled with by now standard off the shelve software like Apache Lucene - or event the commercial counterparts from the enterprise search market. Still oftentimes search is perceived as being made up of simple a small box that users type (typically one or two term) queries into and that as a result show a list of some ten links.

After setting the stage for search in the first chapter the book goes into some more detail in "The anatomy of search". In a very approachable way it explains all the components from user constraints, graphical interface, the basics of retrieval and evaluating search performance in terms of precision and recall. The third chapter shows some bahavioural patterns that make discovery easier for users - from incrementally constructing the answer, progessively disclosing more and more detail up to being predictable.

Finally the design patterns as identified by the authors are introduced. Pretty obvious to those working in the field but well explained to those not intimately familiar with the topic:


  • Though perceived as a mere convenience to type less by users, autocomplete can actually help guide the user's search in case of ambiguities and can help avoid imprecise results.
  • Expected as it might be by users, presenting the best result first actually goes a long way when building credibility for a search engine. Having more precise queries to guide e.g. as a result of autocomplete helps here. So does having strong ranking criteria to build up a compelling ranking function that is used by default (even though others might be offered as an alternative for users to explore more and different results).
  • Federated search has both - advantages (integrating otherwise isolated silos of knowledge) but also disadvantages (it's speed being dominated by the slowest connected search engine).
  • Facetted navigation is pretty much standard for any major search engine - giving the user the option to start with a broad query that returns an overwhelming amount of results but guiding the user when refining the query is one major way of driving searches.
  • Offering personalisation tends to be one beloved feature though it is particularly hard to implement and needs a good deal of user data to work well. Usually there are features that require less work to get done that are more promising to start with.
  • Pageination is as much standard to be expected by users - though its implementation can differ: Though we are used to clicking the next button, this actually may not make much sense and just lead to interrupting the user's flow. Much more appealing - but sometimes also confusing - can be interfaces that allow for simply extending the result page when scroling to it's end.
  • Structured results provide a way to give the user more than just an outlink - triggered by specific searches it may be possible to directly answer the user's question instead of linking to content that answers it.
  • Actionable results are a way for the user to get active - either by voting on results, bookmarking them or sharing them with others.
  • Unified discovery is about accepting that search always plays a role in a bigger context and has to play well with the discovery mode the user is in: When searching for "apple" while browsing the category "electronics" it's rather unlikely that I am looking for the fruit. Similarly search should take context into account and support me seamlessly when switching from discovery to directed search and back to discovery mode.


The book concludes by going into some detail on example search engines and presenting some features that are not yet commonplace but might change the world by employing search in new and creative ways.

Easy to read, well written, several nice examples to make the technical points simpler to understand. Definitely a good read for domain experts planning to build a search engine, designers trying to understand the basics of building effective search engines and engineers struggling for words to explain why a seemingly little box can cause a whole lot of pain when done wrong but a whole lot of joy when done right.

Apprenticeship patterns (O'Reilly)

2010-09-23 08:17
A few days ago I finished reading the book "Apprenticeship Patterns" - Guidance for the Aspiring Software Craftsman, by
Dave Hoover, Adewale Oshineye. The book is addressed to readers who have the goal of becoming great software devleopers.

One naive question one could ask is why there is a need for such a book at all? Students are trained in computer science at university, then enter some IT departement and simply learn from their peers. So how is software development any different than other professions? Turns out there are a few problems with that approach: At university students usually don't get the slightest idea of what professional software development looks like. After four years of study they still have a long way to go before writing great software. When entering your average IT shop these juniors usually are put on some sort of customer project with tight deadlines. However learning implies making mistakes, it implies having time to try different routes to find the best one. Lucky are those very few who join a team that has a way for integrating and training junior developers. Last but not least at least in Germany tech carrier paths are still rare: As soon as developers excel they are offered a promotion - which usually leads straight into management before they even had a chance to become masters in their profession.

So what can people do who love writing software and want to become masters in their profession? The book provides various patterns, grouped by task:

  • Emptying the cup deals with setting up an attitude that enables learning: To be able to learn new skills the trainee first has to face his ignorance and realise that what he knows already is just a tiny little fraction of what differenciates the master from the junior.
  • In the second chapter "Walking the long road" the book deals with the problem of deciding whether to stick with software development or to go into management. Both paths provide their own hurdles and rewards - in the end the developer himself has to decide which one to go. Deciding for a technical carrier however might involve identifying new kinds of rewards: Instead of being promoted to senior super duper manager, this may involve benefits like getting a 20% project, setting up a company internal user group, getting support for presenting ones projects at conferences. The chapter also deals with motivational side of software development: Let's face it, professional development usually is way different from what we'd do if we had unlimited time. It may involve deadlines that cannot be met, it may invovle customers that are hard to communicate with. One might even have to deal with unmovtivated colleagues who have lower quality standards and no intention to learn more than what is needed to accomplish the task at hand. So there is the problem of staying motivated even if times get rough. Getting in touch with other developers - external and internal - here can be a great help: Attending user groups (or organising one), being part of an open source project, meeting regularly with other developers in one's general geografical area all may help to remember the fun things about developing software.
  • The third group of patterns has been put under the headline "Accurate self-assessment" - as people get better and better it get ever harder to remember that there are techniques out there one does not yet know. Being the best in a team means that there is not more room to learn in that environment. It's time to find another group to get in touch with others again: To be the worst in a team means there is a lot of room for learning, finding mentors helps with getting more information on which areas to explore next. Especially helpful is working on a common project with others - doing pair programming can help even with picking up just minor optimisations in their work environment.
  • The fourth chapter "Perpetual learning" deals with finding opportunities to learn new technologies - either in a toy project that in contrast to professional work is allowed to break and can be used to try and test new techniques and learn new languages. Other sources for learning are the source code itself, tech publications on magazines, books (both new and classic), blogs and mailing lists. Reflecting on what you learned helps remember it later - on option to reflect may involve writing up little summaries of what you read and keeping them in a place where you can easily retrieve them (for me this blog has turned into such a resource - yeah, I guess writing this book summary is part of the exercise, even was a proposal in the book itself). Last but not least one of the best resources for reflection and continued learning is to share knowledge - though you may feel there are others out there way better then you are, you are the one who just went though all the initial loops that no master remembers anymore. You can explain concepts in easy to understand words. Sharing and teaching means quickly finding gaps in your own knowledge and fixing them as you go forward. Last but not least it is important to create feedback loops: It does not help to learn after three years of coding that what you did does not match a customers expectations. As an apprentice you need faster feedback: On a technical level this may involve automated tests, code analysis and continuous integration. On a personal level it involves finding people to review your code. It means discussing your ideas with peers.
  • The last chapter on "Constructing your curriculum" finally dealt with the task of finding a way to remain up to date, e.g. by following re-known developers' blogs. But also studying the classic literature - there are various books in computer science and software development that have been written back in the 60s and 70s but are still highly relevant.


The book does not give you a recipe to turn from junior to master in the shortest possible time. However it successfully identifies situations many a software developer has encountered in his professional life that made him quesion his current path. It provides ideas on what to do to improve one's skills even if the current IT industry may not be best equipped with tools for training people.

My conclusion from the book was that most important is getting in touch with other developers, exchanging ideas and working on common projects. Open source get several mentions in the book, but also for me has turned out to be a great source for getting feedback, help and input from the best developers I've met so far.

In addition meeting people who are working on similar projects face-to-face provides a lot of important feedback as well as new ideas to try out. Talking with someone over a cup of coffee for two hours sometimes can be more productive than discussing for days over e-mail. Hacking on a common project, maybe even in the same location, usually is the most productive way not only to solve problems but also to pick up new skills.

Books I found particularly helpful

2009-03-12 18:44
During the last few years I have quite a few books that one could easily file under the category "Hacking books". Some of them were particularly interesting to me and have influenced the way I write code. The following list certainly is not complete at all - but it is a nice starting point.


  • Effective C++ - I have comparably little experience with C++ but this book really helped understand some of the particularities.
  • Effective Java - even though I have been developing in Java since a few years reading and revisiting Effective Java helps understanding and dealing with some of the quirks of the JVM.
  • Mythical Man Month - although classical literature for people dealing with software projects, although very well known, although easy to understand it is scaring to see that the exact same mistakes are still common in today's software projects.
  • Concurrent programming in Java - quick start on concurrent programming patterns - primarily focussed on Java. Fortunately no collection of recipes but thorough background information.
  • Working effectively with legacy code - I really like to have a look into this book from time to time. Shows great ways of untangling bad code, refactoring it and making it testable.
  • XP books by Kent Beck - if you ever had any questions on what XP programming is and how you should implement it: These are the books to read. Don't trust what people call XP in practice as long as they are not willing to refine and improve their "agile processes". Keep on working on what stops you from delivering great code.
  • Why programs fail - a guide to systematic debugging - If you ever had to debug complex programs - and I bet you had - this is the book that explains how to do this systematically. How to even have fun along the way.
  • Zen and the art of motorcycle maintenance - Not particularly on Software Development but the techniques described match stunningly well on software development.
  • Release It! - just about to read that one. But already the first few pages are not only valuable and interesting but also entertaining.
  • Implementation Patterns - forgot that yesterday.
  • Presentation Zen - another one I forgot. Really helped me to make better presentations.


There are still quite a few good books on my list. If you have any recommendations - please leave them in the comments.

There are a few other book lists online in various blogs. Two examples are the ones below:
http://www.codinghorror.com/blog/archives/000020.html
http://www.joelonsoftware.com/navLinks/fog0000000262.html