Archive

Posts Tagged ‘Hadoop’

February 2012 Apache Hadoop Get Together Berlin

January 31st, 2012 at 8:34pm

The upcoming Apache Hadoop Get-Together is scheduled for 22. February, 6 p.m. - taking place at Axel Springer, Axel-Springer-Str. 65, 10888 Berlin. Thanks to Springer for sponsoring the location!

Note: It is important to indicate attendance. Due to security restrictions at the venue only registered visitors will be permitted. Get your ticket here: https://www.xing.com/events/hadoop-22-02-859807

Talks scheduled thus far:

Markus Andrezak : “Queue Management in Product Development with Kanban - enabling flow and fast feedback along the value chain” - It’s a truism today that fast feedback from your market is a key advantage. This talk is about how you can deliver smallest product increments or MVPs (minimal viable products) quickly to your market to get fastest possible feedback on cause and effect of your product changes. To achieve that, it helps to provide a continuous deployment infrastructure as well as all you need for A/B testing and other feedback instruments. To make the most of these achievements, Kanban helps to limit work in progress, thus manage queues and speed up lead times (time from order to delivery or concept to cash). This helps us speed through the OODA Loop, i.e. Eric Ries’ (The Lean Startup) Model -> Build -> Code -> Measure -> Data -> Validate -> Model. The more we can go through the loop, the more we have a chance to fine tune and validate our model of the business and finally make the right decisions.

Markus is one of Germany’s leading Kanban practitioners - writing and presenting talks about it in numerous publications and conferences. He will provide a brief view into how he is achieving fast feedback in diverse contexts.
Currently he is Head of mobile commerce at mobile.de.

Martin Scholl : “On Firehoses and Storms: Event Thinking, Event Processing” - The SQL doctrine is still in full effect and still fundamentally affects the way software is designed, the state it is stored in as well as the system architecture. With the NoSQL movement people have started to realize that the manner in which data is stored affects the full stack — and that reduction of impedance mismatch is a good thing(TM). “Thinking in events” follows this tradition of questioning what is state-of-the-art. Modeling a system not in mutable entities (as with data stores) but as a stream of immutable events that incrementally modify state, yields results that will exceed your expectations. This talk will be about event thinking, event software modeling and how Twitter’s Storm can help you process events at large.

Martin Scholl is interested in data management systems. He is also a Founder of infinipool GmbH.

Fabian Hüske : “Large-Scale Data Analysis Beyond Map/Reduce” - Stratosphere is a joint project by TU Berlin, HU Berlin, and HPI Potsdam and researches “Information Management on the Cloud”. In the course of the project, a massively parallel data processing system is built. The current version of the system consists of the parallel PACT programming model, a database inspired optimizer, and the parallel dataflow processing engine, Nephele. Stratosphere has been released as open source. This talk will focus on the PACT programming model, which is a generalization of Map/Reduce, and show how PACT eases the specification of complex data analysis tasks. At the end of the talk, an overview of Stratosphere’s upcoming release will be given.

Fabian has been a research associate at the Database Systems and Information Management (DIMA) group at the Technische Universität Berlin since June 2008. He is working in the Stratosphere research project, focusing on parallel programming models, parallel data processing, and query optimization. Fabian started his studies at the University of Cooperative Education, Stuttgart, in cooperation with IBM Germany in 2003. During that course, he visited the IBM Almaden Research Center in San Jose, USA, twice and finished in 2006. Fabian undertook his studies at Universität Ulm and earned a master’s degree in 2008. His research interests include distributed information management, query processing, and query optimization.

A big Thank You goes to Axel Springer for providing the venue at no cost for our event and for paying for videos to be taped of the presentations. A huge thanks also to David Obermann for organising the event.

Looking forward to seeing you in Berlin.

Get Together , , ,

Video up: Douwe Osinga

December 9th, 2011 at 10:01pm

Video: Max Jacob on Pig for NLP

December 9th, 2011 at 9:26pm

Slides online

December 9th, 2011 at 6:55am

Slides of this week’s Apache Hadoop Get Together Berlin are online at:

Overall a great event, well organised - looking forward to seeing you at the next Get Together. If you want to get in touch with our participants, learn about new events or simply chat between meetups join our Apache Hadoop Get Together Linked.In Group.

Get Together , , , ,

Apache Hadoop Get Together Berlin December 2011

December 8th, 2011 at 1:50am

First of all a huge Thank You to David Obermann for organising today’s Apache Hadoop Get Together Berlin: After a successful Berlin Buzzwords and a rather long pause following that finally a Christmas meetup took place today at Smarthouse, kindly sponsored by Axel Springer and organised by David Obermann from idealo. About 40 guests from Neofonie, Nokia, Amen, StudiVZ, Gameduell, TU Berlin, nurago, Soundcloud, nugg.ad and many others made it to the event.

In the first presentation Douwe Osinga from triposo went into some details on what Triposo is all about, how development for it differs in terms of scope and focus at larger corporations and what patterns they use for getting the data crawled, cleaned and served to users.

The goal of Triposo is to be able to build travel guides in a fully automated way. In contrast to simply creating a catalog of places to go to the goal is to have an application that is competitive to Lonely Planet books: Have tours, detailed background information, recommend places to visit based on wheather and seasonal signals, allow users to create travel books.

Joining Triposo from Google, Douwe gave a rather interesting perspective on what makes a startup interesting for innovative ideas. There are four interesting aspects of application development that according to his talk matter for Google projects: First is embracing failure. Not only can single hard disks fail, but servers might be switched off automatically for maintenance, even entire datacenters going offline must not affect your application. Second is a strong focus on speed: Developers working with dynamic languages like Python that allow for rapid prototyping at the expense of slower runtime are generally frowned upon. Third building block is the focus on search that is ingrained in every piece of architecture and thinking. Fourth and last is a strong mentality to build your own which may lead to great software but leaves software developers in an isolated island of proprietary software that can limit but at least shapes your way of thinking.

He gave Youtube as an example: Though built on top of MySQL, implemented in Python and certainly not failure proof in every aspect they succeeded by concentrating on users’ needs, time to market and iteratively improving their software with a frequent (as in one week) develop-release-deploy cycle. When entering new markets and providing innovative applications it often is crucial to be able to move quickly at the expense of speed and stability. It certainly is important to consider different architectures and chose the one that is appropriate to solve the problem at hand. Same reasoning applies for Apache Hadoop as well: Do not try to solve problems with it that it is not made to solve. Instead first think what is the right tool for your job.

Triposo itself is built on top of 12 data sources. Most are freely available, integrated to build a usable and valuable travel guide application for iOS and Android. The features available in Triposo can be phrased in terms of a search and information retrieval problem setting and as such lend itself well for integrated sources. With offers from Amazon, Google itself, Dropbox and the like it has become easy to deploy applications in an elastic way and scale with your user base and demand for more country coverage. For them it proved advantages to go for an implementation based on dynamic languages for pure development speed.

When it comes to QA they take a semi-manual approach: There are scripts checking recall (Brandenburger Tor must be found for the Berlin guide) as well as precision (there must be only one Brandenburger Tor in Berlin). Those rules need to be manually tuned.

When integrating different sources you quickly run into a duplicate discovery problem. Their approach is pretty pragmatic: Merge anything that you are confident enough to say it is a duplicate. Kill everything that is likely a duplicate but you are not confident enough to merge. The general guideline is to rather miss a place than have it twice.

For the wikipedia source so far they are only parsing the English version. There are plans to also support other languages - in particular for parsing to increase data quality as e.g. for some places geo coordinates may be available in the German article but not in the English one.

Though not going into too many technical details the talk gave some nice insights as to the strengths and weaknesses of different company sizes and mindsets when it comes to innovation as well as stabilization. Certainly a startup to watch, glad to hear that though incorporated in the US most developers actually live in Berlin now.

The second talk was given by Max Jakob from Neofonie GmbH (working on EU funded research project Dicode) gave an overview of their pipeline for named entity extraction and disambiguation based on a language model extracted from the raw German wikipedia dump. They used Pig to scale a pipeline down from about a week to 1.5 hours with not much development overhead: Quite some logic could be re-used from the open source project pignlproc initiated by Olivier Grisel. This project already features a Wikipedia loader, a UDF for extracting information from Wikipedia documents and additional scripts for training and building corpora.

Based on that they defined the ML probability of a surface form being a named entity. The script itself is not very magical: The whole process can be expressed as a few steps involving grouping and couting tuples. The effect in terms of runtime vs. development time however is impressive.

Checkout their DICODE github project for further details on the code itself.

After the meetup about 20 attendees followed David to a bar nearby. It is always great to get a chance to talk to the speakers, exchange experiences with others and learn more on what people are actually working on with Hadoop after the event.

Slides of all talks are going to be posted soon, videos go online as soon as they are post processed so stay tuned for further information.

Looking forward to seeing you again for the next meetup. If you could not make it this time, there is a very easy way to not have that happen again next time: First speaker to submit a talk proposal to David sets the date and time of the meetup (taking into account any constraints with venue and video taping of course).

Get Together , ,

December Apache Hadoop Get Together Berlin

November 24th, 2011 at 8:14pm

First of all please note that meetup organisation is being transitioned over to our xing meetup group. So in order to be notified of future meetings, make sure to join that group. Please make also sure to register for the December event as in contrast to past meetups this time space will be limited, so make sure to grab a ticket. If you cannot make it, please let the organiser know so he can issue additional tickets.

For those of you currently following this blog only for announcements:

When: December 7th 2011, 7 p.m.

Where: Smarthouse GmbH, Erich-Weinert-Str. 145, 10409 Berlin

Speaker: Martin Scholl
Title: On Firehoses and Storms: Event Thinking, Event Processing

Speaker: Douwe Osinga
Title: Overview of the Data Processing Pipeline at Triposo

Looking forward to seeing you at the next Apache Hadoop Get Together Berlin in December.

Get Together , ,

Cloudera in Berlin

November 14th, 2011 at 8:24pm

Cloudera is hosting another round of trainings in Berlin in November this year. In addition to the trainings on Apache Hadoop this time around there will also be trainings on Apache HBase.

Register online via:

Get Together ,

Apache Hadoop Get Together - Hand over

November 2nd, 2011 at 4:20pm

Apache Hadoop receives lots of attention from large US corporations who are using the project to scale their data processing pipelines:

“Facebook uses Hadoop and Hive extensively to process large data sets. [...]” (Ashish Thusoo, Engineering Manager at Facebook), “Hadoop is a key ingredient in allowing LinkedIn to build many of our most computationally difficult features [...]” (Jay Kreps, Principal Engineer, LinkedIn), “Hadoop enables [Twitter] to store, process, and derive insights from our data in ways that wouldn’t otherwise be possible. [...]” (Kevin Weil, Analytics Lead, Twitter). Found on Yahoo developer blog.

However the system’s use is not limited to large corporations only: With 101tec, Zanox, nugg.ad, nurago also local German players are using the project to enable new applications. Add components like Lucene, Redis, CouchDB, HBase and UIMA to the mix and you end up with a set of majour open source components that allow developers to rapidly develop systems that until a few years ago were possible only either in Google-like companies or in research.

The Berlin Apache Hadoop Get Together started in 2008 allowed to learn more on how the average local company leveraged this software. It is a platform to get in touch informally, exchange knowledge and best practices across corporate boundaries.

After three years of organising that event it is time to hand it over to new caring hands: David Obermann from Idealo kindly volunteered to take over organisation. He is a long-term attendee of the event and will continue it in the roughly the same spirit as before: Technical talks on success stories by users, new features by developers - not solely restricted to Hadoop only but also taking into account related projects.

A huge Thank You for taking up the work of co-ordinating, finding a venue and a sponsor for the videos goes to David! If any of you attending the event think that you have an interesting story to share, would like to support the event financially or just help out please get in touch with David.

Looking forward to the next Apache Hadoop Get Together Berlin. Watch this space for updates on when and where it will take place.

Get Together , , , , ,

Video is up - Simon Willnauer on Lucene 4 Performance improvements

February 22nd, 2011 at 9:21pm

CFP - Berlin Buzzwords 2011 - search, score, scale

January 26th, 2011 at 8:00am

This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany,
taking place in Berlin.


Call for Presentations Berlin Buzzwords

http://berlinbuzzwords.de

Berlin Buzzwords 2011 - Search, Store, Scale

6/7 June 2011

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • IR / Search - Lucene, Solr, katta or comparable solutions
  • NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
  • Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives

Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.

Important Dates (all dates in GMT +2)

  • Submission deadline: March 1st 2011, 23:59 MEZ
  • Notification of accepted speakers: March 22th, 2011, MEZ.
  • Publication of final schedule: April 5th, 2011.
  • Conference: June 6/7. 2011

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no later than March 1st, 2011. Acceptance notifications will be sent out soon after the submission deadline. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a lightning (10min), short (20min) or long (40min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) If you’d like to pitch your brand new product in your talk, please let us know as well - there will be extra space for presenting new ideas, awesome products and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us.

Follow @berlinbuzzwords on Twitter for updates. News on the conference will be published on our website at http://berlinbuzzwords.de.

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Schedule and further updates on the event will be published on http://berlinbuzzwords.de Please re-distribute this CfP to people who might be interested.

Contact us at:

newthinking communications GmbH
Schönhauser Allee 6/7
10119 Berlin, Germany
Julia Gemählich
Isabel Drost
+49(0)30-9210 596

Berlin Buzzwords , , , , ,