Being in San Francisco

2011-11-06 04:47
I spent the last two weeks together with Thilo in San Francisco - and neighboring areas. I had asked beforehand for recommendations on where to go and what to do, had purchased a "Rough Guide to California" as well as a "Lonely Planet guide for San Francisco". In addition I shared my arrival and departure times with a few people I know here. As to be expected our schedule quickly grew until it exploded, so we ended up doing greedy optimisation stuffing things in and putting them out again while we went along. Result of all that: An amazing two weeks that went by way to fast and the conclusion that we do need to return and bring way more time next time around.

A huge thanks for the warm welcome, for shared local knowledge on where to go, invitations for lunch or dinner, as well as fun hours at the Castro during Halloween. Special thanks to Datameer who when asked for accommodations recommendations kindly offered to host us - it makes such a big difference to get the chance to stay in a local neighborhood and avoid hotels altogether. All in all it's amazing to fly across an ocean, cross multiple time zones and arrive in a city that almost feels like being home.

Following a brief overview of what our final schedule turned out to be - happy to share pictures we took privately. After getting our car on Sunday, we went over to Berkeley to see their impressive campus - and got to see part of the Berkeley occupy movement.

Day one was reserved for sports

After biking the bridge we went down to Sausalito. Got some delicious food at fish - a restaurant serving all sorts of healthy and tasty fish dishes. After that we went out on a canoe from Sea Trek Kayaking, taking a closer look at the house boat community and a brief tour over to the Sausalito ferry port. When back at the shore we cycled to Sausalito and took the ferry to SFO.

Day 2 was booked for Highway #1 to Santa Cruz

We headed down scenic Highway#1, past the cliffs at Half Moon Bay. Went out for a hike in Butano park to hug some Redwood trees and take pictures of a fairy-tale like forest. Then headed over to Pigeon Point Lighthouse - even if you are not into staying at hostels you should stop there: this place is great for a nice view of the ocean and the lighthouse itself. Finally we went down to beautiful Santa Cruz.

Day 3 Muir woods

Not much to be said here: It's always impressive to walk underneath huge Redwood trees and go hiking in the mountains around to get a better view.

Weekend for Yosemite

Took the route via Highway 140 - the drive itself was interesting already as it took us to quite a different country side than what we had seen until then. Got through Mariposa in the end over to Midpines. We stayed at friendly Bug Rustic Mountain - classified as Hostel, features a spa with sauna and whirlpool. We were lucky enough to arrive on Friday before Halloween to get to see the screening of "Rocky Horror Picture Show" - including a printed transcript with tag lines to be shouted at the screen for those who do not know these already.

On Saturday we went hiking in amazing Yosemite park. Only then I realized how close different landscapes can be: It took us only half a day by car to go from coast and ocean over to high mountains. We chose to hike to the upper Yosemite fall - returning after 7m of rather steep trails on a sunny but fortunately quite clear day we were absolutely tired.

Day 7 Halloween

We spent Halloween morning over in St. Helena and went to the Francis Ford Coppola Winery. Hard to believe, but several tasty types of Californian wine do exist - at least that's what Thilo confirmed when in St. Helena.

Evening was booked for dressing up and going to the Castro, though presumably more calm than in past years the area is still a must-go for Halloween if you are into watching people walking around in impressing costumes.

Day 8 for Alcatras

Not much to be added here - don't miss if you visit SFO.

Day 9 for China town

After several busy days we only went to China town that day and tried to recover for the rest of the afternoon - and went out for Dia de los Muertos in the evening.

Day 10 for watching whales

Happy I took sea-sickness pre-cautions on that one. It was a bit of a rough ride with the catamaran. But in the end we got to see whales close to Farallon Islands. The naturalist did a great job explaining not only the life of the whales but also provided some background on the islands.

Day 11 for Point Reyes and North Beach

In the morning we followed the recommendation to get to see North Beach - if you like Friedrichshain/ Kreuzberg in Berlin or Dresden Neustadt - do not miss North Beach in San Francisco.

The afternoon was reserved for driving over to Point Reyes Lighthouse. Though we did not spot any whales in the water, the landscape already was amazing by itself.

Day 12 - final walk through SFO

We took some time to walk along Hayes street, through the Golden Gate park up to Cliff House and went back with a Muni bus to the city - just too long to walk right back.

Thanks again to Stefan, St.Ack, Jon, JD, Ted, Ellen, Doug, Anne, Johan, Doris, Jens, Lance, Felix, Markus and everyone else who helped make this trip as awesome as it really was. Sorry to everyone who I did not manage to meet or get in touch - hopefully we can fix that next time I'm here - or next time you are in Berlin.

O'Reilly Strata - day one afternoon lectures

2011-02-13 22:18

Big data at startups - Info Chimps

As a startup to get good people there is no other option then to grow your own: Offer the option to gain a lot of experience in return for a not so great wage. Start out with really great hires:

  • People who have the "get shit done gene": They discover new projects, are proud to contribute to team efforts, are confident in making changes to a code base probably not known before hand. To find these you should ask open ended questions in interviews.
  • People who are passionate learners, that use the tools out there, use open code and are willing to be proven wrong.
  • People who are generally fun to work with.

Put these people on small, non-mission-critical initial projects - make them fail on parallel tasks (and tell them they will fail) to teach them to ask for help. What is really hard for new hires is learning to deal with git, ssh keys, command line stuff, what to do and when to ask for help, knowing what to do when something breaks.

Infochimps uses Kanban for organisation: Each developer has a task he has chosen at any given point in time. He is responsible for getting that task done - which may well involved getting help from others. Being responsible for a complete features is one big performance boost once the feature truely goes online. Code review is being used for teachable moments - and in cases where something really goes wrong.

Development itself is organised to optimise for developers' joy - which usually means to take Java out of the loop.

Machine learning at Orbitz

They use Hadoop mostly for log analysis. Also here the problem of fields or whole entries missing in the original log format was encountered. To be able to dynamically add new attributes and deal with growing data volumns they went from a data warehouse solution to Apache Hadoop. Hadoop is used for data preparation before training, for training recommender models, for cross validation setups. Hive has been added for ad-hoc queries usually issued by business users.

Data scaling patterns at LinkedIn

When scaling to growing data LinkedIn developers started gathering a few patterns that helped make dealing with data easier:

  • When building applications constantly monitor your invariants: It can be so frustrating to run an hour long job just to find out at the very end that you made a mistake during data import.
  • Have a QA cluster, have versioning on your releases to allow for easy rollback should anything go bad. Unit tests go without saying.
  • Profile your jobs to avoid bottlenecks: Do not read from the distributed cache in a combiner - do not reuse code that was intended for a different component without thorough review.
  • Dealing with real world data means dealing with irregular, dirty data: When generating pairs of users for connect recommendation, Obama caused problems as he is friends with seemingly every american.

However the biggest bottleneck: IO during shuffling as every map talks to every reducer. As a rule of thumb, do most work on the map side and minimise data sent to reducers. This also applies to many of the machine learning M/R formulations. One idea for reducing shuffling load is to pre-filter on the map side with bloom filters.

To serve at scale:

  • Run stuff multiple times.
  • Iterate quickly to get fast feedback.
  • Do AB testing to measure performance.
  • Push out quickly for feedback.
  • Try out what you would like to see.

See also for more information.

O'Reilly Strata - Day two - keynotes

2011-02-12 20:17

Day two of Strata started with a very inspiring insight from the host itself that extended the vision discussed earlier in the tutorials: It's not at all about the tools, the current data analytics value lies in the data itself and in the conclusions and actions drawn from analysing it. keynote

The first key note was presented by - for them there are four dimensions to data analytics:

  • Timeliness: There must be realtime access, or at least streaming access to incoming data.
  • Storage must provide the means to efficiently store, access, query and operate on data.
  • Education as there is no clear path to becoming a data scientist today.
  • Imagination to come up with new interesting ways to look at existing data.

Storing shortened urls for there really are three views on their data: The very personal intrinsic preferences expressed in your participation in the network. The neighborhood view taking into account your friends and accquaintances. Finally there is the global view that allows for drawing conclusion on a very large global scale - a way to find out what's happening world wide just by looking at log data.

Thomson Reuters

In contrast to all digital Thomson Reuters comes with a very different background - though acting on a global scale distributing news world wide there lots of manual intervention is still asked for to come up with high quality, clean, curated data. In addition their clients focus on very low latency to be able to act on new incoming news at the stock market.

For traditional media providers it is very important to bring news together with context and users: Knowing who users are and where they live may result in delivering better service with more focussed information. However he sees a huge gap between what is possible with today's web2.0 applications and what is still in common practice in large corporate environments: Social networking sites tend to gather data implicitly without clearly telling users what is collected and for which purpose. In corporate environments though it was (and still is) common practice to come up with general compliance rules that target protecting data privacy and insulating corporate networks from public ones.

Focussing on cautious and explicit data mining might help these environments to benefit from cost savings and targeted information publishing to the corporate environment as well.

Mythology of big data

Each technology caries in itself the seeds for self destruction - same is true for Hadoop and friends: The code is about to start turning into commodity itself. As a result the real value lies in the data it processes and the knowledge about how to combine existing tools to solve 80% of your data analytics problems.

The myth really lies in the lonely hacker sitting in front of his laptop solving the world's data analysis problems. Instead analytics is all about communication and learning from those who stored and generated the data. Only they are able to tell more on business cases as well as the context of the data. Only domain knowledge can help solve real problems.

In the past data emerged from being the product, into being a by-product, to being an asset in the past decade. Nowadays it is turning into a substrate for developing better applications. There is no need for huge data sets for turning data into a basis for better applications. In the end it boils down to using data to re-vamp your organisation's decisions from being horse trading, gut-check based decisions to scientific, data backed informed decisions.

Amazon - Werner Vogels

For amazon, big data means that storing, collecting, analyzing and processing the data are hard to do. Being able to do so currently is a competitive advantage. In contrast to BI where questions drove the way data was stored and collected today infrastructure is cheap enough to creatively come up with new analytics questions based on available data.

  • Collecting data goes from a streaming model to daily imports even to batch imports - never under estimate the bandwidth of FedEx. There even is a FedEx import at Amazon.
  • Never under estimate the need for increased storage capacity. Storage on AWS can be increased dynamically.
  • When organizing data keep data quality and manual cleansing in mind - there is a mechanical turk offering for that at AWS.
  • For Analysis Map Reduce currently is the obvious choice - AWS offeres elastic map reduce for that.
  • The trend goes more and more to sharing analysis results via public APIs to enable customers down stream to reuse data and provide added value on top of it.

Microsoft Azure data market place

Microsoft used their keynote to announce the Azure Data Marketplace - a place to make data available for easy use and trading. To deal with data today you have to find it, license it from its original owner - which incurs overhead negotiating licensing terms.

The goal of Microsoft is to provide a one click stop shop for data that provides a unified and discoverable interface to data. They work with providers to ensure cleanup and curation. In turn providers get a marketplace for trading data. It will be possible to visualize data before purchase to avoid buying what you do not know. There is a subscription model that allows for constant updates, has cleared licensing issues. There are consistant APIs to data that can be incorporated by solution partners to provide better integration and analysis support.

At the very end the Heritage health prize was announced - a 3 million data mining competition open for participation starting next April.

O'Reilly Strata - Tutorial data analytics

2011-02-11 20:17

Acting based on data

It comes as no surprise to hear that also in the data analytics world engineers are unwilling to share details of how their analysis works with higher management - with on the other side not much interest on learning how analytics really works. This culture leads to a sort of black art, witch craft attitude to data analytics that hinders most innovation.

When starting to establish data analytics in your business there are a few steps to consider: Frist of all no matter how beautiful visualizations may look on the tool you just chose to work with and are considering to buy - keep in mind that shiny pebbles won't solve your problems. Instead focus on what kind of information you really want to extract and chose the tool that does that job best. Keep in mind that data never comes as clean as analysts would love it to be.

  • Ask yourself how complete your data really is (Are all fields you are looking at filled for all relevant records?).
  • Are those fields filled with accurate information (Ever asked yourself why everyone using your registration form seems to be working for a 1-100 engineers startup instead of one of the many other options down the list?)
  • For how long will that data remain accurate?
  • For how long will it be relevant for your business case?

Even the cleanest data set can get you only so far: You need to be able to link your data back to actual transactions to be able to segment your customers and add value from data analytics.

When introducing data analytics check whether people are actually willing to share their data. Check whether management is willing to act on potential results - that may be as easy as spending lots of money on data cleansing, or it may involve changing workflows to be able to provide better source data. As a result of data analytics there may be even more severe changes ahead of you: Are people willing to change the product based on pure data? Are they willing to adjust the marketing budget? ... job descriptions? ... development budget? How fast is the turnaround for these changes? When making changes yearly there is no value in having realtime analytics.

In the end it boils down to applying the OODA cycle: If you can be faster observing, orienting, deciding and acting than your competitor only then do you have a real business advantage.

Data analytics ethics

Today Apache Hadoop provides the means to give data analytics super powers to everyone: It brings together the use of commodity hardware with scaling to high data volumns. With great power there must come great great responsibility according to Stan Lee. In the realm of data science that involves solving problems that might be ethically at least questionable though technologically trivial:

  • Helping others adjust their weapons to increase death rates.
  • Making others turn into a monopoly.
  • Predict the likelihood of cheap food making you so sick that you are able and willing to go to court against the provider as a result.

On the other hand it can solve cases that are mutually sensible both for the provider and the customer: Predicting when visitors to a casino are about to become unhappy and willing to leave before the even know they are may give the casino employees a brief time window for counter actions (e.g. offering you a free meal).

In the end it boils down to avoiding to screw up other people's lifes. Deciding which action does least harm while achieving most benefit. Which treats people at least proportional if not equal, what serves the community as a whole - or more simply: What leads me to being the person I always wanted to be.

Teddy in San Francisco

2011-02-10 20:13

Before attending O'Reilly Strata there were a few days left to adjust to the different time zone, meet up with friends and generally spend some days in the Greater San Francisco area. As was to be expected, those were way to few days. The weekend was a bit rainy, still packed with visiting China town right after the plane had landed and spending some time at ... Finally was taken out to Bucks - the restaurant among software engineers generally known for being the place where VC deals are being made.

Sunday was reserved for visiting some red wood trees - it's so great driving just a few minutes out of the city and arriving in an area that looks like being set up for a fairy tale movie. With all the mist and with sun coming out here and there the area looked even more bewitched.

On Monday sun finally arrived in the bay - as a result a ferry trip to Sausolito seemed like the optimal thing to do. Unfortunately not enough time to rent a bike an to the "ride the bridge" tour - or get a kajak to go out into the bay. Maybe next time though.

After returning back home, Teddy showed me some pieces of chocolate someone in the US made him adicted to - now it's not just the tasty swiss one but also the Berkley one I have to find a shop for in Berlin ;)