OCTOBER 23 – 25, 2012

THANK YOUR FOR JOINING US!

Strata Conference + Hadoop World 2012 went off without a hitch! The sold-out conference attracted attendees from a wide range of industries and 38 countries.

Strata Conference explored the change brought to technology and business by big data, data science, and pervasive computing. The joined forces this year with Hadoop World (in its 4th year), this conference is at the heart of the big data industry.

Strata Conference + Hadoop World brought together decision makers using the power of big data to drive business strategy and practitioners who collect, analyze, and manipulate the data — particularly in the worlds of finance, media, and government. The merger of Strata and Hadoop World was the largest gathering of the Apache Hadoop community, with emphasis on hands-on and business sessions on the Hadoop ecosystem.

WHAT HAPPENED

Cloudera Keynotes

Cloudera Presentations

Cloudera Tutorials

Meet the Authors

Meetups

Awards


CLOUDERA KEYNOTE PRESENTATIONS

KEYNOTE: MIKE OLSON

Mike Olson

Watch video

Michael Olson – Cloudera CEO

Big Questions
Michael Olson is the CEO of Cloudera, the company delivering an enterprise-ready data management platform based on Apache Hadoop. He was formerly CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine. Mike spent two years at Oracle Corporation as Vice President for Embedded Technologies after Oracle’s acquisition of Sleepycat in 2006. Prior to joining Sleepycat, Mike held technical and business roles at database vendors Britton Lee, Illustra Information Technologies and Informix Software. Mike has Bachelor’s and Master’s degrees in Computer Science from the University of California at Berkeley. »

KEYNOTE: DOUG CUTTING

Doug Cutting

Watch video
View slides

Doug Cutting – Cloudera Architect, Hadoop Co-founder & Apache Software Foundation Chairman

Beyond Batch
Hadoop started as an offline, batch-processing system. It made it practical to store and process much larger datasets than before. Subsequently, more interactive, online systems emerged, integrating with Hadoop. First among these was HBase, the key/value store. Now scalable interactive query engines are beginning to join the Hadoop ecosystem. Realtime is gradually becoming a viable peer to batch in big data. »


CLOUDERA PRESENTATIONS

GIVEN ENOUGH MONKEYS – SOME THOUGHTS ON RANDOMNESS

View slides

Jesse Anderson – Cloudera Developer and Instructor

Can a million monkeys on a million typewriters eventually recreate Shakespeare? The great minds since Aristotle have been thinking about this theorem. In 2011, Jesse Anderson randomly recreated Shakespeare using Hadoop. Here’s why you should care. »

LARGE SCALE ETL WITH HADOOP

Watch video

Eric Sammer – Cloudera Sr. Solutions Architect

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments. »

HDFS – WHAT IS NEW AND FUTURE

Watch video
View slides

Sanjay Radia – Hortonworks Founder
Todd Lipcon – Cloudera Software Engineer

Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS. »

HIGH AVAILABILITY FOR THE HDFS NAMENODE: PHASE 2

Watch video
View slides

Aaron Myers – Cloudera Software Engineer
Todd Lipcon – Cloudera Software Engineer

This session will discuss the design and implementation of features for a highly available namenode, as well as give an overview of how to deploy these new features. »

APACHE HBASE FEATURES FOR THE ENTERPRISE

Watch video
View slides

Jonathan Hsieh – Cloudera Software Engineer

Apache HBase is a distributed data store that is in production today at many enterprises and sites serving large volumes of near-real-time random-accesses. As Apache HBase matures, the community has augmented the system with new features that many enterprise consider to be hard requirements. We will discuss how the upcoming HBase 0.96 release addresses many of these shortcomings by introducing new features that will help the administrator monitor and control access to the system, and new mechanisms to minimize downtime due to expected and unexpected outages. »

DATA SCIENCE ON HADOOP: HOW CLOUDERA IMPALA UNLOCKS NEW PRODUCTIVITY AND INSIGHTS

Watch video
View slides

Justin Erickson – Cloudera Sr. Products Manager

This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo. »

DESIGNING SCALABLE NETWORK ARCHITECTURES FOR FAST MOVING BIG DATA

Watch video

Kenneth Duda – Arista Networks Founder, CTO and SVP, Software Engineering
Amr Awadallah – Cloudera CTO

The growth of big data and the continuing dramatic decline in the cost of storage and computer processing are having a profound impact on the way business is being conducted across sectors. An updated network infrastructure is essential to keeping the data flowing smoothly throughout your organization and enabling timely, precise analytics that enhance business decision making. Arista and Cloudera have partnered to create networking architectures that accelerate big data productivity by increasing performance, simplifying network scale-out, and tying into Hadoop’s topology aware storage architecture. »

KNITTING BOAR

Watch video
View slides

Josh Patterson – Cloudera Sr. Solutions Architect
Michael Katzenellenbogen – Cloudera Solutions Architect

In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction. »

TAMING THE ELEPHANT – LEARN HOW MONSANTO MANAGES THEIR HADOOP CLUSTER TO ENABLE GENOME/SEQUENCE PROCESSING

Watch video
View Slides

Bala Venkatrao – Cloudera Director, Products
Erich Hochmuth – Monsanto High Performance Analytics Team Lead
Aparna Ramani – Cloudera Director, Engineering
Mark Seidenstricker – Monsanto Infrastructure Architect

Managing Hadoop clusters to meet business needs can be challenging. Learn how Monsanto has effectively tamed the elephant using Cloudera Manager. »


CLOUDERA TUTORIALS

USING HBASE

Amandeep Khurana – Cloudera Solutions Architect
Matteo Bertozzi – Cloudera HBase Consultant

Software testing is hard enough, but it becomes especially challenging when you’re doing large-scale, distributed data processing. This tutorial will present a mix of lecture and instructor-led demonstrations to explain how you can verify that your code performs exactly as you intended. This session will focus on unit testing, integration testing, Performance testing and diagnostics. »

AN INTRODUCTION TO HADOOP

Mark Fei – Cloudera Instructor

This tutorial provides an introduction to Apache Hadoop and what it’s being used for. This will include:

  • The rationale for Hadoop
  • Understanding the Hadoop Distributed File System (HDFS) and MapReduce
  • Common Hadoop use cases including recommendation engines, ETL, time-series analysis and more
  • How Hadoop integrates with other systems like Relational Databases and Data Warehouses
  • Overview of the other components in a typical Hadoop “stack” such as these Apache projects: Hive, Pig, HBase, Sqoop, Flume and Oozie »

TESTING HADOOP APPLICATIONS

Tom Wheeler – Cloudera Curriculum Developer

Software testing is hard enough, but it becomes especially challenging when you’re doing large-scale, distributed data processing. This tutorial will present a mix of lecture and instructor-led demonstrations to explain how you can verify that your code performs exactly as you intended. This session will focus on unit testing, integration testing, Performance testing and diagnostics. »

BUILDING A LARGE-SCALE DATA COLLECTION SYSTEM USING FLUME NG

Hari Shreedharan – Cloudera Software Engineer
Will McQueen – Cloudera Software Engineer
Arvind Prabhakar – Cloudera Software Engineer
Prasad Mujumdar – Cloudera Software Engineer
Mike Percy – Cloudera Software Engineer

Hadoop HDFS is typically adopted in situations where traditional storage and database systems are either reaching their limits or have already surpassed them. This usually implies that there are one or more large streams of events that need to be collected, such as log data streams. Flume NG was designed from the ground-up to tackle this problem in a straightforward, scalable, reliable way, and empirical results support the success of its approach. »


MEET THE AUTHORS, FREE BOOKS

We wrote the book… well actually, Cloudera experts in Apache Hadoop and Apache HBase have written the definitive guides. Stop by the Cloudera booth and you’ll have a chance to meet the authors and get a free book. We are giving away 100 of each book on both Wednesday October 24 and Thursday October 25, including the Hadoop Operations and HBase in Action which will be hot off the press.

Wednesday, Oct 24 @ 10:20AM
Thursday, Oct 25 @ 8:10AM

Tom White

Hadoop: The Definitive Guide, 3rd Edition

Wednesday, Oct 24 @ 3:10PM
Thursday, Oct 25 @ 10:20AM

Lars George

HBase: The Definitive Guide

Wednesday, Oct 24 @ 3:10PM
Thursday, Oct 25 @ 10:20AM

Amandeep Khurana & Nick Dimiduk

HBase in Action

Wednesday, Oct 24 @ 5:40PM
Thursday, Oct 25 @ 8:10AM

Eric Sammer

Hadoop Operations


MEETUPS

New York Hadoop User Group

Tuesday, Oct 23

Hosts: Eli Collins and Aaron Myers
Where: Foursquare offices »

Hive User Group Meetup NYC

Tuesday, Oct 23

Host: Carl Steinbach
Where: Hilton Hotel »

Sqoop User Meetup

Tuesday, Oct 23

Host: Kathleen Ting
Where: PulsePoint offices »

Flume User Meetup

Thursday, Oct 25

Host: Kathleen Ting
Where: Hilton Hotel »

HBase Meetup

Thursday, Oct 25

Host: Otis Gospodnetić
Where: AppNexus offices »

Cloudera Manager Users Meetup

Thursday, Oct 25

Host: Philip Zeyliger
Where: Hilton Hotel »

ZooKeeper Users Meetup

Thursday, Oct 25

Host: Camille Fournier
Where: Hilton Hotel »


AWARDS

NEW: STRATA DATA INOVATION AWARDS

We’re honoring innovative work in big data and data science, and need your nominations for individuals or organizations who deserve recognition for their work in data. Learn more