Hadoop World 2010 · Agenda

Event Agenda
Monday, October 11 – Welcome Reception
Time: 6:00pm – 9:00pm
Location: Ava Lounge
210 West 55th Street NYC 10019
Penthouse and Rooftop of DREAM Hotel
Tuesday, October 12
| Grand Ballroom | Beekman Parlor | Sutton North | Sutton Center | Sutton South | |
|---|---|---|---|---|---|
| 8:00am – 6:00pm | Registration | ||||
| 8:00am – 9:00am | Breakfast | ||||
| 9:00am – 10:30am | Mike Olson, CEO, Cloudera Tim O’Reilly, Founder, CEO, O’Reilly Media PPT Video |
||||
| 10:30am – 11:00am | Break | ||||
| 11:00am – 11:30am |
|
Hadoop Analytics: More Methods, Less Madness
Shevek Mankin, Chief Technical Officer, Karmasphere
One of the biggest stumbling blocks to leveraging Big Data, or even cloud computing in general, is the amount of expertise it takes to get even simple tasks done. In this session, we’ll discuss proven methods to most quickly and effectively extract intelligence using Hadoop. This includes prioritizing when to use a lower level language like Java versus when to use Hive and SQL, Pig, Cascading, etc. We will discuss real-world use cases, including illustrating what other enterprises do to leverage Hadoop without loads of additional training or time testing. This session will help analysts and developers alike understand the capabilities and compromises of alternative approaches.
|
Hadoop Image Processing for Disaster Relief
Andrew Levine, Software Developer, TexelTek Inc.
The Open Cloud Consortium’s Matsu Project is developing an open source system to process large amounts of image data and detect significant changes to provide assistance for disaster relief efforts. Processing of the source imagery is focused on making high-resolution images highly available for disaster relief workers in a timely fashion. This effort is also doing temporal comparison between geospatially identical areas to reveal change over time. For example, it is possible to highlight fallen buildings and bridges or the progress of floods. The framework should work well for other types of image processing like anomaly detection and pattern identification.
|
Search Analytics with Flume and HBase
Otis Gospodnetic, Founder, Sematext International
In this talk we will show how we use Flume to transport search and clickstream data to HBase with the ultimate goal of producing Search Analytics reports using that data. We’ll show how data flow through the system from the moment a query or click event is captured in the search application UI, until it lands in HBase via Flume’s HBase sink. We’ll also share information about what this system looked like in the pre-Flume days. Finally we’ll demonstrate various reports the system ultimately produces and insight we derive from them.
|
Advanced Analytics for the US Army Intelligence Cloud
Tim Estes, CEO, Digital Reasoning
The US Army’s mission has evolved to deal with understanding entity-level relationships from massive amounts of structured and unstructured data. To tackle this problem and support a new generation of entity-centric analytics, the US Army has adopted Hadoop and other cloud scale analytic technologies to support mission-critical intelligence. At the heart of these analytic efforts is a new system for understanding and integrating structured and unstructured data called Synthesys. This talk will discuss the type and scale of analysis that the new Army Cloud is doing using Synthesys and how Hadoop/CDH3 is a critical component of that infrastructure.
|
| 11:35am – 12:05pm |
Hadoop at eBay
Anil Madan, Director Engineering, Analytical Platform Development, eBay
This talk will illustrate how eBay is leveraging its data assets to do advanced insights and analytics. Learn how eBay is sourcing huge volumes of data into the cluster and running Click Stream and Transactional data analysis for user behavior, search quality and research use cases.
|
RDBMS and Hadoop: A Powerful Coexistence
Stephen Hillian, VP Analytics, Greenplum, now part of EMC Corp.
Today, ALL the data in an organization is important. What does it take to manage massive volumes of structured and unstructured data and meet the demand for timely business insight? Innovations like MPP, MapReduce, Hadoop and in-database analytics are redefining what is possible. Learn how Adknowledge applied these tools to analyze massive amounts of data from email and digital advertising campaigns to deliver actionable business insight – faster than ever before.
|
Making Hadoop Security Work in Your IT Environment
Todd Lipcon, Cloudera
Aaron Myers, Cloudera The Apache-Hadoop project has seen several recent advances in its security model including the addition of authentication. This session will discuss the current state of Hadoop security and how compatible this is with different aspects of typical enterprise IT environments. Attendees will learn about the details of the security integration in Hadoop and we will also discuss the integration of security throughout the various projects included in Cloudera’s Distribution for Hadoop (CDH).
|
Top 10 Lessons Learned from Deploying Hadoop and HBase
Rod Cope, CTO & Founder, OpenLogic Inc.
Hadoop, HBase, and friends are built from the ground up to support Big Data, but that doesn’t make them easy. Just like with any other relatively new and complex technologies, there are some rough edges and growing pains to manage. I’ve learned some hard lessons while deploying HBase tables containing billions of rows and dozens of terabytes on OpenLogic’s Hadoop infrastructure. Come to this session to learn about some of the “gotchas” you might run into when deploying Hadoop and HBase in your own production environment and how to avoid them
|
The Explorys Network
Doug Meil, Director of Engineering, Explorys
Formed in partnership with Cleveland Clinic, Explorys addresses the national imperative to leverage electronic health records (EHR), across a network of healthcare providers and life sciences organizations, for the improvement of care and drug safety. With three major healthcare providers already committed, and more expected this year, the Explorys network will go online in the summer of 2010. The Explorys healthcare cloud-computing platform leverages Hadoop, HBase, and MapReduce, to search and analyze patient populations, treatment protocols, and clinical outcomes. Already spanning a billion anonymized clinical records, Explorys provides uniquely powerful and HIPAA compliant solutions for accelerating life saving discovery.
|
| 12:10pm – 12:40pm |
Hadoop – Best Practices and Real Experience Going from 5 to 500 Nodes
Phil Day, HP
The simple hardware requirements and pre-packaged distributions mean that getting a small Hadoop prototype system together is easily achievable in many organizations. However transitioning from this to a full Proof of Concept or operational cluster presents many technically and organizational challenges. In this talk we will discuss some of the issues we have encountered whilst working with customers who want to move beyond the prototype, and how we have helped to overcome them. In particular we will cover the steps from hardware selection through to build, deployment and configuration, and service management considerations.
|
Migrating to CDH and Streaming Data Warehouse Loading
Christopher Gillett, Chief Software Architect, Visible Measures
We recently migrated from a legacy version of Apache Hadoop to a modern implementation using CDH. In parallel we moved from MySQL to Vertica. This talk focuses on the migration techniques used, including gradual grid decommissioning and buildup, data compression, load balancing, etc. The presentation will also discuss how we re-factored our data warehouse loading process to move to a streaming approach from a more traditional bulk-load model. Finally, the talk will present performance numbers comparing CDH to legacy implementations of Hadoop.
|
AOL’s Data Layer
Ian Holsman, CTO Relegence, AOL
An overview on how we use Hadoop and other open source technologies to provide reporting and basic clustering services to AOL’s websites.
|
Using Hadoop for Indexing for Biometric Data, High Resolution Images, Voice/Audio Clips, and Video Clips
Lalit Kapoor, Associate, Booz Allen Hamilton
As the types and volume of multimedia content and complex numeric data increases across the Internet searching this data becomes inaccurate or prohibitively expensive. To help address this problem, we created Fuzzy Table. Fuzzy Table is a distributed, low latency, fuzzy matching database built over Hadoop that enables fast fuzzy searching of content that cannot be easily indexed or ordered such as biometric data, high resolution images, voice/audio clips, and video clips. In this presentation we will discuss scaling an application using Fuzzy Table over Amazon’s EC2 service. We will present experiences, lessons learned, and performance metrics of building large scale systems over Hadoop.
|
Large Scale Web Analytics utilizing AsterData and Hadoop
Will Duckworth, comScore
In this session you will be exposed to how one company has leveraged AsterData and Cloudera’s Distribution to Hadoop (CDH) to implement and build an environment that supports processing over 500 billion rows of web log data in web log records in a syndicated production environment. The session will focus on how comScore applies it taxonomy of the web to help categorize the observed URLs and the methods used to leverage multiple large scale analytical systems. comScore’s taxonomy currently classifies over 88% of all web pages observed on the internet.
|
| 12:40pm – 1:45pm | Lunch | ||||
| 1:45pm – 2:15pm |
Hadoop and Hive at Orbitz
Jonathan Seidman, Lead Software Engineer, Orbitz Worldwide
Ramesh Venkataramaiah, Orbitz Worldwide Orbitz Worldwide’s portfolio of global consumer travel brands processes millions of searches and transactions every day. Storing and processing the ever-growing volumes of data generated by this activity becomes increasingly difficult through traditional systems such as relational databases. This presentation details how Orbitz is using new tools such as Hadoop and Hive to meet these challenges. We’ll discuss how Hadoop and Hive are being leveraged to provide data and analysis that allows us to optimize the products shown to consumers and drive statistical analysis of macro trends.
|
SIFTing Clouds
Paul Burkhardt, SRA, International Inc.
Computer vision algorithms are ideal candidates for distributed computing given the compute-intensive nature of the algorithms and the increasing extent of image resolution and volume. We will describe our MapReduce implementations of the Scale-Invariant Feature Transform (SIFT) algorithm, a well-known computer vision algorithm used for object-recognition. Our SIFT MapReduce application enables fast object identification in distributed image datasets. We will present our results and a new approach for internet image search.
|
HBase in Production at Facebook
Jonathan Gray, Software Engineer, Open Source Advocate, Facebook
A talk on how Facebook is using HBase in production to power both online and offline applications. Beginning with why we chose HBase, the talk will cover the specifics of our use cases and how HBase fits in. Details will be shared about HBase usage for realtime serving applications as well as to augment existing Hadoop-based data warehousing.
|
Business Analyst Tools & Applications for Hadoop
Amr Awadallah, CTO, Cloudera It is a widely held misconception that Hadoop is limited to programmers or other people familiar with command line interfaces. While historically true, there has been an explosion of different analysts tools that have been announced for Hadoop. This session will cover the different categories of Hadoop analyst tool, their capabilities, current maturity and applicable use cases.
|
Better Ad, Offer and Content Targeting using Membase with Hadoop
James Phillips, Co-founder, Membase
Manu Mukerji, Architect, ShareThis Pero Subasic, Chief Architect, AOL Real-time ad, offer and content targeting decisions must happen quickly. AOL Advertising and ShareThis describe how Membase and Hadoop combine in their environments to accelerate and improve targeting. Creating user profiles with Hadoop, then serving them from Membase, reduces profile read and write access to under a millisecond, leaving the bulk of the processing time budget for improved targeting and customization.
|
| 2:20pm – 2:50pm |
The Hadoop Ecosystem at Twitter
Kevin Weil, Analytics Lead, Twitter
Hadoop is rapidly becoming must-have infrastructure for companies of all kinds. But as word of mouth grows, so do questions around how one actually uses Hadoop to solve business problems. There are a number of excellent applications on top of Hadoop like Pig, HBase, and Hive; how do those fit in? How does one get data into Hadoop, and then back out afterward? In this talk I’ll discuss specifically how Twitter uses these tools to solve critical business and engineering problems.
|
SHARD: Storing and Querying Large-Scale SemWeb Data
Kurt Rohloff, Scientist, BBN Technologies
Current Semantic Web data processing technologies are sufficient for generally small datasets, but current methodologies create horrible query processing bottlenecks in Semantic Web triple-stores. This contradicts the fundamentally Web-scale Semantic Web vision, and resulting triple-store performance is probably one of the reasons there hasn’t been a broader uptake in SemWeb technologies. In this talk I will review SHARD, a proof-of-concept triple-store built on Hadoop. SHARD responds to SPARQL queries, stores triple data in HDFS and provides basic OWL reasoning capabilities. SHARD compares favorably in query performance to recent industrial triple-stores, but is much more scalable and robust.
|
ZooKeeper in Online systems, Feed processing and Cluster Management!
Mahadev Konar, Software Engineer, Yahoo!
ZooKeeper has been in production for over 3 years now. Its performance and reliability have allowed it to be a critical component in distributed systems. Its design has proven to be flexible enough that it can be applied to a variety of needs of distributed applications. It has simplified lives of services engineering and is easily applied to your project. In this talk we will review some examples of applications that use ZooKeeper to show the breadth of solutions enabled by ZooKeeper. We will review 1) An online ads system where ZooKeeper is used for fault tolerance and service discovery, 2) Feed processing platform that use ZooKeeper for fault tolerance, name service, service discovery and load balancing information, 3) Crawling service, wherein ZooKeeper is used for cluster management, storing sharding information, name service and fault tolerance.
|
Scale In – Collecting Distributed Data via Flume and Querying through Hive.
Anurag Phadke, Senior Metrics Engineer, Mozilla Corporation
Socorro (Crash Reporting System), Tinderbox, BuildBots (build system) are some of the few distributed systems used at Mozilla. These systems are critical for stable product releases and each build/deployment \”run\” emits tons of useful log information. With Flume, the entire information is collected at single location and Hive allows us to analyze the data in a fine grained fashion. The presentation includes: Technical overview on Flume + Hive integration, our current architecture, optimizations, tradeoffs and then results pertaining to: |
Exchanging Data with the Elephant: Connecting Hadoop and an RDBMS Using SQOOP
Guy Harrison, Director, R&D Melbourne, Quest
As Hadoop penetrates the enterprise, it will increasingly be called upon to integrate with more traditional enterprise datastores, and with Oracle in particular. To this end, Cloudera have provided the open source SQOOP utility to import or export data between any SQL database and Hadoop. Quest have partnered with Cloudera to provide OraOop – an enhanced utility that provides performance and functionality enhancements for those who wish to inter-operate Oracle and Hadoop. This presentation will discuss the architecture of SQOOP and how its extensibility architecture allows third party providers like Quest to provide optimized drivers for specific SQL databases. We’ll then discuss technical challenges in moving data between Oracle and Hadoop. Finally, we’ll consider how Hadoop changes the landscape for enterprise data management and speculate on how the data centre of the future might leverage the best features of Oracle and Hadoop.
|
| 2:55pm – 3:25pm |
Millionfold Mashups
Philip Kromer, President, infochimps
At infochimps, we’re assembling a data repository containing thousands of public and commercial datasets, many at terabyte scale. Modern machine learning algorithms can provide insight into data by drawing only on its generic structure, even moreso when that data is organically embedded in a sea of linked datasets. I’ll talk about the tools and algorithms we use to manage massive scale and massive numerosity data collections, and our bag of tricks for exploring the deep structure and new frontiers where these datasets meet.
|
Optimizing Hadoop Workloads
Nurcan Coskun, Intel Software and Services Group
Deploying a highly efficient Hadoop cluster requires careful attention not only to hardware but also to a multitude of configuration options in Hadoop, HDFS, and the software stack. Intel has devoted resources to Hadoop analysis and testing, both internally and also with fellow travelers, to develop ways to improve efficiency and performance of Hadoop clusters. This workshop will provide a brief introduction to Intel’s analysis and some considerations for optimizing Hadoop for faster analysis and better efficiency. Intel’s whitepaper on Hadoop Optimization will also be available for more in depth discussion
|
Cloudera Roadmap Review
Charles Zedlewski, Sr. Director Product Management, Cloudera
In this session we will discuss recent updates in the past months to Cloudera’s Distribution for Hadoop (CDH) and to Cloudera Enterprise. In addition we will present the roadmap for the next 12 months, giving you valuable insight into development plans.
|
Multi-Channel Behavioral Analytics
Stefan Groschupf, Chief Technology Officer, Datameer
Understanding customer behavior at a granular level has the potential to increase sales across the entire customer lifecycle by more precisely targeting an audience with the right messages, advertisements, product offers, and promotional campaigns at the right moment of opportunity. Moreover, the results of behavioral analytics can be used to produce more desirable products and services and to deliver a better user experience. However, the increasing complexity of customer touch points makes it difficult for businesses to obtain a complete picture of the customer. The presentation will focus on a use case of how a Fortune 500 company can leverage Hadoop to tackle the challenges of multi-channel behavioral analytics including the large number of data sources, structured and unstructured data as well as big data. For example, the demonstration will show the power of marrying clickstream data with customer demographic data from a CRM system and purchase history from an order management system to determine the promotional campaign most likely to succeed. Further, this session will explain how to bring in social media conversations so that companies can better identify how customers are influencing each other’s buying decisions. |
|
| 3:25pm – 4:00pm | Break | ||||
| 4:00pm – 4:30pm |
Intelligent Text Information Processing System
Vaijanath Rao, Technical Lead, AOL
Given a large amount of online content available, extracting information from them poses a great challenge. While the first challenge is to process the huge text, the second important challenge is extracting useful and important information out it. In this talk, we describe our work of extraction of keywords, events (location, date and time) etc. The keywords include important and significant words or phrases that describe the content, which can be used for topic detection and modeling, summarization etc. Our goal is to be able to use them for contextual advertising by identifying relevant ads using the keywords. We pass them through a filtering module which identifies the mood of the content and we restrict the ads for only positive moods.
|
Sentiment Analysis Powered by Hadoop
Linden Hillenbrand, Product Manager – Hadoop Technologies, General Electric
At GE, our Digital Media and Hadoop teams built an interactive application for our Marketing & Communications functions. One of the application’s capabilities is providing automated sentiment analysis, which provides our Marketing & Communications teams the ability to assess external perception of GE (positive, neutral, or negative) through our various campaigns. Hadoop powers the sentiment analysis aspect of the application. This is a highly intensive text mining use case for Hadoop, but through it, we greatly reduce our processing time for sentiment analysis and enable our business leaders to complete their analysis quickly and accurately.
|
Apache Hadoop in the Enterprise
Arun Murthy, Principle Engineer, Yahoo
Yahoo! has been a major contributor and one of the largest enterprise customers of Apache Hadoop for nearly 5 years now. In the early years the users were primarily researchers and ad-hoc applications. As the adoption of Hadoop has ticked up the footprint of the operation has grown significantly – over 40,000 machines at Yahoo!. Similarly, the users & applications have grown much more demanding of software as tens of millions of dollars are riding on it – features such as security, multi-tenancy, auditing, metering, predictability, resilience to rogue applications etc. which were once *nice-to-have* are now absolutely critical for Hadoop. This talk covers the strides taken by Hadoop in the last 12 months at Yahoo! to address the needs of the enterprise, including the multi man-years of effort on strong security for Hadoop (both the file-system and Map-Reduce), support for multiple organizations to use Hadoop clusters in a multi-tenant, resilient manner and operability enhancements to help run very large clusters in a cost-effective manner with minimal human intervention. This talk also presents a brief survey of some of the business critical applications which are enabled by these enhancements. |
Using R and Hadoop to Analyze VoIP Network Data for QoS
Saptarshi Guha, Dept. of Statistics, Purdue University
RHIPE is an R package that integrates the R environment for statistics and data analysis with the Hadoop distributed computing framework. With RHIPE, the user can store and compute with large and complex data sets using R functions and programming idioms. In this talk, I will demonstrate the use of RHIPE to analyze 190GB of VoIP network data for QoS. The jitter between two consecutive packets is the deviation of the real inter-arrival time from theoretical. We show jitter follows desired properties and is negligible, which supports the assumption of the measured traffic being close to the offered traffic.
|
Flume: Distributed Reliable Streaming Log Collection for Hadoop
Jonathan Hsieh, Software Engineer, Cloudera
We describe the key architectural features and initial experiences with Flume, a distributed, reliable, streaming, log collection system. Flume is a core component of CDH3, and designed for ingesting large quantities of data with four goals in mind: Reliability, Scalability, Extensibility, and Manageability. A sampling of its core features include a horizontally scalable architecture, fault-tolerant end-to-end delivery guarantees, support for low-latency continuous event processing, support for bucketing data into HDFS, a simple extension interface for arbitrary input and output sources, a centralized management for dynamic configuration changes, and a HUE-based GUI for ingest monitoring and reporting.
|
| 4:35pm – 5:05pm |
Hadoop – Lessons Learned from Deploying Enterprise Clusters
Shinichi Yamada, EVP & CTO, NTT Data Corporation
NTT DATA has over 3 years experience helping enterprise customers design, deploy and run Hadoop clusters at the range of 20 to over 1000 nodes. In this presentation, we briefly introduce Hadoop business cases in Japan and how NTT DATA addresses the needs of enterprise users. In addition–as lessons learned from working with large enterprise clusters–we also discuss typical reframing in design and operational economies, which have made Hadoop’s deployment successful for users. To provide a use case example we have invited a customer to present alongside us to explain how they have adopted Hadoop into their private cloud infrastructure.
|
A Fireside Chat: Using Hadoop to Tackle Big Data at comScore
Martin Hall, co-founder and CEO, Karmasphere
Will Duckworth, VP Software Engineering, comScore This session will present a commercial use case of Hadoop in a classic ‘Fireside Chat’ format. Martin Hall, co-founder and CEO of Karmasphere, will talk informally with Will Duckworth, Vice President of Software Engineering at comScore, sharing insights into comScores’ experiences in working with Hadoop to process significantly larger amounts of data from a new initiative. Recently comScore was faced with the challenge of dealing with data from a new initiative that required the systems to support a daily increase in excess of 800% compared to a year ago. After a survey of potential solution options, Duckworth and his team settled on using Hadoop as part of a larger solution. This Fireside Chat will delve into how they selected Hadoop, the trials and tribulations that they have experienced during the learning process and what plans they have for the future. Any developer or analyst considering Hadoop for their own commercial application will find this session illuminating.
|
Mixing real-time needs and batch processing: How StumbleUpon built an advertising platform using HBase and Hadoop
Jean-Daniel Cryans, Database Engineer/HBase Committer, StumbleUpon
StumbleUpon serves millions of recommendations to users each and every day, and includes a small portion of sponsored stumbles into these recommendations. Providing accurate metrics to the sponsors participating in the system combines the needs of a batch-processing system with the requirements of a real-time feedback loop to present comprehensive and up to the minute data. HBase is the mass data storage foundation of this advertising platform, with Hadoop and Cascading used to support numeric analysis and other batch jobs in a flexible and extensible fashion.
|
MapReduce and Parallel Database Systems: Complementary or Competitive Technology?
Daniel Abadi, Assistant Professor, Yale University
The MapReduce vs. parallel database system debate has finally been extinguished (for the most part), with the vast majority of people recognizing that each type of system has its own strengths and weaknesses, and ideal application areas. However, there is a new emerging debate: some people believe that MapReduce and parallel database systems are entirely complementary technology and will coexist in the enterprise over the long term. Other people, while acknowledging that each have their own strengths and weaknesses, feel that these differences are superficial and that these systems are on a collision course, with one eventually becoming dominant in the enterprise. In this talk, the speaker will debate against himself both sides of this argument.
|
|
| 5:10pm – 5:40pm |
Managing Derivatives Data with Hadoop
Joshua Bennett, Technology Architect, CME
In 2002 CME Group, the world’s leading and most diverse derivatives marketplace, experienced an exponential growth in volume which has continued over subsequent years. In this presentation we will explore how technologies like Hadoop are leveraged to help cope with the the hundreds of millions of daily customer transactions.
|
Putting Analytics in Big Data Analysis
Jake Cornelius, Dir. of Product Management, Pentaho
The intersection of the increasing data tsunami and economics has produced new ways to structure and store incredibly large volumes of data with Apache Hadoop. For most companies however, Hadoop is not a complete, single solution for analytics but part of a hybrid data pyramid with a tier of raw data stored inexpensively in Hadoop; a secondary tier of key data aggregated out of Hadoop and placed in traditional datamarts, and a third tier of data required for speed-of-thought response times residing in memory. As part of this data pyramid, Hadoop, together with front and back end applications and tools that assist in data loading, transformations and analytics, can dramatically lower big data analytics costs without any compromise in business performance. In this interactive session, we will discuss and present the Pentaho for Hadoop solution, the latest offering from Pentaho that integrates Pentaho Data Integration (also known as Kettle) with Hadoop and Hive to bring ETL, data warehousing and BI applications to the tasks of analyzing Big Data. This session will explore how Pentaho for Hadoop works to provide key data integration and transformation functionality to Hadoop data, how it can manage and control transformations and Hadoop jobs from the Pentaho management console and how Hadoop data can be integrated with data from other sources to drive compelling reporting and analytics for today’s massive volumes of data. The session will include a demonstration of the Pentaho for Hadoop solution.
|
Techniques to use Hadoop with Scientific Data
Jerome Rolia, Automated Infrastructure Lab Researcher, HP Labs
Platforms such as Hadoop are not designed specifically for science users making it difficult to express certain analysis functions in a way that results in efficient execution. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space (e.g., clustering particles that represent a snapshot of a simulation of the universe or extracting hurricanes from a satellite picture). These applications pose an especially interesting challenge in that they exhibit significant computational skew, where different partitions take vastly different amounts of time to run even if their input datasets have the same size. This talk gives examples of such algorithms, manual techniques for overcoming computational skew, and describes joint work with the University of Washington on the SkewReduce platform that automatically partitions data to avoid computational skew.
|
“Productionizing” Hadoop: Lessons Learned
Eric Sammer, Solution Architect, Cloudera
Many Hadoop deployments start small solving a single business problem but then begin to grow as the organization find more valuable use cases. Moving a Hadoop deployment from the proof of concept phase into a full production system presents certain challenges for IT operations teams looking to manage the growing Hadoop deployment and maintain internal SLAs with their customers. In this session Eric will review some of the key considerations the Cloudera Solutions Architect team have learned when working with customers to “productionize” a Hadoop deployment.
|
|
| 5:40pm – 6:00pm | Closing Remarks, Mike Olson, CEO, Cloudera Video | ||||
| 6:00pm – 7:30pm | Networking Reception | ||||
Related Events
Questions? Just Ask
If you have any questions about the event, don’t hesitate to email hadoopworld@cloudera.com.


































