About

The Cloudera Technology Day was held in Washington, DC on March 16, 2016 at the Arena Stage at the Mead Center for American Theater. This technical program was designed to provide current and continuing professional education for industry and government professionals in the mid-Atlantic region who have been using, or aspired to use, Apache Hadoop-based modern data infrastructure.

The agenda featured an opening session by Doug Cutting, Cloudera chief architect and creator of several leading open source projects, including Apache Hadoop, Apache Avro, and Apache Lucene. In addition, the intensive one-day program featured expert briefings on:
  • Fast Analytics on Fast Data
    Including the latest on Apache Kudu (incubating), the new columnar data store for the Hadoop ecosystem
  • Risk Management for Data 
    Attendees learned how to orchestrate modern security architecture for Hadoop under the Cloudera Security Maturity Model
  • Intuitive Real-Time Analytics
    Speakers explored the integrated capabilities of Apache Solr and Hadoop for enabling search-based analytics
  • Advanced Analytics with Apache Spark
    Attendees heard an overview of the real-world applications of Spark for machine-learning use cases

Following a networking lunch, participants attended in-depth tutorials on secure architectures for Hadoop clusters and best practices for running the Hadoop stack in production.

Agenda

Time Event
7:00 – 8:30 AM

Registration and Networking Breakfast

8:30 – 9:00 AM

Keynote -- From MapReduce to Spark: An Ecosystem Evolves SLIDES

Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength.

Doug Cutting, Chief Architect, Cloudera

9:00 – 9:40 AM

Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data SLIDES

The Hadoop ecosystem has improved real-time access capabilities recently, narrowing the gap with relational database technologies. However, gaps remain in the storage layer that complicate the transition to Hadoop-based architectures. In this session, the presenter will describe these gaps and discuss the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. The session also will cover Kudu (currently in beta), the new addition to the open source Hadoop ecosystem with outof-the-box integration with Apache Spark and Apache Impala (incubating), that achieves fast scans and fast random access from a single API.

Todd Lipcon, Software Engineer, Cloudera / Kudu Founder

9:40 – 10:20 AM

Risk Management for Data: Secured and Governed SLIDES

Protecting enterprise data is an increasingly complex challenge given the diversity and sophistication of threat actors and their cyber-tactics. In this session, participants will hear a comprehensive introduction to Hadoop Security, including the “three A’s” for secure operating environments: Authentication, Authorization, and Audit. In addition, the presenter will cover strategies to orchestrate data security, encryption, and compliance, and will explain the Cloudera Security Maturity Model for Hadoop. Attendees will leave with a greater understanding of how effective INFOSEC relies on an enterprise big data governance and risk management approach.

Eddie Garcia, Chief Security Architect, Cloudera

10:20 – 10:35 AM

Networking Break

10:35 – 11:15 AM

Intuitive Real-Time Analytics with Search SLIDES

Text-based search recently has become a critical part of the Hadoop stack, and has emerged as one of the highest-performing solutions for big data analytics. In this session, attendees will learn about the new analytics capabilities in Apache Solr that integrate full-text search, faceted search, statistics, and grouping to provide a powerful engine for enabling next-generation big data analytics applications.

Eva Andreasson, Director Product Management, Cloudera

11:15 – 11:55 AM

Introduction to Machine Learning on Apache Spark MLlib SLIDES

Spark MLlib is a library for performing machine learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take only a few lines of code, and leverage hundreds of machines. This talk will demonstrate how to use Spark MLlib to fit an ML model that can predict which customers of a telecommunications company are likely to stop using their service. It will cover the use of Spark's DataFrames API for fast data manipulation, as well as ML Pipelines for making the model development and refinement process easier.

Juliet Hougland, Senior Data Scientist, Cloudera

11:55 – 1:00 PM

Lunch 

1:00 – 4:00 PM Track A:

A Practitioner’s Guide to Securing Your Hadoop Cluster

Why do many Hadoop clusters lack basic security controls? In part because some security features are relatively new and Hadoop security can be complex and daunting. Participants in this tutorial will be led through the process of securing a Hadoop cluster. The instructors will begin with a Hadoop cluster with no security and incrementally add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance considerations. The following topics will be covered for each of the security features above:

● Introduction: what the security feature is, what protection it provides, and best practices and recommendations
● Planning: how to enable the feature in a phased manner with the fewest growing pains and least risk
● Relevance: why it’s important (demonstrated by live attacks against a cluster without the target security feature)
● Implementation: an overview of how the implementation is performed, where are the moving parts, and potential pitfalls During this tutorial, participants will be provided a maximally secure cluster to learn from and attack. Attendees should bring a laptop to the session with Internet access and the ability to run an SSH client.

Michael Yoder, Software Engineer, Cloudera
Ben Spivey, Solutions Architect, Cloudera
Sravya Tirrukovalur, Software Engineer, Cloudera
Mubashir Kazia, Solutions Architect, Cloudera
 

1:00 – 4:00 PM Track B:

Apache Hadoop Operations for Production Systems

Hadoop is emerging as the standard for big data processing and analytics; however, as usage of the Hadoop clusters grow, so do the demands of managing and monitoring these systems. In this tutorial, attendees will be provided an overview of the necessary phases for successfully managing Hadoop clusters, with an emphasis on production systems—from installation, to configuration management, service monitoring and troubleshooting and support integration. Participants will receive a review of tooling capabilities and learn which have been most helpful to users, as well as hear lessons learned and best practices from users who depend on Hadoop as a business-critical system. The topics to be covered include:

● Installation (hardware considerations, OS prerequisites, sanity testing, security considerations)
● Configuration (mechanics, key configurations, resource management)
● Troubleshooting (managing, troubleshooting, and debugging Hadoop clusters and applications)
● Enterprise considerations (scaling, logs, failure testing)

Sean Kane, Solutions Architect, Cloudera
Sean Kane, Senior Solutions Architect, Cloudera
Jake Miller, Customer Operations Engineer
Greg Phillips, Solutions Architect, Cloudera

Speakers

  • Ben Spivey

    Principal Solutions Architect, Cloudera

    Ben Spivey is a principal solutions architect at Cloudera who provides consulting services for large financial-services customers. Ben specializes in Hadoop security and operations. He is the coauthor of Hadoop Security from O’Reilly Media (2015).

  • Eddie Garcia

    Chief Security Architect, Office of the CTO, Cloudera

    Eddie Garcia is chief security architect at Cloudera, a leader in enterprise analytic data management. Eddie helps Cloudera enterprise customers reduce security and compliance risks associated with sensitive data sets stored and accessed in Apache Hadoop environments. Working in the office of the CTO, Eddie also provides security thought leadership and vision to the Cloudera product roadmap. Formerly the VP of InfoSec and Engineering for Gazzang prior to its acquisition by Cloudera, Eddie architected and implemented secure and compliant Big Data infrastructures for customers in the financial services, healthcare and public sector industries to meet PCI, HIPAA, FERPA, FISMA and EU data security requirements. He was also the chief architect of the Gazzang zNcrypt product and is author of two patents for data security.

  • Eva Andreasson

    Director of Product Management, Cloudera

    Eva Andreasson Director of Product Management, Cloudera Eva Andreasson has been working with JVMs, SOA, Cloud, and infrastructure software for 15+ years. She has two patents on JVM garbage collection heuristics and algorithms. She also pioneered Deterministic GC which was productized as JRockit Real Time at BEA Systems (bef. Oracle). After two years as product manager for Zing at Azul Systems, she joined Cloudera in 2012 to help drive the future of distributed data processing through Cloudera's Distribution of Hadoop. Since, she has worked with the projects Hue, ZooKeeper, Oozie, and other components. In 2013 she initiated and launched Cloudera Search. More recently she drove the partner-showcase and easy-to-get-started trial experience of Cloudera Live.

  • Doug Cutting

    Chief Architect, Cloudera

    Doug Cutting is the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera in 2009 from Yahoo!, where he was a key member of the team that built and deployed a production Hadoop storage and analysis cluster for mission-critical business analytics. Doug holds a Bachelor’s degree from Stanford University and is the former Chairman of the Board of the Apache Software Foundation.

  • Greg Phillips

    Solutions Consultant, Cloudera Government Solutions

    Greg Phillips helps public sector customers optimize their computing resources by implementing current data management and analytics capabilities. His focus is on developing ETL pipelines to meet customer requirements, and enable extraction, transformation, and delivery of intuitive methods for analysts to interact effectively with live data. Prior to his tenure with Cloudera, Greg spent seven years working with the U.S. Government, where in his final assignment, he served as Data Science Analytics Technical Team Lead. He supported enterprise programs for cloud processing architecture, implementation of a Cloudera Hadoop system, and in-depth training for new users to gain immediate value from the available datasets and dashboards. Greg has a Bachelor of Science in Computer Science from the University of Maryland, holds Cloudera Administrator and Cloudera Developer certifications, and is proficient with a number of commonly-used programming languages and data management and analytics applications.

  • Jake Miller

    Customer Operations Engineer, Cloudera

    Jake Miller is a Customer Operations Engineer working in the public sector. Jake helps public sector customers to identify and solve issue that arise during cluster operations. He has a solid background in Linux systems administration and enjoys solving technical problems. Prior to work with Cloudera, Jake spent 14 years working in the public sector as a systems integrator solving challenging technical problems. Jake holds a Master of Science Degree in Cyber Security from NYU-Poly.

  • Data Scientist, Cloudera

    Juliet Hougland

    Juliet Hougland is a recent addition to Cloudera’s data science team. She has spent the last 3 years working on a variety of Big Data applications from e-commerce recommendations to predictive analytics for oil and gas pipelines. She holds an MS in Applied Mathematics from University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in Math-Physics.

  • Michael Yoder

    Software Engineer, Cloudera

    Mike Yoder is a software engineer at Cloudera who has worked on a variety of Hadoop security features and internal security initiatives. Most recently, he implemented log redaction and the encryption of sensitive configuration values in Cloudera Manager. Prior to Cloudera, he was a security architect at Vormetric.

  • Mubashir Kazia

    Solution Architect, Cloudera

    Mubashir Kazia is a solutions architect at Cloudera focusing on security. Mubashir started the initiative of integrating Cloudera Manager with Active Directory for kerberizing the cluster and provided sample code. Mubashir has also contributed patches to Apache Hive that fixed security-related issues.

  • Sean Kane

    Senior Solutions Architect, Cloudera

    Sean is an experienced solutions architect with an extensive background in software engineering and development. During his three-year tenure with Cloudera, he has assisted many customers with system architecture development, installation, configuration, performance-tuning, and development. During the past twelve years, Sean has developed an excellent mix of enterprise information integration experience. With Spry, he led the software development team and developed solutions using Hadoop and semantic web technologies. While with Oracle, Sean worked on a team that supported pre-sales with architecture, POCs, and reusable technical product demonstrations. Prior to being acquired by Oracle; at BEA, he developed solutions for customers using SOA middleware products and open source software. He also worked at Preferred Systems Solutions and MetaMatrix where he developed reusable components for the federated query and metadata management product, developed and delivered the product training courses, and provided general information technology support. Sean holds a Bachelor of Science in Information Sciences and Technology from the Pennsylvania State University. He is certified in service-oriented-architecture (SOA) and has taken numerous courses covering Oracle and BEA software.

  • Sravya Tirukkovalur

    Software Engineer, Cloudera

    Sravya Tirukkovalur is a software engineer at Cloudera focusing on Hadoop security, specifically working on authorization. Sravya is one of the core contributors of Apache Sentry. She is also a committer and a PPMC member of the project driving the Apache community. Sravya has spoken about Hadoop security at various meetups and conferences.

  • Todd Lipcon

    Engineer, Cloudera

    Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. He is a committer and a Project Management Committee member on the Apache Hadoop, HBase, and Thrift projects. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine learning methods for collaborative filtering. Todd received his bachelor’s degree with honors from Brown University.