Following a networking lunch, participants attended in-depth tutorials on secure architectures for Hadoop clusters and best practices for running the Hadoop stack in production.
|7:00 – 8:30 AM||
Registration and Networking Breakfast
|8:30 – 9:00 AM||
Keynote -- From MapReduce to Spark: An Ecosystem Evolves SLIDES
Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength.
Doug Cutting, Chief Architect, Cloudera
|9:00 – 9:40 AM||
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data SLIDES
The Hadoop ecosystem has improved real-time access capabilities recently, narrowing the gap with relational database technologies. However, gaps remain in the storage layer that complicate the transition to Hadoop-based architectures. In this session, the presenter will describe these gaps and discuss the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. The session also will cover Kudu (currently in beta), the new addition to the open source Hadoop ecosystem with outof-the-box integration with Apache Spark and Apache Impala (incubating), that achieves fast scans and fast random access from a single API.
Todd Lipcon, Software Engineer, Cloudera / Kudu Founder
|9:40 – 10:20 AM||
Risk Management for Data: Secured and Governed SLIDES
Protecting enterprise data is an increasingly complex challenge given the diversity and sophistication of threat actors and their cyber-tactics. In this session, participants will hear a comprehensive introduction to Hadoop Security, including the “three A’s” for secure operating environments: Authentication, Authorization, and Audit. In addition, the presenter will cover strategies to orchestrate data security, encryption, and compliance, and will explain the Cloudera Security Maturity Model for Hadoop. Attendees will leave with a greater understanding of how effective INFOSEC relies on an enterprise big data governance and risk management approach.
Eddie Garcia, Chief Security Architect, Cloudera
|10:20 – 10:35 AM||
|10:35 – 11:15 AM||
Intuitive Real-Time Analytics with Search SLIDES
Text-based search recently has become a critical part of the Hadoop stack, and has emerged as one of the highest-performing solutions for big data analytics. In this session, attendees will learn about the new analytics capabilities in Apache Solr that integrate full-text search, faceted search, statistics, and grouping to provide a powerful engine for enabling next-generation big data analytics applications.
Eva Andreasson, Director Product Management, Cloudera
|11:15 – 11:55 AM||
Introduction to Machine Learning on Apache Spark MLlib SLIDES
Spark MLlib is a library for performing machine learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take only a few lines of code, and leverage hundreds of machines. This talk will demonstrate how to use Spark MLlib to fit an ML model that can predict which customers of a telecommunications company are likely to stop using their service. It will cover the use of Spark's DataFrames API for fast data manipulation, as well as ML Pipelines for making the model development and refinement process easier.
Juliet Hougland, Senior Data Scientist, Cloudera
|11:55 – 1:00 PM||
|1:00 – 4:00 PM Track A:||
A Practitioner’s Guide to Securing Your Hadoop Cluster
Why do many Hadoop clusters lack basic security controls? In part because some security features are relatively new and Hadoop security can be complex and daunting. Participants in this tutorial will be led through the process of securing a Hadoop cluster. The instructors will begin with a Hadoop cluster with no security and incrementally add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance considerations. The following topics will be covered for each of the security features above:
● Introduction: what the security feature is, what protection it provides, and best practices and recommendations
Michael Yoder, Software Engineer, Cloudera
|1:00 – 4:00 PM Track B:||
Apache Hadoop Operations for Production Systems
Hadoop is emerging as the standard for big data processing and analytics; however, as usage of the Hadoop clusters grow, so do the demands of managing and monitoring these systems. In this tutorial, attendees will be provided an overview of the necessary phases for successfully managing Hadoop clusters, with an emphasis on production systems—from installation, to configuration management, service monitoring and troubleshooting and support integration. Participants will receive a review of tooling capabilities and learn which have been most helpful to users, as well as hear lessons learned and best practices from users who depend on Hadoop as a business-critical system. The topics to be covered include:
● Installation (hardware considerations, OS prerequisites, sanity testing, security considerations)
Sean Kane, Solutions Architect, Cloudera
Principal Solutions Architect, Cloudera
Ben Spivey is a principal solutions architect at Cloudera who provides consulting services for large financial-services customers. Ben specializes in Hadoop security and operations. He is the coauthor of Hadoop Security from O’Reilly Media (2015).
Chief Security Architect, Office of the CTO, Cloudera
Eddie Garcia is chief security architect at Cloudera, a leader in enterprise analytic data management. Eddie helps Cloudera enterprise customers reduce security and compliance risks associated with sensitive data sets stored and accessed in Apache Hadoop environments. Working in the office of the CTO, Eddie also provides security thought leadership and vision to the Cloudera product roadmap. Formerly the VP of InfoSec and Engineering for Gazzang prior to its acquisition by Cloudera, Eddie architected and implemented secure and compliant Big Data infrastructures for customers in the financial services, healthcare and public sector industries to meet PCI, HIPAA, FERPA, FISMA and EU data security requirements. He was also the chief architect of the Gazzang zNcrypt product and is author of two patents for data security.
Director of Product Management, Cloudera
Eva Andreasson Director of Product Management, Cloudera Eva Andreasson has been working with JVMs, SOA, Cloud, and infrastructure software for 15+ years. She has two patents on JVM garbage collection heuristics and algorithms. She also pioneered Deterministic GC which was productized as JRockit Real Time at BEA Systems (bef. Oracle). After two years as product manager for Zing at Azul Systems, she joined Cloudera in 2012 to help drive the future of distributed data processing through Cloudera's Distribution of Hadoop. Since, she has worked with the projects Hue, ZooKeeper, Oozie, and other components. In 2013 she initiated and launched Cloudera Search. More recently she drove the partner-showcase and easy-to-get-started trial experience of Cloudera Live.
Chief Architect, Cloudera
Doug Cutting is the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera in 2009 from Yahoo!, where he was a key member of the team that built and deployed a production Hadoop storage and analysis cluster for mission-critical business analytics. Doug holds a Bachelor’s degree from Stanford University and is the former Chairman of the Board of the Apache Software Foundation.
Solutions Consultant, Cloudera Government Solutions
Greg Phillips helps public sector customers optimize their computing resources by implementing current data management and analytics capabilities. His focus is on developing ETL pipelines to meet customer requirements, and enable extraction, transformation, and delivery of intuitive methods for analysts to interact effectively with live data. Prior to his tenure with Cloudera, Greg spent seven years working with the U.S. Government, where in his final assignment, he served as Data Science Analytics Technical Team Lead. He supported enterprise programs for cloud processing architecture, implementation of a Cloudera Hadoop system, and in-depth training for new users to gain immediate value from the available datasets and dashboards. Greg has a Bachelor of Science in Computer Science from the University of Maryland, holds Cloudera Administrator and Cloudera Developer certifications, and is proficient with a number of commonly-used programming languages and data management and analytics applications.
Customer Operations Engineer, Cloudera
Jake Miller is a Customer Operations Engineer working in the public sector. Jake helps public sector customers to identify and solve issue that arise during cluster operations. He has a solid background in Linux systems administration and enjoys solving technical problems. Prior to work with Cloudera, Jake spent 14 years working in the public sector as a systems integrator solving challenging technical problems. Jake holds a Master of Science Degree in Cyber Security from NYU-Poly.
Data Scientist, Cloudera
Juliet Hougland is a recent addition to Cloudera’s data science team. She has spent the last 3 years working on a variety of Big Data applications from e-commerce recommendations to predictive analytics for oil and gas pipelines. She holds an MS in Applied Mathematics from University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in Math-Physics.
Software Engineer, Cloudera
Mike Yoder is a software engineer at Cloudera who has worked on a variety of Hadoop security features and internal security initiatives. Most recently, he implemented log redaction and the encryption of sensitive configuration values in Cloudera Manager. Prior to Cloudera, he was a security architect at Vormetric.
Solution Architect, Cloudera
Mubashir Kazia is a solutions architect at Cloudera focusing on security. Mubashir started the initiative of integrating Cloudera Manager with Active Directory for kerberizing the cluster and provided sample code. Mubashir has also contributed patches to Apache Hive that fixed security-related issues.
Senior Solutions Architect, Cloudera
Sean is an experienced solutions architect with an extensive background in software engineering and development. During his three-year tenure with Cloudera, he has assisted many customers with system architecture development, installation, configuration, performance-tuning, and development. During the past twelve years, Sean has developed an excellent mix of enterprise information integration experience. With Spry, he led the software development team and developed solutions using Hadoop and semantic web technologies. While with Oracle, Sean worked on a team that supported pre-sales with architecture, POCs, and reusable technical product demonstrations. Prior to being acquired by Oracle; at BEA, he developed solutions for customers using SOA middleware products and open source software. He also worked at Preferred Systems Solutions and MetaMatrix where he developed reusable components for the federated query and metadata management product, developed and delivered the product training courses, and provided general information technology support. Sean holds a Bachelor of Science in Information Sciences and Technology from the Pennsylvania State University. He is certified in service-oriented-architecture (SOA) and has taken numerous courses covering Oracle and BEA software.
Software Engineer, Cloudera
Sravya Tirukkovalur is a software engineer at Cloudera focusing on Hadoop security, specifically working on authorization. Sravya is one of the core contributors of Apache Sentry. She is also a committer and a PPMC member of the project driving the Apache community. Sravya has spoken about Hadoop security at various meetups and conferences.
Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. He is a committer and a Project Management Committee member on the Apache Hadoop, HBase, and Thrift projects. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine learning methods for collaborative filtering. Todd received his bachelor’s degree with honors from Brown University.