Complete the form to watch now
Unlocking data science in the enterprise
Data science is a practice grounded in statistics but progressively being embraced by software and systems. Data scientists strive to implement their work beyond simple research, but bridging the gap between the language of data scientists and the speak of distributed systems is proving to be increasingly difficult. Factor in a fast-evolving ecosystem of tools and libraries—many being delivered weekly—and you have a recipe for distraction.
Enter Cloudera Data Science Workbench, an enterprise data science platform that accelerates analytics projects from exploration to production. It is a collaborative, scalable, and highly extensible platform for data exploration, analysis, modeling, and visualization. Its powerful features will finally get data scientists, analysts, and business teams speaking the same language.
Part 1: Introducing the Cloudera Data Science Workbench
Today, leading organizations struggle to make their data scientists productive with Hadoop clusters. Data scientists find it difficult to use their existing open source languages (e.g. Python, R) and libraries with Hadoop, especially when the clusters are secured with Kerberos. At the same time, IT doesn't want to give special access to these users, who require very diverse and specific environment configurations to run their experiments. As a result, most data science teams work away from the Hadoop cluster, often on their laptops or in other data silos. The negative business impacts are a lack of insight and agility for the most advanced users, and the security, governance, and cost issues that arise from data silos.
Cloudera Data Science Workbench is a new tool, under development, that will enable collaborative, customizable, self-service access by data scientists to secure Hadoop environments via Python, R, and Scala. It can be installed on any existing cluster, whether on-premises or in the cloud.
Matt Brandwein, Director of Product Management at Cloudera and Tristan Zajonic, Senior Engineering Manager discuss:
- The emergence of open source tools for data science
- Common gaps in the ecosystem
- Introduce a new tool from Cloudera
Part 2: A Visual Dive into Machine Learning and Deep Learning
Machine Learning and Deep Learning present an advanced opportunity for us to understand data beyond simple numbers and text. Data Science practitioners want to quickly implement new machine learning and deep learning libraries but have few options for enterprise analytics systems that support these new tools. The Cloudera Data Science Workbench helps data scientists get ready-access to Hadoop data, leverage the newest machine learning and deep learning frameworks and deliver value much quicker; all in a secure environment.
Join Sean Anderson, Senior Manager of Data Science Marketing at Cloudera and Vartika Singh, Solutions Architect for Data Science at Cloudera as they discuss:
- An introduction to machine learning and deep learning
- Common practices and tools
- Introduce a new tool from Cloudera
Part 3: Models in Production: A Look From Beginning to End
"I've built a model -- now what?"
Developing a predictive model is only one part of a larger journey. Data scientists have to access and transform data, and engineer features, before exploratory modeling happens. A model doesn't do anything until it's applied to data, productionised and deployed.
Apache Hadoop can support all stages of the data science lifecycle, but how this is done is still more art than science, as it requires coordinating different teams and technologies. This webinar will demonstrate a simple reference architecture for connecting the output of exploratory data science in Cloudera Data Science Workbench with production deployment on Hadoop. This includes data engineering with Spark, modeling with Spark MLlib, and production build and deployment via git, Maven and Spark Streaming.
Part 4: Cloudera Data Science Workbench: sparklyr, implyr, and More: dplyr Interfaces to Large-scale Data
One of the most popular packages for R, dplyr, makes it easy to query large data sets in scalable processing engines like Apache Spark and Apache Impala.
When working with various data sources, dplyr can function differently and present a few challenges. In this webinar, Ian Cook, R contributor and Data Scientist at Cloudera, will discuss sparklyr (from RStudio) and the package implyr (from Cloudera). He’ll show you how to write dplyr code that works across these different interfaces.
- Do I need to know SQL to use dplyr?
- When is a “tbl” not a “tibble”?
- Why is 1 not always equal to 1?
- When should you collect(), collapse(), and compute()?
- How can you use dplyr to combine data stored in different systems?