Your browser is out of date!

Update your browser to view this website correctly. Update my browser now


What do you need to know?

Common Skills (all exams)

  • Extract relevant features from a large dataset that may contain bad records, partial records, errors, or other forms of “noise”
  • Extract features from a data stored in a wide range of possible formats, including JSON, XML, raw text logs, industry-specific encodings, and graph link data

Descriptive and Inferential Statistics on Big Data (DS700)

  • Use statistical tests to determine confidence for a hypothesis
  • Calculate common summary statistics, such as mean, variance, and counts
  • Fit a distribution to a dataset and use that distribution to predict event likelihoods
  • Perform complex statistical calculations on a large dataset

Advanced Analytical Techniques on Big Data (DS701)

  • Build a model that contains relevant features from a large dataset
  • Define relevant data groupings, including number, size, and characteristics
  • Assign data records from a large dataset into a defined set of data groupings
  • Evaluate goodness of fit for a given set of data groupings and a dataset
  • Apply advanced analytical techniques, such as network graph analysis or outlier detection

Machine Learning at Scale (DS702)

  • Build a model that contains relevant features from a large dataset
  • Predict labels for an unlabeled dataset using a labeled dataset for reference
  • Select a classification algorithm that is appropriate for the given dataset
  • Tune algorithm metaparameters to maximize algorithm performance
  • Use validation techniques to determine the successfulness of a given algorithm for the given dataset

What should you expect?

Required Exams

Each exam may be taken in any order. All three exams must be passed within 365 days of each other. Candidates who fail an exam must wait a period of thirty calendar days, beginning the day after the failed attempt, before they may retake the same exam. Candidates must pay for each exam attempt.

Each passed exam is verifiable in your exam transcript and history.

Who is this for?

Candidates for CCP Data Scientist exam should have in-depth experience as a practicing data scientist and a high-level of mastery of the skills listed above. There are no other prerequisites.

What is the best way to prepare?

The Solution Kit is your best resource to get hands-on experience with a real-world data science challenge in a self-paced, learner-centric environment. It includes a live data set, a step-by-step tutorial, and a detailed explanation of the processes required to arrive at the correct outcomes.

Learn more

Q. What technologies/languages do I need to know?
A. You'll be provided with a cluster with Hadoop technologies on a cluster, plus standard tools like Python and R. Among these standard technologies, it's your choice what to use to solve the problem.

Q. How difficult are the problems?
A. Think of a scaled-down Kaggle problem that’s intended to be solved in hours, not days of effort. If you can solve a Kaggle problem in a weekend, you’re in good shape. You may also take a look at a sample past exam and the solution in our free solution kit.

Q. What should I study to prepare?
A. Coursera's intro "machine learning" course is a good level of preparation, but here are several more links of interest.

Exam Delivery and Cluster Information

All CCP: Data Scientist exams are remote-proctored and available anywhere, anytime. See the FAQ for more information and system requirements.

Exams are hands-on, practical exams using data science tools on Cloudera technologies. Each user is given their own 7-node, high-performance CDH5 (currently 5.3.2) cluster pre-loaded with Spark, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many others (See a full list). In addition the cluster also comes with Python (2.6 and 3.4), Perl 5.10, Elephant Bird, Cascading 2.6, Brickhouse, Hive Swarm, Scala 2.11, Scalding, IDEA, Sublime, Eclipse, NetBeans, scikit-learn, octave, NumPy, SciPy, Anaconda, R, plyr, dplyrimpaladb, SparkML, vowpal wabbit, clouderML, oryx, impyla, CoreNLP, The Stanford Parser: A statistical parser, Stanford Log-linear Part-Of-Speech Tagger, Stanford Named Entity Recognizer (NER), Stanford Word Segmenter, opennlp, H2O, java-ml, RapidMiner, caffe, Weka, NLTK, matplotlib, ggplot, d3py, SparkingPandas, randomforest, R: ggplot2, Sparkling water. 

Currently, the cluster is open to the internet and there are no restrictions on tools you can install or websites or resources you may use.