How to Prepare
Q. What technologies/languages do I need to know? A. You'll be provided with a cluster with Hadoop technologies on a cluster, plus standard tools like Python and R. Among these standard technologies, it's your choice what to use to solve the problem.
Q. How difficult are the problems? A. Think of a scaled-down Kaggle problem that’s intended to be solved in hours, not days of effort. If you can solve a Kaggle problem in a weekend, you’re in good shape. You may also take a look at a sample past exam and the solution in our free solution kit.
Q. What should I study to prepare?
A. Coursera's intro "machine learning" course is a good level of preparation, but here are several more links of interest.
- General Data Science
- Machine Learning
- Linear Algebra
Exam Delivery and Cluster Information
All CCP: Data Scientist exams are remote-proctored and available anywhere, anytime. See the FAQ for more information and system requirements.
Exams are hands-on, practical exams using data science tools on Cloudera technologies. Each user is given their own 7-node, high-performance CDH5 (currently 5.3.2) cluster pre-loaded with Spark, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many others (See a full list). In addition the cluster also comes with Python (2.6 and 3.4), Perl 5.10, Elephant Bird, Cascading 2.6, Brickhouse, Hive Swarm, Scala 2.11, Scalding, IDEA, Sublime, Eclipse, NetBeans, scikit-learn, octave, NumPy, SciPy, Anaconda, R, plyr, dplyrimpaladb, SparkML, vowpal wabbit, clouderML, oryx, impyla, CoreNLP, The Stanford Parser: A statistical parser, Stanford Log-linear Part-Of-Speech Tagger, Stanford Named Entity Recognizer (NER), Stanford Word Segmenter, opennlp, H2O, java-ml, RapidMiner, caffe, Weka, NLTK, matplotlib, ggplot, d3py, SparkingPandas, randomforest, R: ggplot2, Sparkling water.
Currently, the cluster is open to the internet and there are no restrictions on tools you can install or websites or resources you may use.