Cloudera Primary User Personas

Cloudera has defined the following set of personas described in this topic. These personas are characters based on real people, where each persona represents a user type. This collection of personas helps define the goals and activities of typical users of Cloudera products. Defining personas for software products is a moving target because user types change over time. This collection is the result of a 2018 study collecting data from about fifteen leaders in Cloudera product management and engineering. These primary personas are being validated with some customers to ensure their accuracy and will be updated as needed.

Infrastructure

The personas in this group use either Cloudera Manager or Altus to manage CDH clusters on-premises or in the cloud.

Jim — Senior Hadoop Administrator



Skills and Background

  • Very strong knowledge of HDFS and Linux administration
  • Understanding of:
    • Distributed/grid computing
    • VMs and their capabilities
    • Racks, disk topologies, and RAID
    • Hadoop architecture
  • Proficiency in Java

Tools:

Cloudera

  • Cloudera Manager/CDH
  • Navigator
  • BDR
  • Workload XM

Third-party Tools: Configuration management tools, log monitoring tools, for example, Splunk, Puppet, Chef, Ganglia, or Grafana

Goals:

  • Achieve consistent high availability and performance on Hadoop clusters
  • User administration, including creating new users and updating access control rights upon demand

Typical Tasks:

  • Monitor cluster performance to ensure high percentage up time
  • Back up and replicate appropriate files to ensure disaster recovery
  • Schedule and perform cluster upgrades
  • Security: enable and check status of security services and configurations
  • Analyze query performance with Workload XM to ensure optimum cluster performance
  • Provision new clusters

Jen — Junior Hadoop Administrator



Skills and Background

  • Basic knowledge of HDFS
  • Limited knowledge of Linux (shell scripting mostly)
  • General understanding of:
    • Distributed/grid computing
    • VMs and their capabilities
    • Racks, disk topologies, and RAID
    • Hadoop architecture

Tools:

Cloudera

  • Cloudera Manager/CDH
  • Navigator
  • Workload XM

Third-party Tools: Configuration management tools, log monitoring tools, for example, Splunk, Puppet, Chef, Ganglia, or Grafana

Goals:

  • Maintain high availability and performance of Hadoop clusters

Typical Tasks:

  • Perform basic procedures to ensure clusters are up and running
  • Perform maintenance work flows

Sarah — Cloud Administrator



Skills and Background

  • Understands public cloud primitives (Virtual Private Cloud)
  • Understands security access policies (Identity Access Management)
  • Proficiency in Java

Tools:

Cloudera

  • Altus

Third-party Tools: Amazon Web Services, Microsoft Azure

Goals:

  • Maintain correct access to cloud resources
  • Maintain correct resource allocation to cloud resources, such as account limits

Typical Tasks:

  • Create the Altus environment for the organization

Data Ingest, ETL, and Metadata Management

The personas in this group typically use Navigator, Workload XM, HUE, Hive, Impala, and Spark.

Terence — Enterprise Data Architect or Modeler



Skills and Background

  • Experience with:
    • ETL process
    • Data munging
    • Wide variety of data wrangling tools

Tools:

Cloudera

  • Navigator
  • Workload XM
  • HUE
  • Hive
  • Impala
  • Spark

Third-party Tools: ETL and other data wrangling tools

Goals:

  • Maintain organized/optimized enterprise data architecture to support the business needs
  • Ensure that data models support improved data management and consumption
  • Maintain efficient schema design

Typical Tasks:

  • Organize data at the macro level: set architectural principles, create data models, create key entity diagrams, and create a data inventory to support business processes and architecture
  • Organize data at the micro level: create data models for specific applications
  • Map organization use cases to execution engines (Impala, Spark, Hive)
  • Provide logical data models for the most important data sets, consuming applications, and data quality rules
  • Provide data entity descriptions
  • Ingest new data into the system: use ingest tools, monitor ingestion rate, data formatting, and partitioning strategies

Kara — Data Steward and Data Curator



Skills and Background

  • Experience with:
    • ETL process
    • Data wrangling tools

Tools:

Cloudera

  • Navigator
  • HUE data catalog

Third-party Tools: ETL and other data wrangling tools

Goals:

  • Maintain metadata (technical and custom)
  • Maintain data policies to support business processes
  • Maintain data lifecycle at Hadoop scale
  • Maintain data access permissions

Typical Tasks:

  • Manage technical metadata
  • Classify data at Hadoop scale
  • Create and manage custom and business metadata using policies or third-party tools that integrate with Navigator

Analytics and Machine Learning

The personas in this group typically use Cloudera Data Science Workbench (CDSW), HUE, HDFS, and HBase.

Song — Data Scientist



Skills and Background

  • Statistics
  • Related scripting tools, for example R
  • Machine learning models
  • SQL
  • Basic programming

Tools:

Cloudera

  • CDSW
  • HUE to build and test queries before adding to CDSW
  • HDFS
  • HBase

Third-party Tools: R, SAS, SPSS, and others. Command-line scripting languages such as Scala, Python, Tableau, Qlik, and some Java

Goals:

  • Solve business problems by applying advanced analytics and machine learning in an ad hoc manner

Typical Tasks:

  • Access, explore, and prepare data by joining and cleaning it
  • Define data features and variables to solve business problems as in data feature engineering
  • Select and adapt machine learning models or write algorithms to answer business questions
  • Tune data model features and hyper parameters while running experiments
  • Publish the optimized model for wider use as an API for BI Analysts or Data Owners to use as part of their reporting
  • Publish data model results to answer business questions for consumption by Data Owners and BI Analysts

Jason — Machine Learning Engineer



Skills and Background

  • Machine learning and big data skills
  • Software engineering

Tools:

Cloudera

  • Spark
  • HUE to build and test queries before adding to application
  • CDSW

Third-party Tools: Java

Goals:

  • Build and maintain production machine learning applications

Typical Tasks:

  • Set up big data machine learning projects at companies such as Facebook

Cory — Data Engineer



Skills and Background

  • Software engineering
  • SQL mastery
  • ETL design and big data skills
  • Machine learning skills

Tools:

Cloudera

  • CDSW
  • Spark/MapReduce
  • Hive
  • Oozie
  • Altus Data Engineering
  • HUE
  • Workload XM

Third-party Tools: IDE, Java, Python, Scala

Goals:

  • Create data pipelines (about 40% of working time)
  • Maintain data pipelines (about 60% of working time)

Typical Tasks:

  • Create data workflow paths
  • Create code repository check-ins
  • Create XML workflows for production system launches

Sophie — Application Developer



Skills and Background

  • Deep knowledge of software engineering to build real-time applications

Tools:

Cloudera

  • HBase

Third-party Tools: Various software development tools

Goals:

  • Applications developed run and successfully send workloads to the cluster. For example, connects a front-end to HBase on the cluster.

Typical Tasks:

  • Develops application features, but does not write the SQL workload. Rather writes the application that sends the workloads to the cluster.
  • Tests applications to ensure they run successfully

Abe — SQL Expert/SQL Developer



Skills and Background

  • Deep knowledge of SQL dialects and schemas

Tools:

Cloudera

  • HUE
  • Cloudera Manager to monitor Hive queries
  • Hive via command line or HUE
  • Impala via HUE, another BI tool, or the command line
  • Navigator via HUE
  • Sentry via HUE
  • Workload XM via HUE

Third-party Tools: SQL Studio, TOAD

Goals:

  • Create workloads that perform well and that return the desired results

Typical Tasks:

  • Create query workloads that applications send to the cluster
  • Ensure optimal performance of query workloads by monitoring the query model and partitioning strategies
  • Prepare and test queries before they are added to applications

Kiran — SQL Analyst/SQL User



Skills and Background

  • Has high-level grasp of SQL concepts, but prefers to drag and drop query elements
  • Good at data visualization, but prefers pre-populated tables and queries

Tools:

Cloudera

  • HUE
  • Cloudera Manager to monitor queries
  • Oozie to schedule workloads
  • Impala (rather than Hive)

Third-party Tools: Reporting and business intelligence tools like Cognos, Crystal Reports

Goals:

  • To answer business questions and problems based on data

Typical Tasks:

  • Create query workloads that applications send to the cluster
  • Ensure optimal performance of queries (query model, partitioning strategies)

Christine — BI Analyst



Skills and Background

  • Ability to:
    • View reports and drill down into results of interest
    • Tag, save, share reports and results

Tools:

Cloudera

  • HUE
  • Navigator via HUE

Third-party Tools: SQL query tools, Tableau, Qlik, Excel

Goals:

  • Apply data preparation and analytic skills to solve recurrent business problems. For example, to create a weekly sales report.
  • Provide reports for the Business/Data Owner

Typical Tasks:

  • Access, explore, and prepare data by joining and cleaning it
  • Create reports to satisfy requests from business stakeholders to solve business problems