Cloudera Glossary

This is a reference list of terms related to Cloudera products and services. Additional information is available from a number of resources.

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | Y | Z

access control list (ACL)

A list of permissions associated with an object in a computer file system. An ACL specifies which users or processes are allowed to access an object, and what operations can be performed.

Accumulo

A sorted, distributed key-value store based on the Google BigTable design. Apache Accumulo is a NoSQL DBMS that operates over HDFS, and supports efficient storage and retrieval of structured data, including queries for ranges. Accumulo tables can be used as input and output for MapReduce jobs. Accumulo includes automatic load-balancing and partitioning, data compression, and fine-grained security labels.

action

In Spark, a function that returns a value to the driver after running a computation on an RDD.

Apache

See Apache Software Foundation.

Apache Incubator

Apache Software Foundation gateway for open-source projects that aim to become Apache projects. Incubating projects are open source and may or may not become Apache projects.

Apache Software Foundation (ASF)

A non-profit corporation that supports various open-source software products, including Apache Hadoop and related projects on which Cloudera products are based. Apache projects are developed by teams of collaborators and protected by an ASF license that provides legal protection to volunteers who work on Apache products and protect the Apache brand name.

Apache projects are characterized by a collaborative, consensus-based development process and an open and pragmatic software license. Each project is managed by a self-selected team of technical experts who are active contributors to the project.

Cloudera employees are major contributors to many Apache projects.

application JAR

A JAR containing a Spark application. In some cases you can use an "uber" JAR containing your application with its dependencies. The JAR should never include Hadoop or Spark libraries, however, because these are added at run time.

authentication

The function of confirming the identity of a person or software program.

authorization

The function of specifying access rights to resources.

Avro

A serialization system for storing and transmitting data over a network. Apache Avro supports rich data structures, a compact binary encoding, and a container file for sequences of Avro data (often referred to as Avro data files). Avro is language-independent and several language bindings are available, including Java, C, C++, Python, and Ruby. All components in CDH that produce or consume files support Avro data files.

Avro provides functionality similar to systems such as Thrift and Protocol Buffers.

Beeswax

A Hue application that enables you to perform queries on Hive. You can create Hive tables, load data, and run and manage Hive queries.

big data

Data sets in which the input/output velocity, variety of data structure, and volume exceed the capabilities of systems which were designed for smaller data sets to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are expanding, currently ranging from terabytes to many petabytes in a single data set.

BigTable

A compressed, high-performance, column-oriented database built on Google File System (GFS). The BigTable design was the inspiration for HBase and Accumulo, but the implementation, unlike other Google projects such as Protocol Buffers, is proprietary.

Bigtop

An Apache project to develop the packaging and interoperability testing of the Apache Hadoop ecosystem projects.

BDR

Backup and Disaster Recovery. CDH has several features you can use to back up your data that you can use for disaster recovery. See snapshots and replication.

custom metadata

In Cloudera Navigator, metadata that is added to extracted entities. You can add and modify custom metadata before or after entities are extracted.

CDH

Cloudera distribution containing core Hadoop (HDFS, MapReduce, YARN) and the following related projects: Avro, Flume, Fuse-DFS, HBase, Hive, Hue,Impala, Mahout, Oozie, Pig, Cloudera Search, Sentry, Spark, Sqoop, Whirr, ZooKeeper, DataFu, and Kite.

CDH is free, 100% open source, and licensed under the Apache 2.0 license. CDH is supported on many Linux distributions.

Cloudera Enterprise

Essentials Edition offers an enterprise-ready distribution of CDH together with Cloudera Manager and other advanced management tools and technical support for core Apache Hadoop.
Data Science and Engineering Edition offers an enterprise-ready distribution of CDH together with Cloudera Manager and other advanced management tools and technical support for programmatic data preparation and predictive modeling.
Operational Database Edition offers an enterprise-ready distribution of CDH together with Cloudera Manager and other advanced management tools and technical support for online applications with real-time serving needs.
Data Warehouse Edition offers an enterprise-ready distribution of CDH together with Cloudera Manager and other advanced management tools and technical support for BI and SQL analytics.
Enterprise Data Hub Edition offers an enterprise-ready distribution of CDH together with Cloudera Manager and other advanced management tools and technical support for complete use of the platform.

Cloudera Express

A free download that contains CDH and Cloudera Manager, which offers robust cluster management capabilities like automated deployment, centralized administration, monitoring, and diagnostic tools. Cloudera Express enables data-driven enterprises to evaluate CDH and Cloudera Manager.

Cloudera Manager

An end-to-end management application for CDH, Impala, and Cloudera Search. Cloudera Manager enables administrators to easily and effectively provision, monitor, and manage Hadoop clusters and CDH installations. Cloudera Manager is available in two versions: Cloudera Express and Cloudera Enterprise.

Cloudera Navigator

A fully integrated data management and security tool for the Hadoop platform. Cloudera Navigator provides three categories of functionality:

Auditing data access and verifying access privileges. Cloudera Navigator allows administrators to configure, collect, and view audit events, and generate reports that list the HDFS access permissions granted to groups. Cloudera Navigator tracks access permissions and actual accesses to all entities in HDFS, Hive, HBase, Hue, Impala, Sentry, and Solr.
Searching metadata and visualizing lineage. Metadata management features allow DBAs, data modelers, business analysts, and data scientists to search for, amend the properties of, and tag data entities. Cloudera Navigator supports tracking the lineage of HDFS files, datasets, and directories, Hive tables and columns, MapReduce and YARN jobs, Hive queries, Impala queries, Pig scripts, Oozie workflows, Spark jobs, and Sqoop jobs.
Securing data and simplifying storage and management of encryption keys. Data encryption and key management provide protection against potential threats by malicious actors on the network or in the datacenter. It is also a requirement for meeting key compliance initiatives and ensuring the integrity of enterprise data.

Cloudera Search

A fully integrated search tool for the Apache Hadoop platform that integrates Apache Solr, including Apache Lucene, Apache SolrCloud, and Apache Tika, with CDH. Cloudera Search makes searching more scalable, easy to use, and optimized for both near-real-time and batch-oriented indexing.

cluster

A set of computers or racks of computers that contains an HDFS filesystem and runs MapReduce and other processes on that data. A pseudo-distributed cluster is a CDH installation run on a single machine and useful for demonstrations and individual study.
In Cloudera Manager, a logical entity that contains a set of hosts, a single version of CDH installed on the hosts, and the service and role instances running on the hosts. A host can belong to only one cluster. Cloudera Manager can manage multiple CDH clusters, however each cluster can only be associated with a single Cloudera Manager Server or Cloudera Manager HA pair.

cluster manager

An external service for acquiring resources on the cluster: Spark Standalone or YARN.

commit

An operation in Cloudera Search that makes documents searchable.

hard - A commit that starts the autowarm process, closes old searchers, and opens new ones. It may also trigger replication.
soft - Functionality with NRT and SolrCloud that makes documents searchable without requiring hard commits.

compression

A mechanism to reduce the size of a file so that it takes up less disk space for storage and consumes less network bandwidth when transferred. Common compression tools used with Apache Hadoop include gzip, bzip2, Snappy, and LZO.

container

A resource bucket and process space for a task. A container's resources consist of vcores and memory.

connector

Usually refers to software for connecting external systems with Apache Hadoop. Some connectors work with Apache Sqoop to enable efficient data transfer between an external system and Hadoop. Other connectors translate ODBC driver calls from business intelligence systems into HiveQL queries.

The JDBC drivers supported by Cloudera Manager are also referred to as connectors.

Crunch

An Apache Java library that can be used to write, test, and run MapReduce pipelines.

custom metadata

In Cloudera Navigator, descriptions, key-value pairs, and tags that can be added to entities such as HDFS files, Hive tables, and YARN operations. You can add and modify custom metadata before and after entities are extracted.

Data Encryption Key (DEK)

The encryption/decryption key assigned to a file in an encryption zone. Each file has its own DEK, and these DEKs are never stored persistently unless they are encrypted with the encryption zone's key.

data science

A discipline that builds on techniques and theories from many fields, including mathematics, statistics, and computer science, with the goal of extracting meaning from data and creating data products.

DataFu

A collection of Pig user-defined functions (UDFs) for statistical analysis.

DataNode

See Hadoop Distributed File System (HDFS).

datastore

A repository of a set of integrated information objects. Datastores include repositories such as databases and files.

dataset

A collection of records, similar to a relational database table. Records are similar to table rows, but the columns can contain not only strings or numbers, but also nested data structures such as lists, maps, and other records.

DDL

A category of SQL statements that affect database state rather than table data. Includes all the CREATE, ALTER, and DROP statements.

deployment

A configuration of Cloudera Manager and all the clusters it manages.

distributed system

A system composed of multiple autonomous computers that communicate through a computer network.

DML

A category of SQL statements that change table data, such as INSERT and LOAD DATA.

driver

In Apache Spark, a process that represents an application session. The driver is responsible for converting the application to a directed graph of individual steps to execute on the cluster. There is one driver per application.

dynamic resource pool

In Cloudera Manager, a named configuration of resources and a policy for scheduling the resources among YARN applications or Impala queries running in the pool.

embedded Solr

Provides the ability to execute Solr commands without having a separate servlet container. Use of embedded Solr is generally discouraged, particularly if used because HTTP is assumed to be too slow. However, in Cloudera Search, particularly if a MapReduce process is adopted, embedded Solr is advisable.

Encrypted Data Encryption Key (EDEK)

An encrypted DEK, which is stored persistently as part of the file's metadata on the NameNode.

encryption

The encoding of information so that only authorized users are permitted to read it.

encryption zone

A directory in HDFS in which every file and subdirectory is encrypted. The files in this directory are transparently encrypted on write and transparently decrypted on read. Each encryption zone is associated with a key that is specified when the zone is created.

encryption zone key

Key used to encrypt EDEKs. When a new file is created in an encryption zone, the NameNode sends a request to the KMS to generate a new EDEK encrypted with the encryption zone key. When reading a file from an encryption zone, the NameNode provides the client with the file's EDEK and the encryption zone key version used to encrypt the EDEK. The client then sends a request to the KMS to decrypt the EDEK. If successful, the client uses the DEK to decrypt the file contents.

Enterprise Data Hub

An enterprise data hub (EDH), built on Apache Hadoop, provides a single central system for the storage and management of all data in the enterprise. An EDH runs the full range of workloads that enterprises require, including batch processing, interactive SQL, enterprise search, and advanced analytics, together with the integrations to existing systems, robust security, data management, and data protection.

executor

A process that serves a Spark application. An executor runs multiple tasks over its lifetime, and multiple tasks concurrently. A host may have several Spark executors and there are many hosts running Spark executors for each application.

expression

A construct that allows certain policy properties to be specified programmatically using Java, instead of string literals.

extract, load, transform (ELT)

A variation of Extract, Transform, Load (ETL). The process of transferring data from a source to an end target (a database or data warehouse), and then transforming the data as required.

extract, transform, load (ETL)

A process that involves extracting data from sources, transforming the data to fit operational needs, and loading the data into the end target, typically a database or data warehouse.

facet

In Cloudera Manager and Cloudera Navigator, an explicit dimension of an entity that enables it to be accessed and filtered in multiple ways. Facets correspond to entity properties.

faceting

Arrangement of query results into categories, usually with counts for each category. You can use these categories to explore and further restrict search results to find the information you need.

fault-tolerant design

A design that enables a system to continue operation, possibly at a reduced level instead of failing completely, when some part of the system fails.

field-level

Level at which encryption and data masking can be applied. When protection is applied at this level, it is generally applied only to specific sensitive fields, such as credit card numbers, social security numbers, or names, not to all data.

Flume

A distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of text or streaming data from many different sources to a centralized datastore. Apache Flume is robust and fault tolerant and uses a simple, extensible data model that allows for online analytic application.

filesystem-level

Level at which encryption can be applied to protect some or all files in a volume.

filter query (fq)

A clause that limits returned results in Cloudera Search. For example, “fq=sex:male” limits results to males. Filter queries are cached and reused.

Fuse-DFS

A service that allows HDFS to be mounted on Linux and accessed using standard filesystem tools.

gateway

A type of role that typically provides client access to specific cluster services. For example, HDFS, Hive, Kafka, MapReduce, Solr, and Spark each have gateway roles to provide access for their clients to their respective services. Gateway roles do not always have "gateway" in their names, nor are they exclusively for client access. For example, Hue Kerberos Ticket Renewer is a gateway role that proxies tickets from Kerberos.

The node supporting one or more gateway roles is sometimes referred to as the gateway node or edge node, with the notion of "edge" common in network or cloud environments. In terms of the Cloudera cluster, the gateway nodes in the cluster receive the appropriate client configuration files when Deploy Client Configuration is selected from the Actions menu in Cloudera Manager Admin Console.

Giraph

A large-scale, fault-tolerant, graph-processing framework that runs on Apache Hadoop. Apache Giraph features include master computation, sharded aggregators, edge-oriented input, and out-of-core computation.

HA

See high availability.

Hadoop

A free, open source software framework that supports data-intensive distributed applications. The core components of Apache Hadoop are the HDFS and the MapReduce and YARN processing frameworks. The term is also used for an ecosystem of projects related to Hadoop, under the umbrella of infrastructure for distributed computing and large-scale data processing.

Hadoop Distributed File System (HDFS)

A user space filesystem designed for storing very large files with streaming data access patterns, running on clusters of industry-standard machines. HDFS defines three components:

NameNode - Maintains the namespace tree for HDFS and a mapping of file blocks to DataNodes where the data is stored. A simple HDFS cluster can have only one primary NameNode, supported by a secondary NameNode that periodically compresses the NameNode edits log file that contains a list of HDFS metadata modifications. This reduces the amount of disk space consumed by the log file on the NameNode, which also reduces the restart time for the primary NameNode. A high availability cluster contains two NameNodes: active and standby.
DataNode - Stores data in a Hadoop cluster and is the name of the daemon that manages the data. File data is replicated on multiple DataNodes for reliability and so that localized computation can be executed near the data.
JournalNode - Maintains a directory to log the modifications to the namespace metadata when using the Quorum-based Storage mechanism for providing high availability. During failover, the NameNode standby ensures that it has applied all of the edits from the JournalNodes before promoting itself to the active state.

Hadoop User Group (HUG)

A club focused on the use of Hadoop technology.

Hadoop World

An industry conference for Hadoop users, contributors, administrators, and application developers.

HBase

A scalable, distributed, column-oriented datastore. Apache HBase provides real-time read/write random access to very large datasets hosted on HDFS.

HBaseCon

An industry conference for Apache HBase users, contributors, administrators, and application developers.

HDFS

See Hadoop Distributed File System.

heterogeneous storage

A storage framework that supports multiple storage types (ARCHIVE, DISK, SSD, RAM_DISK) determined by a storage policy.

high availability (HA)

A system and implementation design to keep a service available at all times in case of failure, without regard to its performance.

Hive

An Apache data warehouse system for Hadoop that facilitates summarization and the analysis of large datasets stored in HDFS using an SQL-like language called HiveQL.

HiveServer

A server process that supports clients that connect to Hive over an Thrift connection. The name also refers to a Thrift protocol used by both Impala and Hive.

HiveServer2

A server process that supports clients that connect to Hive over a network connection. These clients can be native command-line editors or applications and tools that use an ODBC or JDBC driver. The name also refers to a Thrift protocol used by both Impala and Hive.

HiveQL

The name of the SQL dialect used by the Hive component. It uses a syntax that is similar to standard SQL to execute MapReduce jobs on HDFS. HiveQL does not support all SQL functionality. Transactions and materialized views are not supported, and support for indexes and subquery is limited. It supports features that are not part of standard SQL, such as multitable, including multitable inserts and create table as select.

Internally, a compiler translates a HiveQL statement into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution. Beeswax, which is included in Hue, provides a graphical front end for HiveQL queries.

host

In Cloudera Manager, a physical or virtual machine that runs role instances. A host can belong to only one cluster.

host template

A set of role groups in Cloudera Manager. When a template is applied to a host, a role instance from each role group is created and assigned to that host.

Hue

A platform for building custom GUI applications for CDH services and a tool containing the following built-in applications: an application for submitting jobs, Apache Pig, Apache HBase, and Sqoop 2 shells, Pig Editor, the Beeswax Hive UI, Impala query editor, Solr Search application, Hive metastore manager, Oozie application editor, scheduler, and submitter, Apache HBase Browser, Sqoop 2 application, HDFS file manager, and MapReduce and YARN job browser.

Impala

Official name: Apache Impala. A service that enables real-time querying of data stored in HDFS or HBase. It supports the same metadata and ODBC and JDBC drivers as Apache Hive and a query language based on the Hive Standard Query Language (HiveQL). To avoid latency, Impala circumvents MapReduce to directly access data through a specialized distributed query engine that is similar to those found in commercial parallel RDBMS.

Incubator

See Apache Incubator.

index

A data structure that improves the speed of data retrieval on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

JDBC driver

A client-side adapter that implements the JDBC Java programming language API for accessing relational database management systems.

job

A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action.

JobTracker

See MapReduce v1 (MRv1).

JournalNode

See Hadoop Distributed File System.

Kafka

A distributed publish-subscribe messaging system that provides high throughput for publishing and subscribing. as well as replication to prevent data loss. Apache Kafka is frequently used for log collection and stream processing and often (but not exclusively) used in tandem with Hadoop, Apache Storm, and Spark Streaming.

Kerberos

An authentication protocol in wide use since it was first developed by MIT in 1993 and standardized by the IETF in 2005 (RFC 4120, Kerberos Version 5). Cloudera recommends using Kerberos to secure clusters, by integrating either MIT Kerberos or Microsoft Active Directory (which uses Kerberos). With Kerberos enabled, user authentication is required. Once users authenticate, other components of the cluster can also be leveraged (for example, Sentry's role-based access privileges) to provide appropriate, secure access to the cluster. See Cloudera Security for more information.

key material

The portion of the key used during encryption and decryption.

Key Management Server (KMS)

Hadoop service that interfaces with a backing key store on behalf of HDFS daemons and clients. Both the backing key store and the KMS implement the Hadoop KeyProvider client API.

Kite

A collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with CDH. Just like CDH, Kite is 100% free, open source, and licensed under the Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.

Kudu

A columnar storage manager for the Hadoop platform. Like other Hadoop ecosystem applications, Apache Kudu runs on commodity hardware, is horizontally scalable, and supports highly available operation.

Fast processing, integration with Hadoop ecosystem components, high availability, and other benefits make it ideal for a variety of applications: reporting applications where new data must be immediately available for end users; time-series applications that must support queries across large amounts of historic data while simultaneously returning granular queries about an individual entity; and applications that use predictive models to make real-time decisions.

latency

A measure of time delay experienced in a system.

lineage diagram

In Cloudera Navigator, a directed graph that depicts an entity and its relationship with other entities.

Linux

A Unix-like computer operating system assembled under the model of free, open-source software development and distribution. Linux is a leading operating system on servers, mainframe computers, supercomputers, and embedded systems such as mobile phones, tablets, network routers, televisions, and video game consoles. The major distributions of enterprise Linux are CentOS, Debian, RHEL, SLES, and Ubuntu.

LZO

A free, open source compression library. LZO compression provides a good balance between data size and speed of compression. The LZO compression algorithm is the most efficient of the codecs, using very little CPU. Its compression ratios are not as good as others, but its compression is still significant compared to the uncompressed file sizes. Unlike some other formats, LZO compressed files are splittable, enabling MapReduce to process splits in parallel.

LZO is published under the GNU General Public License and so is not included in CDH but can be used with CDH components; the Cloudera public Git repository hosts the hadoop-lzo package that provides a version of LZO that can be used with CDH.

machine learning

A field of computer science that explores the construction and study of algorithms that can learn from and make predictions on data. Categories of machine learning problems include:

Recommendation mining - Identifies things users will like based on past preferences; for example, online shopping recommendations.
Clustering - Groups similar items; for example, documents that address similar topics.
Classification - Learns what members of existing categories have in common and then uses that information to categorize new items.
Frequent item-set mining - Examines a set of item-groups (such as items in a query session or shopping cart content) and identifies items that usually appear together.

Libraries implementing machine learning algorithms include:

Mahout

A machine learning library for Hadoop that is scalable to large datasets, thereby simplifying the task of building intelligent applications. Apache Mahout also provides Java libraries for common maths operations and primitive Java collections.

managed metadata

In Cloudera Navigator, key-value pairs that can be added to entities such as HDFS files, Hive tables, and YARN operations. Managed metadata key-value pairs are similar to custom metadata key-value pairs, but can also define the keys within a namespace and enforce conformance to value constraints (for example, require the value to be a date). You can add and modify managed metadata after entities are extracted.

MapReduce

A distributed processing framework for processing and generating large data sets and an implementation that runs on large clusters of industry-standard machines.

The processing model defines two types of functions: a map function that processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

A MapReduce job partitions the input data set into independent chunks that are processed by the map functions in a parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce functions. Typically both the input and the output of the job are stored in a distributed filesystem.

The implementation provides an API for configuring and submitting jobs and job scheduling and management services; a library of search, sort, index, inverted index, and word co-occurrence algorithms; and the runtime. The runtime system partitions the input data, schedules the program's execution across a set of machines, handles machine failures, and manages the required inter-machine communication.

MapReduce v1 (MRv1)

The runtime framework on which MapReduce jobs execute. It defines two daemons:

JobTracker - Coordinates running MapReduce jobs and provides resource management and job lifecycle management. In YARN, those functions are performed by two separate components.
TaskTracker - Runs the tasks that the MapReduce jobs have been split into.

MapReduce v2 (MRv2)

See YARN.

Maven

A software project-management tool. Based on the concept of a project object model, Apache Maven can manage a project's build, reporting, and documentation. CDH artifacts are available in the Cloudera Maven repository.

metadata

Data that describes other data. Metadata summarizes basic information, for example author, date created, and file size, that can facilitate finding and working with particular instances of data.

Multi Cloudera Manager Dashboard

A mode of Cloudera Manager that consolidates the display of monitoring information from CDH clusters managed by multiple Cloudera Manager instances.

NameNode

See Hadoop Distributed File System.

Navigator Key Trustee

A virtual safe-deposit box for managing encryption keys, certificates, and passwords. It provides software-based key and certificate management that supports a variety of robust, configurable, and easy-to-implement policies governing access to the secure artifacts.

near real-time (NRT)

In Cloudera Search, the ability to search documents very soon after they are added to Solr. With SolrCloud, this is largely automatic and measured in seconds.

Navigator

See Cloudera Navigator.

network-level

Level at which encryption and decryption are applied before and after data is sent across a network. In Hadoop, this includes data sent from client user interfaces as well as service-to-service communication like remote procedure calls (RPCs). This protection is available on virtually all transmissions within the Hadoop ecosystem using industry-standard protocols such as TLS/SSL.

ODBC driver

A client-side adapter that implements a standard C programming language API for accessing relational database management systems.

OL

Oracle Linux.

Oozie

A workflow and coordination service for Hadoop that orchestrates data ingest, store, transform, and analysis actions. Apache Oozie supports several types of Hadoop jobs, including MapReduce, Streaming, Pipes, Pig, Hive, and Sqoop.

Oryx

Provides a simple, real-time, large-scale machine learning and predictive analytics infrastructure. Using Apache Hadoop, Oryx can continuously build models from a data stream. It also serves queries of those models in real time through an HTTP REST API, and can update models based on new streaming data.

parcel

A binary distribution format that contains compiled code and meta-information such as a package description, version, and dependencies.

Parquet

An open source, column-oriented binary file format for Hadoop that supports very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and allows adding more encodings as they are invented and implemented. Encoding and compression are separated, allowing Parquet consumers to implement operators that work directly on encoded data without paying a decompression and decoding penalty, when possible.

partition

A subset of the elements in an RDD. Partitions define the unit of parallelism; Spark processes elements within a partition in sequence and multiple partitions in parallel. When Spark reads a file from HDFS, it creates a single partition for a single input split. It returns a single partition for a single block of HDFS (but the split between partitions is on line split, not the block split), unless you have a compressed text file. With a compressed file, you get a single partition for a single file (because compressed text files are not splittable).

peer

A Cloudera Manager instance that manages clusters and is used as the source of data to be replicated. See replication.

petabyte

10¹⁵ bytes. 1,000 terabytes or 1,000,000 gigabytes.

Pig

A data flow language and parallel execution framework built on top of MapReduce. Internally, a compiler translates Apache Pig statements into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.

policy

In Cloudera Navigator, a set of actions performed when a class of entities is extracted.

Quorum-based storage

A mechanism for enabling a standby NameNode to keep its state synchronized with the active NameNode, in which both nodes communicate with a group of daemons called JournalNodes.

rack

In Cloudera Manager, a physical entity that contains a set of physical hosts typically served by the same switch.

RegionServer

In HBase, applications store data into labeled tables, which are partitioned horizontally into regions. RegionServer is responsible for managing one or more regions.

relational database management system (RDBMS)

A database management system based on the relational model, in which all data is represented in terms of tuples, grouped into relations. Most implementations of the relational model use the SQL data definition and query language.

replica

In SolrCloud, a complete copy of a shard. Each replica is identical, so only one replica has to be queried (per shard) for searches.

replication

The ability to copy HDFS directories and files, the Hive metastore and data, and HBase tables to another cluster.

resilient distributed dataset (RDD)

In Spark, a fault-tolerant collection of elements that can be operated on in parallel.

RHEL

Red Hat Enterprise Linux.

role

In Cloudera Manager, a category of functionality within a service. For example, the HDFS service has the following roles: NameNode, SecondaryNameNode, DataNode, and Balancer. Sometimes referred to as a role type. See also user role.

role group

In Cloudera Manager, a set of configuration properties for a set of role instances.

role instance

In Cloudera Manager, an instance of a role running on a host. It typically maps to a Unix process. For example: "NameNode-h1" and "DataNode-h1".

scheduler

Component of a computing framework such as YARN, MapReduce, or Spark, that is responsible for determining which jobs run, where and when they run, and resources allocated to the jobs.

schema

Defines the field names and data types for a dataset. Kite relies on an Apache Avro schema definition for all datasets, standardizes data definition by using Avro schemas for both Parquet and Avro, and supports the standard Avro object models generic and specific.

scheme

In a dataset, defines its storage type and location. You can create datasets in Hive, HDFS, HBase, or as local files. You define dataset schemes using scheme-specific URI patterns.

Sentry

Enterprise-grade big-data security that delivers fine-grained authorization to data stored in Apache Hadoop. An independent security module that integrates with open-source SQL query engines Hive and Impala, Apache Sentry delivers advanced authorization controls to enable multi-user applications and cross-functional processes for enterprise data sets.

serialization

The process of converting a data structure or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection). Deserialization is the process of converting a data structure or object state back to the original state later in the same or another computer environment. See Avro and Thrift.

service

A Linux command that runs a System V init script in /etc/init.d/ in as predictable an environment as possible, removing most environment variables and setting the current working directory to /.
A category of managed functionality in Cloudera Manager, which may be distributed or not, running in a cluster. Sometimes referred to as a service type. For example: MapReduce, HDFS, YARN, Spark, and Accumulo. In traditional environments, multiple services run on one host; in distributed systems, a service runs on many hosts.

service instance

In Cloudera Manager, an instance of a service running on a cluster. For example: "HDFS-1" and "yarn". A service instance spans many role instances.

sharding

In Cloudera Search, splitting a single logical index up into some number of sub-indexes, each of which can be hosted on a separate machine. Solr (and especially SolrCloud) handles querying each shard and assembling the response into a single, coherent list.

SLES

SUSE Linux Enterprise Server.

Snappy

A compression library. Snappy aims for very high speeds and reasonable compression instead of maximum compression or compatibility with other compression libraries. Snappy is provided in the Hadoop package along with the other native libraries (such as native gzip compression).

snapshots

Point-in-time backups of HDFS directories or files, or HBase tables.

SolrCloud

ZooKeeper-enabled, fault-tolerant, distributed Solr.

SolrJ

A Java API for interacting with a Solr instance.

Spark

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Spark core and several related projects:

Spark SQL - Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs.
Spark Streaming - API that allows you to build scalable fault-tolerant streaming applications.
MLlib - API that implements common machine learning algorithms.
GraphX - API for graphs and graph-parallel computation.

Cloudera supports Spark core, Spark SQL (including DataFrames), Spark Streaming, and MLlib. Cloudera does not currently offer commercial support for GraphX or SparkR.

SQL

A declarative programming language designed for managing data in relational database management systems. It includes features for creating schema objects such as databases and tables, and for querying and modifying data. CDH includes SQL support through Impala for high-performance interactive queries, and Hive for long-running batch-oriented jobs.

Sqoop

A tool for efficiently transferring bulk data between Hadoop and external structured datastores, such as relational databases. Apache Sqoop imports the contents of tables into HDFS, Hive, and HBase and generates Java classes that enable users to interpret the table's schema. Sqoop can also extract data from Hadoop storage and export records from HDFS to external structured datastores such as relational databases and enterprise data warehouses.

There are two versions: Sqoop and Sqoop 2. Sqoop requires client-side installation and configuration. Sqoop 2 is a web-based service with a client command-line interface. In Sqoop 2, connectors and database drivers are configured on the server.

stage

In Spark, a collection of tasks that all execute the same code, each on a different partition. Each stage contains a sequence of transformations that can be completed without shuffling the data.

static service pool

In Cloudera Manager, a static partitioning of total cluster resources—CPU, memory, and I/O weight—across a set of services.

suppression

In Cloudera Manager, the ability to suppress the display of health test results, configuration warnings, and parameter validation warnings.

task

A unit of work on a partition of an RDD.

TaskTracker

See MapReduce v1 (MRv1).

technical metadata

In Cloudera Navigator, metadata defined when entities are extracted from a CDH deployment. You cannot modify technical metadata.

terabyte

10¹² bytes. 1,000 gigabytes.

Thrift

An interface definition language, runtime library, and code-generation engine to build services that can be invoked from many languages. Apache Thrift can be used for serialization and RPC, but within Hadoop is mainly used for RPC.

transformation

In Spark, a function that creates a new RDD from an existing RDD. Spark uses "lazy evaluation": transformations do not execute on the cluster until an action is invoked. Examples of actions are collect, which pulls data to the client, and saveAsTextFile, which writes data to a filesystem like HDFS.

TrusteeKeyProvider

KeyTrustee-specific implementation of the Hadoop KeyProvider API, allowing the Hadoop KMS to use the Navigator KeyTrustee server as a key store and enabling key generation on behalf of clients.

UEK

Unbreakable Enterprise Kernel.

user role

Determines the Cloudera Manager or Cloudera Navigator features visible to the user and the actions the user can perform.

virtual core (vcore)

A CPU with a logical separation between areas of a processor. Virtual cores divide the processing resources of a physical core and work independent of one another.

Whirr

A set of libraries for running applications on cloud services. Apache Whirr can be used to run CDH clusters on services such as Amazon Elastic Compute Cloud (Amazon EC2). A working cluster starts immediately when the appropriate command is issued; you do not need to install the CDH packages in the cloud or do any configuration first. This is ideal for running temporary Hadoop clusters as proof-of-concept or training exercises. The cluster and all its data can be destroyed with a single command when it is no longer needed.

YARN (Yet Another Resource Negotiator)

A general architecture for running distributed applications. YARN specifies the following components:

ResourceManager - A master daemon that authorizes submitted jobs to run, assigns an ApplicationMaster to them, and enforces resource limits.
ApplicationMaster - A supervisory task that requests the resources needed for executor tasks. An ApplicationMaster runs on a different NodeManager for each application. The ApplicationMaster requests containers, which are sized by the resources a task requires to run.
NodeManager - A worker daemon that launches and monitors the ApplicationMaster and task containers.
JobHistory Server - Keeps track of completed applications.

The ApplicationMaster negotiates with the ResourceManager for cluster resources—described in terms of a number of containers, each with a certain memory limit—and then runs application-specific processes in those containers. The containers are overseen by NodeManagers running on cluster nodes, which ensure that the application does not use more resources than it has been allocated.

MapReduce v2 (MRv2) is implemented as a YARN application.

ZooKeeper

A centralized service for maintaining configuration information, naming, and providing distributed synchronization and group services. In a CDH cluster, ZooKeeper coordinates the activities of high-availability services, including HDFS, Oozie, Hive, Solr, YARN, HBase, and Hue.

Resources

Documentation

Cloudera product documentation
Cloudera CDH 4 web mirror
Cloudera CDH 5 web mirror
Apache Hadoop documentation – The official documentation for Apache Hadoop technologies.

Books

Hadoop Ecosystem books – Popular books across the ecosystem authored or co-authored by Cloudera employees.
Programming Pig
Programming Hive

Categories: Concepts | Getting Started | All Categories

Building and Running a Crunch Application with Spark