Your browser is out of date!

Update your browser to view this website correctly. Update my browser now

×

StreamSets Data Collector is open-source, in-memory big data ingest infrastructure that lets you develop and operate highly-adaptable ingest pipelines for CDH with minimal coding.   

 

  • A graphical IDE lets you design, test and debug ingest flows without requiring schema specification.

  • Built-in transformations help you sanitize, sample and route your data as needed.

  • Intelligent monitoring gives you runtime visibility to data flow performance, including stage-specific early warnings about anomalies and outliers.

  • Deep integration with the Hadoop ecosystem, including connectors for HDFS, HBase, Kafka and Solr

  • Flexible deployment of pipelines to edge servers or to the Enterprise Data Hub as a Spark Streaming application or MapReduce job.

  • Seamless management of infrastructure via Cloudera Manager and parcels

 

StreamSets Data Collector can be installed as a Cloudera Manager parcel via a Custom Service Descriptor (CSD) file, via an RPM bundle or as a tarball. Source code for the CSD is available on Github. A Docker image is available on Docker Hub.

 

 

Download Options Files
Custom Service Descriptor for Cloudera Manager

Download CSD File

Instructions

Cloudera Manager (Parcels)

Files to Download

Instructions

RPM

Download

Instructions

Tarball

Download

Instructions

Github

Source

Build Instructions

Docker Docker image
Selected tab: DownloadOptions

 

Install the Data Collector on a machine with the following minimum requirements:

Requirement

Description

Operating Systems

Use one of the following operating systems and versions:

  • Mac OS X

  • CentOS 6 or 7

  • RedHat Enterprise Linux 6 or 7

  • Ubuntu 14.04

Java

Oracle Java 7 or 8

Browser

Use the latest version of one of the following browsers:

  • Chrome

  • Firefox

  • Safari

Selected tab: SystemRequirements