StreamSets Data Collector
StreamSets Data Collector can be installed as a Cloudera Manager parcel via a Custom Service Descriptor (CSD) file, via an RPM bundle or as a tarball. Source code for the CSD is available on Github. A Docker image is available on Docker Hub. A graphical IDE lets you design, test and debug ingest flows without requiring schema specification.
- Built-in transformations help you sanitize, sample and route your data as needed.
- Intelligent monitoring gives you runtime visibility to data flow performance, including stage-specific early warnings about anomalies and outliers.
- Deep integration with the Hadoop ecosystem, including connectors for HDFS, HBase, Kafka and Solr
- Flexible deployment of pipelines to edge servers or to the Enterprise Data Hub as a Spark Streaming application or MapReduce job.
- Seamless management of infrastructure via Cloudera Manager and parcels
StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. StreamSets Data Collector is in use at hundreds of companies where it brings unprecedented visibility into and control over data as it moves between an expanding variety of sources and destinations.
Founded in 2014 by Girish Pancha, former chief product officer of Informatica, and Arvind Prabhakar, an early employee and engineering leader at Cloudera, StreamSets is headquartered in San Francisco and is backed by Accel Partners, Battery Ventures and New Enterprise Associates (NEA).