Cloudera Navigator Data Management Overview
Data Management Challenges
As Hadoop clusters have become ubiquitous in organizations large and small, the ability to store and analyze all types of data at any scale brings with it some management challenges for reasons such as these:
- Data volumes are extremely large and continue to grow, with increased velocity while comprising various data types.
- Data ingested by the cluster at one point in time is transformed by many different processes, users, or applications. The result is that the provenance of any data entity can be difficult at best to trace.
- Multi-user and multi-tenant access to the data is the norm. Each user group may need different levels of access to the data, at varying degrees of granularity.
- Where did the data originate? Has it been altered, and if so, by whom or by what process?
- Are downstream processes using that data, and if so, how?
- Have unauthorized people been trying to get to the data? Can we verify all accesses and attempted access to our data?
- Are we prepared for an audit? For example, can we prove that the values shown in the organization's G/L have not been mishandled? Will the bank examiners approve our data governance operations? In general, can the data's validity be trusted, and better yet, proven?
- Are data retention mandates being met? Are we complying with all industry and government regulations for preservation and deletion of data? Can we automate the life cycle of our data?
- Where is the most important data? Is there a way to get an at-a-glance view about overall cluster activity, over various time periods, by data type?
Enabling organizations to quickly answer questions such as these and many more is one of the chief benefits of Cloudera Navigator.
Cloudera Navigator Data Management Capabilities
As an organization's clusters consume more and more data, its business users want self-service access to that data. For example, business users might want to find data relevant for a current market analysis by searching using the nomenclature common to their field of expertise rather than needing to know all the file types that might contain such data.
At the same time, the organization's security team wants to know about all attempts to access any and all data. They want to be able to quickly identify confidential data and to track any data entity to its source (provenance). The compliance group wants to be audit-ready at all times. And everyone, organization wide, wants to completely trust the integrity of the data they are using.
These are the kinds of data management challenges that Cloudera Navigator data management was designed to meet head on. Data stewards, administrators, business analysts, data scientists, and developers can obtain better value from the vast amounts of data stored in Hadoop clusters through the growing set of features provided by Cloudera Navigator data management.
The following capabilities are all available through the Cloudera Navigator console.
The Cloudera Navigator analytics system leverages the metadata system, policies, and auditing features to provide a starting place for most users, from data stewards to administrators. Get at-a-glance overviews of all cluster data across various dimensions and using a variety of interactive tools to easily filter and drill down into specific data objects. For example, Data Stewardship Analytics has a Dashboard and Data Explorer that provide comprehensive yet intuitive tools for interacting with all data in the system. The HDFS Analytics page displays histograms of technical metadata for HDFS files (file size, block size, and so on). Use the mouse and brush over any histogram to display the lower level details, along with selectors and other filters for digging further below the surface of the data objects.
In addition to meeting the needs of data stewards, Cloudera Navigator also meets governance needs by providing secure real-time auditing. The Navigator Audit Server creates a complete and immutable record of cluster activity (in its own database) that is easy to retrieve when needed. Compliance groups can configure, collect, and view audit events that show who accessed data, when, and how.
The analytics, lineage, and search capabilities of Cloudera Navigator rely on metadata, both the technical metadata inherent in the data object itself (date, file type, and so on) or the metadata you define (managed metadata) to characterize data objects so they can be easily found by data stewards, data scientists, and other business users.
For example, Navigator Metadata Server captures information about table, file, and database activity, including file and table creation and modification trends, all of which is displayed in visually clean renderings on the Data Stewardship dashboard.
Cloudera Navigator policies let you automate actions based on data access or on a schedule, to add metadata, create alerts, move data, or purge data.
By default, the Cloudera Navigator console opens to the Search menu. Use this sophisticated (yet simple) filtering scheme to find all types of entities that meet your criteria, then drill down to explore a specific entity's lineage. High level diagrams can be filtered further using the interactive tools provided, letting you follow the data upstream or downstream and in myriad other ways.