This post was co-authored by two Cisco Employees as well: Karthik Krishna, Silesh Bijjahalli
Today’s enterprise data analytics teams are constantly looking to get the best out of their platforms. Storage plays one of the most important roles in the data platforms strategy, it provides the basis for all compute engines and applications to be built on top of it. Businesses are also looking to move to a scale-out storage model that provides dense storages along with reliability, scalability, and performance. Cloudera and Cisco have tested together with dense storage nodes to make this a reality.
Cloudera has partnered with Cisco in helping build the Cisco Validated design (CVD) for Apache Ozone. This CVD is built using Cloudera Data Platform Private Cloud Base 7.1.5 on Cisco UCS S3260 M5 Rack Server with Apache Ozone as the distributed file system for CDP.
Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. This has been a major architectural enhancement on how Apache Ozone manages data at scale in a data lake.
Apache Ozone brings the best of both HDFS and Object Store:
A data generator tool was written to create fake data for Ozone. It works by writing synthetic file system entries directly into Ozone’s OM, SCM, and DataNode RocksDB, and then writing fake data block files on DataNodes. This is significantly faster than writing real data using an application or another client. By running this tool in parallel on all storage nodes in the cluster we can fill up all the 400TB nodes in the cluster in less than a day.
With this tool, we were able to generate large amounts of data and certify Ozone on dense storage hardware. We made several enhancements in the product to improve, scale, and performance to handle the large density per node.
We benchmarked Impala TPC-DS performance on this test setup. The query templates and sample queries used are compliant with the standards set out by the TPC-DS benchmark specification and include only minor query modifications (MQMs) as set out by section 4.2.3 of the specification. All of these scripts can be found at, impala-tpcds-kit. Impala local caching was turned on while running this benchmark. The results of this testing indicate that the performance of 70% of the queries either matched or improved as compared to the same queries running with HDFS as the filesystem.
Loss of one or more dense nodes triggers significant re-replication traffic. For data durability and availability, it is important that the file system should be quickly recovered from Hardware failures. Ozone includes optimizations to recover efficiently from the loss of dense nodes including the use of the multi-RAFT feature of Apache Ozone to get better distribution of data and avoid replication from being bottlenecked on fewer nodes.
Cloudera will publish separate blog posts with results of performance benchmarks.
Cisco Data Intelligence Platform (CDIP) is a private cloud architecture which is future-proofed for the next-gen hybrid cloud architecture of a data lake, bringing together big data, AI/compute farm, and storage tiers to work together as a single entity while also being able to scale independently to address the IT issues in the modern data center. This architecture allows for:
This architecture is the beginning of the convergence of three of the largest open-source initiatives with Hadoop, Kubernetes, and AI/ML largely driven by an impressive software framework and technology introduced by Cloudera Data Platform Private Cloud base and Cloudera Data Platform Private Cloud experiences to crunch big data.
Cisco UCS C240 M5 Rack Servers deliver a highly dense, cost-optimized, on-premises storage with broad infrastructure flexibility for object storage, Hadoop, and Big Data analytics solutions.
This CVD offers customers the ability to consolidate their data lake further, with larger storage per data node. Apache Ozone brings the following cost savings and benefits due to storage consolidation:
CDIP with Cloudera Data Platform Private Cloud Experiences enables customers to independently scale storage and computing resources while maintaining data locality similar to the prior generation of HDFS. It offers an exabyte scale architecture with low total cost of ownership (TCO) and future-proof architecture with the latest generation of technologies provided by Cloudera.
In addition to that, CDIP offers a single pane of glass management for the entire infrastructure with Cisco Intersight.
You can find the Cisco Validated Design document published here.
This may have been caused by one of the following: