ClouderaNOW   Learn about the latest innovations in data, analytics, and AI   |  July 16

Register now
| Business

The Iceberg Wave: How an Open Format Became an Enterprise Standard

Navita Sood Headshot
snowy mountains

Cloudera Innovations Propelling Iceberg Adoption

Apache Iceberg is now the de facto open standard for managing large-scale structured, semi-structured, and evolving data. It was originally developed in 2017 at Netflix to address the challenges of delivering reliable, petabyte (PB)-scale analytics on Apache Hive and Spark, and has since grown into a robust, open-table format suited to run multiple workloads concurrently. 

Iceberg unifies your data and provides SQL behavior to easily access that data. As it continues to evolve with richer SQL capabilities and simplified data operations, Iceberg is increasingly favored by users of varying technical expertise, not just data engineers but also data consumers (data scientists, analysts, and application developers) seeking fast, reliable access to any data.

With Iceberg, organizations gain true separation of compute and storage, enabling unparalleled flexibility. If you're looking for multifunction analytics, AI readiness, and vendor freedom, no other table format comes close.

A Vibrant and Growing Community

In less than 10 years, Iceberg has evolved from emerging tech to enterprise standard. Iceberg’s momentum can be credited to its architectural strengths as well as the vibrant, open community behind it. 

Importantly, the Iceberg community is led by its users, not just a single vendor. This user-driven governance model helps ensure the project evolves in ways that serve broad, real-world needs—a major reason why it has gained so much traction.

Key Takeaways from the Iceberg Summit

Iceberg’s mainstream adoption was evident at the 2025 Iceberg Summit in San Francisco. The event brought together startups, Fortune 500s, and the three major cloud providers (AWS, Microsoft, and Google), and attendees joined from across the globe—both in person and virtually—everyone eager to learn, contribute, and grow the ecosystem. 

A few themes in particular dominated conversations at the summit: interoperability and Iceberg's growing prominence (its expanding ecosystem and capabilities, including automation).

Interoperability

From Netflix to Apple to Bloomberg, many organizations shared how Iceberg enables them to manage a single source of truth that powers multiple workloads—eliminating redundant data copies and reducing data movement across systems. They discussed the various types of workloads that rely on Iceberg’s trusted data layer to deliver segmentation, personalization, churn/relapse predictions, recommendations, optimized customer experience, and more.

Exploding Ecosystem

Another highlight was the emergence of new open-source tools such as Comet, Polaris, and Lance in the Iceberg ecosystem, designed to enhance performance and support multi-modal analytics and AI.

Updates Coming in Iceberg V3 and V4

There was a lot of excitement around the capabilities coming in Iceberg V3 and V4. V3 will significantly bolster data governance, performance optimization, and support for more complex data types like Variant and Geospatial. By leveraging the principles of columnar format, Variant enables advanced querying capabilities, such as filtering and aggregations, on semi-structured data without requiring extensive transformations. Support for Geospatial will allow organizations to manage location-based data, unlocking new use cases. The new adaptive metadata layout proposed in V4 promises to improve performance for small files.

Automated Data Management

Another hot topic was automating routine maintenance (partitioning, sorting, compaction) via policy-driven DevOps-style interfaces to reduce manual toil. As organizations bring more data into Iceberg tables, this becomes a huge bottleneck since they must hire experts for these maintenance tasks. 

As more and more engines access the data in these Iceberg tables, governance, security, and lineage become high priority. Visibility into data flows and data transformations becomes critical to trust the data. This led to discussions around the need for catalog federation and governance to improve visibility across Iceberg tables. 

Iceberg Adoption at Cloudera

Cloudera featured native integration of Apache Iceberg in its public cloud Lakehouse platform in 2021, followed by on-premises in 2022. Today, a majority of our customers are either running or testing new workloads on Iceberg; in total, our customers manage PBs of data on Iceberg.

Iceberg is a growth vector for Cloudera. We’re seeing a surge in customers migrating Hive workloads to Iceberg to modernize and future-proof their data platforms.” - Venkat Rajaji, SVP of Product Management, Cloudera

Once a company starts its Iceberg journey, the benefits compound, resulting in growing volumes of data on Iceberg tables, expansion of workloads, and emergence of new use cases. Faster performance is often the first motivator, followed by interoperability and workload flexibility for agility. Moving to Iceberg reduces storage, ETL, and operational costs by up to 75%. Capabilities like time travel, snapshots, write-audit-publish, and hidden partitioning further improve efficiency, making it the right choice to deploy new use cases.

Some of the most popular use cases for Iceberg at Cloudera are:

  • Data sharing between different vendor systems owned by trusted parties, like different business units within an organization or with trusted partners and suppliers. 
  • Data engineering for massive-scale data preparation and best price performance.
  • Near real-time analytics and decisioning by ingesting streaming data into the lakehouse.
  • Regulatory compliance reporting and continuous risk mitigation, leveraging Iceberg’s time travel features and Cloudera’s governance, lineage, and auditing capabilities.
  • Optimizing analytics cloud spend by unlocking the data in Iceberg and leveraging Cloudera’s robust ingestion and data processing capabilities.
  • Accelerating data prep for AI by leveraging Spark and NiFi for faster data processing.
  • Efficient model training across multiple data versions with reduced compute and storage usage.
  • Multi-tiered feature stores that combine Iceberg and HBase for low-latency AI.
  • Running hybrid workloads using compute in public cloud on sensitive data stored on premises.

Listen to Illumina and LY Corporation’s journey with Apache Iceberg and how they are overcoming their data and analytic challenges at scale.

Cloudera Innovations to Address Common Challenges 

While Lakehouse and Iceberg offer significant benefits, including converging all your data and accelerating analytics, there are a few challenges our customers have shared with us related to adopting Iceberg. First, their data lies in multiple clouds, on premises, and in edge systems and moving all that data to the cloud to leverage Iceberg is almost impossible. Hence, they need the same Iceberg support on premises and in the cloud. Second, they need integration with multiple vendor engines so they can easily share data across systems with confidence, lineage, and traceability. As the data grows, manually and continuously optimizing Iceberg tables for optimal performance becomes very expensive, requiring experts and compute resources. Lastly, while Iceberg increases the usage of data, the freedom to bring in any tool introduces risks and requires effective governance and security tools to control access and provide metadata management for auditability, lineage, and visibility to better understand the data and drive usability.

We’re always innovating to solve customer challenges and have made several platform enhancements to address these common pain points, including:

  • Iceberg everywhere with the hybrid lakehouse: Delivers native support for Iceberg on premises and in multiple public clouds with the ability to port applications and code to use Impala, Spark, NiFi, Flink, and Hive on the same data with the same experience. This allows customers to modernize their data center with cloud-native capabilities. Iceberg on Ozone delivers S3-compatible object stores on premises. Cloudera enables organizations to unify their data in cloud and on premises under a single governance and security model—with fine-grained access controls, versioned metadata, and a shared catalog.
  • Real-time application building: Build real-time CDC pipelines and seamlessly ingest and unify batch and streaming data with our Data in Motion offering for streaming pipelines (NiFi+Kafka+Flink-on-Iceberg).
  • Full interoperability with REST catalog integration: Drive interoperability with external engines and open ecosystems with single security and governance.
  • Lower TCO and faster performance with the Cloudera Lakehouse Optimizer: Built-in AI auto-tunes compaction, snapshot expiry, and layout—no manual tuning required.
  • Complete understanding of all data sources and destinations: Octopai by Cloudera unlocks intelligent metadata automation and full-lifecycle lineage for all data flows even outside of Cloudera to give better visibility into data.
  • HA/DR and low latency across applications: Iceberg table replication provides resilience and flexibility for HA data architectures.
  • Risk-free and fast adoption with smart migration tools: Our “Hive Tables to Apache Iceberg” blueprint simplifies onboarding. 
As we envision a future where Apache Iceberg is the foundation and linchpin, empowering cross-platform data and AI, we relentlessly enhance Iceberg's capabilities to unlock unprecedented agility and intelligence for every enterprise.” Bill Zhang, VP of Product Strategies at Cloudera

Road Ahead

We believe that Iceberg will continue to dominate as the enterprise standard for open-table formats. The new innovations in automated optimizations, multi-modal support, metadata management, and Python integration will only further drive adoption. Other open-table formats will likely take a more specialized approach suited to run specific workloads or in specific environments to complement Iceberg. 

Cloudera’s goal is to help customers build an open data lakehouse powered by Iceberg with lower complexity, greater flexibility, and higher impact. We’re focused on delivering enterprise- grade security and governance, additional optimizations, tiered storage mechanisms, and “catalog of catalogs” to enhance interoperability and collaboration. You can get started today with the Cloudera Lakehouse 5-day trial or by reading our how-to guides

Ready to Get Started?

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.