Cloudera named a leader in The Forrester Wave™: Data Fabric Platforms, Q4 2025

Read the report
| Technical

How Leading Data Teams Build AI-Ready Pipelines with Apache Iceberg and Spark

Pamela Pan headshot
Ying Chen
Akshat Mathur headshot
Woman looking at phone in kitchen

Lessons from two global enterprises modernizing data engineering for scalable AI

From predictive analytics to generative AI, every business is looking to turn data into value. But for many teams, the real challenge lies beneath the surface—in the data engineering work required to make that data usable, trusted, and scalable. Across complex environments, engineers are still stitching together pipelines using legacy table formats, duplicating logic across tools, and retrofitting governance after the fact. These inefficiencies create drag at every stage, delaying outcomes and limiting the impact of even the most advanced AI and analytics initiatives.

For enterprises looking to streamline and future-proof their data engineering stack, Apache Iceberg as the open table format and Apache Spark as the open compute engine have been proven as a powerful combination. Together, they offer an open, scalable, and standardized foundation for processing and managing petabyte (PB)-scale data—without sacrificing governance, flexibility, or performance.

In this blog, we will take a closer look at how two global organizations transformed their data pipelines using Spark and Iceberg with the Cloudera data and AI platform. We’ll explore how they reduced query times by 80%, standardized workflows across teams, and accelerated their path from raw data to AI-ready insights.

How Vodafone Idea Slashed Query Times by 80%

Vodafone Idea is one of the three major telecommunications companies in India, serving 220 million customers. The company was struggling with scale issues: their Hive-based data lake had ballooned to more than 17 PBs, and performance bottlenecks were putting critical business operations at risk. Some reporting queries took more than 70 hours to complete! This delayed compliance, analytics, and regulatory reporting.

Rather than simply upgrading infrastructure, Vodafone Idea chose to re-architect its data platform. Collaborating with Cloudera, the company leveraged Iceberg for faster queries through optimized metadata and schema evolution, and rebuilt its processing workflows on Spark to leverage distributed compute for efficient, large-scale data processing. 

For regulatory reporting, they paired Iceberg with Apache Impala as the interactive query engine to support fast, reliable access to PB-scale datasets. While Impala handled the reporting queries, Iceberg played a critical role behind the scenes—its support for ACID transactions (atomicity, consistency, isolation, and durability—properties that ensure database transactions are processed reliably and consistently), flexible schema evolution capabilities, and rich metadata kept reporting workflows consistent, even as data changed.

Through integration with Cloudera Shared Data Experience (SDX), the team also gained fine-grained governance with role-based and attribute-based access control, making sure that the right people had access to the right data. This foundation enabled the business to deliver timely and auditable reports while meeting growing regulatory demands. 

Transforming Telecom with Data-Driven Efficiency

By partnering with Cloudera,  Vodafone Idea preserved flexibility, strengthened governance, and accelerated insight delivery at scale—without having to rebuild its entire data stack. Using Spark for ingestion, Iceberg for unified table management, and Impala for reporting, they modernized their foundation while reusing existing logic and workflows. 

Together, this architecture delivered measurable results:

  • Reduced query times by 80%.
  • Decreased pipeline failures via Spark’s resilience at scale and Iceberg’s robust table management capabilities.
  • Improved regulatory reporting ( faster and more reliable).


How a Pharmaceutical Company Consolidated In Order To Scale: One Tech Stack, 10,000 Jobs

A global pharmaceutical company managing PB-scale clinical research data faced a familiar but growing challenge: they had too many tools in play, leading to data reliability challenges and difficulty meeting compliance standards, on top of facing pressure to support faster AI and analytics. The data engineering teams needed to run more than 10,000 daily ETL jobs, but lacked a standardized way to build, govern, or validate pipelines across teams.

With Cloudera on AWS, the company set a clear direction forward. The team standardized all data pipelines using Spark on Cloudera Data Engineering, unifying and scaling processing across batch, streaming, and machine learning workloads. At the same time, they adopted Iceberg as the default open table format to ensure consistent schema evolution, built-in version control, and enterprise-grade governance across teams and environments.

By adopting Spark and Iceberg on Cloudera, the company laid a clean, scalable DataOps foundation that standardized data pipelining, enabled secure data sharing across teams and tools, and paved the way for faster and more advanced AI and analytics. This foundation now supports everything from regulatory audit workflows to AI models that accelerate clinical trial discovery and drug development, ensuring the company can seamlessly integrate any new technology or engine in the future.

Transforming Pharma with a Unified Data Platform

Standardizing on Cloudera’s platform gave the global pharmaceutical company a new level of operational consistency:

  • Governance without disruption: Iceberg’s write-audit-publish pattern allows upstream teams to validate data before releasing it to production—without breaking downstream workflows.
  • Time traveling for traceability: Regulatory teams can access historical data snapshots instantly, enabling clean rollback and audit support.
  • Shared pipeline logic: With Spark as the unified engine, teams—ranging from data engineers to data scientists—can collaborate easily and reuse core transformations across jobs and environments, reducing duplication and simplifying maintenance.


Building A Modern Foundation for Data Engineering and AI

These two stories share a common thread: both organizations faced fragmentation, scale pressure, and growing complexity in their data workflows. By standardizing on Apache Spark and Apache Iceberg with Cloudera, they rebuilt their pipelines around open, scalable, and trusted components—enabling better governance, faster performance, and cleaner data flows for AI and analytics.

With Cloudera Data Engineering, enterprises get an end-to-end solution that runs across hybrid and multi-cloud environments. It brings together Spark, Iceberg, and integrated orchestration with Airflow to empower teams to:

  • Build pipelines once, and run them anywhere—in the data center or on clouds
  • Maintain trust and governance at scale in the open data lakehouse

Watch this interactive demo to see how Spark and Iceberg power trusted, scalable pipelines on Cloudera. Try it yourself with the Cloudera Data Engineering 5-day trial and start building AI-ready data workflows today.

Ready to Get Started?

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.