Enterprises today face a steep challenge: they want to leverage advanced AI models to stay competitive, but need to keep the high costs of cloud-based large language models (LLMs) under control and stay compliant with data privacy regulations.
So how can businesses explore cutting-edge AI without overextending budgets or exposing sensitive private data? At Cloudera, we’ve developed a solution that turns this challenge into an opportunity—using synthetic data generated from private data and knowledge distillation to build cost-efficient, accurate, and compliant AI systems.
In this article, we discuss how Cloudera’s Synthetic Data Generation Studio–part of Cloudera AI Studios—allows organizations to capitalize on AI innovation even when real-world data is scarce or sensitive.
Use case: Drawing from an internal use case, we’ll show how we significantly improved the performance and overall throughput for Cloudera’s customer support ticket pipeline through knowledge distillation using synthetic data generated from private data, while maintaining data privacy and regulatory compliance.
Key takeaways:
Data privacy as a competitive advantage: Synthetic data enables innovation without regulatory risk.
Cost-effective performance: Smaller, fine-tuned models outperform larger, resource-heavy alternatives.
Applicable to multiple use cases: The same approach can power use cases from fraud detection to personalized customer service.
Cloudera’s customer support team leverages AI models to analyze and summarize customer support tickets in real time. The system takes as input customer or Cloudera support agent comments. Then, it analyzes each comment and extracts a set of analytics, such as sentiment and summarization. These analytics are paramount to improve the customer experience at Cloudera.
Due to the sensitive nature of the customer data being processed in this pipeline, only models running in local environments can be used and no customer data can be shared with any external sources.
Initially, to analyze the comments, the team relied on local LLMs (Goliath 120B), which met basic performance requirements but lagged in speed and generation performance: on average, processing requests took 12-15 seconds each, and requests came in every 30 seconds. Adherence to the expected output was 77.5%, and generation accuracy was lower than proprietary models—a bottleneck for scalability and LLM performance.
The challenges of using local large LLMs (Goliath-120B) were clear: slower response times, increased costs, lower generation accuracy than state-of-the-art, cloud-based models, and compliance risks.
Large organizations face similar trade-offs—balancing AI accuracy and speed against the risks of data exposure.
Cloudera’s breakthrough lies in a privacy-first approach to knowledge distillation.
Instead of training models on raw customer data, which had regulatory and exposure risks, we generated synthetic datasets using Cloudera Synthetic Data Studio. This new low-code tool in Cloudera AI mimicked real-world interactions—technical questions, troubleshooting scenarios, and more—without ever exposing private information.
Generating synthetic customer support interactions had regulatory and exposure benefits and also enabled the team to send the synthetic data to state-of-the-art, cloud-based LLMs to extract insights such as customer sentiment from the best performing LLMs. These cloud-based LLMs provided much more accurate information extraction than large local LLMs, making them an ideal source to distill accurate insights from these state-of-the-art LLMs.
Cloudera’s synthetic data solution eliminated any compliance and privacy risks and generated the highest quality synthetic data (even higher than existing large, local LLMs). This approach unlocked the option to distill knowledge from state-of-the-art models to small LLMs and solve the same problem as the Goliath-120B but at a lower cost and higher accuracy.
Data generation: Using the Synthetic Data Studio data generation workflow, we crafted a prompt instructing Claude Sonnet to generate customer questions and answers. The prompt instructs the LLM to create customer support questions and answers, impose the tone, and detail the structure. In addition, we provide a list of topics that appear in real-world data (such as customer support for Cloudera AI or Cloudera Data Warehouse) and use seed topics to ensure both diverse and real-world customer support ticket generation.
Fine-tuning: Using only the filtered data, the team split the data into train and development tested and distilled knowledge from the Claude Sonnet model to a Meta Llama3.1-8B-instruct model. The team ran multiple experiments selecting the fine-tuning parameters that maximize the performance of the distilled LLM.
Evaluation: Using the Synthetic Data Studio evaluation workflow, the team crafted a prompt to instruct an LLM-as-a-judge on how to evaluate the quality of the generated data and filtered out low-quality samples.
Using both human and automated LLM-as-a-judge evaluations, the team scored real-world customer support ticketing questions and answers. Cloudera’s team focused on answers that the deployed and distilled LLMs differed and reported the win rate of each LLM. In addition, they measured speed improvements in terms of average running time, adherence to the expected output, and cost to deploy the model.
Improved speed: Processing time dropped 95%.
Better output structure: Output adherence rose from 77.5% to 99.5%.
Higher LLM accuracy: When comparing the smaller distilled LLM (Llama 3.1 8B) against the deployed Goliath LLM (Goliath 120B), win rate was 70% vs. 30% when using Phi-4 as a judge and 63% vs. 37% when using human evaluators to compare the two models.
Improved cost and efficiency: The smaller distilled LLM reduced compute and memory needs while increasing real-time scalability and maintaining data privacy, and throughput improved 11x.
The results are clear: enterprises can achieve AI excellence without compromising data privacy. By synthesizing training data and distilling knowledge, businesses avoid trade-offs between innovation and compliance.
By developing a knowledge distillation approach, Cloudera achieved a 95% reduction in processing time, increased output structure adherence to 99.5%, and deployed a distilled Llama 3.1 8B model that outperformed the prior Goliath 120B model by 70% in accuracy (as judged by Phi-4) and 63% in human evaluations.
This method eliminated compliance risks by avoiding direct use of sensitive data and also unlocked 11x greater throughput, showing that smaller, fine-tuned models can surpass larger, resource-intensive alternatives in both speed and precision.
Try our AMP to explore how to use private synthetic data to distill knowledge from a large model to a smaller model for a customer support use case.
This may have been caused by one of the following: