ClouderaNOW   Learn about the latest innovations in data, analytics, and AI   |  July 16

Register now

Impact

Ensure compliance with legal requirements that include data updates through the ability to modify or delete row-level data without the need to rewrite an entire partition.

Improve data recovery, even in the case of accidental deletion, through the snapshot capability

By utilizing Cloudera's Open Data Lakehouse, KakaoPay has significantly boosted query performance, reducing data processing time by approximately 30%.

Solutions

 

Cloudera on premises

 

Cloudera Professional Services

 

Streaming data powered by Apache NiFi

 

SQL support for operational databases powered by Apache Phoenix

Data Architecture

Open Data Lakehouse powered by Apache Iceberg

Industry

Financial Services

Country

South Korea

KakaoPay is pursuing several strategies to effectively utilize data, including improving data quality, strengthening data analytics capabilities, driving data-driven decision-making, and strengthening data security.

KakaoPay provides mobile payment and financial services via KakaoTalk. Launched in September 2014 as Korea’s first simple payment service, KakaoPay has since expanded to include remittance, overseas payment, loan comparison, and wealth management services like securities, funds, and insurance, making financial services more accessible.

As a result, the strategy focuses on building a data platform that integrates various data sources, seamlessly analyzes them, and utilizes data at scale to provide end-users with a better financial experience and sustain growth.

KakaoPay’s data platform collects and processes real-time and batch data, provides data to customers, visualizes it with business intelligence(BI) tools, operates the core platform, and establishes data governance to ensure a stable analysis environment.

KakaoPay Enhances Data Management with Cloudera: Improved Analysis, Real-Time Processing, and Seamless Querying

KakaoPay was on a legacy Cloudera version, but then upgraded and migrated to the latest to be able to modernize and take advantage of the innovation.

KakaoPay’s deployment effectively consists of three stages that work together to deliver seamless management and analytics capabilities.

The first is an analysis cluster, which is the main cluster used for analyzing data. It consists of a Cloudera Base cluster and a Cloudera Data Services cluster, both deployed on-premises.

The first cluster contains Apache HDFS, Kudu (data storage), Ranger (data access and auditing), Oozie (workflow schedules), Impala, Hive, Spark (data processing), and Iceberg (open table format).

There are several components:

  • Apache HDFS and Apache Kudu are used for data storage.
  • Apache Ranger is used for managing data access and auditing.
  • Apache Oozie manages and schedules workflows.
  • Apache Impala, Hive, and Spark are for data processing
  • Apache Iceberg’s open table format is used for row-level data updates and deletes as well as snapshots.

The second is a real-time data serving cluster. It is a Phoenix cluster and consists of a disaster recovery cluster between multiple Internet Data Centers (IDC). KakaoPay created an HBase connection manager system that detects problem clusters and changes active clusters to good clusters. It also uses the NiFi cluster for real-time data collection.

The last is a heterogeneous query cluster. It is a Trino cluster, built in Kubernetes. Trino allows querying across multiple sources without data collection. KakaoPay uses this cluster to pre-check whether the data is suitable for regular collection or not.

Difficulty that takes too many resources and time

As a fintech company, KakaoPay provides financial services and has to comply with legal requirements. One of the legal requirements is to periodically delete the data of unsubscribers. However, most of the data in the Hadoop Distributed File System(HDFS) cannot be deleted, and even with Impala on top of HDFS for query analytics, it was difficult for KakaoPay to modify or delete row-level data. This meant that when it was time to update data, KakaoPay had to rewrite the entire partition, consuming too many resources and time.

KakaoPay considered using Kudu as a solution to this problem, but Kudu is designed for processing near-real-time data, making it unsuitable for simply deleting those who have left the system. Additionally, Kudu tables take a long time to load when there is too much data in them.

Another challenge was recovering data that was accidentally deleted by users. Previously, KakaoPay had to go through the deleted data directory and recover data that had not yet expired, but if the Time to Live(TTL) had expired, recovery was impossible, and the data had to be re-ingested from ETL

Cloudera enabled row-level data modification and deletion

KakaoPay adopted Apache Iceberg with Cloudera. After the implementation, they realized that Impala added the ability to query, delete, and update data in the Apache Iceberg tables.

This enabled KakaoPay to modify and delete data on a row-level basis using Apache Iceberg. Additionally, the snapshot feature provided by Apache Iceberg simplifies data recovery if a user accidentally deletes data by allowing the viewing and rollback of past snapshots.

“Since the Apache Iceberg architecture searches metadata and filters before reading data, the amount of data that needs to be read to process a query has been significantly reduced,” said Steven Yoon, Senior Data Engineer at KakaoPay. “This ultimately led to improved query performance, and we received feedback from users that query performance improved by about 30%.”

Yoon added that, “previously, both computing resources and storage resources were used on a single server, so if computing resources were insufficient, both had to be added even if storage resources were sufficient. However, in the Cloudera platform environment, if more computing resources are needed, they can be added independently. This resulted in efficient resource management and reduced hardware costs.

Future Data Strategy with Cloudera

“Using open source technology can be difficult. However, Cloudera provides pre-verified packaging to address the difficulties encountered when utilizing open source,” Yoon said. “Many data experts at Cloudera also analyzed the issues and provided solutions, related documents, and test results, which were very helpful.”

KakaoPay is now considering building a hybrid environment where data can be loaded to the cloud and analyzed as needed in the future. KakaoPay noted that the biggest competitive advantage and differentiation from other services is that, with Cloudera, it can utilize the cloud environment as well as the current setup it is using. Additionally, KakaoPay is also reviewing whether the LLM provided by Cloudera can meet its needs.

Using open source technology can be difficult. However, Cloudera provides pre-verified packaging to address the difficulties encountered when utilizing open source. Many data experts at Cloudera also analyzed the issues and provided solutions, related documents, and test results, which were very helpful.

Steven Yoon, Senior Data Engineer at KakaoPay

Ready to Get Started?

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.