Top 5 Cloudera Engineering Blogs of 2017
1. Working with UDFs in Apache Spark
2. Offset Management For Apache Kafka With Apache Spark Streaming
3. Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem
4. Up and running with Apache Spark on Apache Kudu
5. Apache Impala Leads Traditional Analytic Database
In Other News
Apache Impala is now a Top-Level Apache Project
Five years ago, Cloudera shared with the world our plan to transfer the lessons from decades of relational database research to the Apache Hadoop platform via a new SQL engine — Apache Impala — the first and fastest open source MPP SQL engine for Hadoop.
Faster Performance for Selective Queries
Impala’s new data elimination technique works by applying predicates against Parquet column statistics and dictionaries. It complements the existing partitioning mechanism and further improves the performance of selective queries.
Hadoop Delegation Tokens Explained
Apache Hadoop’s security was designed and implemented around 2009, and has been stabilizing since then. However, due to a lack of documentation around this area, it’s hard to understand or debug when problems arise. Delegation tokens were designed and are widely used in the Hadoop ecosystem as an authentication method. This blog post introduces the concept of Hadoop Delegation Tokens in the context of Hadoop Distributed File System (HDFS) and Hadoop Key Management Server (KMS), and provides some basic code and troubleshooting examples.
Large-Scale Health Data Analytics with OHDSI
Data analytics is increasingly being brought to bear to treat human disease, but as more and more health data is stored in computer databases, one significant challenge is how to perform analyses across these disparate databases. In this post I take a look at the Observational Health Data Sciences and Informatics (or OHDSI, pronounced “Odyssey”) program that was formed to address this challenge, and which today accounts for 1.26 billion patient records collectively stored across 64 databases in 17 countries.
Automatic TLS Configuration with Cloudera Director 2.6
Cloudera Director 2.6 and Cloudera Manager 5.13 offer a simple way to have TLS configured for Cloudera Manager and CDH clusters. In this blog post, Bill Havanki describes how to use the new feature and offers technical details behind how the automatic configuration happens.
Upcoming Training
Cloudera Administrator Training, 1/16-1/19, Virtual - Guaranteed to Run
Webinars
January 16th - Building a Better Recommendation System
January 24th - Get Started with Cloudera’s Cyber Solution