Top 5 Cloudera Engineering Blogs of 2017
1. Working with UDFs in Apache Spark
2. Offset Management For Apache Kafka With Apache Spark Streaming
3. Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem
4. Up and running with Apache Spark on Apache Kudu
5. Apache Impala Leads Traditional Analytic Database
Five years ago, Cloudera shared with the world our plan to transfer the lessons from decades of relational database research to the Apache Hadoop platform via a new SQL engine — Apache Impala — the first and fastest open source MPP SQL engine for Hadoop.
Impala’s new data elimination technique works by applying predicates against Parquet column statistics and dictionaries. It complements the existing partitioning mechanism and further improves the performance of selective queries.
Apache Hadoop’s security was designed and implemented around 2009, and has been stabilizing since then. However, due to a lack of documentation around this area, it’s hard to understand or debug when problems arise. Delegation tokens were designed and are widely used in the Hadoop ecosystem as an authentication method. This blog post introduces the concept of Hadoop Delegation Tokens in the context of Hadoop Distributed File System (HDFS) and Hadoop Key Management Server (KMS), and provides some basic code and troubleshooting examples.
Cloudera Director 2.6 and Cloudera Manager 5.13 offer a simple way to have TLS configured for Cloudera Manager and CDH clusters. In this blog post, Bill Havanki describes how to use the new feature and offers technical details behind how the automatic configuration happens.
Data analytics is increasingly being brought to bear to treat human disease, but as more and more health data is stored in computer databases, one significant challenge is how to perform analyses across these disparate databases. In this post I take a look at the Observational Health Data Sciences and Informatics (or OHDSI, pronounced “Odyssey”) program that was formed to address this challenge, and which today accounts for 1.26 billion patient records collectively stored across 64 databases in 17 countries.