Developer Center
Cloudera Blog · Distribution Posts
Apache Bigtop 0.3.0 (incubating) has been released

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

Introducing CDH4

I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.

There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:

Apache Flume – Architecture of Flume NG

This blog was originally posted on the Apache Blog: https://blogs.apache.org/flume/entry/flume_ng_architecture

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/flume. Flume NG is work related to new major revision of Flume and is the subject of this post.

Prior to entering the incubator, Flume saw incremental releases leading up to version 0.9.4. As Flume became adopted it became clear that certain design choices would need to be reworked in order to address problems reported in the field. The work necessary to make this change began a few months ago under the JIRA issue FLUME-728. This work currently resides on a separate branch by the name flume-728, and is informally referred to as Flume NG. At the time of writing this post Flume NG had gone through two internal milestones – NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.

Hadoop World 2011: A Glimpse into Development

The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.

Preview of Development Track Sessions

Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive

Evolution of Hadoop Ecosystem: AOL Advertising Experience

Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.

A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.

Specifically, the success of advertising technology and its impact on revenue are directly proportional to its capability to use large amounts of data in order to compute proper impression value given the unique circumstances of ad serving events such as the characteristics of the impression, the ad, and the user as well as the content and context. As a general rule, more data results in more accurate predictions.

Cloudera Service and Configuration Manager Express Edition Screencast (Director’s Cut)

Philip Zeyliger is a software engineer at Cloudera and started the SCM
project.

Two weeks ago, at Hadoop Summit, we released our Service and Configuration Manager (SCM) Express. It’s a dramatically simpler and faster way to get started with Cloudera’s Distribution including Apache Hadoop (CDH). In a previous blog post, we talked in some detail about SCM Express and what it can do for you.

The screencast included in this post demonstrates the simplicity of a CDH installation using SCM Express. The “Directors” conversing in the background are engineers Philip Langdale and Philip Zeyliger and VP of Products, Charles Zedlewski.

Video

SCM Express: Now Anyone Can Experience the Power of Apache Hadoop

Phil Langdale is a software engineer at Cloudera and the technical lead for Cloudera’s SCM Express product.

What is SCM Express?


The Only Full Lifecycle Management for Apache Hadoop: Introducing Cloudera Enterprise 3.5 and SCM Express

Drew O’Brien is a product marketing manager at Cloudera

We’re excited to share the news about the immediate availability of Cloudera Enterprise 3.5 and SCM Express, which we announced this week in tandem with our presence at Hadoop Summit. These products represent a major advance in Cloudera’s mission to drive massive enterprise adoption of 100% open source Apache Hadoop. We now make it easier and more convenient than ever before for companies to run and manage Apache Hadoop clusters throughout their entire operational lifecycle.

Cloudera Enterprise 3.5 is a substantial update to our subscription service that delivers production support and management software for Apache Hadoop and the entire Apache Hadoop ecosystem. With new features like automated service and configuration tools, activity monitoring, and one-click security, we’ve streamlined the extremely complex processes and eliminated the uncertainties associated with ongoing management and maintenance of Hadoop clusters in production. Cloudera Enterprise 3.5 codifies and makes available best practices Cloudera has learned over many years of helping enterprise customers build and manage Apache Hadoop-based systems.

If 80% of data is unstructured, is it the exception or a new rule?

Ed Albanese leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.

This week’s announcement about the availability of the Cloudera Connector for IBM Netezza is the achievement of a major milestone, but not necessarily the one you might expect. It’s not just the delivery of a useful software component; it’s also the introduction of a new generation of data management architectures.  For literally decades, data management architecture consisted of RDBMS, a BI tool and an ETL engine. Those three components assembled together gave you a bonafide data management environment. That architecture has been relevant for long enough to withstand the onslaught of data driven by the introduction of ERP, the rise and fall of client/server and several versions of web architecture. But the machines are unrelenting. They keep generating data. And there’s not just more of it, there is more you can—and often need—to do with it.

The times they are a-changin’, and unstructured data is taking over

Companies of all sizes and in nearly every vertical are increasingly tasked with decoding the information being generated by the machines they rely on most. New data sources are creating new data types, including web data, clickstreams, location data, point of sale, social data, building sensors, vehicle and aircraft data, satellite images, medical images, log files, network data and weather data… just to name a few. These data sources were but a glimmer in the eyes of the forefathers of the RDBMS and were most certainly not accounted for in its design. And yet, the percentage of data that fit into this newer bucket is growing at astounding rates. While at Netezza’s Enzee event this week, I listened to Steve Mills, IBM Senior Vice President and Group Executive for the Software Group, cite that more than 80% of the world’s data is unstructured.

So what to do with all of this data?

Reflections from Enzee Universe 2011

Bala Venkatrao is the director of product management at Cloudera.

I had the pleasure of attending Enzee Universe 2011 User Conference this week (June 20-22) in Boston. The conference was very well organized and was attended by well over 1000+ attendees, many of whom lead the Data Warehouse/Data Management functions for their companies.  This was Netezza’s largest conference so far in seven years. Netezza is known for enterprise data warehousing, and in fact, they pioneered the concept of the data warehouse appliance. Netezza is a success story: since its founding in 2000, Netezza has seen a steady growth in customers and revenues and last year (2010), IBM acquired Netezza for a whopping $1.7B.

Cloudera announced a partnership with Netezza last year and since then the two companies have been working closely to build a high-speed bi-directional connector between Netezza DW appliances and Apache Hadoop. We launched the general availability of this connector at this week at Enzee Universe 2011. You can download the connector here:  https://ccp.cloudera.com/display/SUPPORT/Downloads. Thank you to the teams at Netezza and Cloudera for making this happen. It’s been a great collaboration!