The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
Preview of Development Track Sessions
Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.
This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.
Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution. For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.
For those of you who are recent Cloudera users, here is a refresh on our update policy:
- by Tim Robertson, Lars Francke and Oliver Meyn
- June 21, 2011
- 2 comments
This post was contributed by The Global Biodiversity Information Facility development team.
The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the Hadoop stack.
- Overview
- Downloads
- Learn Hadoop
- Get Support
-
Blog
- Avro (11)
- Careers (10)
- CDH (29)
- Cloudera Manager (10)
- Cloudera's Service And Configuration Manager (6)
- Community (86)
- Connector (6)
- Data Collection (13)
- Distribution (34)
- Flume (6)
- General (237)
- Guest (35)
- Hadoop (146)
- HBase (40)
- HDFS (26)
- Hive (22)
- MapReduce (37)
- Oozie (4)
- Pig (15)
- Sqoop (9)
- Testing (5)
- Training (18)
- Use Case (11)
- Whirr (1)
- ZooKeeper (10)
- Archives by Month

Hadoop was created by 