Developer Center
Cloudera Blog · Sqoop Posts
Sqoop Graduation Meetup

This blog was originally posted on the Apache Blog:
https://blogs.apache.org/sqoop/entry/sqoop_graduation_meetup

Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption. 

Not only was this Sqoop’s second Meetup but also a celebration for Sqoop’s graduation from the Incubator, cementing its status as a Top-Level Project in Apache Software Foundation. Sqoop’s come a long way since its beginnings three years ago as a contrib module for Apache Hadoop submitted by Aaron Kimball. As a result, it was fitting that Aaron gave the first talk of the night by discussing its history: “Sqoop: The Early Days.” From Aaron, we learned that Sqoop’s original name was “SQLImport” and that it was conceived out of his frustration from the inability to easily query both unstructured and structured data at the same time.

Apache Sqoop: Highlights of Sqoop 2

This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop

Apache Sqoop (incubating) was created to efficiently transfer bulk data between Hadoop and external structured datastores, such as RDBMS and data warehouses, because databases are not easily accessible by Hadoop. Sqoop is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, to enhance its functionality, Sqoop needs to fulfill data integration use-cases as well as become easier to manage and operate.

What is Sqoop?

Cloudera Connector for Teradata 1.0.0

Apache Sqoop (incubating) provides an efficient approach for transferring big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data warehouses, and NoSQL systems). The extensible architecture used by Sqoop allows support for a data store to be added as a so-called connector. By default, Sqoop comes with connectors for a variety of databases such as MySQL, PostgreSQL, Oracle, SQL Server, and DB2. In addition, there are also third-party connectors available separately from various vendors for several other data stores, such Couchbase, VoltDB, and Netezza. This post will take a brief look at the newly introduced Cloudera Connector for Teradata 1.0.0.

Features

A key feature of the connector is that it uses temporary tables to provide atomicity on data transfer. This feature ensures that either all or none of the data are transferred during import and export operations. Moreover, the connector opens JDBC connection against Teradata for fetching and inserting data, and it automatically injects appropriate parameter underneath to use the FastExport/FastLoad feature of Teradata for fast performance.

Installation

The first thing you will need is to install Sqoop. CDH3 documentation serves as a good reference on how to do this. You also need the Teradata JDBC JAR files (terajdbc4.jar and tdgssconfig.jar), and they can be put under the lib directory of your Sqoop installation (so Sqoop can pick them up at run time). One last thing is to enable Sqoop to process the Teradata JDBC URL syntax with the specialized Teradata manager factory. To do this, you can add the following inside a sqoop-site.xml file within the configuration directory of your Sqoop installation:

What’s New in Apache Sqoop 1.4.0-incubating

This blog was originally posted on the Apache Blog.

Apache Sqoop recently celebrates its first incubator release, version 1.4.0-incubating.  There are several new features and improvements added in this release.  This post will cover some of those interesting changes.  Sqoop is currently undergoing incubation at The Apache Software Foundation.  More information on this project can be found at http://incubator.apache.org/sqoop.

Customized Type Mapping (SQOOP-342)

Sqoop is equipped with a default mapping from most SQL types to appropriate Java or Hive counterparts during import.  Even though, this one-mapping-fits-all approach might not be ideal in all scenarios considering a wide variety of data stores available today, not to mention there are certain vendor-specific SQL types that may not be covered by the default mapping.

Inaugural Sqoop Meetup

This blog was originally posted on the Apache Blog:
 https://blogs.apache.org/sqoop/entry/inaugural_sqoop_meetup

Over 30 people attended the inaugural Sqoop Meetup on the eve of Hadoop World in NYC. Faces were put to names, troubleshooting tips were swapped, and stories were topped – with the table-to-end-all-tables weighing in at 28 billion rows.

I started off the scheduled talks by discussing “Habits of Effective Sqoop Users.” One tip to make your next debugging session more effective was to provide more information up front on the mailing list such as versions used and running with the –verbose flag enabled. Also, I pointed out workarounds to common MySQL and Oracle errors.

Hadoop World 2011: A Glimpse into Development

The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.

Preview of Development Track Sessions

Building Web Analytics Processing on Hadoop at CBS Interactive
Michael Sun, CBS Interactive

Apache Sqoop – Overview

This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview

Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.

This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

CDH3 Update 1 Released

Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution.  For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.

For those of you who are recent Cloudera users, here is a refresh on our update policy:

Biodiversity Indexing: Migration from MySQL to Hadoop


This post was contributed by The Global Biodiversity Information Facility development team.

The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the Hadoop stack.