This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.
Background
The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.
However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.
Garbage Collection
Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.
Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:
Apache Lucene and Apache Solr
Apache Lucene is a mature, high performance, full-featured Java API used for indexing and searching that has been around since the late nineties — it supports field-specific indexing and searching, sorting, highlighting, and wildcard searches, to name only a few. Everything in Lucene boils down to creating a document using artifacts such as email messages, HTML, PDF, XML, Word, Excel, etc, the contents of which will end up being parsed and added to Lucene documents as name/value pairs. There are a number of libraries available for extracting actual content, depending on what the artifact is. When extracting content from .msg email files, for instance, TIKA and POI are some useful libraries.
Apache HBase 0.90.5 is now available. This release of the scalable distributed data store inspired by Google’s BigTable is a fix release that covers 81 issues, including 5 considered blockers, and 11 considered critical. The release addresses several robustness and resource leakage issues, fixes rare data-loss scenarios having to do with splits and replication, and improves the atomicity of bulk loads. This version includes some new supporting features including improvements to hbck and an offline meta-rebuild disaster recovery mechanism.
The 0.90.5 release is backward compatible with 0.90.4. Many of the fixes in this release will be included as part of CDH3u3.
Acknowledgments
A special thanks to everyone who contributed to the release (reporting issues, fixing bugs, reviewing changes, writing documentation, etc).
Apache Whirr release 0.7.0 is now available. It includes changes covering over 50 issues, four of which were considered blockers. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. This is the first Whirr release as a top level Apache project (previously releases were under the auspices of the Incubator). In addition to improving overall stability some of the highlights are described below:
Support for Apache Mahout as a deployable component is new in 0.7.0. Mahout is a scalable machine learning library implemented on top of Apache Hadoop.
Aparna Ramani is the Director of Engineering for Cloudera Enterprise.
Cloudera Manager 3.7, a major new version of Cloudera’s Management applications for Apache Hadoop, is now available. Cloudera Manager Free Edition is a free download, and the Enterprise edition of Cloudera Manager is available as part of the Cloudera Enterprise subscription.
Cloudera Manager 3.7 includes several new features and enhancements:
This blog was originally posted on the Apache Blog: https://blogs.apache.org/flume/entry/flume_ng_architecture
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/flume. Flume NG is work related to new major revision of Flume and is the subject of this post.
Prior to entering the incubator, Flume saw incremental releases leading up to version 0.9.4. As Flume became adopted it became clear that certain design choices would need to be reworked in order to address problems reported in the field. The work necessary to make this change began a few months ago under the JIRA issue FLUME-728. This work currently resides on a separate branch by the name flume-728, and is informally referred to as Flume NG. At the time of writing this post Flume NG had gone through two internal milestones – NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.
This guide is intended to be an introduction to Crunch.
Introduction
Crunch is used for processing data. Crunch builds on top of Apache Hadoop to provide a simpler interface for Java programmers to process data. In Crunch you create pipelines, not unlike Unix pipelines, such as the command below:
grep "ERROR" log | sort | uniq -c
San Francisco, Salesforce.com HQ - Recently there was an Apache HBase Pow-wow where project contributors gathered to discuss the directions of future releases of HBase in person. This group included a quorum of the core committers from Facebook, StumbleUpon, Salesforce, eBay, and Cloudera as well as many contributors and users from other companies. This was an open discussion, and in compliance with Apache Software Foundation policies, the agenda and detailed minutes were shared with the community at large so that everyone can chime in before any final decisions are made.
We summarize some of the high-level discussion topics:
- Overview
- Downloads
- Learn Hadoop
- Get Support
-
Blog
- Avro (11)
- Careers (10)
- CDH (29)
- Cloudera Manager (10)
- Cloudera's Service And Configuration Manager (6)
- Community (86)
- Connector (6)
- Data Collection (13)
- Distribution (34)
- Flume (6)
- General (237)
- Guest (35)
- Hadoop (146)
- HBase (40)
- HDFS (26)
- Hive (22)
- MapReduce (37)
- Oozie (4)
- Pig (15)
- Sqoop (9)
- Testing (5)
- Training (18)
- Use Case (11)
- Whirr (1)
- ZooKeeper (10)
- Archives by Month
Hadoop was created by 