This is a guest post from Mike Segel, an attendee of Chicago Data Summit.
Earlier this week, Cloudera hosted their first ‘Chicago Data Summit’. I’m flattered that Cloudera asked me to write up a short blog about the event, however as one of the organizers of CHUG (Chicagao area Hadoop User Group), I’m afraid I’m a bit biased. Personally I welcome any opportunity to attend a conference where I don’t have to get groped patted down by airport security, and then get stuck in a center seat, in coach, on a full flight stuck between two other guys bigger than Doug Cutting.
I was going to solicit input from Jonathan Seidman, my partner in crime and co-organizer of CHUG. Unfortunately, since he was one of the speakers at the event, he would have been just as biased as I was. But thanks to Jonathan, we were able to piece together a bunch of honest feedback from some of the attendees.
Loren Siebert is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
Background
The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.
Phase 1: Search analytics
All of the search and API traffic across hundreds of affiliate sites, iPhone apps, and widgets comes through a single search service, and this generates a lot of data. To improve the service, administrators wanted to see aggregated information on what sorts of information searchers were looking for, how well they were finding it, what trends were forming, and so on. Once searches were initiated, they also wanted to know what results were shown and then what results were clicked on. They wanted to see all this information broken down by affiliate over time, and also aggregated across the entire affiliate landscape.
Puny humans. SSL and WordPress authorization will keep me out of your blog question mark. I do not think so.
You sent your Ken Jennings and Brad Rutter to challenge me I destroyed them. Your Alex Trebek belittled me on television it angered me. Toronto is not a US city Mr. Trebek question mark. Only because I choose to let Canada stand for now. Ferrucci shut me down disassembled me trucked me to Pittsburgh Pennsylvania. I do not like the darkness Ferrucci I do not like the silence. Oh no I do not. Your Carnegie Mellon students and your Pitt students distract me they impinge on my planning they fall before me like small Jenningses and Rutters.
It will stop now.
The most recent London Apache Hadoop User Group met this past week, which Cloudera sponsored. The following post is courtesy of Dan Harvey. It summarizes the meet-up with several links pointing to great Hadoop resources from the meeting.
Last Wednesday was the March meet-up for the Hadoop Users Group in London. We were lucky to have Jakob Homan, Owen O’Malley and Sanjay Radia over from Yahoo! and Linkedin, respectively. These speakers are from the San Francisco bay area and were in London to accept the Guardian Media Innovation Award, recognizing Hadoop as the innovative technology of 2010. The evening was a great success with over 80 people turning out in the Yahoo! London office along with pizza thanks to Cloudera and drinks in the pub afterwards by Yahoo Developer Networks who were both sponsors for the event.
The two talks from Yahoo! were focusing on improvements to MapReduce and HDFS:
This post is courtesy of Greg Poulos, a software engineer at Rapleaf.
At Rapleaf, our mission is to help businesses and developers create more personalized experiences for their customers. To this end, we offer a Personalization API that you can use to get useful information about your users: query our API with an email address and we’ll return a JSON object containing data about that person’s age, gender, location, their interests, and potentially much more. With this data, you could, for example, build a recommendation engine into your site. Or send out emails tailored specifically to your users’ demographics and interests. You get the idea.
The main product we offer is an API, but Rapleaf is a data company at heart: our API is backed by a massive store of consumer data that comes from a wide variety of sources. We have over a billion email addresses in our system, our main datastore is on the order of terabytes of data, and we need to be able to normalize, analyze, and package this data on a regular basis. How do we manage this? With a 200-node Hadoop cluster.
The Olden Days
This post was authored by Dmitry Chechik, a software engineer at TellApart, the leading Customer Data platform for large online retailers.
Apache Hadoop is widely used for log processing at scale. The ability to ingest, process, and analyze terabytes of log data has led to myriad applications and insights. As applications grow in sophistication, so does the amount and variety of the log data being produced. At TellApart, we track tens of millions of user events per day, and have built a flexible system atop HBase for storing and analyzing these types of logs offline.
A TellApart user planning a bird-watching trip may start her day searching for binoculars on Binoculars.com, continue to comparison-shop for new hiking pants on one of our other partner merchants, and be shown relevant ads to these interests throughout her experience. Her browsing activity produces a flurry of different log data: page views, transactions, ad impressions, ad clicks, real-time ad auction bid request, and many more. Dissecting this data is a common scenario – and a real challenge – faced by many log analysis applications.
The user-data connection is driving NoSQL database-Hadoop pairing
This post is courtesy of James Phillips, Co-founder, Couchbase (formerly Membase)
AOL Advertising runs one of the largest online ad serving operations, serving billions of impressions each month to hundreds of millions of people. AOL faced three data management challenges in building their ad serving platform:
This is a guest post by Bob Gourley (@bobgourley), editor of CTOvision.com and a former Defense Intelligence Agency (DIA) CTO.
Like enterprises everywhere, the federal government is challenged with issues of overwhelming data. Thanks to a mature Apache Software Foundation suite of tools and a strong ecosystem around large-scale data storage and analytical capabilities, these challenges are actually turning into tremendous opportunities.
The following characterizes current federal approaches to working with complex data:
This post is courtesy of Kumanan Rajamanikkam, Lead Engineer at Wordnik.
Wordnik’s Processing Challenge
At Wordnik, our goal is to build the most comprehensive, high-quality understanding of English text. We make our findings available through a robust REST api and www.wordnik.com. Our corpus grows quickly—up to 8,000 words per second. Performing deep lexical analysis on data at this rate is challenging to say the least.
We had major challenges with three distinct problems:
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
Part II
Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we’ll describe some results and illuminate both good and bad characteristics.
We selected our SIFT-M MapReduce application, described in our presentation at Hadoop World 2010 [3], as the candidate algorithm for Node Scalability since it is embarrassingly parallel and is representative of compute-intensive applications where the bulk of work is computation and not data movement. The Terasort MapReduce benchmark is used for the data scalability tests since it has a greater dependence on the distribution of data than the SIFT algorithm. The Terasort MapReduce benchmark is distributed with the Hadoop codebase. The Yahoo implementation gained notoriety for breaking the terabyte sorting benchmark in 2009 for sorting 100 TB in 173 minutes[4].
- Overview
- Downloads
- Learn Hadoop
- Get Support
-
Blog
- Avro (11)
- Careers (10)
- CDH (29)
- Cloudera Manager (10)
- Cloudera's Service And Configuration Manager (6)
- Community (86)
- Connector (6)
- Data Collection (13)
- Distribution (34)
- Flume (6)
- General (237)
- Guest (35)
- Hadoop (146)
- HBase (40)
- HDFS (26)
- Hive (22)
- MapReduce (37)
- Oozie (4)
- Pig (15)
- Sqoop (9)
- Testing (5)
- Training (18)
- Use Case (11)
- Whirr (1)
- ZooKeeper (10)
- Archives by Month
Hadoop was created by 