Written by Patrick Wendell, an amazing summer intern with Cloudera and an Avro Committer.
In my summer internship project at Cloudera, I added RPC tracing as a first-order feature of Apache Avro. Avro is a platform for data storage and exchange that caters to data-intensive, dynamic applications. My project focused on Avro’s RPC functionality.
Excitement is building as Hadoop World nears and we are sitting down with some of our presenters to ask them a few questions regarding their presentations and how they are using Hadoop within their organization. Here we speak with Philip Kromer, President of Infochimps, who answers questions regarding his presentation, how Hadoop is used in his business, and what he aims to get out of Hadoop World. Philip’s presentation at Hadoop World is about the development of a data marketplace and commoditization, and their chimpanzee-style approach to data processing. Attend Hadoop World October 12th in New York to hear more from and to talk to Philip.
What can attendees expect learn about Hadoop from your presentation at Hadoop World?
We’re now able to quantify aspects of human behavior never before accessible. Twitter, the News stream, the Smart Grid, are exquisite lab instruments for measuring ‘Conversation’, ‘Interest’, ‘Activity’. What’s more, with enough data machine-learning algorithms and big data tools let us expose insight using only the *structure*, not the content of the data. The massive quantity and connectivity required demands industrial-strength tools such as Hadoop.
We do *all* our data processing in high level tools (chiefly Pig and Wukong) — “black boxes with flexible glue”. We use ‘programmer fun’ + ‘programmer time’ as our primary development metrics. Together, writing simple loosely coupled scripts lets us run the fast experiment-driven design cycles that a lean startup demands. It has also let us grow our own talent and recruit outside CS (physicists, in particular, dream in map reduce). I think this approach should have strong appeal to small- and medium-sized businesses, or anyone looking for low barrier-to-adoption of Hadoop.
Do you have Hadoop in production use today?
That’s right, sign up for any of the training courses surrounding Hadoop World 2010, and receive a complimentary pass to the conference! There are seven different courses on offer, so whether you are new to Hadoop or looking to deepen your skills, you’ll find something to fit your needs.
If you are a manager trying to decide whether Hadoop is an appropriate technology for your organization, Hadoop Essentials for Managers will answer your questions. We will show you when using Hadoop is appropriate, what Hadoop is being used for in a range of industries, how Hadoop fits into your existing environment and what you need to know in order to deploy it within your organization.
Why not turn your Hadoop World trip into a multiple day Hadoop learning extravaganza by attending one of our two-day sessions? Both the developer and administrator training courses culminate in an exam which, when passed, confers Cloudera Certified Hadoop Developer or Administrator status.
Hadoop is increasingly being adopted by many Fortune 500 enterprises. Some of the speakers featured at Hadoop World this year include leading companies who have been able to create new value for their business using Hadoop. The presentations at Hadoop World are focused on how Hadoop is solving business problems for these enterprises. Below are three examples of leading enterprises that will present how Hadoop has impacted their businesses.
GE, Product Manager, Linden Hillenbrand, will be talking about how Hadoop has improved GE’s Marketing & Communications functions. One capability GE has implemented is assessing the external perception of GE–positive, neutral, or negative–through various marketing campaigns.
Apache Hadoop 0.21.0 was released on August 23, 2010. The last major release was 0.20.0 in April last year, so it’s not surprising that there are so many changes in this release, given the amount of activity in the Hadoop development community. In fact, there were over 1300 issues fixed in JIRA (Common, HDFS, MapReduce), the issue tracker used for Apache Hadoop development. Bear in mind that the 0.21.0 release, like all dot zero releases, isn’t suitable for production use.
With such a large delta from the last release, it is difficult to grasp the important new features and changes. This post is intended to give a high-level view of some of the more significant features introduced in the 0.21.0 release. Of course, it can’t hope to cover everything, so please consult the release notes (Common, HDFS, MapReduce) and the change logs (Common, HDFS, MapReduce) for the full details. Also, please let us know in the comments of any features, improvements, or bug fixes that you are excited about.
You can download Hadoop 0.21.0 from an Apache Mirror. Thanks to everyone who contributed to this release!
Fraud has multiple meanings and the term can be easily abused. The definition of fraud has undergone multiple changes throughout the years and is elusive as well as fraud itself. The modern legal definition of fraud usually contains a few elements that have to be proven in court and depends on the state/country. For example, in California, the elements of fraud, which give rise to the fraud cause of action in the California Courts, are: (a) misrepresentation (false representation, concealment, or nondisclosure); (b) knowledge of falsity (or scienter); (c) intent to defraud, i.e., to induce reliance; (d) justifiable reliance; and (e) resulting damage. A more general definition may contain up to 9 elements.
From the statistical or technical perspective, fraud is a rare event that results in a significant financial impact to the organization.
Both definitions emphasize that the event is rare (assuming that most of the population is law-abiding citizens), is intentional (there is no “accidental” fraud), as well as imply a significant damage caused to the defrauded party (otherwise why bother). Fraud detection is difficult from statistical point of view for exactly these reasons: (a) the events are rare and it is difficult to build a predictive model and (b) fraud assumes a real human being behind it and incorporates elements of game theory since the fraudster is often an insider who knows how to game the system.
Cloudera’s Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Hive, Pig, HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.
Enter the code “london_10pct” when registering and receive a 10% discount!
This post was contributed by Jonathan Seidman from Orbitz. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide . You can hear more from Jonathan at Hadoop World October 12th in NYC.
Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub, Additionally, the company operates business-to-business service: Orbitz Worldwide Distribution provides third parties such as Amtrak, Delta, LAN, KLM, Air France and a number of other leading airlines hotel booking capabilities, and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients. The Orbitz Worldwide sites process millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day. Not all of that data necessarily has value, but much of it does. Unfortunately storing and processing all of that data in our existing data warehouse infrastructure is impractical because of expense and space considerations.
Hadoop was selected to provide a solution to the problem of long-term storage and processing of these large quantities of un-structured and semi-structured data. We deployed our first Hadoop clusters in late 2009 running Cloudera’s Distribution for Hadoop (CDH), and in early 2010 deployed Hive to provide structure and SQL-like access to Hadoop data. In the short period of time since our initial deployment we’ve seen Hadoop rapidly adopted as a component in a wide range of applications across the organization due to its power, ease of use, and suitability for solving big data problems.
Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.
We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.
Hadoop and HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use. This blog is to provide guidance in sizing your first Hadoop/HBase cluster. First, there are significant differences in Hadoop and HBase usage. Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum). HBase is much better for real-time read/write/modify access to tabular data. Both applications are designed for high concurrency and large data sizes. For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html]. We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.
Hadoop was created by