Example: Sentiment Analysis Using MapReduce Custom Counters

MapReduce jobs report results using a wide array of built-in counters. You can add your own counters and analyze the results in your applications. One use for custom counters is sentiment analysis.

To perform a basic sentiment analysis, you count up the positive words and negative words in a data set. Divide the difference by the sum to calculate an overall sentiment score.

sentiment = (positive - negative) / (postitive + negative)

See Creating Your First Sentiment Analysis Application.

Modularizing the Application

In this example, Map and Reduce are compiled as separate classes to keep the code samples short and easier to read. MrManager handles the setup, defines the Job, and provides the main method.

All of the source is provided in sentimentAnalysis.tar.gz, which contains:
  • makefile
  • Map.java
  • MrManager.java
  • Reduce.java
  • neg-words.txt
  • pos-words.txt
  • stop-words.txt
  • /shakespeare
    • comedies
    • histories
    • poems
    • tragedies

Application Highlights

In this example, most of the work is done in the Map class. It begins with an enumeration used by the custom counters to store the number of positive and negative words from the input source.

enum Gauge{POSITIVE, NEGATIVE}
The setup method processes the command line arguments. If -skip is true, it calls parseSkipFile. It then calls the parsePostitive and parseNegative methods to populate the hash sets used to compare and identify words in their respective lists.
URI[] localPaths = context.getCacheFiles();
int uriCount = 0;
if (config.getBoolean("mrmanager.skip.patterns", false))
{
  parseSkipFile(localPaths[uriCount++]);
}
parsePositive(localPaths[uriCount++]);
parseNegative(localPaths[uriCount]);

The parsePositive method cycles through the list of positive terms and creates an entry for each word. The parseNegative method does the same with the negative terms.

private void parsePositive(URI goodWordsUri) {
  try {
    BufferedReader fis = new BufferedReader(new FileReader(
    new File(goodWordsUri.getPath()).getName()));
    String goodWord;
    while ((goodWord = fis.readLine()) != null) {
      goodWords.add(goodWord);
    }
  } catch (IOException ioe) {
    System.err.println("Caught exception parsing cached file '"
      + goodWords + "' : " + StringUtils.stringifyException(ioe));
  }
}
While the map method continues to perform its word count activity from earlier examples, two additional counters filter and capture the positive and negative terms.
for (String word : WORD_BOUNDARY.split(line))
{
  if (word.isEmpty() || patternsToSkip.contains(word)) {
    continue;
  }
  // Count instances of each (non-skipped) word.
  currentWord = new Text(word);
  context.write(currentWord,one);         

  // Filter and count "good" words.
  if (goodWords.contains(word)) {
    context.getCounter(Gauge.POSITIVE).increment(1);
  }




  // Filter and count "bad" words.
  if (badWords.contains(word)) {
    context.getCounter(Gauge.NEGATIVE).increment(1);
  }
}

The Reduce method assembles the results and returns them to MrManager.

Instead of returning the results immediately, though, MrManager stores the result in a variable. This gives you a chance to work with the results and write to the console (or perform any other Java-esque processing) before ending the program.
int result = job.waitForCompletion(true) ? 0 : 1;

/*
 *  Work with the results before returning control to the main method.
 * 
 */ 

// Get the counters from the Map class.

Counters counters = job.getCounters();
float good = counters.findCounter("org.myorg.Map$Gauge", "POSITIVE").getValue();
float bad = counters.findCounter("org.myorg.Map$Gauge", "NEGATIVE").getValue();
    
// Calculate the basic sentiment score by dividing the difference of good and 
// bad words by their sum.


float sentiment = ((good - bad) / (good + bad));

// Calculate the positivity score by dividing good results by the sum of
// good and bad results. Multiply by 100 and round off to get a percentage.
// Results 50% and above are more positive, overall.

float positivity = (good / (good + bad))*100;
int positivityScore = Math.round(positivity);



// Display the results in the console.

System.out.println("\n\n\n**********\n\n\n");
System.out.println("Sentiment score = (" + good + " - " + bad + ") / (" + good +
  " + " + bad + ")");
System.out.println("Sentiment score = " + sentiment);
System.out.println("\n\n");
System.out.println("Positivity score = " + good + "/(" + good + "+" + bad + ")");
System.out.println("Positivity score = " + positivityScore + "%");
System.out.println("\n\n\n********** \n\n\n\n");

/*
 *
 * Return and finish.
 *
 */

return result;

Running the Sentiment Analysis Application

These are the steps for building and running the Sentiment Analysis example.

  1. Download and expand sentimentAnalysis.tar.gz.

    The makefile provides a number of convenient utility commands. If you are using a parcel installation, delete or comment out the compile_ commands for packages, then un-comment the commands for parcel-based compilation.

  2. Open a terminal window and go to the expanded /sentimentAnalysis directory.
  3. Enter make run. This sets off a chain of events.
    1. Cleans up by deleting results of previous invocations of the make run command.
    2. Creates HDFS directories for input and output.
    3. Copies the works of Shakespeare to the input directory.
    4. Creates and uploads three short poems about Hadoop.
    5. Copies stop-words.txt, pos-words.txt, and neg-words.txt to HDFS.
    6. Compiles and jars Map, Reduce, and MrManager.
    7. Runs the application.

MrManager appends the results of your custom counters to the end of the list of standard counters. The sentiment and positivity scores appear after the standard output.

A positivity score 50% or higher indicates that the words from the input tend to be mostly positive. In the case of Shakespeare, it falls just 1% short of that goal. However, this is a rudimentary example of a sentiment analysis. If the content includes phrases such as "not unhappy," that would count as two negative words, even though the overall intent of the phrase is positive. You can implement more robust sentiment analysis algorithms that are beyond the scope of this example.


org.myorg.Map$Gauge


NEGATIVE=42163


POSITIVE=41184
**********



Sentiment score = (41184.0 - 42163.0) / (41184.0 + 42163.0)
Sentiment score = -0.011746074



Positivity score = 41184.0/(41184.0+42163.0)
Positivity score = 49%



********** 

See the makefile for command line options and variations for compiling and running the application.