Cloudera Tutorials

Optimize your time with detailed tutorials that clearly explain the best way to deploy, use, and manage Cloudera products. Login or register below to access all Cloudera tutorials.

Cloudera named a leader in 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems Get the report

Ready to Get Started?




As of January 31, 2021, this tutorial references legacy products that no longer represent Cloudera’s current product offerings.

Please visit recommended tutorials:



Security breaches happen. And when they do, your server logs may be your best line of defense. Hadoop takes server-log analysis to the next level by speeding and improving security forensics and providing a low cost platform to show compliance.


  • Read overview of tutorial series


Server Log Data

Server logs are computer-generated log files that capture network and server operations data. They are useful for managing network operations, especially for security and regulatory compliance.

NASA Server Logs Dataset

The dataset which we are going to use in this lab is of NASA-HTTP. It has HTTP requests to the NASA Kennedy Space Center WWW server in Florida. The logs are an ASCII file with one line per request, with the following columns:

  • host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.
  • timestamp in the format "DAY MON DD HH:MM:SS YYYY", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.
  • request given in quotes.
  • HTTP reply code.
  • bytes in the reply.

What is Log Analysis?

Server Log Analysis is an evaluation of records generated by computers, networks or other IT systems. Log analysis is leveraged by organizations to decrease a variety of risks and abide by compliance standards.

How Does Log Analysis Work?

Logs are made up of several messages in chronological order and stored on disk, files, or an application like a log collector. Data Analyst are responsible for ensuring the logs contain a range of messages and are interpreted according to context. Normalization is a procedure often performed to resolve cases when there are minor inconsistencies between logs, such as one log file contains WARN and another contains CRITICAL, but maybe in the context they mean the same idea. After the log data is collected and cleaned, then it can be analyzed to detect patterns and anomalies like network intrusions.

Use Cases For Log Analysis

IT organizations use server log analysis to answer questions about:

Compliance – Large organizations are bound by regulations such as HIPAA and Sarbanes-Oxley. How can IT administrators prepare for system audits? In this demo, we will focus on a network security use case. Specifically, we will look at how Apache Hadoop can help the administrator of a large enterprise network diagnose and respond to a distributed denial-of-service attack.

Security – For example, if we suspect a security breach, how can we use server log data to identify and repair the vulnerability?

Troubleshoot - Debug system, computer and network issues

User Behavior - Understanding more about your users

Computer Forensics - Conducted in the event of investigation

Best Practices For Log Analysis

1. Pattern Detection and Recognition

Help detect anomalies by filtering messages through understanding patterns in the data

2. Normalization

Establish a common format between log elements, such as setting a timestamp to the same format

3. Tagging and Classification

Filter and adjust the way you want to display your data by tagging log elements with keywords and categorize them into a set number of classes at which you can perform filtration

4. Correlation Analysis

Helps discover connections between data not visible in a single log. In the experience of a recent cyber attack, correlation analysis is able to find messages relevant to the particular attack by putting together logs generated by servers, firewalls, network devices and other sources. The data gathered from correlation analysis can assist in the effort of generating alerts when certain patterns take place in the logs.

5. Artificial Ignorance

Is a machine learning process that ignores routine log messages that are not useful and allows for unusual messages to be detected and flagged for investigation, which may turn out to be anomalies.


Congratulations! Now you are familiar with the idea of server log data, log data analysis, how the data analysis works, some of the use cases for server log analysis and some suggested practices to apply while analyzing your server logs. Let's start building the server log analysis application by first setting up the development environment in the next tutorial.

Further Reading

Your form submission has failed.

This may have been caused by one of the following:

  • Your request timed out
  • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.