Building a Server Log Analysis Application
Overview
NOTICE
As of January 31, 2021, this tutorial references legacy products that no longer represent Cloudera’s current product offerings.
Please visit recommended tutorials:
- How to Create a CDP Private Cloud Base Development Cluster
- All Cloudera Data Platform (CDP) related tutorials
Introduction
Security breaches happen. And when they do, your server logs may be your best line of defense. Hadoop takes server-log analysis to the next level by speeding and improving security forensics and providing a low cost platform to show compliance.
Prerequisites
- Read overview of tutorial series
Outline
- Server Log Data
- NASA Server Logs Dataset
- What is Log Analysis?
- How Does Log Analysis Work?
- Use Cases For Log Analysis
- Best Practices For Log Analysis
- Summary
- Further Reading
Server Log Data
Server logs are computer-generated log files that capture network and server operations data. They are useful for managing network operations, especially for security and regulatory compliance.
NASA Server Logs Dataset
The dataset which we are going to use in this lab is of NASA-HTTP. It has HTTP requests to the NASA Kennedy Space Center WWW server in Florida. The logs are an ASCII file with one line per request, with the following columns:
- host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.
- timestamp in the format "DAY MON DD HH:MM:SS YYYY", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.
- request given in quotes.
- HTTP reply code.
- bytes in the reply.
What is Log Analysis?
Server Log Analysis is an evaluation of records generated by computers, networks or other IT systems. Log analysis is leveraged by organizations to decrease a variety of risks and abide by compliance standards.
How Does Log Analysis Work?
Logs are made up of several messages in chronological order and stored on disk, files, or an application like a log collector. Data Analyst are responsible for ensuring the logs contain a range of messages and are interpreted according to context. Normalization is a procedure often performed to resolve cases when there are minor inconsistencies between logs, such as one log file contains WARN and another contains CRITICAL, but maybe in the context they mean the same idea. After the log data is collected and cleaned, then it can be analyzed to detect patterns and anomalies like network intrusions.
Use Cases For Log Analysis
IT organizations use server log analysis to answer questions about:
Compliance – Large organizations are bound by regulations such as HIPAA and Sarbanes-Oxley. How can IT administrators prepare for system audits? In this demo, we will focus on a network security use case. Specifically, we will look at how Apache Hadoop can help the administrator of a large enterprise network diagnose and respond to a distributed denial-of-service attack.
Security – For example, if we suspect a security breach, how can we use server log data to identify and repair the vulnerability?
Troubleshoot - Debug system, computer and network issues
User Behavior - Understanding more about your users
Computer Forensics - Conducted in the event of investigation
Best Practices For Log Analysis
1. Pattern Detection and Recognition
Help detect anomalies by filtering messages through understanding patterns in the data
2. Normalization
Establish a common format between log elements, such as setting a timestamp to the same format
3. Tagging and Classification
Filter and adjust the way you want to display your data by tagging log elements with keywords and categorize them into a set number of classes at which you can perform filtration
4. Correlation Analysis
Helps discover connections between data not visible in a single log. In the experience of a recent cyber attack, correlation analysis is able to find messages relevant to the particular attack by putting together logs generated by servers, firewalls, network devices and other sources. The data gathered from correlation analysis can assist in the effort of generating alerts when certain patterns take place in the logs.
5. Artificial Ignorance
Is a machine learning process that ignores routine log messages that are not useful and allows for unusual messages to be detected and flagged for investigation, which may turn out to be anomalies.
Summary
Congratulations! Now you are familiar with the idea of server log data, log data analysis, how the data analysis works, some of the use cases for server log analysis and some suggested practices to apply while analyzing your server logs. Let's start building the server log analysis application by first setting up the development environment in the next tutorial.