Acquiring advanced data analytics capabilities built on top of your servers infrastructure can help your organization understand the risks associated with operating its technology infrastructure.
Using analytics to understand your IT risks
Modern IT operations make enormous demand in monitoring and optimization management systems. Regulatory requirement demand rigorous protection of data stored in systems or in transit as well as proper access control to systems and network. Developing a real time monitoring environment as networks and operations grow can be severely impeded by infrastructure monitoring applications with out-of-the-box types capabilities. The advent of multi-factor authentication and ubiquitous security oriented devices and sensors that generate massive amounts of sensitive (from a risk analysis angle) data in logs can by their sheer volume rapidly overwhelm the most stringent applications. Factoring traditional security areas such as access servers, anti-virus, mail servers, message queues , networking devices, and now various security devices and sensors, traffic analyzer, access logs etc adds complexity and expose the need for cost prohibitive customizations.
The disparate and distributed nature of operational data makes effective monitoring of modern state-of-the-art IT infrastructure seemingly impossible due to the demand for real time, event correlation, historical analysis, pattern recognition, integration of volumes of semi-structured data etc…
What Web 2.0 companies are using
Hadoop/MapReduce brings very high performance data analysis capabilities on very large sets of data set via a low cost and highly scalable off-the-shelf servers infrastructure. This paradigm presents significant benefits in an environment in which voluminous amount of applications-generated semi-structured data needs to be analysis in order to understand operational and technical risks for large Web 2.0 datacenters. The fundamental scalability limitations presented by relational database management systems have made querying large data sets for pattern discovery very inefficient. The Hadoop Distributed File System for its part has been designed with the sole purpose of taking advantage of scalability:
Data storage scalability is achieved first by adding more storage capacity in the form of low cost servers equipped with disks. In turn data processing capability also increases from the addition of those same storage computer nodes.
Logs processing with Hadoop and MapReduce
Modern servers operating systems generate a large amount and wide variety of logs. These logs produce status information about systems status, processes and activities whose analysis can help discover IT-related operational problems. IT-related operational problems however can be related to seemingly unrelated misconfiguration hidden somewhere down the technology chain. For larger datacenters fault isolation can become very difficult and persisting problem. It becomes clear that not only logs that are generated locally can be a nightmare to interpret, but moreover logs correlation from various interdependent connected systems becomes a critical need.
Workflow for log messages processing:
logs are aggregated and compressed
log aggregation and compression, storage of logs in accordance with policy and applying fault tolerance,
applying mapreduce jobs to the logs:
log messages parsing for relevant event information in order to produce multiple indexes:
-systems configuration and performance, viruses, connections, users access, spam filtering, load generated, users habits, mail transfer agents, geographical information, network performance
Building a Open Source stack for large-scale data analysis
Hadoop Distributed File System
Hive data warehousing structured querying environment
HBase scalable data base highly distributed data storing
Pig Latin scripting language offering high level of abstraction for generating Mapreduce programs