Simplifying your Business Intelligence architecture.
HDFS (Hadoop Distributed File System) and MapReduce is built to simplify the otherwise very complex processing operations of massive,semi-structured data extracted by crawlers from the web. This model offers the operational benefits and cost savings not found with traditional data warehouse requiring dedicated expensive technology dimensional modeling, ETL, data federation…
Building on reliability and real-time
A primary goal of a data warehouse environment is to provide management with latest business-critical information that are likely to provide insight for making business decisions. Establishing integrated decision-support systems is dictated by the rapidly changing pace of business operating conditions. Competition and stakeholders pressure for enhanced results amid a cascade of sometimes conflicting or incomplete information can spell catastrophe. Effective data warehouse infrastructure must incorporates data that are timely, complete while adhering to real time constraint.
Data warehousing & complexity.
Federating data consists in retrieving data from various operational systems , data stores, data marts and consolidate them for use in data warehouse prior to business intelligence operations. Various heterogeneous systems are often deployed enterprise-wide to conduct business on behalf of organizational units with distinctive lines of business. Adding to complexity is the disparate technologies under which entities and even individual systems operate. The last hurdle to unifying business data from disparate systems is obviously the differing data formats employed necessitating some form of data treatment prior to storing in data warehouse. This succinct analysis suggests a high level of complexity in technology and cost prior to realizing any return on investment.
A simplified fully Open Source stack.
Dimensional modeling, Extraction Transformation and Loading, data federation, real time update are the technology attributes that have been making data warehousing a very costly proposition.
Using an Open source stack for data warehouse has been pioneered by large data-centered web companies relying on Hadoop/Mapreduce and its underlying distributed file system to process very large sets of semi-structured data in reasonable time. Being distributed across a server infrastructure and native interaction with various file system including S3 gives great deployment flexibility.
HBase is a distributed column-oriented database initially bundled with Hadoop. Its scalability and high performance of write operations makes it the ideal real time update component for our Hadoop-powered data warehouse environment. The HDFS being used for batch-processing (read operations) of large data sets is well complemented by HBase for rapid storage (write operations) of voluminous amount of data.
Hive provide a SQL-like data processing language that can be installed on top of HDFS. It provides analysts and developers with a familiar querying environment to interact with the data warehouse.
This simplified approach can help your organization understands its data and operation in a vary cost effective way that leverage on existing servers architecture while requiring little learning curve. Cloud computing presents a no investment cost option for implementing this solution.