Big Data – Processing of Large Volume of Data
In the last blog ?REPORTING STRATEGY FOR OLTP SYSTEMS? Vishnu has discussed about the strategy for the OLTP system using the RDBMS and ETL. There some of us thought and questioned that for the high growth of data it will be a bigger challenge to go with RDBMS and ETL data. In this context world is moving towards the big data for structured and unstructured data. As data grows other technologies like columnar database are emerging technologies to provide the faster and effective information to race with the market. Let?s discuss some of the technologies in this blog.
Why Big Data
Big Data with MapReduce?it’s varied; it’s growing; it’s moving fast, and it’s very much in need of smart management. Data, cloud and engagement are energizing organizations across multiple industries and present an enormous opportunity to make organizations more agile, more efficient and more competitive. Every day million and billions of data, which are being captured by different organization and In order to capture that opportunity, organizations, require a modern Information Management architecture. Information management of growing data and analytics capabilities are now requirement of every industry. Since data is growing exponentially organizations are not able to decide on technologies that which fits their requirement.
Here is basic comparison of two technologies.
In order to understand the capabilities of analytics and information management we need to understand few things:
Data Management & Warehouse:
Data management and warehouse has gain industry-leading database performance across multiple workloads while lowering administration, storage, development and server costs; Realize extreme speed with capabilities optimized for analytics workloads such as deep analytics, and benefit from workload-optimized systems that can be up and running in hours. This has started quite a long back and now most of the large telco banks and retail industry is using this.
Data in Motion Computing (Stream Computing)
Stream computing delivers real-time analytic processing on constantly changing data in motion. It enables descriptive and predictive analytics to support real-time decisions and forecasting to leverage the power of business. Stream computing allows everyone to capture and analyze all data – all the time, just in time. Data is all around us ? from social media feeds to call data records to videos ? but can be difficult to leverage. Sometimes there is simply too much data to collect and store before analyzing it. Sometimes it?s an issue of timing ? by the time you store data, analyze it, and respond ? it?s too late.
- Stream computing changes where, when and how much data you can analyze. Store less, analyze more, and make better decisions, faster with stream computing.
- Efficiently deliver real-time analytic processing on constantly changing data in motion and enable descriptive and predictive analytics to support real-time decisions. Capture and analyze all data, all the time, just in time.
- With stream computing, store less, analyze more and make better decisions faster. Mostly the data we processed at run time is more effective. I remember days when first time we started CIP(call in progress) data analysis to capture the fraud to minimize the loss of telecom industries using the online probe. That time in a way industry started thinking about ?Data in motion?.
Content Management Enable comprehensive content lifecycle and document management with cost-effective control of existing and new types of content with scale, security and stability. As you know customer better business grows secure and faster. In order to enable infrastructure to manage the comprehensive content management lifecycle traditional way of storage is no longer effective. Now documents handling and using for the real time analysis combination of few technologies like Hadoop, stream computing and DMS system together work effectively. Security of Bank, Telco and other financial industries where document is asset can not be handled by DMS in silo, it has to be integrated with other running system.
Let?s discuss in detail for a well known established framework.
Haddop is fairly new and effective open source software framework (started in 2005 and picked up approx. 2010), which brings the power of Apache Hadoop to the enterprise with application accelerators, analytics, visualization, development tools, performance and security features. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.
The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Along with this it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that the framework automatically handles node failures.
Hadoop’s Distributed File System
Hadoop’s Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the (GFS Google File System and Facebook). Hadoop DFS stores every file as a sequence of data blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. IBM uses the similar concept for the designing the failsafe storage of XIV. Files in HDFS are “write once” and have strictly one writer at any time. Goal of HDFS is to store larger dataset, hardware failsafe, and emphasize on streaming data access (data in motion).
The Hadoop MapReduce framework couples a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster. A MapReduce computation has two phases, a map phase and a reduce phase. The input to the computation is a data set of key value pairs. Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the middle of a computation the tasks assigned to them are re-distributed among the remaining nodes. Having many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with small runtime overhead. MapReduce goal is to process a large set of data, hardware failsafe, and high throughput where it has few commonalities to HDFS like failsafe.
Now a days there are so many data analyst firms sends the data to user nodes to use the compute power in smaller chunks and pay good amount of money to users.
Below is the architecture of Hadoop system in common Telco/Finance/retail industries scenarios.
If we look on the above architecture, it is blend of both the traditional RDBMS and Hadoop system. When both technologies work hand in hand we can have faster, online, reliable, scalable and dependable system, which can provide information for the business in no time and we can serve business and end consumber better.