In the last Blog we discussed more about the Hadoop with HDFS and MapReduce. One of my teammate wants me to discuss about the IBM GPSF also in case of BigData Scenario. Let’s understand the difference in HDFS, GPFS and how it can be used in different scenarios. The Hadoop Distributed File System (HDFS) is considered a core component of Hadoop, but it?s not an essential one. IBM has been talking up the benefits of hooking Hadoop up to the General Parallel File System (GPFS). IBM has done the work of integrating GPFS with Hadoop.

What is GPFS

IBM developed GPFS(General Parallel File System) in 1998 as a SAN file system for use in HPC applications and IBM?s biggest supercomputers, such as Blue Gene, ASCI Purple, Watson, Sequoia, and MIRA. In 2009, IBM hooked GPFS to Hadoop, and today IBM is running GPFS, which scales into the petabyte range and has more advanced data management capabilities than HDFS, on InfoSphere BigInsights.

GPFS is basically storage file system developed as a SAN file system. Being an storage system one can not attach it directly with Hadoop system that makes a cluster. In order to do this IBM FPO(File placement optimization) comes in picture and bridge the gap. FPO is essentially emulation of key component of HDFS which is moving the workload from the application to data. Basically it move the job to Data instead of moving the Data to job. In my previous blog I have mentioned that if we move Job(processing) near the data it would be faster. GPFS is POSIX compliant, which enables any other applications running on top of the Hadoop cluster to access data stored in the file system in a straightforward manner. With HDFS, only Hadoop applications can access the data, and they must go through the Java-based HDFS API.


Source: (IBM)

So major difference is framework verses file system (GPFS) gives the flexibility to users to access storage data from Hadoop and Non Hadoop system which free users to create more flexible workflow (Big Data or ETL or online). In that case one can create the series of ETL process with multiple execution steps, local data or java processes to manipulate the data. Also ETL can be plugged with MapReduce to execute the process for workflow.

Features GPFS HDFS
Hierarchical storage management Allows sufficient usage of disk drives with different performance characteristics
High performance support for MapReduce applications Stripes data across disks by using metablocks, which allows a MapReduce split to be spread over local disks Places a MapReduce split on one local disk
High performance support for traditional applications
  • Manages metadata by using the local node when possible rather than reading metadata into memory unnecessarily
  • Caches data on the client side to increase throughput of random reads
  • Supports concurrent reads and writes by multiple programs
  • Provides sequential access that enables fast sorts, improving performance for query languages such as Pig and Jaql
High availability Has no single point of failure because the architecture supports the following attributes:

  • Distributed metadata
  • Replication of both metadata and data
  • Node quorums
  • Automatic distributed node failure recovery and reassignment
Has a single point of failure on the NameNode, which requires it to run in a separate high availability environment
POSIX compliance Is fully POSIX compliant, which provides the following benefits:

  • Support for a wide range of traditional applications
  • Support for UNIX utilities, that enable file copying by using FTP or SCP
  • Updating and deleting data
  • No limitations or performance issues when using a Lucene text index
Is not POSIX compliant, which creates the following limitations:

  • Limited support of traditional applications
  • No support of UNIX utilities, which requires using the hadoop dfs get command or the put command to copy data
  • After the initial load, data is read-only
  • Lucene text indexes must be built on the local file system or NFS because updating, inserting, or deleting entries is not supported
Ease of deployment Supports a single cluster for analytics and databases Requires separate clusters for analytics and databases
Other enterprise level file system features
  • Workload isolation
  • Security
Data replication Provides cluster-to-cluster replication over a wide area network

Source: IBM


As per IBM ?By storing your data in GPFS-FPO you are freed from the architectural restrictions of HDFS?. Shared nothing architecture used by the GPFS-FPO provide greater elasticity than HDFS by allowing each node to operate independently, reducing than impact of failure event across the multiple node.

While all the above benefits we have seen over the storage based Big Data processing but there are advantages of HDFS as well here are some of them.

Low cost solution

HDFS uses commodity storage where low end and high end both storage works and shares the cost of the network & computers it runs on with the MapReduce / compute layers of the Hadoop stack. HDFS is open source software, so that if an organization chooses, it can be used with zero licensing and support costs.

This cost advantage lets organizations store and process orders of magnitude more data per unit than traditional SAN or NAS systems, which is the price point of many of these other systems. In big data deployments, the cost of storage often determines the viability of the system. Now a days for the large computing storage cost per unit is very popular and many Storage vendors are selling this as USP. But all the requirement can not be factored by only one way of Storage.

Extremely efficient MapReduce workloads

HDFS can deliver data into the compute infrastructure at a huge data rate, which is often a requirement of big data workloads. HDFS can easily exceed 2 gigabits per second per node into the map-reduce layer, on a very low cost shared network. Hadoop can go much faster on higher speed networks, but 10gigE, IB, SAN and other high-end technologies increses significantly the cost of a deployed cluster.

These technologies are optional for HDFS. 2+ gigabits per second per node may not sound very high, but this means that today?s large Hadoop clusters can easily read/write more than a terabyte of data per second continuously to the MapReduce layer using with multimode architecture.

Solid data reliability?

In large data processing distributed system like Hadoop, the laws of probability are not in our control. Things may break every day, often in new and creative ways which we can not predict earlier and take the precautionary measures. Devices will fail and data will be lost or subtly mutated. The design of HDFS is focused on taming this beast.

It was designed from the ground up to correctly store and deliver data while under constant assault from the gremlins that huge scale out unleashes in your data center. And it does this in software, again at low cost. Smart design is the easy part; the difficult part is hardening a system in real use cases. The only way you can prove a system is reliable is to run it for years against a variety of production applications at full scale. Hadoop has been proven in thousands of different use cases and cluster sizes, from startups to Internet giants and other organization.

In nutshell GPFS and HDFS both have it’s own merit and demerits and application of the technology in respective areas. For the large system where cost does not matter much and organisations are focusing on storage based system GPFS is better option and for all those who are working on commodity hardware and distributed processing on self and other dependent systems HDFS is a choice.

In the next Blog I will discuss other Big Data technologies.

In the last blog ?REPORTING STRATEGY FOR OLTP SYSTEMS? Vishnu has discussed about the strategy for the OLTP system using the RDBMS and ETL. There some of us thought and questioned that for the high growth of data it will be a bigger challenge to go with RDBMS and ETL data. In this context world is moving towards the big data for structured and unstructured data. As data grows other technologies like columnar database are emerging technologies to provide the faster and effective information to race with the market. Let?s discuss some of the technologies in this blog.

Why Big Data

Big Data with MapReduce?it’s varied; it’s growing; it’s moving fast, and it’s very much in need of smart management. Data, cloud and engagement are energizing organizations across multiple industries and present an enormous opportunity to make organizations more agile, more efficient and more competitive. Every day million and billions of data, which are being captured by different organization and In order to capture that opportunity, organizations, require a modern Information Management architecture. Information management of growing data and analytics capabilities are now requirement of every industry. Since data is growing exponentially organizations are not able to decide on technologies that which fits their requirement.

Here is basic comparison of two technologies.

In order to understand the capabilities of analytics and information management we need to understand few things:

Data Management & Warehouse:

Data management and warehouse has gain industry-leading database performance across multiple workloads while lowering administration, storage, development and server costs; Realize extreme speed with capabilities optimized for analytics workloads such as deep analytics, and benefit from workload-optimized systems that can be up and running in hours. This has started quite a long back and now most of the large telco banks and retail industry is using this.

Data in Motion Computing (Stream Computing)

Stream computing delivers real-time analytic processing on constantly changing data in motion. It enables descriptive and predictive analytics to support real-time decisions and forecasting to leverage the power of business. Stream computing allows everyone to capture and analyze all data – all the time, just in time. Data is all around us ? from social media feeds to call data records to videos ? but can be difficult to leverage. Sometimes there is simply too much data to collect and store before analyzing it. Sometimes it?s an issue of timing ? by the time you store data, analyze it, and respond ? it?s too late.

  • Stream computing changes where, when and how much data you can analyze. Store less, analyze more, and make better decisions, faster with stream computing.
  • Efficiently deliver real-time analytic processing on constantly changing data in motion and enable descriptive and predictive analytics to support real-time decisions. Capture and analyze all data, all the time, just in time.
  • With stream computing, store less, analyze more and make better decisions faster. Mostly the data we processed at run time is more effective. I remember days when first time we started CIP(call in progress) data analysis to capture the fraud to minimize the loss of telecom industries using the online probe. That time in a way industry started thinking about ?Data in motion?.

Content Management

Content Management Enable comprehensive content lifecycle and document management with cost-effective control of existing and new types of content with scale, security and stability. As you know customer better business grows secure and faster. In order to enable infrastructure to manage the comprehensive content management lifecycle traditional way of storage is no longer effective. Now documents handling and using for the real time analysis combination of few technologies like Hadoop, stream computing and DMS system together work effectively. Security of Bank, Telco and other financial industries where document is asset can not be handled by DMS in silo, it has to be integrated with other running system.

Let?s discuss in detail for a well known established framework.

Hadoop System

Haddop is fairly new and effective open source software framework (started in 2005 and picked up approx. 2010), which brings the power of Apache Hadoop to the enterprise with application accelerators, analytics, visualization, development tools, performance and security features. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. Along with this it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that the framework automatically handles node failures.

Hadoop’s Distributed File System

Hadoop’s Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the (GFS Google File System and Facebook). Hadoop DFS stores every file as a sequence of data blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. IBM uses the similar concept for the designing the failsafe storage of XIV. Files in HDFS are “write once” and have strictly one writer at any time. Goal of HDFS is to store larger dataset, hardware failsafe, and emphasize on streaming data access (data in motion).


The Hadoop MapReduce framework couples a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster. A MapReduce computation has two phases, a map phase and a reduce phase. The input to the computation is a data set of key value pairs. Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the middle of a computation the tasks assigned to them are re-distributed among the remaining nodes. Having many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with small runtime overhead. MapReduce goal is to process a large set of data, hardware failsafe, and high throughput where it has few commonalities to HDFS like failsafe.

Now a days there are so many data analyst firms sends the data to user nodes to use the compute power in smaller chunks and pay good amount of money to users.

Below is the architecture of Hadoop system in common Telco/Finance/retail industries scenarios.

?Hadoop Big Data, Processing large Data

If we look on the above architecture, it is blend of both the traditional RDBMS and Hadoop system. When both technologies work hand in hand we can have faster, online, reliable, scalable and dependable system, which can provide information for the business in no time and we can serve business and end consumber better.