Hadoop with HDFS, MapReduce, GPFS

In the last Blog we discussed more about the Hadoop with HDFS and MapReduce. One of my teammate wants me to discuss about the IBM GPSF also in case of BigData Scenario. Let’s understand the difference in HDFS, GPFS and how it can be used in different scenarios. The Hadoop Distributed File System (HDFS) is considered a core component of Hadoop, but it?s not an essential one. IBM has been talking up the benefits of hooking Hadoop up to the General Parallel File System (GPFS). IBM has done the work of integrating GPFS with Hadoop.

What is GPFS

IBM developed GPFS(General Parallel File System) in 1998 as a SAN file system for use in HPC applications and IBM?s biggest supercomputers, such as Blue Gene, ASCI Purple, Watson, Sequoia, and MIRA. In 2009, IBM hooked GPFS to Hadoop, and today IBM is running GPFS, which scales into the petabyte range and has more advanced data management capabilities than HDFS, on InfoSphere BigInsights.

GPFS is basically storage file system developed as a SAN file system. Being an storage system one can not attach it directly with Hadoop system that makes a cluster. In order to do this IBM FPO(File placement optimization) comes in picture and bridge the gap. FPO is essentially emulation of key component of HDFS which is moving the workload from the application to data. Basically it move the job to Data instead of moving the Data to job. In my previous blog I have mentioned that if we move Job(processing) near the data it would be faster. GPFS is POSIX compliant, which enables any other applications running on top of the Hadoop cluster to access data stored in the file system in a straightforward manner. With HDFS, only Hadoop applications can access the data, and they must go through the Java-based HDFS API.

IBM_GPFS_2

Source: (IBM)

So major difference is framework verses file system (GPFS) gives the flexibility to users to access storage data from Hadoop and Non Hadoop system which free users to create more flexible workflow (Big Data or ETL or online). In that case one can create the series of ETL process with multiple execution steps, local data or java processes to manipulate the data. Also ETL can be plugged with MapReduce to execute the process for workflow.

Features GPFS HDFS
Hierarchical storage management Allows sufficient usage of disk drives with different performance characteristics
High performance support for MapReduce applications Stripes data across disks by using metablocks, which allows a MapReduce split to be spread over local disks Places a MapReduce split on one local disk
High performance support for traditional applications
  • Manages metadata by using the local node when possible rather than reading metadata into memory unnecessarily
  • Caches data on the client side to increase throughput of random reads
  • Supports concurrent reads and writes by multiple programs
  • Provides sequential access that enables fast sorts, improving performance for query languages such as Pig and Jaql
High availability Has no single point of failure because the architecture supports the following attributes:

  • Distributed metadata
  • Replication of both metadata and data
  • Node quorums
  • Automatic distributed node failure recovery and reassignment
Has a single point of failure on the NameNode, which requires it to run in a separate high availability environment
POSIX compliance Is fully POSIX compliant, which provides the following benefits:

  • Support for a wide range of traditional applications
  • Support for UNIX utilities, that enable file copying by using FTP or SCP
  • Updating and deleting data
  • No limitations or performance issues when using a Lucene text index
Is not POSIX compliant, which creates the following limitations:

  • Limited support of traditional applications
  • No support of UNIX utilities, which requires using the hadoop dfs get command or the put command to copy data
  • After the initial load, data is read-only
  • Lucene text indexes must be built on the local file system or NFS because updating, inserting, or deleting entries is not supported
Ease of deployment Supports a single cluster for analytics and databases Requires separate clusters for analytics and databases
Other enterprise level file system features
  • Workload isolation
  • Security
Data replication Provides cluster-to-cluster replication over a wide area network

Source: IBM

IBM_GPFS

As per IBM ?By storing your data in GPFS-FPO you are freed from the architectural restrictions of HDFS?. Shared nothing architecture used by the GPFS-FPO provide greater elasticity than HDFS by allowing each node to operate independently, reducing than impact of failure event across the multiple node.

While all the above benefits we have seen over the storage based Big Data processing but there are advantages of HDFS as well here are some of them.

Low cost solution

HDFS uses commodity storage where low end and high end both storage works and shares the cost of the network & computers it runs on with the MapReduce / compute layers of the Hadoop stack. HDFS is open source software, so that if an organization chooses, it can be used with zero licensing and support costs.

This cost advantage lets organizations store and process orders of magnitude more data per unit than traditional SAN or NAS systems, which is the price point of many of these other systems. In big data deployments, the cost of storage often determines the viability of the system. Now a days for the large computing storage cost per unit is very popular and many Storage vendors are selling this as USP. But all the requirement can not be factored by only one way of Storage.

Extremely efficient MapReduce workloads

HDFS can deliver data into the compute infrastructure at a huge data rate, which is often a requirement of big data workloads. HDFS can easily exceed 2 gigabits per second per node into the map-reduce layer, on a very low cost shared network. Hadoop can go much faster on higher speed networks, but 10gigE, IB, SAN and other high-end technologies increses significantly the cost of a deployed cluster.

These technologies are optional for HDFS. 2+ gigabits per second per node may not sound very high, but this means that today?s large Hadoop clusters can easily read/write more than a terabyte of data per second continuously to the MapReduce layer using with multimode architecture.

Solid data reliability?

In large data processing distributed system like Hadoop, the laws of probability are not in our control. Things may break every day, often in new and creative ways which we can not predict earlier and take the precautionary measures. Devices will fail and data will be lost or subtly mutated. The design of HDFS is focused on taming this beast.

It was designed from the ground up to correctly store and deliver data while under constant assault from the gremlins that huge scale out unleashes in your data center. And it does this in software, again at low cost. Smart design is the easy part; the difficult part is hardening a system in real use cases. The only way you can prove a system is reliable is to run it for years against a variety of production applications at full scale. Hadoop has been proven in thousands of different use cases and cluster sizes, from startups to Internet giants and other organization.

In nutshell GPFS and HDFS both have it’s own merit and demerits and application of the technology in respective areas. For the large system where cost does not matter much and organisations are focusing on storage based system GPFS is better option and for all those who are working on commodity hardware and distributed processing on self and other dependent systems HDFS is a choice.

In the next Blog I will discuss other Big Data technologies.

4 thoughts on “Hadoop with HDFS, MapReduce, GPFS

  1. Hi Daya

    Nice article.
    However, when considering GPFS over HDFS following variables should be taken into account
    1. Cost of commodity hardware in HDFS vs server grade SAN storage by GPFS. When data runs into petabytes, the cost differential will be significant. One advantage that GPFS gives is that it also supports non-hadoop processes like ETL. This can be leveraged by using GPFS on reduced dataset by hadoop.
    2. Open source vs. proprietary
    3. Support for newer versions of Hadoop.

    Also with new version of Mapreduce2.0 (YARN). The HDFS can also be made HA and NodeManager being single point of failure can be done away with

    1. Yes I agree with your points. In the new version of MapReduce SPOF is tackled and this has to adopt the market which will eventually take 6 months to go.

  2. Hi Daya,
    This is a well written article. However, it would be even better iof you had other sources other than IBM. Having IBM as your only source when comparing an IBM product with open source seems like a conflict of interest to me. Thanks for sharing you take though.

Leave a Reply

Your email address will not be published. Required fields are marked *