What is Scalability

Scalability is a system’s ability to process more workload, with a proportional increase in system resource usage (mainly hardware like CPU, Memory, Threads etc.). In other words, in a scalable system, if user doubles the workload, then the system would use twice with system resources (most vital parameters like CPU, Memory, disk usage will proportionately change). This sounds obvious, but due to conflicts within the system, the resource usage might exceed twice the original workload.

Examples of bad scalability due to resource conflicts include the following:

  • Applications requiring significant concurrency management as user populations increase or increase in transaction
  • Increased locking activities and IO of the system.
  • Increased data consistency during the load
  • Increased operating system load
  • Transactions requiring increases in data access as data volumes increase and IO becomes bottleneck.
  • Poor SQL and index design resulting in a higher number of logical I/O for the same number of rows returned
  • Reduced availability, because database objects take longer to maintain

An application is said to be non-scalable if it exhausts a system resource to the point where no more throughput is possible when it’s workload is increased. Such applications result in fixed throughputs and poor response times. In resultant we have bad user experiences during the system usage.

Examples of resource exhaustion include the following:

  • Hardware exhaustion
  • Table scans in high-volume transactions causing inevitable disk I/O shortages
  • Excessive network requests, resulting in network and scheduling bottlenecks
  • Memory allocation causing paging and swapping
  • Excessive process and thread allocation causing operating system thrashing and bottleneck on CPU.

This means that application designers must create a design that uses the same resources, regardless of user populations and data volumes, and does not put loads on the system resources beyond their limits.

Scalability of System

Now days there are multi device applications are being used where communication is carried out via IP protocol, USSD, SMS which have more complex performance and availability requirement. Some applications are designed only for the back-office use and some are designed for the end users like telecom, hospitality segment and financial segment.

End-user based application require more data to be available online. In these segments not only the availability of data but also the real time service is very important.

Following are important for the current era application.

  • Availability of system 24 hours 365 days means no downtime.
  • Unpredictable and non-specific number of users at any point of time.
  • Difficult in sizing as capacity changes frequently.
  • Availability for any type of requirement at all time.
  • Stateless of middleware
  • Multi-tier and multi layer architecture
  • Multi device requirement.
  • Rapid development requirement as market is moving fast.
  • Less or no time for testing
  • Agile methodology of development.

As the current application workload growth multifold and system grow exponentially year on year. Have a look on the Amazon, Flipkart, and Netflix all has grown in their areas rapidly and increased their usage exponentially by e-commerce, telemarketing. With such dynamism every system has adopted the new technology, and moved to scalable architecture in its own way. One thing is clear that none of the scalable architecture if 100% future proofs. Any newly adopted architecture today will become obsolete in two year due to growth in system, users behavior, update in technology change and pattern of user adoption and we search for the new scalable architecture.

exponential-growth

Factors Preventing the Scalability

While designing the system Architects and designers should target the perfect scalability model but in reality it is not happening. Most reliable and scalable model is known as linear scalability model. In linear scalability model system throughput is directly proportional to the resources of the system means double the resources like (CPU and memory) and double the capacity. But actually this is not the case in real world where application is tightly coupled with some technology and predefined architecture.

Following are areas, which prevent scalability:

  • Poor design of system; this is one of the biggest factors for the scalability.
    • Poor application, database design that causes more processing.
    • Poor transaction system design which cause locking and serialization problem.
    • Poor interconnects and complete synchronous design or too many hops in system.
    • Non-EDA (event driven architecture) design, which are interdependent of multiple systems.
    • Outdated software and plug-ins.
    • Older design patterns which does not fit in current era.
  • Implementation of System, sometime system is properly designed but implementation is not correct.
    • System moves in Production with bad I/O strategies.
    • Additional modules/queries applied in production while it was not planned in actual design.
    • As per production environment, configuration of application and hardware resources are not carried out.
    • Insufficient memory usage or memory leaks put high workload on virtual/OS memory and CPU exhaustion.
  • Incorrect sizing of hardware
    • Poor Capacity planning.
    • All components are not in resonance.
    • Outdated hardware and technologies.
  • Software components limitation
    • Used sub systems software, third party software may not be linearly scalable.
    • Scalability of database, operating systems like database is not linearly scalable and Risk based OS like Solaris; AIX gives more scalability then Linux.
  • Limitation of hardware and it?s peripherals
    • Each server has it?s own capacity and it should be sized properly.
    • Most of multiprocessor systems are close to linear scalability with finite number of CPU but beyond a point adding CPU can increase the performance but not proportionately. Some time it also possible that adding CPU decrease the performance of system as well.
    • For the higher scalability should use the engineered system like EXADATA, EXALOGIC, PURESYSTEM, PURESCALE, and PUREAPPS etc.
    • System should have compatible components like Network system should be adequate,
    • IO of storage and disk should be proper, maybe using the SSD.
    • Adequate capacity of backup system.

Right way of architecture

There are two parts of system architecture

  • Hardware and Software components
  • Configuring the right system architecture for your requirement.

Hardware and Software Components

Hardware components – In todays world solution architects are responsible with software architects and designers for sizing and capacity planning of hardware at each layer of multi layer architecture environment. It is responsibility of architects to design the balanced system. Designers are bridge to develop the rightly designed system. So designer should be sensitive about this because he could be the strongest or weakest link of the system.

In the balance design all components must be reach their design limits simultaneously.

Below are main components of system.

  • Processor Units (CPU)
  • Memory
  • I/O subsystems
  • Storage
  • Network

Processor Units (CPU)

Right selection of CPU is very important; this is costly and most important part of entire eco system. There can be one or more CPU, and it can vary in processing power from simple CPU from the mobile device to high powered and engineered system. Right sizing of this component is very important as CPU provide the power to system but at same time cost is a factor. Though CPU may not cost high but other software components are licensed based on CPU and core like Oracle, DB2 database, Middleware like Websphere, Weblogic. It is trade off between cost and performance while designing the right system. Following should be kept in mind while selecting the CPU.

  • For transactional system use the multithreaded system.
  • For databases select the higher processing capacity system, higher the CPU clock-speed higher the database performance.
  • For the data processing with NOSQL commodity hardware.

Memory

Now a day?s application required lot of memory to cache the data so appropriate memory is necessity of hour. Even handheld devices are available with 3-4 GB of memory. Enterprise systems require more memory so we need to be provision good amount of memory where caching of data, memory gird, in memory data makes system more scalable.

I/O Subsystems and Storage

I/O of any system can make or break of the system, when we have good I/O system performs well but small glitch in I/O drag the performance of entire eco-system. I/O subsystems can vary from the small disk of system to a disk array. In market there are so many smart storage and disk arrays are available, so while designing system we should consider the following.

  • For the OLTP system consider the high IOPS storage and high bandwidth.
  • Use solid-state drives for high IO system.
  • Use commodity storage for the document servers
  • Use high capacity storage for backup and archive.
  • Progressive storage is also the one of the best option in varied situation.

Network

In any of the solution design network is one factor where most of us are ignorant. In the initial days we did not had any choice of network. One gbps network has ruled on us quite a long. But in recent past bandwidth of Internet and internal LAN and WAN has grown significantly. So we should plan in such a way that network should not be bottleneck and at the same time it should be scalable, use of 10 gigabyte network, and infini-band are good choices for high scalability. Most engineered system like EXADATA, EXALOGIC designed with high bandwidth network.

Software and Third party Components

Similar to hardware component software components and third party software are equally responsible factor for the scalability of system. By dividing software components into functional and non-functional components, it is possible to better comprehend the application design and architecture. Some components of the system are performed by existing software bought to accelerate application implementation, or to avoid re-development of common components.

The difference between software components and hardware components is that while hardware components only performs one task, a piece of software can perform the roles of various software components. For example, a disk drive only stores and retrieves data, but a client program can manage the user interface and perform business logic.

Following are major software components involved in system

  • Management of User interface ? Desktop clients, Web, Mobile, Apps, USSD, SMS, and IVR etc.
  • Management of Business logic ? backend holds the logic of application and have specific request response format
  • Management of User Request and Resource allocation
  • Management of Data and Transactions

User Interface

This component is visible to application user. All the interaction with users are performed via this layer as mentioned earlier Web, Client desktop applications, Mobile App, IVR, USSD, SMS and other machine apps are part of this. This include the following functions

  • Rendering the screen and information to user
  • Collecting user data and transferring it to business logic layer.
  • Displaying processed information to user.
  • Validating entered data
  • Navigating user to next requirement and page/intent.
  • Notification and alert to users like email alert or application notifications.

While choosing this layer scalability should be kept in mind because if this layer is not scalable entire system is useless. Following should be factored:

  • Right tool like mobile app native app, framework or related technology.
  • No of con-current users to be supported
  • Peak time of system, day, week, festive etc.

Business logic layer

This component contains the core business of system that are central to the application function. Any problem occurs in this part of system or error made in this component can be very costly to repair. A mixture of declarative and procedural approaches implements this component. An example of a declarative activity is defining input and output of system. An example of procedure-based logic is implementing a loyalty or accounting strategy.

Common functions of this component are as follows.

  • Defining Input and Output of system
  • Defining required variables and parameters
  • Defining rules and constraint of system
  • Validation of Business logic
  • Coding of Procedural logic to implement business rules.

To make this component scalable following is required

  • Right tools for Business logic
  • Right protocols and API for interaction with User interface and third party.
  • Synchronous and asynchronous calls
  • Proper EDA (Event Driven Architecture) architecture
  • Notification to other system and users like SMS, Android notification.

User Request and Resource allocation

This component is implemented in all pieces of software. However, there are some requests and resources that can be influenced by the application design and some that cannot. In a multiuser application, the application and database server or the operating system handles most resource allocation by user requests.

However, in a large application where the number of users and their usage pattern is unknown or growing rapidly, the system architect must be proactive to ensure that no single software component becomes overloaded and unstable.

In order to make this system scalable we should take care of following:

  • Understand the load on each component like user interface, business logic, database and its underlying hardware components.
  • Choose right tools and technology
  • Select proper protocol of communication internal components and external system.
  • Separate the transaction, reporting, auditing, logging, analytics and notification.
  • OLAP and OLTP should go in different path both should not co-exist together.

Data and Transaction Layer

This layer is mainly takes the responsibility of database and transaction. This layer is bridge between the user requirement and the background data stored in storage. Here proper design of database is very important.

Following are important to make system scalable.

  • Proper Data Modeling ? OLAP, OLTP and document servers all must be designed based on requirement
  • Table and Index Design ? Storage and I/O should be managed by right normalisation of system and proper indexes.
  • Using Views
  • SQL Execution Efficiency – This drags the system performance significantly so user and designer should be very careful.

For details see

SQL Mistakes which Drag System Performance Java, JDBC, Hibernate ? Part1

SQL Mistakes which Drag System Performance Java, JDBC, Hibernate Part2

  • Proper storage of data

In the next Article part II I will explain about “Configuring the right architecture for your requirement”

In the last Blog we discussed more about the Hadoop with HDFS and MapReduce. One of my teammate wants me to discuss about the IBM GPSF also in case of BigData Scenario. Let’s understand the difference in HDFS, GPFS and how it can be used in different scenarios. The Hadoop Distributed File System (HDFS) is considered a core component of Hadoop, but it?s not an essential one. IBM has been talking up the benefits of hooking Hadoop up to the General Parallel File System (GPFS). IBM has done the work of integrating GPFS with Hadoop.

What is GPFS

IBM developed GPFS(General Parallel File System) in 1998 as a SAN file system for use in HPC applications and IBM?s biggest supercomputers, such as Blue Gene, ASCI Purple, Watson, Sequoia, and MIRA. In 2009, IBM hooked GPFS to Hadoop, and today IBM is running GPFS, which scales into the petabyte range and has more advanced data management capabilities than HDFS, on InfoSphere BigInsights.

GPFS is basically storage file system developed as a SAN file system. Being an storage system one can not attach it directly with Hadoop system that makes a cluster. In order to do this IBM FPO(File placement optimization) comes in picture and bridge the gap. FPO is essentially emulation of key component of HDFS which is moving the workload from the application to data. Basically it move the job to Data instead of moving the Data to job. In my previous blog I have mentioned that if we move Job(processing) near the data it would be faster. GPFS is POSIX compliant, which enables any other applications running on top of the Hadoop cluster to access data stored in the file system in a straightforward manner. With HDFS, only Hadoop applications can access the data, and they must go through the Java-based HDFS API.

IBM_GPFS_2

Source: (IBM)

So major difference is framework verses file system (GPFS) gives the flexibility to users to access storage data from Hadoop and Non Hadoop system which free users to create more flexible workflow (Big Data or ETL or online). In that case one can create the series of ETL process with multiple execution steps, local data or java processes to manipulate the data. Also ETL can be plugged with MapReduce to execute the process for workflow.

Features GPFS HDFS
Hierarchical storage management Allows sufficient usage of disk drives with different performance characteristics
High performance support for MapReduce applications Stripes data across disks by using metablocks, which allows a MapReduce split to be spread over local disks Places a MapReduce split on one local disk
High performance support for traditional applications
  • Manages metadata by using the local node when possible rather than reading metadata into memory unnecessarily
  • Caches data on the client side to increase throughput of random reads
  • Supports concurrent reads and writes by multiple programs
  • Provides sequential access that enables fast sorts, improving performance for query languages such as Pig and Jaql
High availability Has no single point of failure because the architecture supports the following attributes:

  • Distributed metadata
  • Replication of both metadata and data
  • Node quorums
  • Automatic distributed node failure recovery and reassignment
Has a single point of failure on the NameNode, which requires it to run in a separate high availability environment
POSIX compliance Is fully POSIX compliant, which provides the following benefits:

  • Support for a wide range of traditional applications
  • Support for UNIX utilities, that enable file copying by using FTP or SCP
  • Updating and deleting data
  • No limitations or performance issues when using a Lucene text index
Is not POSIX compliant, which creates the following limitations:

  • Limited support of traditional applications
  • No support of UNIX utilities, which requires using the hadoop dfs get command or the put command to copy data
  • After the initial load, data is read-only
  • Lucene text indexes must be built on the local file system or NFS because updating, inserting, or deleting entries is not supported
Ease of deployment Supports a single cluster for analytics and databases Requires separate clusters for analytics and databases
Other enterprise level file system features
  • Workload isolation
  • Security
Data replication Provides cluster-to-cluster replication over a wide area network

Source: IBM

IBM_GPFS

As per IBM ?By storing your data in GPFS-FPO you are freed from the architectural restrictions of HDFS?. Shared nothing architecture used by the GPFS-FPO provide greater elasticity than HDFS by allowing each node to operate independently, reducing than impact of failure event across the multiple node.

While all the above benefits we have seen over the storage based Big Data processing but there are advantages of HDFS as well here are some of them.

Low cost solution

HDFS uses commodity storage where low end and high end both storage works and shares the cost of the network & computers it runs on with the MapReduce / compute layers of the Hadoop stack. HDFS is open source software, so that if an organization chooses, it can be used with zero licensing and support costs.

This cost advantage lets organizations store and process orders of magnitude more data per unit than traditional SAN or NAS systems, which is the price point of many of these other systems. In big data deployments, the cost of storage often determines the viability of the system. Now a days for the large computing storage cost per unit is very popular and many Storage vendors are selling this as USP. But all the requirement can not be factored by only one way of Storage.

Extremely efficient MapReduce workloads

HDFS can deliver data into the compute infrastructure at a huge data rate, which is often a requirement of big data workloads. HDFS can easily exceed 2 gigabits per second per node into the map-reduce layer, on a very low cost shared network. Hadoop can go much faster on higher speed networks, but 10gigE, IB, SAN and other high-end technologies increses significantly the cost of a deployed cluster.

These technologies are optional for HDFS. 2+ gigabits per second per node may not sound very high, but this means that today?s large Hadoop clusters can easily read/write more than a terabyte of data per second continuously to the MapReduce layer using with multimode architecture.

Solid data reliability?

In large data processing distributed system like Hadoop, the laws of probability are not in our control. Things may break every day, often in new and creative ways which we can not predict earlier and take the precautionary measures. Devices will fail and data will be lost or subtly mutated. The design of HDFS is focused on taming this beast.

It was designed from the ground up to correctly store and deliver data while under constant assault from the gremlins that huge scale out unleashes in your data center. And it does this in software, again at low cost. Smart design is the easy part; the difficult part is hardening a system in real use cases. The only way you can prove a system is reliable is to run it for years against a variety of production applications at full scale. Hadoop has been proven in thousands of different use cases and cluster sizes, from startups to Internet giants and other organization.

In nutshell GPFS and HDFS both have it’s own merit and demerits and application of the technology in respective areas. For the large system where cost does not matter much and organisations are focusing on storage based system GPFS is better option and for all those who are working on commodity hardware and distributed processing on self and other dependent systems HDFS is a choice.

In the next Blog I will discuss other Big Data technologies.