Batch processing is a highly data incentive operation and most of architects recommend to execute it near data using database in built capability like PL/SQL in oracle. This is true but real challenge came when Architects has to conform organization standard to follow Java and ORM layer like Hibernate to have their product database agnostic.

In this blog I will talk about similar challenge where use case is to deal with more than 20 million records in database and have to do processing on each record and then save the output into database. Below is typical requirement in batch processing.

  • Pickup data records from database
  • Apply some business logic on each record
  • Save the manipulated data into database.

The obvious solution is to use database technologies like Pl/SQL or ETL tools for better performance. But using a Java application and use hibernate ORM layer and all this with a performance target of 10M record processing in 30 minutes, task seems challenging.

The challenges with Java+ Hibernate based processing are:

  • Limited Heap Size and out of memory error during processing
  • Network overhead in pulling data from database to application.
  • Hibernate overhead as compared to native JDBC calls.

Let?s quickly jump to solution part and talk something positive about Java.

Java performance can be scaled using its multithreading capability and that is hint to solve the puzzle. The problem can be solved if input data can be partitioned into independent data sets and each set can be processed parallel. The following picture illustrates the partitioning approach:

SQL Data1

There can be many partitioning approaches: The following lists some of the possible partitioning approaches. Selecting a partitioning approach has to be done on a case-by-case basis.

  1. Fixed and Even Break-Up of Record Set
  • This involves breaking the input record set into an even number of portions (e.g. 10, where each portion will have exactly 1/10th of the entire record set). Each portion is then processed by one instance of the batch/extract application.
  • In order to use this approach, preprocessing will be required to split the recordset up. The result of this split will be a lower and upper bound placement number which can be used as input to the batch/extract application in order to restrict its processing to its portion alone.
  • Preprocessing could be a large overhead as it has to calculate and determine the bounds of each portion of the record set.
  1. Breakup by a Key Column
  • This involves breaking up the input record set by a key column such as a location code, and assigning data from each key to a batch instance.
  1. Addition of a Processing Indicator
  • This involves the addition of a new column to the input table, which acts as an indicator. As a preprocessing step, all indicators would be marked to non-processed. During the record fetch stage of the batch application, records are read on the condition that that record is marked non-processed, and once they are read (with lock), they are marked processing. When that record is completed, the indicator is updated to either complete or error. Many instances of a batch application can be started without a change, as the additional column ensures that a record is only processed once.
  • With this option, I/O on the table increases dynamically. In the case of an updating batch application, this impact is reduced, as a write will have to occur anyway.

Still problem remain same as large data sets cannot be pulled into Java memory. For this Hibernate provides scroll() to take advantage of server-side cursors for queries that return many rows of data.

Session session = sessionFactory.openSession();Transaction tx = session.beginTransaction();
ScrollableResults customers = session.getNamedQuery("GetCustomers?)    .setCacheMode(CacheMode.IGNORE)
int count=0;
while ( customers.next() ) {
 if ( ++count % 20 == 0 ) 
 //flush a batch of updates and release memory:        session.flush();        

In this code snippet, Inside while loop, create batches for say 500 records and spawn a new thread from thread pool dedicated to batch program. The size of batch and no of concurrent threads processing batches shall be tuned based on memory allocated to java instance.

In my next blog, I will talk about Batch processing Strategies and Best Practices.