Interpretation of jStat (Heap memory) to suspect java memory leak

In this blog I am going to discuss the memory leak in Java and understanding that how we can observe that there are memory leak and then analyse the memory leak.

Heap

Let’s understand the heap and non heap memory.

The Java heap is the amount of memory allocated to applications running in the JVM. Objects in heap memory can be shared between threads. The practical limit for Java heap size is typically limited to under 4-8GB even with Oracle’s current-generation G1 Collector due to the impact of garbage collection pauses.

This has three main part:

  • Eden Space: The pool from which memory is initially allocated for most objects.
  • Survivor Space: The pool containing objects that have survived the garbage collection of the Eden space.
  • Tenured Generation: The pool containing objects that have existed for some time in the survivor space.

Non Heap memory is amount of memory allocated to application for the loading the classes, other meta data which

  • Permanent Generation: The pool containing all the reflective data of the virtual machine itself, such as class and method objects. With Java VMs that use class data sharing, this generation is divided into read-only and read-write areas.
  • Code Cache: The HotSpot Java VM also includes a code cache, containing memory that is used for compilation and storage of native code.

For the basic monitoring of heap we can use the jStat utility. The jstat utility uses the built-in instrumentation in the Java HotSpot VM to provide information about performance and resource consumptionof running applications. The tool can be used when diagnosing performance issues, and in particular issues related to heap sizing and garbage collection. The jstat utility does not require the VM to be started with any special options. The built-in instrumentation in the Java HotSpot VM is enabled by default. This utility is included in the JDK download for all operating system platforms supported by Oracle.

jstat [ generalOption | outputOptions vmid [interval[s|ms] [count]] ]

Below are options for the jStat tool but in this blog I will discuss only the GC option.

 

Option Displays…
Class Statistics on the behavior of the class loader.
compiler Statistics of the behavior of the HotSpot Just-in-Time compiler.
Gc Statistics of the behavior of the garbage collected heap.
gccapacity Statistics of the capacities of the generations and their corresponding spaces.
gccause Summary of garbage collection statistics (same as -gcutil), with the cause of the last and current (if applicable) garbage collection events.
gcnew Statistics of the behavior of the new generation.
gcnewcapacity Statistics of the sizes of the new generations and its corresponding spaces.
Gcold Statistics of the behavior of the old and permanent generations.
gcoldcapacity Statistics of the sizes of the old generation.
gcpermcapacity Statistics of the sizes of the permanent generation.
gcutil Summary of garbage collection statistics.
printcompilation HotSpot compilation method statistics.

 

Let’s discuss the behaviour of Garbage collection to understand memory leak. Once jstat is used with the option below are details printed.

Garbage-collected heap statistics
Column Description
S0C Current survivor space 0 capacity (KB).
S1C Current survivor space 1 capacity (KB).
S0U Survivor space 0 utilization (KB).
S1U Survivor space 1 utilization (KB).
EC Current eden space capacity (KB).
EU Eden space utilization (KB).
OC Current old space capacity (KB).
OU Old space utilization (KB).
PC Current permanent space capacity (KB).
PU Permanent space utilization (KB).
YGC Number of young generation GC Events.
YGCT Young generation garbage collection time.
FGC Number of full GC events.
FGCT Full garbage collection time.
GCT Total garbage collection time.

Let’s see how we can interpret it. Here is sample of one of the application at various points and specially when a GC occurs.

Example1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT 
46583.0 152832.0 153600.0 74470.6  0.0   741376.0 366911.0 2097152.0  1986626.3  72768.0 65442.1    432   32.979  19     23.297   56.276
46593.0 152832.0 153600.0 74470.6  0.0   741376.0 568393.9 2097152.0  1986626.3  72768.0 65447.5    432   32.979  19     23.297   56.276
46603.0 150144.0 153600.0  0.0    0.0   741376.0  9453.8  2097152.0  1006314.0  72128.0 64552.8    433   33.054  20     24.699   57.753    
47340.3 149696.0 158400.0 139822.9  0.0   731776.0 552192.7 2097152.0  1911806.9  72128.0 65312.9    446   34.058  20     24.699   58.758
47350.4 149696.0 158400.0 139822.9  0.0   731776.0 700174.6 2097152.0  1911806.9  72128.0 65330.9    446   34.058  20     24.699   58.758
47360.3 159680.0 124864.0  0.0    0.0   731776.0 103510.6 2097152.0  1047206.0  70976.0 64632.1    447   34.143  21     25.999   60.141
47370.4 159680.0 124864.0  0.0    0.0   731776.0 192150.3 2097152.0  1047206.0  70976.0 64633.6    447   34.143  21     25.999   60.141
58039.5 215488.0 265728.0 215475.3  0.0   517120.0 501874.6 2097152.0  1955789.1  66304.0 65702.4    696   53.018  35     46.035   99.053
58040.5 215488.0 265728.0 215475.3 146995.9 517120.0 517120.0 2097152.0  2014621.6  66304.0 65703.1    697   53.018  35     46.035   99.053
58041.5 280896.0 202880.0  0.0   202841.1 501760.0   0.0    2097152.0  2026636.8  66304.0 65703.1    697   53.110  36     46.035   99.145
58042.5 280896.0 202880.0  0.0    0.0   501760.0 26749.5  2097152.0  1227003.7  66112.0 64806.5    697   53.110  36     47.578  100.688
67032.7 296896.0 237952.0  0.0   237914.2 492224.0 492072.2 2097152.0  1998610.5  66176.0 65760.1   1043   77.918  52     71.807  149.725
67033.7 296896.0 313536.0 224865.4  0.0   421504.0   0.0    2097152.0  2036326.4  66176.0 65760.1   1044   78.017  53     71.807  149.824
67034.7 296896.0 313536.0  0.0    0.0   421504.0 36119.0  2097152.0  1296443.8  66176.0 64804.6   1044   78.017  53     73.261  151.278

 

Let’s see the one row and understand it..

Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT 
46583.0 152832.0 153600.0 74470.6  0.0   741376.0 366911.0 2097152.0  1986626.3  72768.0 65442.1    432   32.979  19     23.297   56.276

 

In this row we can see there are Eden space utilization (EU) and Old Space utilization.

This process is running since 45583 second (13 hours) and total 19 full GC, 432 young GC, total time taken by the GC is 56 second(23 Sec by full GC and 33 Sec by young GC). Which means approx. 35-40 minutes for full GC and 2 minutes for young GC. Which seems to be healthy initially because if full GC come every 30-50 mins it is good for the system.

Below is sample for the frequent GC.

Example2

Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT   
408558.1 349504.0 349504.0  0.0    0.0   349568.0 341903.6 2097152.0  2097142.3  65536.0 64873.5   9488  780.312 1652  4592.517 5372.829
408558.1 349504.0 349504.0  0.0    0.0   349568.0 341903.6 2097152.0  2097142.3  65536.0 64873.5   9488  780.312 1652  4592.517 5372.829

 

If we look on this Example2 it has approx. 1652 GC in occurrence 113 hours which depict approximately GC occurs in every 4 minutes. This is not healthy for the system. Also it has approx 5372 second took on the Garbage Collection means approx 1.5 hours system was in hung state which is approx 12% of time.

Here is another jStat sample for the same.

Example 3

Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT   
333695.7 349504.0 349504.0  0.0    0.0   349568.0 349568.0 2097152.0  2097152.0  65536.0 64908.4   6720  535.716 2043  5042.068 5577.784
333695.8 349504.0 349504.0  0.0    0.0   349568.0 349568.0 2097152.0  2097152.0  65536.0 64908.4   6720  535.716 2043  5042.068 5577.784

If we look on this Example 3 it has approx. 2043 GC occurrence in 93 hours which depict approximately GC occurs in every 2.7 minutes. Also it has approx 5577 second took on the Garbage Collection means approx 1.55 hours system was in hung state which is approx 16% of time. This is even worse than previous Jstat example.

1
2
3
4
5
Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT 
46603.0 150144.0 153600.0  0.0    0.0   741376.0  9453.8  2097152.0  1006314.0  72128.0 64552.8    433   33.054  20     24.699   57.753
47360.3 159680.0 124864.0  0.0    0.0   731776.0 103510.6 2097152.0  1047206.0  70976.0 64632.1    447   34.143  21     25.999   60.141
58042.5 280896.0 202880.0  0.0    0.0   501760.0 26749.5  2097152.0  1227003.7  66112.0 64806.5    697   53.110  36     47.578  100.688
67034.7 296896.0 313536.0  0.0    0.0   421504.0 36119.0  2097152.0  1296443.8  66176.0 64804.6   1044   78.017  53     73.261  151.278

Now let’s go back again to earlier example (extract defined in Example 4) for analysis to understand that how we can determine the memory leak. After each GC OU is increasing and initially at 46603 second it was 1006314.0 KB approx. at 67034 second it increased to 1296443.8 KB means in 5.6 hours system has consumed approx. 283MB of space as a retained heap. This is clear memory leak in the system and after some time system will trow out of memory exception. With this speed system will crash after some time. As the retained heap increases
over period of time the difference between two GC time decreases due to reducing heap space.

Now let’s do the analysis of same binary. For this take the heap dump using the following command.

1
jmap -dump:file=<file_name> <pid>

Once taken the heap dump process it using the memory analyser. For this there are many tools, I like the eclipse memory analyser tool (MAT). Here is link for the MAT
Memory Analyzer (MAT)

After opening the memory leak suspect there are two main memory leak suspect is visible.

Leak Suspect 1

One instance of "org.hibernate.impl.SessionFactoryImpl" loaded by "org.apache.catalina.loader.WebappClassLoader @ 0x740db1d00" occupies 1,011,397,080 (40.25%) bytes. The memory is accumulated in one instance of "java.util.HashMap$Entry[]" loaded by "<system class loader>".

Keywords
org.hibernate.impl.SessionFactoryImpl
org.apache.catalina.loader.WebappClassLoader @ 0x740db1d00
java.util.HashMap$Entry[]

 

This is basically because of the improper query written the system. Generally If we expand this we found the it is hibernate statistics are problem here.

heap1

In the above it is visible that approx. 1 GB of heap is retained by the hibernate statistics. Further looking on the table and key level here is snap.

heap2

In the above figure one can easily figure it out the query statement is not prepared statement and it is memory consuming. Every statement with different value will become the key for the hibernate query caching and stored in the memory in retained heap. Solution for the same is to create a prepared statement like below.

“FROM com.abc.dto.xxxxxxDTO thd where thd.xxxid=:transactionID”

heap3

For high transactional system this is really killer because the entries in the retained heap is approximately 1.8 million.

Leak Suspect 2

The classloader/component "org.apache.catalina.loader.WebappClassLoader @ 0x740db1d00" occupies 457,168,920 (18.19%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Segment[]" loaded by "<system class loader>".

Keywords
org.apache.catalina.loader.WebappClassLoader @ 0x740db1d00
java.util.concurrent.ConcurrentHashMap$Segment[]

 

This leak suspect is for the concurrent hashmap is keep on increasing and not getting cleared. This is consuming approximately 450 MB of space in old generation and this is not collected by Garbage collector because this is defined as retained map.

heap4
Further exploring the same we can see the session map is retaining the heap. To handle this developer need to clear the map and clear the release the memory after the use otherwise slowly it will throw the out of memory error.

heap5

In the next blog I will discuss some more memory leaks and it’s solution.

 

Negative and Exception Scenarios: Think Right

Think right scenarios
Thing right scenarios

Many organizations are aiming the most optimal product quality while development and QA processes. One side developer develop the product according the specifications provided by the business teams. Other side tester and QA teams test the product quality by various quality assurance processes, manual and automated testing and attempt to find the maximum defects during the QA process and testing. During this process QA team trying to ensure that client and/or end user consuming the product should not see the abnormality with respect to product functioning and it’s feature.

While designing the software product, BA, developer and tester’s goal is design or craft the right architecture, right test cases scenario, to make sure that particular application and product platform the way it is supposed to. At one hand it is absolute important to verify that the product perform it’s function as per the functional need, on the other hand it is equally or more important to verify that application or product gracefully handle the abnormal situations. It is obvious that most of the engineering and QA teams are working on the stated functionality by business teams and develop the software and test cases.

Handling of abnormal scenarios depends on the developers and tester’s creativity and knowledge/experience of eco system. But relying on developer and tester creativity cannot ensure the best quality product. For this we need to do more. In this paper I will discuss the right and wrong scenarios.

What are Positive and Negative Scenarios?

Positive scenario: Generally while defining the business scenario we define the happy situation.

Case1: Taking very basic two examples of the positive scenarios. Let’s take the case of user entry form of kid’s school admission and user input is “Date of Birth” of kids and parents “Date of Birth”. In the happy scenario developer provides a date control and tester create test case input field containing the date of birth for the kid and parent both.

Case2: As a complex example let’s consider that consumer withdraw money from the bank ATM.

# Business Scenario Steps Expected Result
1 Withdraw Money from the ABC Bank ATM –        Insert/Swipe the card in ATM.

–        Authorize and select Account

–        Enter Amount to withdraw.

–        Confirm the transaction

–        Collect cash

–        Collect Receipt

–        Money will be dispense from the ATM

–        Account will be debited.

–        User receives SMS and mail notification.

 

Developer will create the application based on the functionality provided in the above use case. These scenarios are “Happy Scenarios” and generally works fine because developer and tester both are focusing on the same.

Negative Scenarios: Now in all the above cases negative scenarios makes an impact for the quality of application/product. We dill discuss all the cases for the negative scenarios and also the owner of the scenarios.

Case1:

  • In case of “Date of Birth” Date should be always less than the current date else someone can put the future date.
  • Parent date should always be greater than the kid’s birthday; otherwise data entry operator can enter the parent date of birth in kid’s date of birth field.
  • Most important that for the school kids age should be greater than ‘X’ month. It cannot be yesterday in the kid birthday field.

Case2:

  • If customer inserts the card machine is not accepting the other bank’s card.
  • If customer is of same bank he enter wrong pin.
  • If customer enters the amount more than the limit of ATM money combination to dispense.
  • If customer is about the complete the transaction at the same time machine dispense system is malfunctioning and his money is debited in his book of accounts.

For thinking negative scenarios there are multiple owners and their creativity.

  • “Date of Birth” should be less than current date scenario should be covered by the designer, developer and QA team,
  • Parent’s date should be always less than kid date of birth also be covered by designer, developer and QA team
  • While for third case business team must define Age factor, this cannot be left on the developer and tester creativity.

In the above scenarios if everyone start right user cannot make mistake, so if BA defines possible negative business cases in requirement, developer define the technical negative scenarios and develop and tester defines the usability negative test cases and test in order to provide the softer to have expected output of product.

Factors to consider the negative scenarios:

Following are factors, which should be considered to define the negative scenarios.

  • Usability and handling mistakes
  • Business driven negative scenarios
  • Human Errors
  • Technology malfunctions
  • Environmental Glitches
  • Development Time Scenarios

Let’s discuss this in details.

Usability and Handling Mistakes

In the above example if data entry operator interchange the parent and kid date of birth then it is usability and handling mistake and for this developer and tester and BA should make it fool proof. In order to handle this there should be a check that parent DOB should always be greater than kid DOB.

Also DOB should always be less then current date. It is very similar to credit card expiry date field where month and year is always greater or equal to current date.

Business Driven Negative Scenarios

Business driven negative scenarios are those scenarios where BA must define the proper handling of business failure cases and negative scenarios. Like kids can start school after 36 month or amount in the ATM machine cannot exceed to a limit are basic examples.

Human Errors

Human errors are very common and it needs to be taken care in the system. For example user leaves certain column blank, or enters the wrong value in relevant field. Such cases require basic checks and negative scenarios of testing and in general making eliminates 90% of errors of the system.

Technology Malfunctions

None of the technology is perfect, every technology malfunction some or other time. ATM machine consist of many important technologies like machine, modem, Internet, and electricity, in this any part can malfunction. So all the failure scenarios must be defined in the negative scenarios.

Environmental Glitches

Many times system expected output changes because of the environment where system is operated. For example poor network can result in the non-performance of application while graphics loading; heavy pages may not be transmitted accurately but still it is part of allocation and system is not working as expected. So while developing and testing these things should be taken care.

Development Time Scenarios – Technical Exception

These are called technical exception handlings. Technical exceptions happen when a technical component of a business process acts in an unexpected way. When using Java based systems, this often results in a literal code Exception being thrown by the system. Technical components used in a process can fail in a way that cannot be described using process. In this case, it’s important to handle these exceptions in expected ways.

Generally developer creates a try catch and finally block and for all the exceptions they are using single exception, which is not correct. Every different exception scenario must be handled independently.

Let’s discuss some scenarios, which should be taken care by different roles.

Negative Scenarios and Exceptions Thinking

Business Exception Handling.

A business scenario is essentially a complete description of a business problem, both in business and in architectural terms, which enables individual requirements to be viewed in relation to one another in the context of the overall problem. Without such a complete description to serve as context:

  • There is a danger of the architecture being based on an incomplete set of requirements that do not add up to a whole problem description, and that can therefore misguide architecture work. So clearly create define all the alternative paths.
  • The business value of solving the problem is unclear
  • The relevance of potential solutions is unclear

Also, because the technique requires the involvement of business line management and other stakeholders at an early stage in the architecture project, it also plays an important role in gaining the buy-in of these key personnel to the overall project and its end-product – the IT architecture. Thinking of the business exception scenario can only be done by the business stakeholders, though developer and tester can add value but it is not their prime area to work on.

Creating a business scenario involves the following

1 – Identifying, documenting and ranking the problem driving the scenario

2 – Identify the business and technical environment of the scenario and documenting it in scenario models

3 – Identifying and documenting desired objectives (the results of handling the problems successfully)

4 – Identifying the human actors (participants) and their place in the business model

5 – Identifying computer actors (computing elements) and their place in the technology model

6 – Identifying and documenting roles, responsibilities and measures of success per actor; documenting the required scripts per actor, and the results of handling the situation

7 – Checking for “fitness for purpose” and refining only if necessary.

Technical Exception Handling

There are multiple areas in technical design and development which need to be taken care while handling negative scenarios, and these consists of the coding, information store, network, deployment etc. Here are some cases of Java, DB (oracle) ways to handle exception and negative scenarios.

  • Handle the common use negative scenarios
  • Use appropriate validation at each input and output level
  • Use Checked Exception for Recoverable error and Unchecked Exception for programming error.
  • Close or release resource in finally block
  • Including cause of Exception in stack-trace
  • Always provide meaning full message on Exception
  • Avoid overusing Checked Exception
  • Converting Checked Exception into Runtime Exception
  • Remember Exceptions are costly in terms of performance
  • Avoid empty catch blocks
  • Use Standard Exceptions
  • Document Exception thrown by any method
  • Have proper database exceptions handles at each level.

Below are oracle database standard exception and messages to deal the negative scenario.

Exception Name ORA Error SQLCODE Raised When …
ACCESS_INTO_NULL 06530 -6530 A program attempts to assign values to the attributes of an uninitialized object
CASE_NOT_FOUND 06592 -6592 None of the choices in the WHEN clauses of a CASE statement is selected, and there is no ELSE clause.
COLLECTION_IS_NULL 06531 -6531 A program attempts to apply collection methods other than EXISTS to an uninitialized nested table or v-array, or the program attempts to assign values to the elements of an uninitialized nested table or v-array.
CURSOR_ALREADY_OPEN 06511 -6511 A program attempts to open an already open cursor. A cursor must be closed before it can be reopened. A cursor FOR loop automatically opens the cursor to which it refers, so your program cannot open that cursor inside the loop.
DUP_VAL_ON_INDEX 00001 -1 A program attempts to store duplicate values in a column that is constrained by a unique index.
INVALID_CURSOR 01001 -1001 A program attempts a cursor operation that is not allowed, such as closing an unopened cursor.
INVALID_NUMBER 01722 -1722 n a SQL statement, the conversion of a character string into a number fails because the string does not represent a valid number. (In procedural statements,VALUE_ERROR is raised.) This exception is also raised when theLIMIT-clause expression in a bulkFETCH statement does not evaluate to a positive number.
LOGIN_DENIED 01017 -1017 A program attempts to log on to the database with an invalid username or password.
NO_DATA_FOUND 01403 +100 A SELECT INTO statement returns no rows, or your program references a deleted element in a nested table or an uninitialized element in an index-by table.

Because this exception is used internally by some SQL functions to signal completion, you must not rely on this exception being propagated if you raise it within a function that is invoked as part of a query.

NOT_LOGGED_ON 01012 -1012 A program issues a database call without being connected to the database.
PROGRAM_ERROR 06501 -6501 PL/SQL has an internal problem.
ROWTYPE_MISMATCH 06504 -6504 The host cursor variable and PL/SQL cursor variable involved in an assignment have incompatible return types. When an open host cursor variable is passed to a stored subprogram, the return types of the actual and formal parameters must be compatible.
SELF_IS_NULL 30625 -30625 A program attempts to invoke a MEMBER method, but the instance of the object type was not initialized. The built-in parameter SELF points to the object, and is always the first parameter passed to a MEMBER method.
STORAGE_ERROR 06500 -6500 PL/SQL ran out of memory or memory was corrupted.
SUBSCRIPT_BEYOND_COUNT 06533 -6533 A program references a nested table or v-array element using an index number larger than the number of elements in the collection.
SUBSCRIPT_OUTSIDE_LIMIT 06532 -6532 A program references a nested table or v-array element using an index number (-1 for example) that is outside the legal range.
SYS_INVALID_ROWID 01410 -1410 The conversion of a character string into a universal rowid fails because the character string does not represent a valid rowid.
TIMEOUT_ON_RESOURCE 00051 -51 A time out occurs while the database is waiting for a resource.
TOO_MANY_ROWS 01422 -1422 A SELECT INTO statement returns more than one row.
VALUE_ERROR 06502 -6502 An arithmetic, conversion, truncation, or size-constraint error occurs. For example, when your program selects a column value into a character variable, if the value is longer than the declared length of the variable, PL/SQL stops the assignment and raises VALUE_ERROR. In procedural statements, VALUE_ERROR is raised if the conversion of a character string into a number fails. (In SQL statements INVALID_NUMBER is raised.)
ZERO_DIVIDE 01476 -1476 A program attempts to divide a number by zero.

At the end I would like to emphasis  on the need of negative scenarios and exception handling and every team must define the negative scenarios in their capacity and understanding, it should start from the business team then designer, developer and at the end by testers. It is good if we involve the customer as well in this Journey.

Daya Shanker
Date: 5 Sept 2015