Low level speed optimizations of Java code
Recently we had a unique challenge of implementing a custom calculation/solution search algorithm for which we did not find any suitable algorithm or library.
While we managed to implement a suitable algorithm that was on the edge of acceptability, we wanted to use this opportunity to push a little bit further and implement low level optimizations. Our test case included 2^25 (33 million) iterations for which the initial algorithmic solution took 60 seconds of runtime on the development machine.
First thing that was noticed running a test case that the CPU was not utilized 100%, but a single core was. The cause was obvious – the implementation was running in a single thread. Thus it was not able to use multiple cores available in the machine.
Code was refactored to use @Async method annotation and CompletableFuture invocations to be able to split execution into multiple, independent batches. This change required additional synchronized code portions, since it needed to maintain a sorted list of best solutions found during the solution search.
That indeed helped, reducing the runtime almost fivefold to 13 seconds using ten running threads on six core processor. Although my processor supports hyperthreading, in this case the performance seemed to be proportional to the number of cores, and not number of supported hyper-threads.
So the goal was to push further. To get a better insight into the running code, we looked at the profilers, and selected a JProfiler (available at https://www.ej-technologies.com/products/jprofiler/overview.html). I refactored code to have most of the inner loop code split into individual method calls, so that we could better count the number of invocations and time spent at each method.
First the recommended CPU sampling method was used during profiling, but that did not really provide enough insight of the innermost running loop. So after switching to instrumentation, (which really slowed down the test case execution time by 5x factor), problematic statements were quickly identified.
The easy improvement was related to logging statement:
log.debug("permutation {} vendors {} solution {}", permutationIndex, permutationVendorSet);
It was refactored by wrapping it inside isDebugEnabled call:
if (log.isDebugEnabled()){ ... }
So logging can have significant performance impact even with disabled appenders, especially in tight loops. So this single change brought 20% optimization.
Almost 50% of the runtime was spent in various Collections API methods. HashSets were used to lookup objects and ArrayLists to add and iterate over objects. But as the search space was rather short (up to 32 objects), code was rewritten using plain old java arrays and using integer based indexing to access the arrays. (Hash)Set was also gone and replaced by direct, iterative search over the array. These two changes more than doubled the performance,
reducing the test case runtime to 28 seconds (for a single thread). The cost was reduced code readability.
Getting back to profiler, I noticed significant time was again spent in calling various getter methods. Again, I rewrote the code with assigning local variables that could be reused. This helped, reducing running time to just 22
seconds for single thread.
But profiler has revealed another, surprising code feature: almost third of the remaining time was spent calling a simple constructor:
SolutionLineItem solutionLineItem = new SolutionLineItem(quantity, price);
This is a simple data class whose constructor was created by lombok @AllArgsConstructor annotation. Changing this to empty constructor and using setters was even worse. And in most cases the new object was thrown away at the end of the permutation processing.
So the solution was to construct in advance all SolutionLineItem objects that we might need, since we had limited numbers of quantity/price combinations. Thus the inner loop code was rewritten:
SolutionLineItem solutionLineItem = priceQuantityArray[itemPriceIndex][itemQuantityIndex];
Collectively, all optimizations reduced the test case runtime from 60 seconds to 13 seconds (for single threaded implementation) or 5x improvement, and 2.9 seconds for multithreaded implementation or 20x improvement. This result was achieved despite the fact that the initial code did seem as optimal implementation on the first glance. Total improvement was a result of many small improvements and analysis how each step affected the performance.
To summarize, using data/value objects and method invocations extensively (such as constructors, getters, setters etc.) does come at significant performance cost. Thus use of array and primitive data types in local variables really provide much better performance compared to objects and method invocations at the expense of code readability.