Allocation elimination: when the Java keyword “new” doesn’t create an object

Which of these two methods for calculating the sum of the first N integers do you think runs the fastest? The answer may surprise you.

Method 1:

public Long sumPrimitive() {
    long total = 0;
    for (int i = 0; i < LOOP_SIZE; i++) {
        total += i;
    }
    return total;
}

Method 2:

public Long sumMutableWrapper() {
    long total = 0;
    for (int i = 0; i < LOOP_SIZE; i++) {
        MutableWrapper mutableWrapper = new MutableWrapper(total);
        mutableWrapper.add(i);
        total = mutableWrapper.value;
    }
    return total;
}

private static class MutableWrapper {
    private long value;

    public MutableWrapper(long value) {
        this.value = value;
    }

    public void add(long other) {
        value += other;
    }
}

NOTE: The research and views presented below are mine and do not necessarily reflect those of my employer.

It appears method 2 must be slower because it does more work: it creates an object for each loop iteration. In fact the methods are equally fast on a HotSpot JVM. The simple analysis neglected the many layers of abstraction between Java language code and actual execution of the program. If a new object was actually created each loop iteration method 2 would indeed be slower, however the JVM compiler is able to eliminate the object creation, as we’ll see below. Keep in mind that the cost of creating short lived objects is very small, and the effect is only noticeable here because it’s happening 10,000,000 times. The takeaway of this article is that object creation is not to be feared, but embraced. Using new does not automatically make your code slower.

For more information on how these tests were run, see my article on object creation costs.

How the just in time compiler eliminates object allocation

Here’s the code for method 2, with some comments:

public Long sumMutableWrapper() {
    long total = 0;
    for (int i = 0; i < LOOP_SIZE; i++) {
        // Create mutableWrapper and 
        // set mutableWrapper.value to total
        MutableWrapper mutableWrapper = new MutableWrapper(total);
        // Set mutableWrapper.value to mutableWrapper.value + i 
        mutableWrapper.add(i);
        // Set total to mutableWrapper.value
        total = mutableWrapper.value;
        // The mutableWrapper reference never escaped 
        // method scope, so it cannot be referenced
        // elsewhere by any other objects 
    }
    return total;
}

No references to mutableWrapper escaped the scope of the method, which allows the just in time compiler to safely eliminate allocation of the object. The compiler needs only worry about returning the correct value, it is otherwise free to alter the method as needed without impacting the rest of the program. The compiler will translate the actions corresponding to the comments above into code like this:

// Create mutableWrapper and set mutableWrapper.value to total
long value = total;
// Set mutableWrapper.value to mutableWrapper.value + i 
value += i;
// Set total to mutableWrapper.value
total = value;

Which can be inlined to:

total += i;

There’s a great article from Brian Goetz that goes into more depth on this subject: Urban performance legends, revisited.

Benchmarks

Below are jmh benchmarks for different scenarios that show how the complier is behaving. The average time is reported for the average time to complete the loop with LOOP_SIZE set to 10,000,000.

  • Summing a primitive (method 1 above): 3.6 ms avg time.
    public Long sumPrimitive() {
        long total = 0;
        for (int i = 0; i < LOOP_SIZE; i++) {
            total += i;
        }
        return total;
    }
    
  • Summing a boxed primitive: 39.7 ms avg time.
    Long is not optimized to long, so a new object must be created each iteration.

    public Long sumBoxed() {
        Long total = 0L;
        for (int i = 0; i < LOOP_SIZE; i++) {
            total += i;
        }
        return total;
    }
    
  • Summing a primitive wrapped in a mutable wrapper: 3.7 ms avg time.
    This is method 2, see explanation above.

    public Long sumMutableWrapper() {
        long total = 0;
        for (int i = 0; i < LOOP_SIZE; i++) {
            MutableWrapper mutableWrapper = new MutableWrapper(total);
            mutableWrapper.add(i);
            total = mutableWrapper.value;
        }
        return total;
    }
    
  • Same as above, but with allocation elimination disabled: 35.8 ms avg time.
    With the compiler optimization disabled (-XX:-EliminateAllocations), the result is similar to using boxed Longs.

    @Fork(jvmArgs = "-XX:-EliminateAllocations")
    public Long sumMutableWrapperOptOff() {
        long total = 0;
        for (int i = 0; i < LOOP_SIZE; i++) {
            MutableWrapper mutableWrapper = new MutableWrapper(total);
            mutableWrapper.add(i);
            total = mutableWrapper.value;
        }
        return total;
    }
    
  • Mutable wrapper references escape: 40.2 ms avg time.
    When the object reference escapes method scope, allocation cannot be eliminated.

    private MutableWrapper escapee;
    public Long sumEscapedMutableWrapper() {
        long total = 0;
        for (int i = 0; i < LOOP_SIZE; i++) {
            escapee = new MutableWrapper(total);
            escapee.add(i);
            total = escapee.value;
        }
        return total;
    }
    
  • For fun, a solution using streams: 22.8 ms avg time (5.4 ms median).
    The chance of a bug creeping into this code is much smaller than in the above examples.

    public Long sumStream() {
        return LongStream.range(0, LOOP_SIZE).sum();
    }
    

The results of the benchmarks are plotted below (also: Jupyter notebook with the analysis). The methods in which object allocation can be avoided are much faster than the others (remember object creation is cheap, this effect is only relevant because the loop contains so many iterations). The stream solution had a median time of ~5 ms, but a maximum time of ~34 ms, more on that later.

Oddly, the time taken by the sumStream method varied, apparently due to different compilation choices made by the JVM (changes in compilation were seen using the -XX:+PrintCompilation JVM flag). This is a reminder that benchmarks are only general guidelines, performance can vary in surprising ways depending on load and environment.

Different compilations result in ~7x variation in performance within the same benchmark run.

Benchmark code is available on github.

Of course many performance issues can be solved by a better algorithm rather than extensive research about object allocation and compiler optimization. In this case all the methods could be replaced by return LOOP_SIZE * (LOOP_SIZE + 1) / 2;, which would be about seven orders of magnitude faster (or more so, given that the result is a compile time constant).

JVM Details:

$java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)