Make sure, you benchmark correctly

Recently, I read posts about some Java loop benchmarks. I am not entirely sure, if the outcome and advise given is correct, so here is my version. I don’t want to blame any of the other tests for being incorrect, but judge for yourself.

Spoiler alert: The type of loop might or might not make a difference and overall, it is rather not the loop that makes the difference but all the other code that leads to conditions that might make some loops run faster, despite the same work/task. And yes, parallelStream() is mostly a bad idea.

This article also will give you a brief glimpse into benchmarking as well as teach you, that most benchmarks are wrong or better, the benchmark is right but the conclusions are wrong. Bring some time, the article got longer than expected. The article is not going to teach you how to benchmark with JMH, how the JVM works, or how that all is executed by the CPU. It might just overwhelm you with facts. But hey, it is not supposed to be the last article on that subject.

Thesis

Not all loops are the same because they can or cannot be optimized by the JVM. Some loops are even the same under the hood, despite a different syntax.

Note
All of these benchmarks have been executed on cloud machines with proper cooling (Digital Ocean, Dedicated CPU, Compute-Optimized, 8-core), no throttling, plenty of memory, Linux as OS, and repeated and varied several times.
Warning
Don’t use your average laptop for benchmarking. Modern CPUs have thermal management and flexible frequencies, so the more cores you use, the less speed you get, the longer you test, the less speed you have.

Benchmark Code

Here is our first version of the benchmark code. It uses the Java Microbenchmark Harness (JMH 1.36).

Each different loops sums up the sizes of all strings in the list or array and passes it on to a blackhole to avoid optimization (the JVM is very smart)[1]. These tests used JDK 11.0.18 Temurin.

We use not only a list but also an array to demonstrate the differences between a plain counting for-loop and an enhanced for-loop.

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class ForEachSimple
{
    List<String> list;
    String[] array = null;

    @Param({"1", "11", "101", "1001"})
    int length;

    @Setup
    public void setup()
    {
        array = new String[length];
        list = new ArrayList<>();

        for (int i = 0; i < length; i++)
        {
            array[i] = String.valueOf(i);
            list.add(String.valueOf(i));
        }
    }

    @Benchmark
    public void arrayFor(Blackhole bh)
    {
        int result = 0;
        for (int i = 0; i < array.length; i++)
        {
            result += array[i].length();
        }

        bh.consume(result);
    }

    @Benchmark
    public void arrayEnhanced(Blackhole bh)
    {
        int result = 0;
        for (String s : array)
        {
            result += s.length();
        }

        bh.consume(result);
    }

    @Benchmark
    public void classicFor(Blackhole bh)
    {
        int result = 0;
        for (int i = 0; i < list.size(); i++)
        {
            result += list.get(i).length();
        }

        bh.consume(result);
    }

    @Benchmark
    public void enhancedFor(Blackhole bh)
    {
        int result = 0;
        for (String s : list)
        {
            result += s.length();
        }

        bh.consume(result);
    }

    @Benchmark
    public void whileCounting(Blackhole bh)
    {
        int result = 0;
        int i = 0;

        while (i < list.size())
        {
            result += list.get(i).length();
            i++;
        }

        bh.consume(result);
    }

    @Benchmark
    public void whileIterator(Blackhole bh)
    {
        int result = 0;
        var iterator = list.iterator();

        while (iterator.hasNext())
        {
            result += iterator.next().length();
        }

        bh.consume(result);
    }

    @Benchmark
    public void forEach(Blackhole bh)
    {
        var w = new Wrapper();
        list.forEach(s -> w.sum += s.length());

        bh.consume(w.sum);
    }

    @Benchmark
    public void lambdaStream(Blackhole bh)
    {
        int result = list.stream()
            .mapToInt(s -> s.length())
            .sum();

        bh.consume(result);
    }

    @Benchmark
    public void parallelLambdaStream(Blackhole bh)
    {
        int result = list.parallelStream()
            .mapToInt(s -> s.length())
            .sum();

        bh.consume(result);
    }

    class Wrapper
    {
        int sum = 0;
    }
}

For-Loops on Arrays

Arrays don’t have an iterator interface, hence we expect to get the same performance for a classic for-loop and an enhanced for-loop.

Let’s look at the result of the benchmark run with fully warmed up code. That means, the JIT had a chance to make the best compile choices.

Benchmark                           (length)  Mode  Cnt      Score       Error  Units
ForEachSimple.arrayFor                     1  avgt    3      3.539 ±     0.102  ns/op
ForEachSimple.arrayFor                    11  avgt    3     11.753 ±     0.968  ns/op
ForEachSimple.arrayFor                   101  avgt    3     73.158 ±     4.641  ns/op
ForEachSimple.arrayFor                  1001  avgt    3   1182.275 ±    33.812  ns/op

ForEachSimple.arrayEnhanced                1  avgt    3      3.665 ±     0.259  ns/op
ForEachSimple.arrayEnhanced               11  avgt    3     11.994 ±     0.312  ns/op
ForEachSimple.arrayEnhanced              101  avgt    3     73.317 ±     4.613  ns/op
ForEachSimple.arrayEnhanced             1001  avgt    3   1138.964 ±    22.596  ns/op

Ok, the results tells us, there is no difference at all. Just a little noise. But why is it the same, the code is so different and there is even an extra variable i?

The explanation is simple, if you decompile the Java bytecode back into Java[2], you get identical code. The enhanced for-loop is just syntactic sugar and compiled into a classic-for-loop by javac.

Decompiled Java-Bytecode
@Benchmark
public void arrayFor(Blackhole bh) {
    int result = 0;
    for (int i = 0; i < this.array.length; ++i) {
        result += this.array[i].length();
    }
    bh.consume(result);
}

@Benchmark
public void arrayEnhanced(Blackhole bh) {
    int result = 0;
    String[] arrstring = this.array;
    int n = this.array.length;
    for (int i = 0; i < n; ++i) {
        String s = arrstring[i];
        result += s.length();
    }
    bh.consume(result);
}

For-Loops on Lists

Let’s check out the List version of the loops.

Benchmark                           (length)  Mode  Cnt      Score       Error  Units
ForEachSimple.classicFor                   1  avgt    3      4.563 ±     0.324  ns/op
ForEachSimple.classicFor                  11  avgt    3     17.026 ±     0.810  ns/op
ForEachSimple.classicFor                 101  avgt    3     89.444 ±    10.692  ns/op
ForEachSimple.classicFor                1001  avgt    3   1300.496 ±    20.275  ns/op

ForEachSimple.enhancedFor                  1  avgt    3      5.624 ±     0.345  ns/op
ForEachSimple.enhancedFor                 11  avgt    3     15.098 ±     0.270  ns/op
ForEachSimple.enhancedFor                101  avgt    3     92.262 ±     0.691  ns/op
ForEachSimple.enhancedFor               1001  avgt    3   1406.659 ±   119.568  ns/op

For this, we can see a larger difference that seems beyond noise, at least for the 1k loop. When we now try to decompile it, we will see no difference to our original source (not shown here). So, we have to directly look at the bytecode.

Bytecode for (String s : list)
 0: iconst_0
 1: istore_1
 2: aload_0
 3: getfield        #4   // Field list:Ljava/util/List;
 6: invokeinterface #9, 1// InterfaceMethod java/util/List.iterator:()Ljava/util/Iterator;
11: astore_2
12: aload_2
13: invokeinterface #10, 1// InterfaceMethod java/util/Iterator.hasNext:()Z
18: ifeq            41
21: aload_2
22: invokeinterface #11, 1// InterfaceMethod java/util/Iterator.next:()Ljava/lang/Object;
27: checkcast       #7   // class java/lang/String
30: astore_3
31: iload_1
32: aload_3
33: invokevirtual   #8   // Method java/lang/String.length:()I
36: iadd
37: istore_1
38: goto            12
41: iload_1
42: ireturn

This shows clearly, our enhanced for-loop accesses the Iterator interface, despite us not declaring that access. The plain for-loop uses the get() method instead. But is that all free, Iterators as just objects at the end?

Memory Profile Comparison

Let’s compare the classic for-loop and the enhanced for-loop on a list memory-wise to confirm our suspicion that we get almost the same performance but might have to pay with memory for it. Just start the benchmark with -perf gc. The result below has been condensed to the most important lines.

Benchmark                                     (length)  Mode  Cnt     Score     Error   Units
ForEachSimple.classicFor:·gc.alloc.rate.norm         1  avgt    3    ≈ 10⁻⁶              B/op
ForEachSimple.classicFor:·gc.alloc.rate.norm        11  avgt    3    ≈ 10⁻⁶              B/op
ForEachSimple.classicFor:·gc.alloc.rate.norm       101  avgt    3    ≈ 10⁻⁵              B/op
ForEachSimple.classicFor:·gc.alloc.rate.norm      1001  avgt    3    ≈ 10⁻⁴              B/op

ForEachSimple.enhancedFor:·gc.alloc.rate.norm        1  avgt    3    ≈ 10⁻⁶              B/op
ForEachSimple.enhancedFor:·gc.alloc.rate.norm       11  avgt    3    ≈ 10⁻⁵              B/op
ForEachSimple.enhancedFor:·gc.alloc.rate.norm      101  avgt    3    ≈ 10⁻⁵              B/op
ForEachSimple.enhancedFor:·gc.alloc.rate.norm     1001  avgt    3    ≈ 10⁻⁴              B/op

That is strange. We access an Iterator and don’t have to pay a dime? That cannot be right, can it? Well, it is right, because you just witnessed one of the modern JVM/GC marvels - Escape Analysis.

When the JIT compiler determines that an created object does not leave the method scope, it can allocate it on the stack instead on the heap. This means, we don’t have to garbage collect it later. Instead, when returning from the method call, the removal of the method-stack will free the memory automatically and with no effort.

You can turn of that wonder-drug with -XX:-DoEscapeAnalysis and you will get this result in return. And sure, we are requiring memory for the enhanced-for loop. 32 bytes to be precise. But thanks to the Escape Analysis (when not turned off, of course), we don’t allocate that on the heap and hence don’t have to garbage collect it. Also the stack is cheap memory because it is close to the CPU (caches!). Splendid!

Benchmark                                     (length)  Mode  Cnt     Score     Error   Units
ForEachSimple.classicFor:·gc.alloc.rate.norm         1  avgt    3    ≈ 10⁻⁶              B/op
ForEachSimple.classicFor:·gc.alloc.rate.norm        11  avgt    3    ≈ 10⁻⁶              B/op
ForEachSimple.classicFor:·gc.alloc.rate.norm       101  avgt    3    ≈ 10⁻⁵              B/op
ForEachSimple.classicFor:·gc.alloc.rate.norm      1001  avgt    3    ≈ 10⁻⁴              B/op

ForEachSimple.enhancedFor:·gc.alloc.rate.norm        1  avgt    3    32.000 ±   0.001    B/op
ForEachSimple.enhancedFor:·gc.alloc.rate.norm       11  avgt    3    32.000 ±   0.001    B/op
ForEachSimple.enhancedFor:·gc.alloc.rate.norm      101  avgt    3    32.000 ±   0.001    B/op
ForEachSimple.enhancedFor:·gc.alloc.rate.norm     1001  avgt    3    32.000 ±   0.001    B/op
Important
When you benchmark, ensure that you have a theory first. Verify it with the benchmark!

While-Loops

Because we saw, that we basically go the iterator way with an enhanced for-loop, let’s use the iterator explicitly and compare. Additionally, we also use a counting while-loop.

While-Loops vs. Their For-Counterparts
Benchmark                           (length)  Mode  Cnt      Score       Error  Units

## we count with for
ForEachSimple.classicFor                   1  avgt    3      4.563 ±     0.324  ns/op
ForEachSimple.classicFor                  11  avgt    3     17.026 ±     0.810  ns/op
ForEachSimple.classicFor                 101  avgt    3     89.444 ±    10.692  ns/op
ForEachSimple.classicFor                1001  avgt    3   1300.496 ±    20.275  ns/op

## we count with a while
ForEachSimple.whileCounting                1  avgt    3      4.582 ±     0.175  ns/op
ForEachSimple.whileCounting               11  avgt    3     17.005 ±     0.562  ns/op
ForEachSimple.whileCounting              101  avgt    3     89.141 ±     9.996  ns/op
ForEachSimple.whileCounting             1001  avgt    3   1292.825 ±    60.140  ns/op

## automatic mapping to an Iterator
ForEachSimple.enhancedFor                  1  avgt    3      5.624 ±     0.345  ns/op
ForEachSimple.enhancedFor                 11  avgt    3     15.098 ±     0.270  ns/op
ForEachSimple.enhancedFor                101  avgt    3     92.262 ±     0.691  ns/op
ForEachSimple.enhancedFor               1001  avgt    3   1406.659 ±   119.568  ns/op

## manual iterator usage
ForEachSimple.whileIterator                1  avgt    3      5.765 ±     0.375  ns/op
ForEachSimple.whileIterator               11  avgt    3     15.417 ±     0.800  ns/op
ForEachSimple.whileIterator              101  avgt    3     92.195 ±     9.009  ns/op
ForEachSimple.whileIterator             1001  avgt    3   1407.841 ±    35.302  ns/op

No surprise here, the while-loop with a counter matches the classic for-loop and the Iterator-based loop is exactly what we got for our enhanced for-loop.

Caution
A for-loop and a while-loop are not always running the same code. There are chances, that a for-loop can be optimized further.

Streams

Loops are nowadays considered boring and not sexy, so people do streams all the time. So, let’s do streams instead. We can either go the full stream route or just the forEach route, which is not a stream but accepts a function too and is effectively a loop. Our classic for-loop for comparison, of course.

Benchmark                           (length)  Mode  Cnt      Score       Error  Units
ForEachSimple.classicFor                   1  avgt    3      8.953 ±     0.525  ns/op
ForEachSimple.forEach                      1  avgt    3     10.358 ±     0.144  ns/op
ForEachSimple.lambdaStream                 1  avgt    3     45.239 ±     1.071  ns/op
ForEachSimple.parallelLambdaStream         1  avgt    3     70.632 ±     6.422  ns/op

ForEachSimple.classicFor                  11  avgt    3     21.126 ±     0.056  ns/op
ForEachSimple.forEach                     11  avgt    3     24.696 ±     1.727  ns/op
ForEachSimple.lambdaStream                11  avgt    3     54.382 ±     3.151  ns/op
ForEachSimple.parallelLambdaStream        11  avgt    3  32823.435 ±  7157.456  ns/op

ForEachSimple.classicFor                 101  avgt    3    106.671 ±     0.221  ns/op
ForEachSimple.forEach                    101  avgt    3    147.870 ±     2.064  ns/op
ForEachSimple.lambdaStream               101  avgt    3    180.519 ±     1.747  ns/op
ForEachSimple.parallelLambdaStream       101  avgt    3  35915.980 ±  5925.201  ns/op

ForEachSimple.classicFor                1001  avgt    3   1401.623 ±    49.657  ns/op
ForEachSimple.forEach                   1001  avgt    3   1414.010 ±     2.530  ns/op
ForEachSimple.lambdaStream              1001  avgt    3   1597.659 ±    79.037  ns/op
ForEachSimple.parallelLambdaStream      1001  avgt    3  35937.046 ± 13351.198  ns/op

As we can see, the more streamy it is, the slower it gets. A forEach is surprisingly fast. A non-parallel stream seems to have some overhead to get going (larger difference for small loops). The difference to the leanest loop (for-loop) becomes negligible, when we have more iterations aka more data to process.

One thing is clear. parallelStream() is a waste as soon as the stream has more than one element. That is not surprising, because giving a thread something to do, getting the result back and repeating this, is a start and stop operation. That is not efficient. In our case, the task is light, because we just fetch the size from a string and this is just a read of an int.

An Expensive Loop

Our loop so far was pretty boring and quick. We have not really burned any CPU, hence the loop overhead was likely large compared to the loop content. We also saw, that a parallelStream() does not give us any benefit for light operations. So let’s make it a little heavier.

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class ForEachExpensive
{
    final List<BigInteger> list = new ArrayList<>();
    BigInteger[] array = null;


    @Param({"1", "11", "101", "1001", "100001"})
    int length;

    @Setup
    public void setup()
    {
        array = new BigInteger[length];

        for (int i = 0; i < length; i++)
        {
            array[i] = BigInteger.valueOf(i + 1);
            list.add(BigInteger.valueOf(i + 1));
        }
    }

    @Benchmark
    public void arrayFor(Blackhole bh)
    {
        BigInteger total = BigInteger.ZERO;
        for (int i = 0; i < array.length; i++)
        {
            var b = array[i];
            total = total.add(b.pow(2).divide(b));
        }

        bh.consume(total);
    }

    @Benchmark
    public void arrayEnhanced(Blackhole bh)
    {
        BigInteger total = BigInteger.ZERO;
        for (var b : array)
        {
            total = total.add(b.pow(2).divide(b));
        }

        bh.consume(total);
    }

    @Benchmark
    public void classicFor(Blackhole bh)
    {
        BigInteger total = BigInteger.ZERO;
        for (int i = 0; i < list.size(); i++)
        {
            var b = list.get(i);
            total = total.add(b.pow(2).divide(b));
        }

        bh.consume(total);
    }


    @Benchmark
    public void enhancedFor(Blackhole bh)
    {
        BigInteger total = BigInteger.ZERO;
        for (var b : list)
        {
            total = total.add(b.pow(2).divide(b));
        }

        bh.consume(total);
    }


    @Benchmark
    public void lambdaStream(Blackhole bh)
    {
        var total = list.stream()
            .map(b -> b.pow(2).divide(b))
            .reduce((a, b) -> a.add(b));

        bh.consume(total.orElse(BigInteger.ZERO));
    }

    @Benchmark
    public void parallelLambdaStream(Blackhole bh)
    {
        var total = list.parallelStream()
                        .map(b -> b.pow(2).divide(b))
                        .reduce((a, b) -> a.add(b));

        bh.consume(total.orElse(BigInteger.ZERO));
    }
}

Our example keeps the CPU busy by running some calculations with BigInteger and, because BigInteger is immutable, will cause some memory churn too.

Loops Compared when Loop-Content is Heavier
Benchmark                              (length)  Mode  Cnt        Score   Error  Units
ForEachExpensive.arrayFor                     1  avgt    2       23.160          ns/op
ForEachExpensive.arrayEnhanced                1  avgt    2       24.681          ns/op
ForEachExpensive.classicFor                   1  avgt    2       25.704          ns/op
ForEachExpensive.enhancedFor                  1  avgt    2       26.505          ns/op
ForEachExpensive.lambdaStream                 1  avgt    2       66.207          ns/op
ForEachExpensive.parallelLambdaStream         1  avgt    2       84.996          ns/op

ForEachExpensive.arrayFor                    11  avgt    2      680.240          ns/op
ForEachExpensive.arrayEnhanced               11  avgt    2      687.945          ns/op
ForEachExpensive.classicFor                  11  avgt    2      701.725          ns/op
ForEachExpensive.enhancedFor                 11  avgt    2      692.163          ns/op
ForEachExpensive.lambdaStream                11  avgt    2      787.201          ns/op
ForEachExpensive.parallelLambdaStream        11  avgt    2    34220.859          ns/op

ForEachExpensive.arrayFor                   101  avgt    2     6540.811          ns/op
ForEachExpensive.arrayEnhanced              101  avgt    2     6302.211          ns/op
ForEachExpensive.classicFor                 101  avgt    2     6689.042          ns/op
ForEachExpensive.enhancedFor                101  avgt    2     6657.927          ns/op
ForEachExpensive.lambdaStream               101  avgt    2     7134.313          ns/op
ForEachExpensive.parallelLambdaStream       101  avgt    2    39687.555          ns/op

ForEachExpensive.arrayFor                  1001  avgt    2    64549.934          ns/op
ForEachExpensive.arrayEnhanced             1001  avgt    2    66166.213          ns/op
ForEachExpensive.classicFor                1001  avgt    2    66788.819          ns/op
ForEachExpensive.enhancedFor               1001  avgt    2    66384.256          ns/op
ForEachExpensive.lambdaStream              1001  avgt    2    69821.081          ns/op
ForEachExpensive.parallelLambdaStream      1001  avgt    2    59831.436          ns/op

ForEachExpensive.arrayFor                100001  avgt    2  8813801.499          ns/op
ForEachExpensive.arrayEnhanced           100001  avgt    2  8591560.078          ns/op
ForEachExpensive.classicFor              100001  avgt    2  8702603.394          ns/op
ForEachExpensive.enhancedFor             100001  avgt    2  8670508.307          ns/op
ForEachExpensive.lambdaStream            100001  avgt    2  8763410.470          ns/op
ForEachExpensive.parallelLambdaStream    100001  avgt    2  2199651.895          ns/op

Our expensive payload for the loop seems almost five times heavier. So, let’s see what we got.

Even for a loop of one iteration, a stream is significantly slower. As soon as we have about 10 elements, a regular stream is not that much slower anymore. When the size gets larger, a standard stream does not bring too much overhead (100 elements and up).

Warning
When I say 100 elements and up, this is just for this example! This is not a golden rule!

When we have more than 1000 elements, finally a parallel stream will boost performance and of course, when we have a lot of elements, it shines.

Tip
The use of parallelStream() is mostly wrong in standard code, unless you really know what you do and expect really a lot of data. If your stream operations are more expensive, smaller datasets might also benefit. But keep in mind, you take the CPU away from someone. If you run an application server, I bet that is not cool. It could be good for data crunching operations or jobs, if you have the CPU power at hand!

Preliminary Conclusion

So far so good. It seems that a classic for-loop and an enhanced for-loop are not any different besides a little bit more memory use due to the iterator. That is luckily optimized away by Escape Analysis. A forEach might be cool as well and even a regular stream operation is not that much of an overhead, as long as we have some calculations to do and the data stream is long enough.

parallelStream() is a different beast and likely most of the time incorrectly applied. More threads must be faster - this logic is not true. More threads, more overhead. The calculations have to be long and dependency free before multi-threading in streams can deliver.

Classic-For vs. Enhanced-For

Can we safely use our enhanced-for all the time? Well, we already learned that the enhanced-for on a list is not entirely free, while for an array, we get the same code, because it is just syntactic-sugar.

Well, not so fast my friend. A benchmark is a benchmark and hence it is certainly what we measure but we only measure what we think we understand. I am being ahead if myself here.

Just to show you how difficult benchmarking is, here is an example of our previous code with slight modifications.

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class ForEachUnroll
{
    List<String> list;

    @Param({"100"})
    int length = 100;

    @Setup
    @Before
    public void setup()
    {
        list = new ArrayList<>();

        for (int i = 0; i < length; i++)
        {
            list.add(new String("42"));
        }
    }

    @Benchmark
    public int classicFor()
    {
        int result = 0;
        for (int i = 0; i < list.size(); i++)
        {
            var s = list.get(i);
            result += s.length() * 2;
        }

        return result;
    }

    @Benchmark
    public int classicForAsWhile()
    {
        int result = 0;
        int i = 0;
        while (i < list.size())
        {
            var s = list.get(i);
            result += s.length() * 2;

            i++;
        }

        return result;
    }

    @Benchmark
    public int enhancedFor()
    {
        int result = 0;
        for (String s : list)
        {
            result += s.length() * 2;
        }

        return result;
    }

    @Benchmark
    public int enhancedForAsWhile()
    {
        int result = 0;
        var i = list.iterator();
        while (i.hasNext())
        {
            var s = i.next();
            result += s.length() * 2;
        }

        return result;
    }
}

Let’s run that (in this case my T14s is sufficient to show you the challenges), and we get this:

Benchmark                         (length)  Mode  Cnt   Score    Error  Units
ForEachUnroll.classicFor               100  avgt    3  79.412 ±  6.102  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3  80.223 ± 13.582  ns/op
ForEachUnroll.enhancedFor              100  avgt    3  81.203 ± 13.272  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3  81.936 ±  5.763  ns/op

That is not any different, you might say and yes, that is not different yet. Wait till the end!

No JIT

Let’s turn off the JIT aka the C1 and C2 compilers of Hotspot and run all in interpreted mode only (-Xint).

Benchmark                         (length)  Mode  Cnt      Score       Error  Units
ForEachUnroll.classicFor               100  avgt    3  43439.493 ±   574.240  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3  43279.290 ±  4737.294  ns/op
ForEachUnroll.enhancedFor              100  avgt    3  36760.306 ±  6235.583  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3  39371.962 ± 12936.379  ns/op

What we see here is, that in pure interpreter mode, a classic for-loop is more expensive because. It might be the slower, because we got a little more bytecode (also 57 vs. 53 bytes), but that is speculation.

Some More Work

We learned, that work changes the picture, so let’s give the loops some work to do and this time, we opt for long and expensive code aka string replacement via regular expressions. Long code here really means a lot of code that is fetched, compiled, and executed.

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class ForEachUnroll
{
    List<String> list;

    @Param({"100"})
    int length = 100;

    @Setup
    @Before
    public void setup()
    {
        list = new ArrayList<>();

        for (int i = 0; i < length; i++)
        {
            // new String makes sure we really have a new memory object,
            // not just a reference to the same String
            list.add(new String("42"));
        }
    }

    @Benchmark
    public int classicFor()
    {
        int result = 0;
        for (int i = 0; i < list.size(); i++)
        {
            var s = list.get(i);
            s = s.replace("[0-9]", "aa");
            result += s.length() * 2;
        }

        return result;
    }

    @Benchmark
    public int classicForAsWhile()
    {
        int result = 0;
        int i = 0;
        while (i < list.size())
        {
            var s = list.get(i);
            s = s.replace("[0-9]", "aa");
            result += s.length() * 2;

            i++;
        }

        return result;
    }

    @Benchmark
    public int enhancedFor()
    {
        int result = 0;
        for (String s : list)
        {
            s = s.replace("[0-9]", "aa");
            result += s.length() * 2;
        }

        return result;
    }

    @Benchmark
    public int enhancedForAsWhile()
    {
        int result = 0;
        var i = list.iterator();
        while (i.hasNext())
        {
            var s = i.next();
            s = s.replace("[0-9]", "aa");
            result += s.length() * 2;
        }

        return result;
    }
}

The line s = s.replace("[0-9]", "aa") will burn CPU, waste memory, and require more code to run at the end of the day. Ok, let’s check the results.

No JIT

Benchmark                         (length)  Mode  Cnt        Score        Error  Units
ForEachUnroll.classicFor               100  avgt    3  164,581.015 ± 15,760.350  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3  160,875.988 ± 51,765.914  ns/op
ForEachUnroll.enhancedFor              100  avgt    3  154,810.944 ± 21,699.395  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3  154,565.734 ± 18,431.951  ns/op

Ok, enhanced for-loops are still a little faster but the runtime is huge because it does not get compiled to native code at all. That is how the first seconds of your applications look like.

JIT

Benchmark                         (length)  Mode  Cnt    Score     Error  Units
ForEachUnroll.classicFor               100  avgt    3  113.667 ±  21.731  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3  116.929 ±  36.295  ns/op
ForEachUnroll.enhancedFor              100  avgt    3  183.416 ± 127.044  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3  194.511 ±  98.850  ns/op

# The result without String::replace for comparison
Benchmark                         (length)  Mode  Cnt    Score     Error  Units
ForEachUnroll.classicFor               100  avgt    3   79.412 ±   6.102  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3   80.223 ±  13.582  ns/op
ForEachUnroll.enhancedFor              100  avgt    3   81.203 ±  13.272  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3   81.936 ±   5.763  ns/op

Whoops, things changed a lot. Our classic for-loops are significantly faster now. What? Why?

C1 Level 1

Without talking about the details of why and what, just accept the fact that we can prevent the JVM and its JIT-compiler from putting too much effort into the compile process. Here, we will basically just accept the first simplest level of compilation: -XX:TieredStopAtLevel=1

ForEachUnroll.classicFor               100  avgt    3  2344.140 ±  518.545  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3  2375.217 ±  998.936  ns/op
ForEachUnroll.enhancedFor              100  avgt    3  2339.279 ± 1298.007  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3  2306.845 ± 1566.545  ns/op

There is almost no runtime difference between both concepts. Surprise! The C2 compiler seems to be the magic wand here.

C2 and No Inlining

Once again, without explaining it in detail, we test the speed without inlining: -XX:-Inline.

Benchmark                         (length)  Mode  Cnt     Score      Error  Units
ForEachUnroll.classicFor               100  avgt    3  2387.720 ±  269.697  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3  2332.320 ± 1589.507  ns/op
ForEachUnroll.enhancedFor              100  avgt    3  2259.663 ± 1311.101  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3  2355.178 ±  392.075  ns/op

That result almost resembles our C1 test where we limited the compiler smartness. And yes, inlining of code, avoiding jumping to methods and returning, is creating a huge runtime difference. The CPU is lazy and does not want to jump. It is an expensive operation.

What we discovered is, that, most likely, the enhanced for-loop including its while-loop brother have too much code inside the loop body, so the C2 compiler does not inline as much as possible (there are limits).

Let’s ask someone with proper knowledge - JITWatch[3]. A tool for understanding the behavior of the Java HotSpot Just-In-Time (JIT) compiler during the execution of a program.

Inlining Chain for classicFor
Figure 1. classicFor Inlining
Inlining Chain for enhancedFor
Figure 2. enhancedFor Inlining

And here we go, we see that two method calls (but only one method) have been inlined into our classic for-loop while no method was inlined for the enhanced for-loop example. This is not the fault of the loop concept rather of the methods required to pull off an enhanced for-loop. Also you can see that we require six method calls for it and the fast classic for-loop is just five. The reason for not inlining a method is mostly their size.

By the way, I find it interesting that List.get() and String.length() have not been inlined. That is a topic for another post, I guess.

Warning
This is just an observation, not any confirmation. I am not a JVM expert, just a user with a touch more under-the-hood knowledge.

Ask the Perf Counters

We already found the explanation, why suddenly our classic for-loop shines. Here is one more thing to check - Perf[4].

CPU perf-counters can tell us a lot about the code execution. Let’s ask them with the JMH flag -prof perfnorm. You need to a little bit of Linux kernel-tooling and permission for that.

Benchmark Results with Perf-Counters
Benchmark                                            (length)  Mode  Cnt     Score  Units
ForEachUnroll.classicFor                                  100  avgt    3   113.821  ns/op
ForEachUnroll.classicFor:IPC                              100  avgt          4.936  insns/clk
ForEachUnroll.classicFor:L1-dcache-loads                  100  avgt        522.912  #/op
ForEachUnroll.classicFor:branches                         100  avgt        513.400  #/op
ForEachUnroll.classicFor:cycles                           100  avgt        441.891  #/op
ForEachUnroll.classicFor:instructions                     100  avgt       2181.063  #/op
ForEachUnroll.classicFor:stalled-cycles-backend           100  avgt        203.976  #/op

ForEachUnroll.classicForAsWhile                           100  avgt    3   113.001  ns/op
ForEachUnroll.classicForAsWhile:IPC                       100  avgt          4.902  insns/clk
ForEachUnroll.classicForAsWhile:L1-dcache-loads           100  avgt        520.509  #/op
ForEachUnroll.classicForAsWhile:branches                  100  avgt        507.763  #/op
ForEachUnroll.classicForAsWhile:cycles                    100  avgt        438.979  #/op
ForEachUnroll.classicForAsWhile:instructions              100  avgt       2151.930  #/op
ForEachUnroll.classicForAsWhile:stalled-cycles-backend    100  avgt        202.121  #/op

ForEachUnroll.enhancedFor                                 100  avgt    3   182.292  ns/op
ForEachUnroll.enhancedFor:IPC                             100  avgt          4.450  insns/clk
ForEachUnroll.enhancedFor:L1-dcache-loads                 100  avgt        951.272  #/op
ForEachUnroll.enhancedFor:branches                        100  avgt        521.054  #/op
ForEachUnroll.enhancedFor:cycles                          100  avgt        706.595  #/op
ForEachUnroll.enhancedFor:instructions                    100  avgt       3144.677  #/op
ForEachUnroll.enhancedFor:stalled-cycles-backend          100  avgt        255.732  #/op

ForEachUnroll.enhancedForAsWhile                          100  avgt    3   195.116  ns/op
ForEachUnroll.enhancedForAsWhile:IPC                      100  avgt          4.360  insns/clk
ForEachUnroll.enhancedForAsWhile:L1-dcache-loads          100  avgt        943.672  #/op
ForEachUnroll.enhancedForAsWhile:branches                 100  avgt        516.808  #/op
ForEachUnroll.enhancedForAsWhile:cycles                   100  avgt        710.696  #/op
ForEachUnroll.enhancedForAsWhile:instructions             100  avgt       3098.458  #/op
ForEachUnroll.enhancedForAsWhile:stalled-cycles-backend   100  avgt        252.638  #/op

First, mileage may vary. This is AMD Ryzen 4xxx data. Intel data looks slightly different. I also took the liberty to remove everything that does not contribute to our discussion.

Just the most important lines explained first. The data is given for each method call and (important) might also contain the benchmark framework code’s CPU usage.

  • IPC (instruction per second): Defines how many instructions (machine code) per clock cycle (one Hertz) can be executed. Modern CPUs can run up to 7 instructions per Hertz and effectively run code implicitly in parallel

  • L1-dcache-loads: How often did we successfully fetch data from the L1 cache (the miss line has been omitted)

  • Branches: How often did the CPU jump?

  • Stalled-cycles-backend: How often did we miss the L2 and L3 cache. The AMD CPUs don’t not tell us what cache we missed

  • Instructions: How much machine code was executed

  • Cycles: How many Hertz on the CPU (clock ticks) did we need

And here is the verdict without any conclusion, just FYI. This is for our example with classic-for being faster than enhanced-for due to inlining.

  • IPC: We run more instructions per cycle, hence we fill the CPU better

  • L1 data cache loads: We have to load less from the caches, which saves time

  • Branches: Both loops run about the same number of branch instructions (give or take)

  • Instruction: We need way more instructions for the same work when we run enhanced for-loops

  • Stalled-cycles-backend: We have to go less often to the memory because we miss less data in the caches

  • Cycles: A side effect of all this is, that we need less clock ticks at the end, hence are are faster

By the way, it is unknown to me, why we need way more instructions in this example compared to our test without the regular expression overhead. Just for completeness and because we already have listed so much data, that one more listing won’t make a difference. Once again, the data is condensed to show only the important data points and yes, this is all simplified for your pleasure.

Benchmark Result For-Loops without String.replace
Benchmark                                            (length)  Mode  Cnt     Score  Units
ForEachUnroll.classicFor                                  100  avgt    3    81.446  ns/op
ForEachUnroll.classicFor:IPC                              100  avgt          4.359  insns/clk
ForEachUnroll.classicFor:L1-dcache-loads                  100  avgt        526.824  #/op
ForEachUnroll.classicFor:branches                         100  avgt        167.188  #/op
ForEachUnroll.classicFor:cycles                           100  avgt        331.037  #/op
ForEachUnroll.classicFor:instructions                     100  avgt       1442.873  #/op
ForEachUnroll.classicFor:stalled-cycles-backend           100  avgt        217.216  #/op

ForEachUnroll.classicForAsWhile                           100  avgt    3    82.326  ns/op
ForEachUnroll.classicForAsWhile:IPC                       100  avgt          4.357  insns/clk
ForEachUnroll.classicForAsWhile:L1-dcache-loads           100  avgt        525.339  #/op
ForEachUnroll.classicForAsWhile:branches                  100  avgt        166.581  #/op
ForEachUnroll.classicForAsWhile:cycles                    100  avgt        330.463  #/op
ForEachUnroll.classicForAsWhile:instructions              100  avgt       1439.852  #/op
ForEachUnroll.classicForAsWhile:stalled-cycles-backend    100  avgt        216.553  #/op

ForEachUnroll.enhancedFor                                 100  avgt    3    82.635  ns/op
ForEachUnroll.enhancedFor:IPC                             100  avgt          4.463  insns/clk
ForEachUnroll.enhancedFor:L1-dcache-loads                 100  avgt        531.923  #/op
ForEachUnroll.enhancedFor:branches                        100  avgt        170.008  #/op
ForEachUnroll.enhancedFor:cycles                          100  avgt        332.392  #/op
ForEachUnroll.enhancedFor:instructions                    100  avgt       1483.547  #/op
ForEachUnroll.enhancedFor:stalled-cycles-backend          100  avgt        163.256  #/op

ForEachUnroll.enhancedForAsWhile                          100  avgt    3    86.888  ns/op
ForEachUnroll.enhancedForAsWhile:IPC                      100  avgt          4.443  insns/clk
ForEachUnroll.enhancedForAsWhile:L1-dcache-loads          100  avgt        528.162  #/op
ForEachUnroll.enhancedForAsWhile:branches                 100  avgt        167.795  #/op
ForEachUnroll.enhancedForAsWhile:cycles                   100  avgt        329.708  #/op
ForEachUnroll.enhancedForAsWhile:instructions             100  avgt       1465.001  #/op
ForEachUnroll.enhancedForAsWhile:stalled-cycles-backend   100  avgt        165.720  #/op
Caution
This does not say that enhanced for-loops are worse. It just says, that we have hit a condition, which makes code using enhanced for-loops less ideal for the CPU.

JDK 17

Just to confuse you even more, here is a JDK 11 to JDK 17 comparison.

## JDK 11.0.17
Benchmark                         (length)  Mode  Cnt    Score     Error  Units
ForEachUnroll.classicFor               100  avgt    3  113.667 ±  21.731  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3  116.929 ±  36.295  ns/op
ForEachUnroll.enhancedFor              100  avgt    3  183.416 ± 127.044  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3  194.511 ±  98.850  ns/op

## JDK 17.0.4
Benchmark                         (length)  Mode  Cnt    Score    Error  Units
ForEachUnroll.classicFor               100  avgt    3  144.518 ±  8.738  ns/op
ForEachUnroll.classicForAsWhile        100  avgt    3  150.021 ± 26.712  ns/op
ForEachUnroll.enhancedFor              100  avgt    3  191.386 ± 46.860  ns/op
ForEachUnroll.enhancedForAsWhile       100  avgt    3  189.995 ± 48.755  ns/op

NO! The results are void. Yes, they are. The difference between both loop implementation got smaller. The classic for-loop got slower and that is certainly unexpected. For now, I will stop further drilling down into that. Another day, another post. Hint: Inlining might have changed due to changes of the underlying code.

Caution
Don’t assume, what you have learned is a law forever. Things can change, even for the worse.

Conclusion

Don’t take anything for granted. Smaller code variations might result in totally different results. If you benchmark, benchmark the code you later really run, not any bogus shortcut.

In terms of what loop is better. You cannot tell easily. AND…​ these are all micro-benchmarks, hence when you use that in your program as shown here, you won’t see ANY runtime difference. You will only see an impact when you have computational heavy programs and these loops are your hot path.

Tip
Don’t worry about performance upfront too much. You need a problem first, you have to understand the problems, and you have to properly benchmark it.

Sure, when you start with the wrong algorithm such as bubble sort instead of a quick sort, things are messy right from the start. But this is then not about languages and compiler details, it is about a higher level abstraction. A List might be also the wrong thing, when you want to access data later in a unique key fashion (Map preferred), but once again, this is an algorithmic change, not a low-level cutting CPU/memory edge change.