Post

Loop Unrolling in Java - JVM JIT

How JVM evaluate loops.

Loop Unrolling in Java - JVM JIT

The difference between source code and how code is executed

Modern compilers optimize source code to ensure efficient execution. If code were executed exactly as it is written, it would often lead to poor performance. This is particularly evident in the execution of loops, where naive execution can significantly impact efficiency. Without optimizations, a compiler would execute loops in a straightforward manner:

  • Execute the loop body.
  • Check the loop termination condition.
  • Jump back to the beginning of the loop.

Such execution is not efficient because modern processors perform several operations in one clock cycle: selecting the next instruction, decoding, executing, writing. This type of execution is called a pipeline. The pipeline depth (number of stages) varies between processors. For example:

  • Intel Pentium 4: 20 stages.
  • Intel Pentium 4 Prescott: 31 stages.

When a loop is executed naively (e.g., one instruction per iteration), frequent jumps to the beginning of the loop disrupt the pipeline. This results in a pipeline flush, which is comparable to the performance penalty of a cache miss.

Impact of Pipeline Flushes on Loop Execution Pipeline Disruption: Each iteration invalidates the pipeline, causing the CPU to restart instruction execution, which wastes cycles. Inefficiency: The processor is unable to fully utilize the pipelined execution, where multiple instructions are processed simultaneously.

To better understand the impact of compiler optimizations, consider an experiment where:

  • Memory Allocation: Allocate an array of 1,000,000 long integers in memory.
  • Data Population: Fill the array with random values.

Comparison:

  • Analyze how loops compile in C without optimization.
  • Evaluate how HotSpot JIT optimizes loop execution in Java 11 and Java 17.

Loops in C code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

unsigned long xorshift(unsigned long state[static 1]) {
    unsigned long x = state[0];
    x ^= x << 13;
    x ^= x >> 17;
    x ^= x << 5;
    state[0] = x;
    return x;
}

long random_long(long min, long max) {
    int urandom = open("/dev/urandom", O_RDONLY);
    unsigned long state[1];
    read(urandom, state, sizeof(state));
    close(urandom);
    unsigned long range = (unsigned long) max - min + 1;
    unsigned long random_value = xorshift(state) % range;
    return (long) (random_value + min);
}


int main(int argv, char** argc) {
    int MAX = 1000000;

    long* data = (long*)calloc(MAX, sizeof(long));

    for (int i = 0; i < MAX; i++) {
        data[i] = random_long(0,MAX);
    }
}
1
gcc -S loopunrolling.c 

Let’s consider only a part of the assembly code, calling the main method. As we can see, there is only one call of the call random_long function per loop iteration, which is expected.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
main:
.LFB8:
	.cfi_startproc
	endbr64
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	pushq	%rbx
	subq	$40, %rsp
	.cfi_offset 3, -24
	movl	%edi, -36(%rbp)
	movq	%rsi, -48(%rbp)
	movl	$1000000, -28(%rbp)
	movl	-28(%rbp), %eax
	cltq
	movl	$8, %esi
	movq	%rax, %rdi
	call	calloc@PLT
	movq	%rax, -24(%rbp)
	movl	$0, -32(%rbp)
	jmp	.L7
.L8:
	movl	-28(%rbp), %eax
	cltq
	movl	-32(%rbp), %edx
	movslq	%edx, %rdx
	leaq	0(,%rdx,8), %rcx
	movq	-24(%rbp), %rdx
	leaq	(%rcx,%rdx), %rbx
	movq	%rax, %rsi
	movl	$0, %edi
	call	random_long
	movq	%rax, (%rbx)
	addl	$1, -32(%rbp)
.L7:
	movl	-32(%rbp), %eax
	cmpl	-28(%rbp), %eax
	jl	.L8
	movl	$0, %eax
	movq	-8(%rbp), %rbx
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE8:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	1f - 0f
	.long	4f - 1f
	.long	5
	

Loops in Java

Now let’s fill in long[] in Java. Java code is different from C, we need to add an intStride1 method which will compile JIT since the minimum compilation unit is a method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public class LoopUnroll {
    private  static int MAX = 1000000;
    private static long[] data = new long[MAX];

    public static void main(String[] args) {
        java.util.Random random = new java.util.Random();

        for (int i = 0; i < MAX; i++) {
            data[i] = random.nextLong();
        }
        final long sum = intStride1();

        System.out.println("Out");
        System.out.println(sum);
    }

    private static long intStride1()
    {
        int sum = 0;
        for (int i = 0; i < MAX; i += 1)
        {
            sum += data[i];
        }
        return sum;
    }
}

Bytecode

In the Bytecode, we focus on the private static long intStride1(); method. The bytecode shows two ladd operations per iteration: one for handling the array data[] (at instruction 20: ladd) and the other for the counter i (at instruction 24: ladd), corresponding to one operation per loop iteration. This indicates that no runtime optimization is applied in the bytecode.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
javap -p -v LoopUnroll.class
// -- omitted
  private static long intStride1();
    descriptor: ()J
    flags: (0x000a) ACC_PRIVATE, ACC_STATIC
    Code:
      stack=5, locals=4, args_size=0
         0: lconst_0
         1: lstore_0
         2: lconst_0
         3: lstore_2
         4: lload_2
         5: getstatic     #10                 // Field MAX:I
         8: i2l
         9: lcmp
        10: ifge          29
        13: lload_0
        14: getstatic     #16                 // Field data:[J
        17: lload_2
        18: l2i
        19: laload
        20: ladd
        21: lstore_0
        22: lload_2
        23: lconst_1
        24: ladd
        25: lstore_2
        26: goto          4
        29: lload_0
        30: lreturn
      LineNumberTable:
        line 21: 0
        line 22: 2
        line 24: 13
        line 22: 22
        line 26: 29
      StackMapTable: number_of_entries = 2
        frame_type = 253 /* append */
          offset_delta = 4
          locals = [ long, long ]
        frame_type = 250 /* chop */
          offset_delta = 24
// -- omitted
SourceFile: "LoopUnroll.java"

Benchmark

We will evaluate several loop variants with different counter types: one using int and the other using long—to observe how the counter type affects the JIT compiler-generated code, loop unrolling, and safepoint placement. To ensure the method is not inlined into the benchmark, we include the annotation @CompilerControl(CompilerControl.Mode.DONT_INLINE).

The benchmark will be run with different VM options to control code generation:

  • -XX:+UseCountedLoopSafepoints: Controls the presence of safepoints within the loop.
  • -XX:LoopStripMiningIter=<number_of_iterations>: Sets the number of iterations in the inner loop. A safepoint will be inserted in the outer loop, while the inner loop will remain safepoint-free. The default is 1,000 iterations.
  • -XX:LoopStripMiningIterShortLoop=<number_of_iterations>: Loops with fewer than the specified number of iterations will not have a safepoint.

Listing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseCompressedOops", "-XX:PrintAssemblyOptions=intel", "-XX:LoopStripMiningIter=10000", "-XX:-UseCountedLoopSafepoints"})
public class LoopUnrollBenchmark {

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void baseline() {
    }

    private static final int MAX = 1_000_000;

    private long[] data = new long[MAX];

    @Setup
    public void createData()
    {
        java.util.Random random = new java.util.Random();

        for (int i = 0; i < MAX; i++)
        {
            data[i] = random.nextLong();
        }
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long intStride1()
    {
        long sum = 0;
        for (int i = 0; i < MAX; i++)
        {
            sum += data[i];
        }
        return sum;
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public long longStride1()
    {
        long sum = 0;
        for (long l = 0; l < MAX; l++)
        {
            sum += data[(int) l];
        }
        return sum;
    }
}
1
java  -jar target/benchmarks.jar -prof perfasm

Java 11 counter loop

Build with the following VM arguments.

1
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseCompressedOops", "-XX:PrintAssemblyOptions=intel"})
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
c2, level 4, com.rkdeep.LoopUnrollBenchmark::intStride1, version 3, compile id 646 

              0x00007fdee83d10d0: cmp    r10d,0xf423f
              0x00007fdee83d10d7: jbe    0x00007fdee83d115d
              0x00007fdee83d10dd: mov    rax,QWORD PTR [r9+0x18]  ;*laload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@16 (line 72)
              0x00007fdee83d10e1: mov    r10d,0x1           ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
              0x00007fdee83d10e7: mov    r8d,0xfa0
           ↗  0x00007fdee83d10ed: mov    ecx,0xf423d
           │  0x00007fdee83d10f2: sub    ecx,r10d
           │  0x00007fdee83d10f5: cmp    ecx,r8d
   0.02%   │  0x00007fdee83d10f8: cmovg  ecx,r8d
           │  0x00007fdee83d10fc: add    ecx,r10d
           │  0x00007fdee83d10ff: nop                       ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
   0.06%  ↗│  0x00007fdee83d1100: add    rax,QWORD PTR [r9+r10*8+0x18]
  31.11%  ││  0x00007fdee83d1105: add    rax,QWORD PTR [r9+r10*8+0x20]
  22.35%  ││  0x00007fdee83d110a: add    rax,QWORD PTR [r9+r10*8+0x28]
  22.42%  ││  0x00007fdee83d110f: add    rax,QWORD PTR [r9+r10*8+0x30]
          ││                                                ;*ladd {reexecute=0 rethrow=0 return_oop=0}
          ││                                                ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
  22.14%  ││  0x00007fdee83d1114: add    r10d,0x4           ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ││                                                ; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 70)
   0.02%  ││  0x00007fdee83d1118: cmp    r10d,ecx
          ╰│  0x00007fdee83d111b: jl     0x00007fdee83d1100  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::intStride1@7 (line 70)
           │  0x00007fdee83d111d: mov    r14,QWORD PTR [r15+0x108]
           │                                                ; ImmutableOopMap{r11=Oop r9=Oop };*goto {reexecute=1 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
   0.01%   │  0x00007fdee83d1124: test   DWORD PTR [r14],eax  ;*goto {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70);   {poll}
   0.12%   │  0x00007fdee83d1127: cmp    r10d,0xf423d
           ╰  0x00007fdee83d112e: jl     0x00007fdee83d10ed
              0x00007fdee83d1130: cmp    r10d,0xf4240
              0x00007fdee83d1137: jge    0x00007fdee83d114d
              0x00007fdee83d1139: data16 xchg ax,ax         ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
              0x00007fdee83d113c: add    rax,QWORD PTR [r9+r10*8+0x18]
                                                            ;*ladd {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
....................................................................................................
  98.23%  <total for region 1>

....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::longStride1, version 3, compile id 630 

               0x00007f8cc43cf124: jne    0x00007f8cbc84b080  ;   {runtime_call ic_miss_stub}
               0x00007f8cc43cf12a: xchg   ax,ax
               0x00007f8cc43cf12c: nop    DWORD PTR [rax+0x0]
             [Verified Entry Point]
               0x00007f8cc43cf130: mov    DWORD PTR [rsp-0x14000],eax
               0x00007f8cc43cf137: push   rbp
               0x00007f8cc43cf138: sub    rsp,0x30           ;*synchronization entry
                                                             ; - com.rkdeep.LoopUnrollBenchmark::longStride1@-1 (line 81)
               0x00007f8cc43cf13c: mov    r10,QWORD PTR [rsi+0x10]  ;*getfield data {reexecute=0 rethrow=0 return_oop=0}
                                                             ; - com.rkdeep.LoopUnrollBenchmark::longStride1@14 (line 84)
   0.00%       0x00007f8cc43cf140: mov    r9d,DWORD PTR [r10+0x10]  ;*laload {reexecute=0 rethrow=0 return_oop=0}
                                                             ; - com.rkdeep.LoopUnrollBenchmark::longStride1@19 (line 84)
                                                             ; implicit exception: dispatches to 0x00007f8cc43cf1a4
   0.01%       0x00007f8cc43cf144: xor    eax,eax            ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                             ; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
               0x00007f8cc43cf146: xor    r11d,r11d
               0x00007f8cc43cf149: xor    r8d,r8d
          ╭    0x00007f8cc43cf14c: jmp    0x00007f8cc43cf153
          │    0x00007f8cc43cf14e: xchg   ax,ax
  12.00%  │ ↗  0x00007f8cc43cf150: mov    r11d,r8d           ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
          │ │                                                ; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 84)
  10.85%  ↘ │  0x00007f8cc43cf153: cmp    r11d,r9d
           ╭│  0x00007f8cc43cf156: jae    0x00007f8cc43cf184
   9.98%   ││  0x00007f8cc43cf158: add    rax,QWORD PTR [r10+r11*8+0x18]
           ││                                                ;*ladd {reexecute=0 rethrow=0 return_oop=0}
           ││                                                ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
  24.21%   ││  0x00007f8cc43cf15d: mov    r11,QWORD PTR [r15+0x108]
  11.56%   ││  0x00007f8cc43cf164: add    r8,0x1             ; ImmutableOopMap{r10=Oop rsi=Oop }
           ││                                                ;*goto {reexecute=1 rethrow=0 return_oop=0}
           ││                                                ; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
  10.62%   ││  0x00007f8cc43cf168: test   DWORD PTR [r11],eax  ;*goto {reexecute=0 rethrow=0 return_oop=0}
           ││                                                ; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
           ││                                                ;   {poll}
  18.83%   ││  0x00007f8cc43cf16b: cmp    r8,0xf4240
           │╰  0x00007f8cc43cf172: jl     0x00007f8cc43cf150  ;*ifge {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 82)
           │   0x00007f8cc43cf174: add    rsp,0x30
           │   0x00007f8cc43cf178: pop    rbp
   0.01%   │   0x00007f8cc43cf179: mov    r10,QWORD PTR [r15+0x108]
           │   0x00007f8cc43cf180: test   DWORD PTR [r10],eax  ;   {poll_return}
           │   0x00007f8cc43cf183: ret    
           ↘   0x00007f8cc43cf184: mov    rbp,rsi
               0x00007f8cc43cf187: mov    QWORD PTR [rsp],r8
               0x00007f8cc43cf18b: mov    QWORD PTR [rsp+0x8],rax
               0x00007f8cc43cf190: mov    QWORD PTR [rsp+0x10],r10
               0x00007f8cc43cf195: mov    DWORD PTR [rsp+0x18],r11d
               0x00007f8cc43cf19a: mov    esi,0xffffffe4
               0x00007f8cc43cf19f: call   0x00007f8cbc849e00  ; ImmutableOopMap{rbp=Oop [16]=Oop }
                                                             ;*laload {reexecute=0 rethrow=0 return_oop=0}
....................................................................................................
  98.08%  <total for region 1>

Benchmark                             Mode  Cnt          Score          Error  Units
LoopUnrollBenchmark.baseline         thrpt    5  420136389.339 ± 61698598.658  ops/s
LoopUnrollBenchmark.baseline:asm     thrpt                 NaN                   ---
LoopUnrollBenchmark.intStride1       thrpt    5       2457.647 ±      176.800  ops/s
LoopUnrollBenchmark.intStride1:asm   thrpt                 NaN                   ---
LoopUnrollBenchmark.longStride1      thrpt    5       1391.287 ±       85.554  ops/s
LoopUnrollBenchmark.longStride1:asm  thrpt                 NaN                   ---

You can observe that when the counter type is int, the loop consists of two loops: an inner loop and an outer loop. The body of the inner loop is unrolled 4 times, meaning the loop is expanded by a factor of 4. Safepoints are inserted after the inner loop.

A safepoint is a point in the code where data is in a consistent state, allowing threads to be safely paused for operations such as stack trace collection or garbage collection (GC). For clarity, the executed loop can be represented as shown in the listing below.

1
2
3
4
5
6
7
8
9
        for (int j = 0; j < 250; j++) {
            for (int i = 0; i < 4_000; i = i+4) {
                sum += data[i];
                sum += data[i+1];
                sum += data[i+2];
                sum += data[i+3];
            }
            // safepoint
        }

Unlike a loop with an int counter, when a long counter is used, the loop is compiled without applying loop unrolling optimization, and a safepoint is checked in each iteration. This behavior can be represented in pseudocode, as shown in the listing below

1
2
3
4
        for (int i = 0; i < 1_000_000; i++) {
            sum += data[i];
            // safepoint
        }

Let’s take the results of java 11 as a baseline.

Java 17 counter loop saftpoints control

Benchmark without safepoints -XX:-UseCountedLoopSafepoints

Remove safepoints from the loop and add an inner loop with 10000 iterations "-XX:LoopStripMiningIter=10000", "-XX:-UseCountedLoopSafepoints".

1
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseCompressedOops", "-XX:+UseSuperWord", "-XX:PrintAssemblyOptions=intel", "-XX:LoopStripMiningIter=10000", "-XX:-UseCountedLoopSafepoints"})
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
Result "com.rkdeep.LoopUnrollBenchmark.intStride1":
  2581.171 ±(99.9%) 14.527 ops/s [Average]
  (min, avg, max) = (2575.700, 2581.171, 2585.076), stdev = 3.773
  CI (99.9%): [2566.645, 2595.698] (assumes normal distribution)

Secondary result "com.rkdeep.LoopUnrollBenchmark.intStride1:asm":
PrintAssembly processed: 166164 total address lines.
Perf output processed (skipped 59.009 seconds):
 Column 1: cycles (49732 events)

Hottest code regions (>10.00% "cycles" events):
 Event counts are percents of total event count.

....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::intStride1, version 3, compile id 721 

   0.01%      0x00007f3ee4fd7453:   mov    r8d,DWORD PTR [r10+0xc]      ; implicit exception: dispatches to 0x00007f3ee4fd751c
   0.01%      0x00007f3ee4fd7457:   test   r8d,r8d
              0x00007f3ee4fd745a:   jbe    0x00007f3ee4fd751c
              0x00007f3ee4fd7460:   cmp    r8d,0xf423f
              0x00007f3ee4fd7467:   jbe    0x00007f3ee4fd751c
              0x00007f3ee4fd746d:   mov    rax,QWORD PTR [r10+0x10]     ;*laload {reexecute=0 rethrow=0 return_oop=0}
                                                                        ; - com.rkdeep.LoopUnrollBenchmark::intStride1@16 (line 73)
              0x00007f3ee4fd7471:   mov    r11d,0x1
          ╭   0x00007f3ee4fd7477:   jmp    0x00007f3ee4fd7483
          │   0x00007f3ee4fd7479:   nop    DWORD PTR [rax+0x0]
   0.01%  │↗  0x00007f3ee4fd7480:   mov    r11d,r9d                     ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 73)
   6.03%  ↘│  0x00007f3ee4fd7483:   add    rax,QWORD PTR [r10+r11*8+0x10]
   0.01%   │  0x00007f3ee4fd7488:   add    rax,QWORD PTR [r10+r11*8+0x18]
   5.98%   │  0x00007f3ee4fd748d:   add    rax,QWORD PTR [r10+r11*8+0x20]
   5.69%   │  0x00007f3ee4fd7492:   add    rax,QWORD PTR [r10+r11*8+0x28]
   5.73%   │  0x00007f3ee4fd7497:   add    rax,QWORD PTR [r10+r11*8+0x30]
   5.89%   │  0x00007f3ee4fd749c:   add    rax,QWORD PTR [r10+r11*8+0x38]
   7.75%   │  0x00007f3ee4fd74a1:   add    rax,QWORD PTR [r10+r11*8+0x40]
   5.88%   │  0x00007f3ee4fd74a6:   add    rax,QWORD PTR [r10+r11*8+0x48]
   5.69%   │  0x00007f3ee4fd74ab:   add    rax,QWORD PTR [r10+r11*8+0x50]
   5.94%   │  0x00007f3ee4fd74b0:   add    rax,QWORD PTR [r10+r11*8+0x58]
   6.17%   │  0x00007f3ee4fd74b5:   add    rax,QWORD PTR [r10+r11*8+0x60]
   5.94%   │  0x00007f3ee4fd74ba:   add    rax,QWORD PTR [r10+r11*8+0x68]
   5.84%   │  0x00007f3ee4fd74bf:   add    rax,QWORD PTR [r10+r11*8+0x70]
   5.71%   │  0x00007f3ee4fd74c4:   add    rax,QWORD PTR [r10+r11*8+0x78]
   8.30%   │  0x00007f3ee4fd74c9:   add    rax,QWORD PTR [r10+r11*8+0x80]
   6.04%   │  0x00007f3ee4fd74d1:   add    rax,QWORD PTR [r10+r11*8+0x88];*ladd {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 73)
   5.82%   │  0x00007f3ee4fd74d9:   mov    r9d,r11d
   0.00%   │  0x00007f3ee4fd74dc:   add    r9d,0x10                     ;*iinc {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 71)
           │  0x00007f3ee4fd74e0:   cmp    r9d,0xf4231
           ╰  0x00007f3ee4fd74e7:   jl     0x00007f3ee4fd7480           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                        ; - com.rkdeep.LoopUnrollBenchmark::intStride1@7 (line 71)
              0x00007f3ee4fd74e9:   cmp    r9d,0xf4240
              0x00007f3ee4fd74f0:   jge    0x00007f3ee4fd7509
              0x00007f3ee4fd74f2:   add    r11d,0x10
              0x00007f3ee4fd74f6:   xchg   ax,ax                        ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                                        ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 73)
              0x00007f3ee4fd74f8:   add    rax,QWORD PTR [r10+r11*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
....................................................................................................
  98.44%  <total for region 1>


....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::longStride1, version 3, compile id 719 

   0.00%     0x00007f72e4fd549a:   cmp    edx,r11d
             0x00007f72e4fd549d:   mov    r10d,0x80000000
             0x00007f72e4fd54a3:   cmovl  r11d,r10d
             0x00007f72e4fd54a7:   movsxd r10,r11d
             0x00007f72e4fd54aa:   cmp    r10,rbp
             0x00007f72e4fd54ad:   cmovg  r11d,edi
             0x00007f72e4fd54b1:   cmp    r11d,0x2
             0x00007f72e4fd54b5:   jle    0x00007f72e4fd55ad
             0x00007f72e4fd54bb:   mov    r10d,0x2                     ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 85)
  10.74%  ↗  0x00007f72e4fd54c1:   cmp    r12d,DWORD PTR [rsp]
          │  0x00007f72e4fd54c5:   jae    0x00007f72e4fd5575
   0.02%  │  0x00007f72e4fd54cb:   add    rax,QWORD PTR [rcx+r12*8+0x10]
   0.31%  │  0x00007f72e4fd54d0:   movsxd rbx,r10d
   0.02%  │  0x00007f72e4fd54d3:   mov    r8,r9
  10.86%  │  0x00007f72e4fd54d6:   add    r8,rbx
          │  0x00007f72e4fd54d9:   mov    r12,rsi
   0.08%  │  0x00007f72e4fd54dc:   add    r12,rbx
   0.02%  │  0x00007f72e4fd54df:   mov    rbx,QWORD PTR [rcx+r12*8+0x48]
  20.31%  │  0x00007f72e4fd54e4:   mov    rdi,QWORD PTR [rcx+r12*8+0x40]
   0.31%  │  0x00007f72e4fd54e9:   mov    rdx,QWORD PTR [rcx+r12*8+0x38]
   0.44%  │  0x00007f72e4fd54ee:   mov    rbp,QWORD PTR [rcx+r12*8+0x30]
   0.19%  │  0x00007f72e4fd54f3:   mov    r13,QWORD PTR [rcx+r12*8+0x28]
  10.46%  │  0x00007f72e4fd54f8:   mov    r14,QWORD PTR [rcx+r12*8+0x20]
   0.03%  │  0x00007f72e4fd54fd:   mov    r12,QWORD PTR [rcx+r12*8+0x18];*laload {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::longStride1@19 (line 85)
   0.41%  │  0x00007f72e4fd5502:   add    rax,r12
   0.18%  │  0x00007f72e4fd5505:   add    rax,r14
  10.29%  │  0x00007f72e4fd5508:   add    rax,r13
   0.10%  │  0x00007f72e4fd550b:   add    rax,rbp
   0.47%  │  0x00007f72e4fd550e:   add    rax,rdx
  10.77%  │  0x00007f72e4fd5511:   add    rax,rdi
  11.09%  │  0x00007f72e4fd5514:   add    rax,rbx                      ;*ladd {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 85)
  11.06%  │  0x00007f72e4fd5517:   add    r8,0x8
   0.02%  │  0x00007f72e4fd551b:   mov    r12d,r8d                     ;*l2i {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 85)
          │  0x00007f72e4fd551e:   add    r10d,0x8
   0.01%  │  0x00007f72e4fd5522:   cmp    r10d,r11d
          ╰  0x00007f72e4fd5525:   jl     0x00007f72e4fd54c1           ;*ifge {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 83)
             0x00007f72e4fd5527:   cmp    r10d,DWORD PTR [rsp+0x4]
             0x00007f72e4fd552c:   jge    0x00007f72e4fd5556
             0x00007f72e4fd552e:   xchg   ax,ax                        ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 85)
             0x00007f72e4fd5530:   cmp    r12d,DWORD PTR [rsp]
             0x00007f72e4fd5534:   jae    0x00007f72e4fd55cb
             0x00007f72e4fd553a:   add    rax,QWORD PTR [rcx+r12*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 85)
....................................................................................................
  98.19%  <total for region 1>

Benchmark                             Mode  Cnt          Score          Error  Units
LoopUnrollBenchmark.baseline         thrpt    5  413725733.066 ± 20385130.808  ops/s
LoopUnrollBenchmark.baseline:asm     thrpt                 NaN                   ---
LoopUnrollBenchmark.intStride1       thrpt    5       2581.171 ±       14.527  ops/s
LoopUnrollBenchmark.intStride1:asm   thrpt                 NaN                   ---
LoopUnrollBenchmark.longStride1      thrpt    5       2427.188 ±        9.450  ops/s
LoopUnrollBenchmark.longStride1:asm  thrpt                 NaN                   ---

As we can see in assembly, the inner loop is not added to the code because there is no sense in it if the safepoint is removed. With int counter the loop is expanded for 16 iterations. The resulting code can be represented as in the listing below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
        for (int j = 0; j < 1_000_000; j++) {
                sum += data[i];
                sum += data[i+1];
                sum += data[i+2];
                sum += data[i+3];
                sum += data[i+4];
                sum += data[i+5];
                sum += data[i+6];
                sum += data[i+7];
                sum += data[i+8];
                sum += data[i+9];
                sum += data[i+10];
                sum += data[i+11];
                sum += data[i+12];
                sum += data[i+13];
                sum += data[i+14];
                sum += data[i+15];
            }
        }

For a long counter, the loop is unrolled into 8 iterations of the loop body. Additionally, with the long type, registers such as rbx, rdi, rdx, rbp, r13, r14, and r12 are filled first, followed by the summation operations. The resulting compiled code can be represented as the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
        for (long j = 0; j < 1_000_000; j++) {
                sum += data[i];
                sum += data[i+1];
                sum += data[i+2];
                sum += data[i+3];
                sum += data[i+4];
                sum += data[i+5];
                sum += data[i+6];
                sum += data[i+7];
            }
        }

As shown, Java 17 has significantly improved the handling of loops with long counters. The number of operations compared to loops with an int counter has increased from 56% in Java 11 to 94% in Java 17.

Benchmark with safepoints -XX:+UseCountedLoopSafepoints and -XX:LoopStripMiningIter=1000

We will run benchmark with the parameters

1
@Fork(value = 1, jvmArgsPrepend = {"-XX:+UnlockDiagnosticVMOptions", "-XX:-UseCompressedOops", "-XX:PrintAssemblyOptions=intel", "-XX:LoopStripMiningIter=1000"})
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::intStride1, version 3, compile id 720 

                0x00007f27f8fd76da:   jbe    0x00007f27f8fd77d8
                0x00007f27f8fd76e0:   cmp    r10d,0xf423f
                0x00007f27f8fd76e7:   jbe    0x00007f27f8fd77d8
                0x00007f27f8fd76ed:   mov    rax,QWORD PTR [r8+0x10]      ;*laload {reexecute=0 rethrow=0 return_oop=0}
                                                                          ; - com.rkdeep.LoopUnrollBenchmark::intStride1@16 (line 72)
                0x00007f27f8fd76f1:   mov    r12d,0x1                     ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                                          ; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
                0x00007f27f8fd76f7:   mov    ebx,0x3e80
                0x00007f27f8fd76fc:   xor    ecx,ecx
          ╭     0x00007f27f8fd76fe:   jmp    0x00007f27f8fd777e
   0.02%  │↗    0x00007f27f8fd7703:   mov    r12d,r11d                    ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
          ││                                                              ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
   5.60%  ││ ↗  0x00007f27f8fd7706:   add    rax,QWORD PTR [r8+r12*8+0x10]
   0.01%  ││ │  0x00007f27f8fd770b:   add    rax,QWORD PTR [r8+r12*8+0x18]
   5.73%  ││ │  0x00007f27f8fd7710:   add    rax,QWORD PTR [r8+r12*8+0x20]
   5.75%  ││ │  0x00007f27f8fd7715:   add    rax,QWORD PTR [r8+r12*8+0x28]
   5.48%  ││ │  0x00007f27f8fd771a:   add    rax,QWORD PTR [r8+r12*8+0x30]
   5.59%  ││ │  0x00007f27f8fd771f:   add    rax,QWORD PTR [r8+r12*8+0x38]
   9.69%  ││ │  0x00007f27f8fd7724:   add    rax,QWORD PTR [r8+r12*8+0x40]
   5.59%  ││ │  0x00007f27f8fd7729:   add    rax,QWORD PTR [r8+r12*8+0x48]
   5.77%  ││ │  0x00007f27f8fd772e:   add    rax,QWORD PTR [r8+r12*8+0x50]
   5.38%  ││ │  0x00007f27f8fd7733:   add    rax,QWORD PTR [r8+r12*8+0x58]
   6.13%  ││ │  0x00007f27f8fd7738:   add    rax,QWORD PTR [r8+r12*8+0x60]
   5.56%  ││ │  0x00007f27f8fd773d:   add    rax,QWORD PTR [r8+r12*8+0x68]
   5.58%  ││ │  0x00007f27f8fd7742:   add    rax,QWORD PTR [r8+r12*8+0x70]
   5.56%  ││ │  0x00007f27f8fd7747:   add    rax,QWORD PTR [r8+r12*8+0x78]
   9.13%  ││ │  0x00007f27f8fd774c:   add    rax,QWORD PTR [r8+r12*8+0x80]
   5.83%  ││ │  0x00007f27f8fd7754:   add    rax,QWORD PTR [r8+r12*8+0x88];*ladd {reexecute=0 rethrow=0 return_oop=0}
          ││ │                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
   5.61%  ││ │  0x00007f27f8fd775c:   mov    r11d,r12d
   0.00%  ││ │  0x00007f27f8fd775f:   add    r11d,0x10                    ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ││ │                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 70)
   0.00%  ││ │  0x00007f27f8fd7763:   cmp    r11d,r10d
          │╰ │  0x00007f27f8fd7766:   jl     0x00007f27f8fd7703           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
          │  │                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@7 (line 70)
          │  │  0x00007f27f8fd7768:   mov    r9,QWORD PTR [r15+0x350]     ; ImmutableOopMap {r8=Oop rdi=Oop }
          │  │                                                            ;*goto {reexecute=1 rethrow=0 return_oop=0}
          │  │                                                            ; - (reexecute) com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
   0.01%  │  │  0x00007f27f8fd776f:   test   DWORD PTR [r9],eax           ;*goto {reexecute=0 rethrow=0 return_oop=0}
          │  │                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
          │  │                                                            ;   {poll}
   0.02%  │  │  0x00007f27f8fd7772:   cmp    r11d,0xf4231
          │ ╭│  0x00007f27f8fd7779:   jge    0x00007f27f8fd77a5
          │ ││  0x00007f27f8fd777b:   mov    r12d,r11d
          ↘ ││  0x00007f27f8fd777e:   mov    r10d,0xf4231
   0.01%    ││  0x00007f27f8fd7784:   sub    r10d,r12d
            ││  0x00007f27f8fd7787:   cmp    r12d,0xf4231
            ││  0x00007f27f8fd778e:   cmovg  r10d,ecx
   0.00%    ││  0x00007f27f8fd7792:   cmp    r10d,0x3e80
            ││  0x00007f27f8fd7799:   cmova  r10d,ebx
   0.00%    ││  0x00007f27f8fd779d:   add    r10d,r12d
   0.00%    │╰  0x00007f27f8fd77a0:   jmp    0x00007f27f8fd7706
            ↘   0x00007f27f8fd77a5:   cmp    r11d,0xf4240
                0x00007f27f8fd77ac:   jge    0x00007f27f8fd77c5
                0x00007f27f8fd77ae:   add    r12d,0x10
                0x00007f27f8fd77b2:   xchg   ax,ax                        ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                                          ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
                0x00007f27f8fd77b4:   add    rax,QWORD PTR [r8+r12*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
                                                                          ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
                0x00007f27f8fd77b9:   inc    r12d                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                                          ; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 70)
                0x00007f27f8fd77bc:   cmp    r12d,0xf4240
....................................................................................................
  98.07%  <total for region 1>

....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::longStride1, version 3, compile id 717 

                 0x00007fe980fd651f:   cmovl  r10d,esi
                 0x00007fe980fd6523:   movsxd r9,r10d
                 0x00007fe980fd6526:   cmp    r9,r11
                 0x00007fe980fd6529:   cmovg  r10d,edi
                 0x00007fe980fd652d:   mov    DWORD PTR [rsp+0x8],r10d
                 0x00007fe980fd6532:   cmp    r10d,0x2
          ╭      0x00007fe980fd6536:   jle    0x00007fe980fd65e7
          │ ↗    0x00007fe980fd653c:   mov    r10d,DWORD PTR [rsp+0x8]
          │ │    0x00007fe980fd6541:   sub    r10d,ecx
          │ │    0x00007fe980fd6544:   mov    r9d,DWORD PTR [rsp+0x8]
   0.01%  │ │    0x00007fe980fd6549:   xor    r11d,r11d
          │ │    0x00007fe980fd654c:   cmp    r9d,ecx
   0.00%  │ │    0x00007fe980fd654f:   cmovl  r10d,r11d
   0.01%  │ │    0x00007fe980fd6553:   cmp    r10d,0x1f40
   0.00%  │ │    0x00007fe980fd655a:   mov    r9d,0x1f40
          │ │    0x00007fe980fd6560:   cmova  r10d,r9d
   0.01%  │ │    0x00007fe980fd6564:   add    r10d,ecx
   0.00%  │ │    0x00007fe980fd6567:   nop    WORD PTR [rax+rax*1+0x0]     ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
          │ │                                                              ; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 84)
  11.22%  │↗│    0x00007fe980fd6570:   cmp    ebx,DWORD PTR [rsp]
          │││    0x00007fe980fd6573:   jae    0x00007fe980fd6628
   0.03%  │││    0x00007fe980fd6579:   add    rax,QWORD PTR [r8+rbx*8+0x10]
   0.40%  │││    0x00007fe980fd657e:   movsxd r9,ecx
   0.04%  │││    0x00007fe980fd6581:   mov    rdx,r14
  10.88%  │││    0x00007fe980fd6584:   add    rdx,r9
   0.01%  │││    0x00007fe980fd6587:   mov    r11,rbp
   0.10%  │││    0x00007fe980fd658a:   add    r11,r9
   0.05%  │││    0x00007fe980fd658d:   mov    r9,QWORD PTR [r8+r11*8+0x48]
  18.17%  │││    0x00007fe980fd6592:   mov    r12,QWORD PTR [r8+r11*8+0x40]
   0.30%  │││    0x00007fe980fd6597:   mov    rbx,QWORD PTR [r8+r11*8+0x38]
   0.47%  │││    0x00007fe980fd659c:   mov    rdi,QWORD PTR [r8+r11*8+0x30]
   0.20%  │││    0x00007fe980fd65a1:   mov    rsi,QWORD PTR [r8+r11*8+0x28]
  10.59%  │││    0x00007fe980fd65a6:   mov    r13,QWORD PTR [r8+r11*8+0x20]
   0.09%  │││    0x00007fe980fd65ab:   mov    r11,QWORD PTR [r8+r11*8+0x18];*laload {reexecute=0 rethrow=0 return_oop=0}
          │││                                                              ; - com.rkdeep.LoopUnrollBenchmark::longStride1@19 (line 84)
   0.44%  │││    0x00007fe980fd65b0:   add    rax,r11
   0.22%  │││    0x00007fe980fd65b3:   add    rax,r13
  10.65%  │││    0x00007fe980fd65b6:   add    rax,rsi
   0.15%  │││    0x00007fe980fd65b9:   add    rax,rdi
   0.57%  │││    0x00007fe980fd65bc:   add    rax,rbx
  10.83%  │││    0x00007fe980fd65bf:   add    rax,r12
  11.50%  │││    0x00007fe980fd65c2:   add    rax,r9                       ;*ladd {reexecute=0 rethrow=0 return_oop=0}
          │││                                                              ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
  11.28%  │││    0x00007fe980fd65c5:   add    rdx,0x8
   0.01%  │││    0x00007fe980fd65c9:   mov    ebx,edx                      ;*l2i {reexecute=0 rethrow=0 return_oop=0}
          │││                                                              ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
   0.02%  │││    0x00007fe980fd65cb:   add    ecx,0x8
   0.04%  │││    0x00007fe980fd65ce:   cmp    ecx,r10d
          │╰│    0x00007fe980fd65d1:   jl     0x00007fe980fd6570           ;*ifge {reexecute=0 rethrow=0 return_oop=0}
          │ │                                                              ; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 82)
   0.02%  │ │    0x00007fe980fd65d3:   mov    r10,QWORD PTR [r15+0x350]    ; ImmutableOopMap {r8=Oop xmm0=Oop }
          │ │                                                              ;*goto {reexecute=1 rethrow=0 return_oop=0}
          │ │                                                              ; - (reexecute) com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
   0.01%  │ │    0x00007fe980fd65da:   test   DWORD PTR [r10],eax          ;*goto {reexecute=0 rethrow=0 return_oop=0}
          │ │                                                              ; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
          │ │                                                              ;   {poll}
   0.11%  │ │    0x00007fe980fd65dd:   cmp    ecx,DWORD PTR [rsp+0x8]
          │ ╰    0x00007fe980fd65e1:   jl     0x00007fe980fd653c
          ↘      0x00007fe980fd65e7:   cmp    ecx,DWORD PTR [rsp+0x4]
             ╭   0x00007fe980fd65eb:   jge    0x00007fe980fd660e
   0.00%     │   0x00007fe980fd65ed:   data16 xchg ax,ax                   ;*l2i {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
             │↗  0x00007fe980fd65f0:   cmp    ebx,DWORD PTR [rsp]
             ││  0x00007fe980fd65f3:   jae    0x00007fe980fd666f
             ││  0x00007fe980fd65f5:   add    rax,QWORD PTR [r8+rbx*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
             ││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
   0.00%     ││  0x00007fe980fd65fa:   movsxd rdx,ecx
             ││  0x00007fe980fd65fd:   add    rdx,r14
             ││  0x00007fe980fd6600:   add    rdx,0x1
             ││  0x00007fe980fd6604:   mov    ebx,edx                      ;*l2i {reexecute=0 rethrow=0 return_oop=0}
             ││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
   0.00%     ││  0x00007fe980fd6606:   inc    ecx
             ││  0x00007fe980fd6608:   cmp    ecx,DWORD PTR [rsp+0x4]
             │╰  0x00007fe980fd660c:   jl     0x00007fe980fd65f0           ;*ifge {reexecute=0 rethrow=0 return_oop=0}; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 82)
             ↘   0x00007fe980fd660e:   vmovq  r11,xmm0
                 0x00007fe980fd6613:   mov    r10d,DWORD PTR [rsp]
                 0x00007fe980fd6617:   cmp    rdx,0xf4240
                 0x00007fe980fd661e:   jge    0x00007fe980fd665c
                 0x00007fe980fd6620:   mov    r14,rdx
                 0x00007fe980fd6623:   jmp    0x00007fe980fd645e
....................................................................................................
  98.42%  <total for region 1>

Benchmark                             Mode  Cnt          Score          Error  Units
LoopUnrollBenchmark.baseline         thrpt    5  419882701.472 ± 13085589.188  ops/s
LoopUnrollBenchmark.baseline:asm     thrpt                 NaN                   ---
LoopUnrollBenchmark.intStride1       thrpt    5       2493.944 ±      102.343  ops/s
LoopUnrollBenchmark.intStride1:asm   thrpt                 NaN                   ---
LoopUnrollBenchmark.longStride1      thrpt    5       2446.834 ±      231.035  ops/s
LoopUnrollBenchmark.longStride1:asm  thrpt                 NaN                   ---

As you can see in the assembly code, both benchmarks unroll the loop in long for 8, int for 16 iterations respectively. An inner loop and a safepoint after it are added.

java 21

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
# JMH version: 1.37
# VM version: JDK 21.0.1, OpenJDK 64-Bit Server VM, 21.0.1+12-LTS
# VM invoker: /home/kirill/.sdkman/candidates/java/21.0.1-tem/bin/java
# VM options: -XX:+UnlockDiagnosticVMOptions -XX:-UseCompressedOops -XX:PrintAssemblyOptions=intel -XX:LoopStripMiningIter=1000

....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::intStride1, version 3, compile id 764 

   0.01%         0x0000747cd43da1c5:   test   r10d,r10d
                 0x0000747cd43da1c8:   jbe    0x0000747cd43da2ac
                 0x0000747cd43da1ce:   cmp    r10d,0xf423f
                 0x0000747cd43da1d5:   jbe    0x0000747cd43da2ac
                 0x0000747cd43da1db:   mov    rax,QWORD PTR [r9+0x10]      ;*laload {reexecute=0 rethrow=0 return_oop=0}
                                                                           ; - com.rkdeep.LoopUnrollBenchmark::intStride1@16 (line 72)
                 0x0000747cd43da1df:   mov    r10d,0x1                     ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                                           ; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
                 0x0000747cd43da1e5:   mov    r8d,0x3e80
          ╭      0x0000747cd43da1eb:   jmp    0x0000747cd43da268
   0.00%  │↗     0x0000747cd43da1f0:   mov    r10d,ebx                     ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
          ││                                                               ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
   0.00%  ││ ↗   0x0000747cd43da1f3:   add    rax,QWORD PTR [r9+r10*8+0x10]
   5.74%  ││ │   0x0000747cd43da1f8:   add    rax,QWORD PTR [r9+r10*8+0x18]
   6.09%  ││ │   0x0000747cd43da1fd:   add    rax,QWORD PTR [r9+r10*8+0x20]
   5.56%  ││ │   0x0000747cd43da202:   add    rax,QWORD PTR [r9+r10*8+0x28]
   5.94%  ││ │   0x0000747cd43da207:   add    rax,QWORD PTR [r9+r10*8+0x30]
   5.88%  ││ │   0x0000747cd43da20c:   add    rax,QWORD PTR [r9+r10*8+0x38]
   9.17%  ││ │   0x0000747cd43da211:   add    rax,QWORD PTR [r9+r10*8+0x40]
   5.87%  ││ │   0x0000747cd43da216:   add    rax,QWORD PTR [r9+r10*8+0x48]
   5.66%  ││ │   0x0000747cd43da21b:   add    rax,QWORD PTR [r9+r10*8+0x50]
   5.61%  ││ │   0x0000747cd43da220:   add    rax,QWORD PTR [r9+r10*8+0x58]
   5.83%  ││ │   0x0000747cd43da225:   add    rax,QWORD PTR [r9+r10*8+0x60]
   5.73%  ││ │   0x0000747cd43da22a:   add    rax,QWORD PTR [r9+r10*8+0x68]
   5.56%  ││ │   0x0000747cd43da22f:   add    rax,QWORD PTR [r9+r10*8+0x70]
   5.67%  ││ │   0x0000747cd43da234:   add    rax,QWORD PTR [r9+r10*8+0x78]
   9.08%  ││ │   0x0000747cd43da239:   add    rax,QWORD PTR [r9+r10*8+0x80]
   5.86%  ││ │   0x0000747cd43da241:   add    rax,QWORD PTR [r9+r10*8+0x88];*ladd {reexecute=0 rethrow=0 return_oop=0}
          ││ │                                                             ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
   5.65%  ││ │   0x0000747cd43da249:   lea    ebx,[r10+0x10]
   0.00%  ││ │   0x0000747cd43da24d:   cmp    ebx,r12d
          │╰ │   0x0000747cd43da250:   jl     0x0000747cd43da1f0           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
          │  │                                                             ; - com.rkdeep.LoopUnrollBenchmark::intStride1@7 (line 70)
          │  │   0x0000747cd43da252:   mov    r12,QWORD PTR [r15+0x458]    ; ImmutableOopMap {r11=Oop r9=Oop }
          │  │                                                             ;*goto {reexecute=1 rethrow=0 return_oop=0}
          │  │                                                             ; - (reexecute) com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
   0.01%  │  │   0x0000747cd43da259:   test   DWORD PTR [r12],eax          ;*goto {reexecute=0 rethrow=0 return_oop=0}
          │  │                                                             ; - com.rkdeep.LoopUnrollBenchmark::intStride1@22 (line 70)
          │  │                                                             ;   {poll}
   0.01%  │  │   0x0000747cd43da25d:   cmp    ebx,0xf4231
          │ ╭│   0x0000747cd43da263:   jge    0x0000747cd43da284
          │ ││   0x0000747cd43da265:   mov    r10d,ebx
          ↘ ││   0x0000747cd43da268:   mov    r12d,0xf4231
            ││   0x0000747cd43da26e:   sub    r12d,r10d
   0.01%    ││   0x0000747cd43da271:   cmp    r12d,0x3e80
            ││   0x0000747cd43da278:   cmova  r12d,r8d
   0.01%    ││   0x0000747cd43da27c:   add    r12d,r10d
            │╰   0x0000747cd43da27f:   jmp    0x0000747cd43da1f3
            ↘    0x0000747cd43da284:   add    r10d,0x10                    ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                                           ; - com.rkdeep.LoopUnrollBenchmark::intStride1@10 (line 72)
              ↗  0x0000747cd43da288:   add    rax,QWORD PTR [r9+r10*8+0x10];*ladd {reexecute=0 rethrow=0 return_oop=0}
              │                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@17 (line 72)
              │  0x0000747cd43da28d:   inc    r10d                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
              │                                                            ; - com.rkdeep.LoopUnrollBenchmark::intStride1@19 (line 70)
              │  0x0000747cd43da290:   cmp    r10d,0xf4240
              ╰  0x0000747cd43da297:   jl     0x0000747cd43da288
                 0x0000747cd43da299:   add    rsp,0x10
....................................................................................................
  98.97%  <total for region 1>


....[Hottest Region 1]..............................................................................
c2, level 4, com.rkdeep.LoopUnrollBenchmark::longStride1, version 3, compile id 764 

               0x00007f7c7c3d94e0:   lea    r14d,[r13-0x4]
               0x00007f7c7c3d94e4:   mov    r8d,0x2
               0x00007f7c7c3d94ea:   cmp    r14d,0x2
          ╭    0x00007f7c7c3d94ee:   jle    0x00007f7c7c3d95cf
          │    0x00007f7c7c3d94f4:   vmovd  xmm1,r12d
          │    0x00007f7c7c3d94f9:   mov    rdi,rdx
          │    0x00007f7c7c3d94fc:   add    rdi,rbx
          │    0x00007f7c7c3d94ff:   mov    r12d,esi
          │    0x00007f7c7c3d9502:   vmovq  xmm0,rcx
          │ ↗  0x00007f7c7c3d9507:   mov    ebx,r13d
   0.00%  │ │  0x00007f7c7c3d950a:   sub    ebx,r8d
          │ │  0x00007f7c7c3d950d:   add    ebx,0xfffffffc
   0.01%  │ │  0x00007f7c7c3d9510:   xor    r9d,r9d
          │ │  0x00007f7c7c3d9513:   cmp    r14d,r8d
          │ │  0x00007f7c7c3d9516:   cmovl  ebx,r9d
   0.00%  │ │  0x00007f7c7c3d951a:   cmp    ebx,0x1f40
          │ │  0x00007f7c7c3d9520:   mov    r9d,0x1f40
   0.00%  │ │  0x00007f7c7c3d9526:   cmova  ebx,r9d
   0.01%  │ │  0x00007f7c7c3d952a:   add    ebx,r8d
   0.00%  │ │  0x00007f7c7c3d952d:   data16 xchg ax,ax                   ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
          │ │                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 84)
  10.97%  │↗│  0x00007f7c7c3d9530:   cmp    r11d,r10d
          │││  0x00007f7c7c3d9533:   jae    0x00007f7c7c3d95fd
   0.03%  │││  0x00007f7c7c3d9539:   add    rax,QWORD PTR [rdx+r11*8+0x10]
   0.44%  │││  0x00007f7c7c3d953e:   lea    r11d,[r12+r8*1]
   0.03%  │││  0x00007f7c7c3d9542:   movsxd rcx,r8d
  10.59%  │││  0x00007f7c7c3d9545:   mov    r9,QWORD PTR [rdi+rcx*8+0x18]
   0.06%  │││  0x00007f7c7c3d954a:   mov    rbp,QWORD PTR [rdi+rcx*8+0x20];*laload {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@19 (line 84)
   0.21%  │││  0x00007f7c7c3d954f:   add    rax,r9
   0.11%  │││  0x00007f7c7c3d9552:   add    rax,rbp
  10.87%  │││  0x00007f7c7c3d9555:   lea    r9d,[r11+0x3]
   0.02%  │││  0x00007f7c7c3d9559:   cmp    r9d,r10d
          │││  0x00007f7c7c3d955c:   jae    0x00007f7c7c3d9606
   0.01%  │││  0x00007f7c7c3d9562:   add    rax,QWORD PTR [rdi+rcx*8+0x28];*ladd {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
  11.07%  │││  0x00007f7c7c3d9567:   add    r11d,0x4                     ;*l2i {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
   0.04%  │││  0x00007f7c7c3d956b:   cmp    r11d,r10d
          │││  0x00007f7c7c3d956e:   jae    0x00007f7c7c3d95f9
   0.04%  │││  0x00007f7c7c3d9574:   add    rax,QWORD PTR [rdi+rcx*8+0x30]
  19.82%  │││  0x00007f7c7c3d9579:   mov    r11,QWORD PTR [rdi+rcx*8+0x40]
   0.16%  │││  0x00007f7c7c3d957e:   mov    r9,QWORD PTR [rdi+rcx*8+0x38];*laload {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@19 (line 84)
   0.09%  │││  0x00007f7c7c3d9583:   lea    rbp,[rsi+rcx*1]
   0.02%  │││  0x00007f7c7c3d9587:   add    rax,r9
  11.11%  │││  0x00007f7c7c3d958a:   add    rax,r11                      ;*ladd {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
  11.23%  │││  0x00007f7c7c3d958d:   mov    r9d,ebp                      ;   {no_reloc}
   0.02%  │││  0x00007f7c7c3d9590:   add    r9d,0x7                      ;*l2i {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
   0.02%  │││  0x00007f7c7c3d9594:   cmp    r9d,r10d
          │││  0x00007f7c7c3d9597:   jae    0x00007f7c7c3d9602
   0.02%  │││  0x00007f7c7c3d9599:   add    rax,QWORD PTR [rdi+rcx*8+0x48];*ladd {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@20 (line 84)
  10.89%  │││  0x00007f7c7c3d959e:   add    rbp,0x8
   0.02%  │││  0x00007f7c7c3d95a2:   mov    r11d,ebp                     ;*l2i {reexecute=0 rethrow=0 return_oop=0}
          │││                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@18 (line 84)
   0.02%  │││  0x00007f7c7c3d95a5:   add    r8d,0x8
   0.04%  │││  0x00007f7c7c3d95a9:   cmp    r8d,ebx
          │╰│  0x00007f7c7c3d95ac:   jl     0x00007f7c7c3d9530           ;*ifge {reexecute=0 rethrow=0 return_oop=0}
          │ │                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@9 (line 82)
   0.02%  │ │  0x00007f7c7c3d95b2:   mov    r9,QWORD PTR [r15+0x458]     ; ImmutableOopMap {rdx=Oop rdi=Derived_oop_rdx xmm0=Oop }
          │ │                                                            ;*goto {reexecute=1 rethrow=0 return_oop=0}
          │ │                                                            ; - (reexecute) com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
   0.07%  │ │  0x00007f7c7c3d95b9:   test   DWORD PTR [r9],eax           ;*goto {reexecute=0 rethrow=0 return_oop=0}
          │ │                                                            ; - com.rkdeep.LoopUnrollBenchmark::longStride1@26 (line 82)
          │ │                                                            ;   {poll}
   0.15%  │ │  0x00007f7c7c3d95bc:   cmp    r8d,r14d
          │ ╰  0x00007f7c7c3d95bf:   jl     0x00007f7c7c3d9507
          │    0x00007f7c7c3d95c5:   vmovq  rcx,xmm0
          │    0x00007f7c7c3d95ca:   vmovd  r12d,xmm1
          ↘    0x00007f7c7c3d95cf:   cmp    r8d,r12d
               0x00007f7c7c3d95d2:   jge    0x00007f7c7c3d9647           ;*lload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                                         ; - com.rkdeep.LoopUnrollBenchmark::longStride1@12 (line 84)
               0x00007f7c7c3d95d4:   cmp    r11d,r10d
               0x00007f7c7c3d95d7:   jae    0x00007f7c7c3d967b
....................................................................................................
  98.25%  <total for region 1>

Benchmark                             Mode  Cnt          Score          Error  Units
LoopUnrollBenchmark.baseline         thrpt    5  435780715.553 ± 18841125.750  ops/s
LoopUnrollBenchmark.baseline:asm     thrpt                 NaN                   ---
LoopUnrollBenchmark.intStride1       thrpt    5       2605.923 ±      224.247  ops/s
LoopUnrollBenchmark.intStride1:asm   thrpt                 NaN                   ---
LoopUnrollBenchmark.longStride1      thrpt    5       2404.129 ±       66.965  ops/s
LoopUnrollBenchmark.longStride1:asm  thrpt                 NaN                   ---

As we could see in Java 21 in loops with int counter unrolled similar to Java 17. As expected with int counter loop unrolled by 16 and with long counter unrolled by 8.

Conclusion

Java 11 applies optimizations differently to loops with int and long counters. In Java 11, a loop with a long counter is approximately 2 times slower to execute compared to one with an int counter. However, in Java 17, loop strip mining and safepoint control optimizations were introduced. As a result, an inner loop with a safepoint placed after it was added, allowing the frequency of safepoint checking during loop execution to be controlled more effectively.

This post is licensed under CC BY 4.0 by the author.

Trending Tags