source/assembly/x86/perf/sum_low_nibbles.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171

# How to add low nibbles real fast

Once upon a time I happened to need to sum only the low nibble of each byte in an array.  

The initial version of this code was in C#:  
  public int sum(byte[] xs) {
    int sum = 0;
    for (int i = 0; i < xs.Length; i++) {
      sum += xs[i] & 0x0f;
    }
    return sum;
  }
Simple enough. Out of curiosity, I looked at the assembly mono produced (with mono --aot -O=all test.exe, then looking at sym.sum _TODO: NAME_ with radare):
  _TODO: RADARE OUTPUT_ with colors maybe?
  for now: https://www.iximeow.net/public_files/before.png
This is fractally subpar, and a little surprising considering -O=all should be the full gamut of optimization mono can do, the best it has to show!
0. Look at the trailing `movsxd rax, dword [r14 + 0x18]`. As evidenced by this and the above `lea rax, [r14 + rax + 0x20]`, Mono's array implementation is as a class (with likely some vtable-y thing at 0x08 and reflection info at 0x00. _TODO: VERIFY_) with array length at 0x18 into the struct and the data starting at 0x20. That means that in every iteration of the loop, wmono decided it must re-load the array length! Given this very tiny loop, an extra read from memory is very impactful on overall performance. It's not hard to see that r14 is never assigned to, and is local to this function, so the array length would never get changed, even if another thread were executing and reassigned the original array reference! I would have expected mono to hoist that length read to the top of the function, outside the loop body.
0. `mov byte [rsp + local_18h], al`. I can only imagine this is generated because the original bytecode stores the result of `xs[i]` in a local variable for a moment, and faithfully transcribes the local variable assignment into machine code. However, the jitter also realizes the value is already in a register (al), and continues to use the register. This value on the stack is, then, never read from, and overwritten every iteration of the loop. Again with the unnecessary memory activity!!
0. Register musical chairs. `rax` is used as an index at the start of the loop body, then overwritten with the array element, then used to hold the accumulated sum, then overwritten with the array length as described above. r15d holds the array index that's actually incremented each step of the loop. So, it should be entirely valid to have not produced the `movsxd rax, r15d`, then use `r15` in `lea rax, [r14 + r15 + 0x20]`. This frees up one use of rax, leaving it used as the array element, sum, and array length.
0. Also on the topic of the running sum, it's stored in r13, but this generated code moves from r13 to rax, adds the nibble from the arary element, and then moves the value back into r13! It is entirely valid to simply add directly into r13 and remove both `mov rax, r13` and `mov r13, rax` after.

These are the most obvious cases where this code is sub-optimal, so for grins I patched the compiled binary into something that should be a little more performant:  
  _TODO: Exactly that._

_TODO: deeper writeup of each change._
Because I wanted to avoid having to move code around (updating relative jumps is fairly annoting), I made most of these changes simultaneously.
0. Considering that we have a handful of unused registers in this function, it's reasonable to put the array length in a different register. I picked `rdx` for no particular reason.
  - Move `movsxd rax, dword [r14 + 0x18]` to `movsxd rdx, dword [r14 + 0x18]` where `jmp _cmp_` currently is. Move `jmp _cmp` down and fix up the offset to still jump to `cmp r15d, eax`. Rewrite `cmp r15d, eax` to `cmp r15d, edx`. For maximum carefulness, insert a `push edx`/`pop edx` pair around this loop now. (I didn't)
0. nop out `mov byte [rsp + local_18h], al`. Easy :)
0. nop out `mov rax, r13` and `mov r13, rax`. Rewrite `add eax, ecx` to `add r13d, ecx`.
0. nop out `movsxd rax, r15d` and `movsx rax, byte [rax]`, rewrite `lea rax, [r14 + rax + 0x20]` as `movsx rcx, byte [r14 + r15 + 0x20]`, simultaneously merging the two instructions and freeing two uses of rax
0. nop out `movzx rcx, al`
0. because it makes no sense to execute nops every iteration of this loop when we care about speed, pack all the above changes together before `cmp r15d, rdx`, with any added nop after the `xor r15d, r15d; xor r13d, r13d; jmp _loop_` region.
0. don't forget to fix up the backwards `jl _start_`! rewrite the `7cd4` with `7c <appropriate size>`

If you run it, you'll see it's twice as fast now! Too bad trying to ship hand-patched AOT'd mono binaries isn't a good idea :)

## But wait, there's more

At this point I was bitten by the performance bug. Excatly **HOW QUICKLY** can my CPU add a bunch of four bit numbers together? I figure at this poing it's easier to write the assembly myself instead of continuing to patch a binary that will be blown away as soon as I recompile.

So, the baseline is a straight copy of the code mono produced:
  section .text
  global _start:
    call do_work
    mov rax, 60
    mov rdi, r13
    syscall

  do_work:
    mov r14, arrrrrr
    xor r15, r15
    xor r13, r13
    jmp tst
    nop
  lp:
    movsxd rax, r15d
    lea rax, [r14 + rax + 0x8]
    movsx rax, byte [rax]
    mov byte [rsp + 0x18], al
    movzx rcx, al
    and ecx, 0xf
    mov rax, r13
    add eax, ecx
    inc r15
    mov r13, rax
  tst:
    movsxd rax, dword [r14]
    cmp r15, rax
    jl lp
    ret

  arrrrrr:
    dd 134217728
    times 134217728 db 57h

Which you can build with `nasm -f elf -o array.o <wherever you put it> && ld array.o`.

To no surprise, this is as slow as the initial mono code. Pay no attention to the superfluous stack write that trashes some data somewhere :) _TODO: FIX THAT._


### Onward, to our own code!

Now, I wrote up the hand-optimized version:
  section .text
  global _start:
    call do_work
    mov rax, 60
    mov rdi, r13
    syscall

  do_work:
    mov r14, arrrrrr
    xor r15, r15
    xor r13, r13
    movsxd rax, dword [r14]
    jmp tst
    nop
  lp:
    mov cl, byte [r14 + r15 + 0x8]
    and cl, 0xf
    add r13, rcx
    inc r15
  tst:
    cmp r15, rax
    jl lp
    ret

  arrrrrr:
    dd 134217728
    times 134217728 db 57h

And like before, this code is almost twice as fast on systems I tested on. Results are exactly what we should expect so far!

After some thoughtful consideration, I realized it might be slightly faster, if not definitely more compact, to leverage lodsb:  
  replace `mov cl, byte [r14 + r15 + 0x8]` with `lodsb`, having rsi initialized before the loop to `r14 + 0x8` rewrite uses of `cl` and `rcx` to `al` as appropriate (and put the array length in `rdi` instead of `rax`).
  remove `inc r15`, as `rsi` is incremented by lodsb automagically.

This ends up being another 10% improvement or so! Probably from a combination of two(?) less additions in the loop (by removing `r14 + r15 + 0x8` and `inc r15` for the implicit increment with `lodsb`)

But can we do better? `lodsb` has variants.. what about using `lodsd` to read four bytes at a time?  

_TODO: write up lodsd variant and masking by 0x0f0f0f0f, perf improvement again_

#### Enter MMX

Having heard of vector operations before, but never having a reason to use them myself (either not working in a low enough language to be able to write them, or working on problems that they don't help much with), I figured there's probably *something* to make adding four bytes at once *really really fast*. I ended up running across `psadbw`. summing each byte after pairwise subtraction of each byte of a pair of vector registers. sum({A,B,C,D,E,F,G,H} - {0,0,0,0,0,0,0,0}) is just the sum of each value though, so `psadbw` on a pair of registers, one with my nibbles, the other with zero, should do exactly the right addtion, in *fast*. I hope.

In addition, we have to load 64 bits of 0x0f0f0f0f0f0f0f0f into a reigster to mask the loadq'd value by. No need to worry about overflow here because the largest value is 8 * 0xf == 0x78.

In the end,
  section .text
  global _start:
    call do_work
    mov rax, 60
    mov rdi, r13
    syscall

  do_work:
    mov r14, arrrrrr
    xor r15, r15
    xor r13, r13
    mov rcx, 0x0f0f0f0f
    rcl rcx, 32
    or rcx, 0x0f0f0f0f
    mov rax, 0
    movq mm1, rax

    mov ebx, dword [r14]
    lea rsi, [r14 + 0x8]
    test rbx, rbx
    jmp tst
    nop
  lp:
    lodsq
    and rax, rcx
    movq mm0, rax
    psadbw mm0, mm1
    movq rax, mm0
    add r13, rax

    sub rbx, 8
  tst:
    jnz lp
    ret

  arrrrrr:
    dd 134217728
    times 134217728 db 57h

And this is *really fast*. Rough math says ~6gb/s of data processing through this loop on my desktop _TODO: STATS_