fix formatting issues, expand more on everything

author: iximeow <me@iximeow.net> 2017-07-24 08:55:25 -0700
committer: iximeow <me@iximeow.net> 2017-07-24 08:55:25 -0700
commit: 61783b45ac3b68e2026c45ae6b3e3533f62a4e53 (patch)
tree: 30342018cdcac553643b21ceafdfdd3190e676be /source/assembly/x86
parent: 9c7462d90cf6c48b0cd82cbf3ae4e8a458ed77f7 (diff)
1 files changed, 137 insertions, 130 deletions
diff --git a/source/assembly/x86/perf/sum_low_nibbles.md b/source/assembly/x86/perf/sum_low_nibbles.md
index 744bbb4..274ba08 100644
--- a/source/assembly/x86/perf/sum_low_nibbles.md
+++ b/source/assembly/x86/perf/sum_low_nibbles.md
@@ -3,35 +3,39 @@
 Once upon a time I happened to need to sum only the low nibble of each byte in an array.  
 
 The initial version of this code was in C#:  
-  public int sum(byte[] xs) {
-    int sum = 0;
-    for (int i = 0; i < xs.Length; i++) {
-      sum += xs[i] & 0x0f;
+
+    public int sum(byte[] xs) {
+      int sum = 0;
+      for (int i = 0; i < xs.Length; i++) {
+        sum += xs[i] & 0x0f;
+      }
+      return sum;
     }
-    return sum;
-  }
+
 Simple enough. Out of curiosity, I looked at the assembly mono produced (with mono --aot -O=all test.exe, then looking at sym.sum _TODO: NAME_ with radare):
   _TODO: RADARE OUTPUT_ with colors maybe?
-  for now: https://www.iximeow.net/public_files/before.png
+  for now: https://www.iximeow.net/public_files/before.png  
 This is fractally subpar, and a little surprising considering -O=all should be the full gamut of optimization mono can do, the best it has to show!
-0. Look at the trailing `movsxd rax, dword [r14 + 0x18]`. As evidenced by this and the above `lea rax, [r14 + rax + 0x20]`, Mono's array implementation is as a class (with likely some vtable-y thing at 0x08 and reflection info at 0x00. _TODO: VERIFY_) with array length at 0x18 into the struct and the data starting at 0x20. That means that in every iteration of the loop, wmono decided it must re-load the array length! Given this very tiny loop, an extra read from memory is very impactful on overall performance. It's not hard to see that r14 is never assigned to, and is local to this function, so the array length would never get changed, even if another thread were executing and reassigned the original array reference! I would have expected mono to hoist that length read to the top of the function, outside the loop body.
-0. `mov byte [rsp + local_18h], al`. I can only imagine this is generated because the original bytecode stores the result of `xs[i]` in a local variable for a moment, and faithfully transcribes the local variable assignment into machine code. However, the jitter also realizes the value is already in a register (al), and continues to use the register. This value on the stack is, then, never read from, and overwritten every iteration of the loop. Again with the unnecessary memory activity!!
-0. Register musical chairs. `rax` is used as an index at the start of the loop body, then overwritten with the array element, then used to hold the accumulated sum, then overwritten with the array length as described above. r15d holds the array index that's actually incremented each step of the loop. So, it should be entirely valid to have not produced the `movsxd rax, r15d`, then use `r15` in `lea rax, [r14 + r15 + 0x20]`. This frees up one use of rax, leaving it used as the array element, sum, and array length.
-0. Also on the topic of the running sum, it's stored in r13, but this generated code moves from r13 to rax, adds the nibble from the arary element, and then moves the value back into r13! It is entirely valid to simply add directly into r13 and remove both `mov rax, r13` and `mov r13, rax` after.
+
+* Look at the trailing `movsxd rax, dword [r14 + 0x18]`. As evidenced by this and the above `lea rax, [r14 + rax + 0x20]`, Mono's array implementation is as a class (with likely some vtable-y thing at 0x08 and reflection info at 0x00. _TODO: VERIFY_) with array length at 0x18 into the struct and the data starting at 0x20. That means that in every iteration of the loop, wmono decided it must re-load the array length! Given this very tiny loop, an extra read from memory is very impactful on overall performance. It's not hard to see that r14 is never assigned to, and is local to this function, so the array length would never get changed, even if another thread were executing and reassigned the original array reference! I would have expected mono to hoist that length read to the top of the function, outside the loop body.  
+* `mov byte [rsp + local_18h], al`. I can only imagine this is generated because the original bytecode stores the result of `xs[i]` in a local variable for a moment, and faithfully transcribes the local variable assignment into machine code. However, the jitter also realizes the value is already in a register (al), and continues to use the register. This value on the stack is, then, never read from, and overwritten every iteration of the loop. Again with the unnecessary memory activity!!  
+* Register musical chairs. `rax` is used as an index at the start of the loop body, then overwritten with the array element, then used to hold the accumulated sum, then overwritten with the array length as described above. r15d holds the array index that's actually incremented each step of the loop. So, it should be entirely valid to have not produced the `movsxd rax, r15d`, then use `r15` in `lea rax, [r14 + r15 + 0x20]`. This frees up one use of rax, leaving it used as the array element, sum, and array length.
+* Also on the topic of the running sum, it's stored in r13, but this generated code moves from r13 to rax, adds the nibble from the arary element, and then moves the value back into r13! It is entirely valid to simply add directly into r13 and remove both `mov rax, r13` and `mov r13, rax` after.
 
 These are the most obvious cases where this code is sub-optimal, so for grins I patched the compiled binary into something that should be a little more performant:  
   _TODO: Exactly that._
 
 _TODO: deeper writeup of each change._
 Because I wanted to avoid having to move code around (updating relative jumps is fairly annoting), I made most of these changes simultaneously.
-0. Considering that we have a handful of unused registers in this function, it's reasonable to put the array length in a different register. I picked `rdx` for no particular reason.
-  - Move `movsxd rax, dword [r14 + 0x18]` to `movsxd rdx, dword [r14 + 0x18]` where `jmp _cmp_` currently is. Move `jmp _cmp` down and fix up the offset to still jump to `cmp r15d, eax`. Rewrite `cmp r15d, eax` to `cmp r15d, edx`. For maximum carefulness, insert a `push edx`/`pop edx` pair around this loop now. (I didn't)
-0. nop out `mov byte [rsp + local_18h], al`. Easy :)
-0. nop out `mov rax, r13` and `mov r13, rax`. Rewrite `add eax, ecx` to `add r13d, ecx`.
-0. nop out `movsxd rax, r15d` and `movsx rax, byte [rax]`, rewrite `lea rax, [r14 + rax + 0x20]` as `movsx rcx, byte [r14 + r15 + 0x20]`, simultaneously merging the two instructions and freeing two uses of rax
-0. nop out `movzx rcx, al`
-0. because it makes no sense to execute nops every iteration of this loop when we care about speed, pack all the above changes together before `cmp r15d, rdx`, with any added nop after the `xor r15d, r15d; xor r13d, r13d; jmp _loop_` region.
-0. don't forget to fix up the backwards `jl _start_`! rewrite the `7cd4` with `7c <appropriate size>`
+
+* Considering that we have a handful of unused registers in this function, it's reasonable to put the array length in a different register. I picked `rdx` for no particular reason.
+* Move `movsxd rax, dword [r14 + 0x18]` to `movsxd rdx, dword [r14 + 0x18]` where `jmp _cmp_` currently is. Move `jmp _cmp` down and fix up the offset to still jump to `cmp r15d, eax`. Rewrite `cmp r15d, eax` to `cmp r15d, edx`. For maximum carefulness, insert a `push edx`/`pop edx` pair around this loop now. (I didn't)
+* nop out `mov byte [rsp + local_18h], al`. Easy :)
+* nop out `mov rax, r13` and `mov r13, rax`. Rewrite `add eax, ecx` to `add r13d, ecx`.
+* nop out `movsxd rax, r15d` and `movsx rax, byte [rax]`, rewrite `lea rax, [r14 + rax + 0x20]` as `movsx rcx, byte [r14 + r15 + 0x20]`, simultaneously merging the two instructions and freeing two uses of rax
+* nop out `movzx rcx, al`
+* because it makes no sense to execute nops every iteration of this loop when we care about speed, pack all the above changes together before `cmp r15d, rdx`, with any added nop after the `xor r15d, r15d; xor r13d, r13d; jmp _loop_` region.
+* don't forget to fix up the backwards `jl _start_`! rewrite the `7cd4` with `7c <appropriate size>`
 
 If you run it, you'll see it's twice as fast now! Too bad trying to ship hand-patched AOT'd mono binaries isn't a good idea :)
 
@@ -40,39 +44,40 @@ If you run it, you'll see it's twice as fast now! Too bad trying to ship hand-pa
 At this point I was bitten by the performance bug. Excatly **HOW QUICKLY** can my CPU add a bunch of four bit numbers together? I figure at this poing it's easier to write the assembly myself instead of continuing to patch a binary that will be blown away as soon as I recompile.
 
 So, the baseline is a straight copy of the code mono produced:
-  section .text
-  global _start:
-    call do_work
-    mov rax, 60
-    mov rdi, r13
-    syscall
-
-  do_work:
-    mov r14, arrrrrr
-    xor r15, r15
-    xor r13, r13
-    jmp tst
-    nop
-  lp:
-    movsxd rax, r15d
-    lea rax, [r14 + rax + 0x8]
-    movsx rax, byte [rax]
-    mov byte [rsp + 0x18], al
-    movzx rcx, al
-    and ecx, 0xf
-    mov rax, r13
-    add eax, ecx
-    inc r15
-    mov r13, rax
-  tst:
-    movsxd rax, dword [r14]
-    cmp r15, rax
-    jl lp
-    ret
-
-  arrrrrr:
-    dd 134217728
-    times 134217728 db 57h
+
+    section .text
+    global _start:
+      call do_work
+      mov rax, 60
+      mov rdi, r13
+      syscall
+
+    do_work:
+      mov r14, arrrrrr
+      xor r15, r15
+      xor r13, r13
+      jmp tst
+      nop
+    lp:
+      movsxd rax, r15d
+      lea rax, [r14 + rax + 0x8]
+      movsx rax, byte [rax]
+      mov byte [rsp + 0x18], al
+      movzx rcx, al
+      and ecx, 0xf
+      mov rax, r13
+      add eax, ecx
+      inc r15
+      mov r13, rax
+    tst:
+      movsxd rax, dword [r14]
+      cmp r15, rax
+      jl lp
+      ret
+
+    arrrrrr:
+      dd 134217728
+      times 134217728 db 57h
 
 Which you can build with `nasm -f elf -o array.o <wherever you put it> && ld array.o`.
 
@@ -82,90 +87,92 @@ To no surprise, this is as slow as the initial mono code. Pay no attention to th
 ### Onward, to our own code!
 
 Now, I wrote up the hand-optimized version:
-  section .text
-  global _start:
-    call do_work
-    mov rax, 60
-    mov rdi, r13
-    syscall
-
-  do_work:
-    mov r14, arrrrrr
-    xor r15, r15
-    xor r13, r13
-    movsxd rax, dword [r14]
-    jmp tst
-    nop
-  lp:
-    mov cl, byte [r14 + r15 + 0x8]
-    and cl, 0xf
-    add r13, rcx
-    inc r15
-  tst:
-    cmp r15, rax
-    jl lp
-    ret
-
-  arrrrrr:
-    dd 134217728
-    times 134217728 db 57h
-
-And like before, this code is almost twice as fast on systems I tested on. Results are exactly what we should expect so far!
+
+    section .text
+    global _start:
+      call do_work
+      mov rax, 60
+      mov rdi, r13
+      syscall
+
+    do_work:
+      mov r14, arrrrrr
+      xor r15, r15
+      xor r13, r13
+      movsxd rax, dword [r14]
+      jmp tst
+      nop
+    lp:
+      mov cl, byte [r14 + r15 + 0x8]
+      and cl, 0xf
+      add r13, rcx
+      inc r15
+    tst:
+      cmp r15, rax
+      jl lp
+      ret
+
+    arrrrrr:
+      dd 134217728
+      times 134217728 db 57h
+
+And like before, this code is almost twice as fast on systems I tested on. Results are exactly what we should expect so far!  
 
 After some thoughtful consideration, I realized it might be slightly faster, if not definitely more compact, to leverage lodsb:  
-  replace `mov cl, byte [r14 + r15 + 0x8]` with `lodsb`, having rsi initialized before the loop to `r14 + 0x8` rewrite uses of `cl` and `rcx` to `al` as appropriate (and put the array length in `rdi` instead of `rax`).
-  remove `inc r15`, as `rsi` is incremented by lodsb automagically.
+  replace `mov cl, byte [r14 + r15 + 0x8]` with `lodsb`, having rsi initialized before the loop to `r14 + 0x8` rewrite uses of `cl` and `rcx` to `al` as appropriate (and put the array length in `rdi` instead of `rax`).  
+  remove `inc r15`, as `rsi` is incremented by lodsb automagically.  
 
-This ends up being another 10% improvement or so! Probably from a combination of two(?) less additions in the loop (by removing `r14 + r15 + 0x8` and `inc r15` for the implicit increment with `lodsb`)
+This ends up being another 10% improvement or so! Probably from a combination of two(?) less additions in the loop (by removing `r14 + r15 + 0x8` and `inc r15` for the implicit increment with `lodsb`)  
 
 But can we do better? `lodsb` has variants.. what about using `lodsd` to read four bytes at a time?  
 
-_TODO: write up lodsd variant and masking by 0x0f0f0f0f, perf improvement again_
+_TODO: write up lodsd variant and masking by 0x0f0f0f0f, perf improvement again_  
 
 #### Enter MMX
 
-Having heard of vector operations before, but never having a reason to use them myself (either not working in a low enough language to be able to write them, or working on problems that they don't help much with), I figured there's probably *something* to make adding four bytes at once *really really fast*. I ended up running across `psadbw`. summing each byte after pairwise subtraction of each byte of a pair of vector registers. sum({A,B,C,D,E,F,G,H} - {0,0,0,0,0,0,0,0}) is just the sum of each value though, so `psadbw` on a pair of registers, one with my nibbles, the other with zero, should do exactly the right addtion, in *fast*. I hope.
-
-In addition, we have to load 64 bits of 0x0f0f0f0f0f0f0f0f into a reigster to mask the loadq'd value by. No need to worry about overflow here because the largest value is 8 * 0xf == 0x78.
-
-In the end,
-  section .text
-  global _start:
-    call do_work
-    mov rax, 60
-    mov rdi, r13
-    syscall
-
-  do_work:
-    mov r14, arrrrrr
-    xor r15, r15
-    xor r13, r13
-    mov rcx, 0x0f0f0f0f
-    rcl rcx, 32
-    or rcx, 0x0f0f0f0f
-    mov rax, 0
-    movq mm1, rax
-
-    mov ebx, dword [r14]
-    lea rsi, [r14 + 0x8]
-    test rbx, rbx
-    jmp tst
-    nop
-  lp:
-    lodsq
-    and rax, rcx
-    movq mm0, rax
-    psadbw mm0, mm1
-    movq rax, mm0
-    add r13, rax
-
-    sub rbx, 8
-  tst:
-    jnz lp
-    ret
-
-  arrrrrr:
-    dd 134217728
-    times 134217728 db 57h
-
-And this is *really fast*. Rough math says ~6gb/s of data processing through this loop on my desktop _TODO: STATS_
+Having heard of vector operations before, but never having a reason to use them myself (either not working in a low enough language to be able to write them, or working on problems that they don't help much with), I figured there's probably *something* to make adding four bytes at once *really really fast*. I ended up running across `psadbw`. summing each byte after pairwise subtraction of each byte of a pair of vector registers. sum({A,B,C,D,E,F,G,H} - {0,0,0,0,0,0,0,0}) is just the sum of each value though, so `psadbw` on a pair of registers, one with my nibbles, the other with zero, should do exactly the right addtion, in *fast*. I hope.  
+
+In addition, we have to load 64 bits of 0x0f0f0f0f0f0f0f0f into a reigster to mask the loadq'd value by. No need to worry about overflow here because the largest value is 8 * 0xf == 0x78.  
+
+In the end,  
+
+    section .text
+    global _start:
+      call do_work
+      mov rax, 60
+      mov rdi, r13
+      syscall
+
+    do_work:
+      mov r14, arrrrrr
+      xor r15, r15
+      xor r13, r13
+      mov rcx, 0x0f0f0f0f
+      rcl rcx, 32
+      or rcx, 0x0f0f0f0f
+      mov rax, 0
+      movq mm1, rax
+
+      mov ebx, dword [r14]
+      lea rsi, [r14 + 0x8]
+      test rbx, rbx
+      jmp tst
+      nop
+    lp:
+      lodsq
+      and rax, rcx
+      movq mm0, rax
+      psadbw mm0, mm1
+      movq rax, mm0
+      add r13, rax
+
+      sub rbx, 8
+    tst:
+      jnz lp
+      ret
+
+    arrrrrr:
+      dd 134217728
+      times 134217728 db 57h
+
+And this is *really fast*. Rough math says ~6gb/s of data processing through this loop on my desktop _TODO: STATS_
author	iximeow <me@iximeow.net>	2017-07-24 08:55:25 -0700
committer	iximeow <me@iximeow.net>	2017-07-24 08:55:25 -0700
commit	61783b45ac3b68e2026c45ae6b3e3533f62a4e53 (patch)
tree	30342018cdcac553643b21ceafdfdd3190e676be /source/assembly/x86
parent	9c7462d90cf6c48b0cd82cbf3ae4e8a458ed77f7 (diff)