Age | Commit message (Collapse) | Author |
|
|
|
unfortunately because of the layout of instruction information this
*adds* lines rather than removes them..
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
this applies
* f338c74656f6eef8b3080fa9f249b1cb733fd1a9
* bece19e6a69b158893abbf56a6cac25eb25d9a32
* 6353f58170d28a142e3b012c2c86f684d50dea45
* 67be1c0983244645a3c762b7aa0601f0d0ba4bb3
* 091f1d66ef853d6339a96e43d71c137ee7d3907a
as one unit to both the 16-bit and 32-bit decoders.
|
|
|
|
these don't need the extra `rex`-supporting index space, so they don't
have it.
|
|
|
|
|
|
the overwhelming majority of x86 instructions are either a single-byte
opcode or a single-byte opcode with a rex prefix. supporting these
specially means that we don't have to length-check on every byte or
go through the full decode loop while reading the most likely
instructions. this is a significant improvement on typical x86 streams,
but comes at a moderate penalty for crafted x86 instructions.
the penalty is still not very bad, as the fast path is exited in favor
of the full decode loop as soon as we see a non-rex prefix byte; this
adds maybe a dozen instructions to the slow path.
|
|
|
|
sharing vex/evex invalid prefix checks improves codegen a bit, but
ordering prefix checks by likeliest prefix first reduces time falling
through prefix handling arms. both together are a notable improvement in
throughput on typical x86 code.
bundled in here is some code motion to where `mem_size = 0` and
`operand_count = 2` are executed; this is because, at least on zen2 and
cascade lake parts, bunching all stores to the instruction together
caused small stalls getting into the decoder. spreading out stores seems
to mix these assignments with parts of code that was not using memory
anyway, and pipelines better.
|
|
cleanliness, but also slightly better codegen somehow?
|
|
the match compiled into some indirect branch awfulness!! no thank you
|
|
|
|
the correct bank is applied far after register numbers are read. a
correct annotation would need to know to defer emission until setting
register banks, but also would need to work backwards for the number of
bits between the current byte and modrm. not impossible, but substantial
refactoring.
|
|
|
|
|
|
|
|
this includes a `Makefile` that exercises the various crate configs.
most annoyingly, several doc comments needed to grow
`#[cfg(feature="fmt")]` blocks so docs continue to build with that
feature enabled or disabled.
carved out a way to run exhaustive tests; they should be written as
`#[ignore]`, and then the makefile will run even ignored tests on the
expectation that this will run the exhaustive (but slower) suite.
exhaustive tests are not yet written. they'll probably involve spanning
4 byte sequences from 0 to 2^32-1.
|
|
|
|
|
|
|
|
|
|
unfortunately something about the wrapper functions adjusts codegen even when
the wrapper functions themselves are just calls to inner functions. the in-tree
benchmark (known to not be comprehensive, but enough to spot a difference),
showed a ~3.5% regression in throughput with the prior commit, even though it
doesn't change behavior at all.
explicit #[inline(always)] gets things to a state where the wrapper functions
do not penalize performance. for an example of the differences in codegen, see
below.
before:
```
< 141d4: 48 39 fa cmp %rdi,%rdx
< 141d7: 0f 84 0b 4d 00 00 je 18ee8 <_ZN5bench16do_decode_swathe17h694154735739ce4cE+0x4e58>
< 141dd: 0f b6 0f movzbl (%rdi),%ecx
< 141e0: 48 83 c7 01 add $0x1,%rdi
< 141e4: 48 89 7c 24 38 mov %rdi,0x38(%rsp)
... snip ...
```
after:
```
> 141d4: 48 39 ea cmp %rbp,%rdx
> 141d7: 0f 84 97 4c 00 00 je 18e74 <_ZN5bench16do_decode_swathe17h694154735739ce4cE+0x4de4>
> 141dd: 0f b6 4d 00 movzbl 0x0(%rbp),%ecx
> 141e1: 48 83 c5 01 add $0x1,%rbp
> 141e5: 48 89 6c 24 38 mov %rbp,0x38(%rsp)
... snip ...
```
there are several spans of code with this kind of change involved; there are no
explicit calls to `get_kinda_unchecked` or `unreachable_kinda_unchecked` but
clearly a difference did make it through to the benchmark's code.
while the choice of `rbp` instead of `rdi` wouldn't seem very interesting, the
instructions themselves are more substantially different. `0fb60f` vs
`0fb64d00`; to encode `[rbp + 0]`, the instruction requires a displacement,
and is one byte longer as a result. there are several instructions
so-impacted, and i suspect the increased code size is what ended up changing
benchmark behavior.
after adding these `#[inline(always)]` annotations, there is no difference in
generated code with or without the `kinda_unchecked` helpers!
|
|
Closes https://github.com/iximeow/yaxpeax-x86/issues/16
|
|
actual release is being held until cargo fuzz runs a while without a panic
|
|
|
|
not only did the instruction have wrong data, but if displayed, the
formatter would panic.
|
|
in the process, fix 64-bit rex-byte limit, 32/16-bit mode mask reg limit
|
|
`apply_disp_scale` forgot that `wrapping_mul` exists, so we don't need
to explicitly write the size of value that `mem_size` should be cast to,
in casting to/from a signed integer. taken with `.into()`, we don't need
per-architecture stubs to make evex decoding work.
|
|
|
|
|
|
so multiplying to expand EVEX compressed offsets can overflow, and that
needs to be okay.
|
|
|
|
|
|
alas
|
|
|
|
|
|
|
|
This makes generated docs refer to a type and show said type in the list of all structs rather than rustdoc showing gray text in return types.
quote doc references
|
|
|
|
|
|
|
|
|