Age | Commit message (Collapse) | Author |
|
this profiles slightly better? not entirely sure why...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
the evex route would allow "valid" instructions that have the opcode
`invalid`. this is.. not correct.
|
|
|
|
at least on my zen2.
when reading prefixes, optimize for the likely case of reading an
instruction rather than an invalid run of prefixes. checking if we've
exceeded the x86 length bound immediately after reading the byte is only
a benefit if we'd otherwise read an impossibly-long instruction; in this
case we can exit exactly at prefix byte 15 rather than potentially later
at byte 16 (assuming a one-byte instruction like `c3`), or byte ~24 (a
more complex store with immediate and displacement).
these casese are extremely unlikely in practice. more likely is that
reading a prefix byte is one of the first two or three bytes in an
instruction, and we will never benefit from checking the x86 length
bound at this point. instead, only check length bounds after decoding
the entire instruction. this penalizes the slowest path through the
decoder but speeds up the likely path about 5% on my zen2 processor.
additionally, begin reading instruction bytes as soon as we enter the
decoder, and before initial clearing of instruction data. again, this is
for zen2 pipeline reasons. reading the first byte and corresponding
`OPCODES` entry improves the odds that this data is available by the
time we check for `Interpretation::Prefix` in the opcode scanning
loop. then, if we did *not* load an instruction, we immediately know
another byte must be read; begin reading this byte before applying `rex`
prefixes, and as soon as a prefix is known to not be one of the
escape-code prefix byte (c5, c4, 62, 0f). this clocked in at another ~5%
in total.
i've found that `read_volatile` is necessary to force rust to begin the
loadwhere it's written, rather than reordering it over other data. i'm
not committed to this being a guaranteed truth.
also, don't bother checking for `Invalid`. again, `Opcode::Invalid` is a
relatively unlikely path through the decoder and `Nothing` is already
optiimized for `None` cases. this appears to be another small improvement
in throughput but i wouldn't want to give it a number - it was
relatively small and may not be attributable to this effect.
|
|
|
|
|
|
|
|
|
|
this measures a bit faster. it doesn't seem like it should be. the rex
prefix checks compile identically but move a lea for a later expression
up and pipelines better?
|
|
also remove redundant assignments of operand_count and some OperandSpec,
bulk-assign all registers and operands on entry to `read_instr`. this
all, taken together, shaves off about 7 cycles per decode.
|
|
|
|
|
|
|
|
|
|
|
|
also some long-mode cleanup in corresponding areas
|
|
|
|
|
|
|
|
|
|
|
|
i really didnt know rust could do this
|
|
|
|
instructions
|
|
|
|
|
|
in the future these can and will change (new operands, new instructions) and i would prefer they not be major breaking changes. applications can ignore them and probably do undesired variants anyway.
if you want to write a 1120-variant match, are you me? why would you do this
|
|
the in-repo benchmark got better with this inlined but it's probably
better to leave it up to the compiler when finally stitching stuff
together. i suspect that having read_operands inlined resulted in just
too many live values, and the compiler was inspired to play hijinks that
pipelined poorly. disas-bench shows a ~15% improvement from this change.
|
|
|
|
vmov* are.. somehow messed up too
|
|
does intel know no bounds
|
|
|
|
|
|
|
|
|
|
decoder flag to come
|
|
this is... a more significant rewrite than i expected yaxpeax-x86 to
ever need. it turns out that capstone is extremely permissive about
duplicative 66/f2/f3 prefixes to the point that the implemented prefex
handling was unsalvageable.
while this replaces the *0f* opcode tables, i haven't profiled these
changes. it's possible this is a net improvement for single-byte
opcodes, it could be a net loss. code size may be severely impacted.
there is still work to do.
but this in total gets very close to iced/xed/zydis parity, far more
than before.
also adds several small extensions, gfni, 3dnow, enqcmd, invpcid, some
of cet, and a few missing avx instructions.
|
|
|
|
|
|
initial work to optionally discard any instruction printing support
when using `-Z build-std` to fully remove .eh_frame, a stripped
long_mode_no_fmt .so is 61kb!
|
|
|
|
clearing reg_rrr and reg_mmm more efficiently is an extremely small win,
but a win
read_imm_signed generally should inline well and runs afoul of some
heuristic. inlining gets about 8% improved throughput on the
(unrealistic) in-repo benchmark
it would be great to be able to avoid bounds checks somehow; it looks
like they alone are another ~10% of decode time. i'm not sure how to
pull that off while retaining the generic iterator parameter. might just
not be possible.
|
|
* `mwaitx`, `monitorx`, `rdpru`, and `clzero` are now supported
* swapgs is no longer decoded in protected mode
* rdpkru and wrpkru are no longer decoded if mod bits != 11
|
|
base 0b101
for memory operands with a base, index, and displacement either
the wrong base would be selected (register number ignored, so only
`*ax` or `r8*` would be reported), or yaxpeax-x86 would report a
base register is present when it is not (`RegIndexBaseScaleDisp`
when the operand is actually `RegScaleDisp`)
thank you to Evan Johnson for catching and reporting this bug!
also bump crate version to 0.1.4 as this will be immediately tagged and
released.
|