Commit Graph

13433 Commits

Author SHA1 Message Date
Kovid Goyal
920b8a2496 Use VZEROUPPER in avx functions
See https://www.intel.com/content/dam/develop/external/us/en/documents/11mc12-avoiding-2bavx-sse-2btransition-2bpenalties-2brh-2bfinal-809104.pdf
2024-02-25 09:57:40 +05:30
Kovid Goyal
5a5e31c38b Also zero upper at start of function 2024-02-25 09:57:40 +05:30
Kovid Goyal
db2e0e816d Fix mixing of register types in the same function 2024-02-25 09:57:40 +05:30
Kovid Goyal
a298781b85 DRYer 2024-02-25 09:57:40 +05:30
Kovid Goyal
d5cd9ef2ca ... 2024-02-25 09:57:40 +05:30
Kovid Goyal
55c909c656 Use -mtune=intel for SIMD files when building without native optimizations 2024-02-25 09:57:40 +05:30
Kovid Goyal
da31db3212 ... 2024-02-25 09:57:40 +05:30
Kovid Goyal
601c4ad4df Fix some typos 2024-02-25 09:57:40 +05:30
Kovid Goyal
2549b4328f Update throughput comparison table in light of latest improvements 2024-02-25 09:57:40 +05:30
Kovid Goyal
68d800d4fa make clean should clean generated asm as well 2024-02-25 09:57:40 +05:30
Kovid Goyal
9fc3db1dd1 Work on C0 index func 2024-02-25 09:57:40 +05:30
Kovid Goyal
d4c4805f96 const away to glory 2024-02-25 09:57:40 +05:30
Kovid Goyal
161eae78b6 Make generated asm_* files world readable 2024-02-25 09:57:40 +05:30
Kovid Goyal
6cdc7ac91d A further 5% speedup for UTF-8 decoding
Achieved by decoding in larger chunks thereby amortizing the cost
of creating various constant vectors over larger chunks.
2024-02-25 09:57:40 +05:30
Kovid Goyal
0bccada9d1 No longer need to abort after dealing with trailing bytes 2024-02-25 09:57:40 +05:30
Kovid Goyal
9cb9373274 Allow unbounded output in UTF8Decoder
This will allow us to eventually decode more than a single
vector's worth in a fast inner loop
2024-02-25 09:57:39 +05:30
Kovid Goyal
d987ffe49a Use unaligned stores
Makes no measurable difference in the benchmark. And will eventually
allow us to process larger chunks of data without need to reset a bunch
of vector registers to constant values each time.
2024-02-25 09:57:39 +05:30
Kovid Goyal
77cfd44f24 More efficient clearing of register to all zeros or all ones 2024-02-25 09:57:39 +05:30
Kovid Goyal
59be7213cf Make set1_epi8 more general 2024-02-25 09:57:39 +05:30
Kovid Goyal
d60dacbd09 Implement > and < intrinsics for vector registers 2024-02-25 09:57:39 +05:30
Kovid Goyal
82b7b4fcce Make a re-useable template for generating ASM index functions with different tests 2024-02-25 09:57:39 +05:30
Kovid Goyal
fa9a2b1e2e Switch file input to use new SIMD parser to search for \n and \r in parallel 2024-02-25 09:57:39 +05:30
Kovid Goyal
4e6138d785 Generate SIMD code during build 2024-02-25 09:57:39 +05:30
Kovid Goyal
86a55e2c0a Use an aligned slice for file reads 2024-02-25 09:57:39 +05:30
Kovid Goyal
de8c1e0206 Work on porting SIMD vt arser to Go for the kittens 2024-02-25 09:57:39 +05:30
Kovid Goyal
131716da00 Ignore another warning on some compiler versions in simde 2024-02-25 09:57:39 +05:30
Kovid Goyal
4d35fc2928 Use a custom movmask for ARM rather than the one from simde
Supposedly faster, not that I can measure it, but...
Also gives neater code, so keep it.
2024-02-25 09:57:39 +05:30
Kovid Goyal
3b65c1a58a remove declaration without implementation 2024-02-25 09:57:39 +05:30
Kovid Goyal
9bca415af2 Use aligned loads when finding either of two bytes
No measurable performance improvement, but neater algorithm anyway.
2024-02-25 09:57:39 +05:30
Kovid Goyal
60bc8e6c25 ... 2024-02-25 09:57:39 +05:30
Kovid Goyal
8aa1b112b8 Turns out the simde implementation of movemask is not slow enough to compensate for the speed bump from 256 bit 2024-02-25 09:57:39 +05:30
Kovid Goyal
0bd47d8457 Cleanup KITTY_NO_SIMD compilation 2024-02-25 09:57:39 +05:30
Kovid Goyal
fcbda63023 Move finding byte code into separate functions
movemask() is inefficient on ARM64 this will allow us to use a dedicated
implementation for finding bytes on that platform
2024-02-25 09:57:38 +05:30
Kovid Goyal
1d59bfade3 ... 2024-02-25 09:57:38 +05:30
Kovid Goyal
fd7d0f8787 Fix event loop continuously ticking every input_delay seconds even when no input is available 2024-02-25 09:57:38 +05:30
Kovid Goyal
fa11858a72 Make bash integration tests more robust on macOS 2024-02-25 09:57:38 +05:30
Kovid Goyal
1293ee60e0 ... 2024-02-25 09:57:38 +05:30
Kovid Goyal
66341aa28e Make the env var controlling which SIMD level to use more capable 2024-02-25 09:57:38 +05:30
Kovid Goyal
73342411bc Dont build any SIMD code when the target is neither ARM64 nor x86/amd64 2024-02-25 09:57:38 +05:30
Kovid Goyal
8dd6f9b07c Get universal builds working again
Now we use lipo and build individually so we can pass the correct
compiler flags per arch
2024-02-25 09:57:38 +05:30
Kovid Goyal
7e77a196e6 Build only the SIMD code with SIMD compiler flags 2024-02-25 09:57:38 +05:30
Kovid Goyal
465616223c Drop using the v2 microarch
No significant performance impact and small risk of breakage
2024-02-25 09:57:38 +05:30
Kovid Goyal
9d4193f4ea Fix texture ref not useable on repurposed image object 2024-02-25 09:57:38 +05:30
Kovid Goyal
dafb876d75 Skip simd parser tests on machines without SIMD instructions 2024-02-25 09:57:38 +05:30
Kovid Goyal
4b846e0106 Turns out that using 256 bit code on ARM is slightly faster even though it is emulated with 128 bit registers 2024-02-25 09:57:38 +05:30
Kovid Goyal
76c6630084 Dont use 256 bit code paths on ARM
ARM only has 128 bit registers. simde simulates 256 bit operations using
them, which is fairly pointless for us.
2024-02-25 09:57:38 +05:30
Kovid Goyal
23a4012aeb Add an env var to turn off use of SIMD instructions 2024-02-25 09:57:38 +05:30
Kovid Goyal
eee14ae148 Workaround for machines on GitHub Actions that incorrectly report CPU vector instruction availability 2024-02-25 09:57:37 +05:30
Kovid Goyal
b0ccaa09be Clean up test env reporting 2024-02-25 09:57:37 +05:30
Kovid Goyal
bbaccfdaae DRYer 2024-02-25 09:57:37 +05:30