Kovid Goyal
920b8a2496
Use VZEROUPPER in avx functions
...
See https://www.intel.com/content/dam/develop/external/us/en/documents/11mc12-avoiding-2bavx-sse-2btransition-2bpenalties-2brh-2bfinal-809104.pdf
2024-02-25 09:57:40 +05:30
Kovid Goyal
5a5e31c38b
Also zero upper at start of function
2024-02-25 09:57:40 +05:30
Kovid Goyal
db2e0e816d
Fix mixing of register types in the same function
2024-02-25 09:57:40 +05:30
Kovid Goyal
a298781b85
DRYer
2024-02-25 09:57:40 +05:30
Kovid Goyal
d5cd9ef2ca
...
2024-02-25 09:57:40 +05:30
Kovid Goyal
55c909c656
Use -mtune=intel for SIMD files when building without native optimizations
2024-02-25 09:57:40 +05:30
Kovid Goyal
da31db3212
...
2024-02-25 09:57:40 +05:30
Kovid Goyal
601c4ad4df
Fix some typos
2024-02-25 09:57:40 +05:30
Kovid Goyal
2549b4328f
Update throughput comparison table in light of latest improvements
2024-02-25 09:57:40 +05:30
Kovid Goyal
68d800d4fa
make clean should clean generated asm as well
2024-02-25 09:57:40 +05:30
Kovid Goyal
9fc3db1dd1
Work on C0 index func
2024-02-25 09:57:40 +05:30
Kovid Goyal
d4c4805f96
const away to glory
2024-02-25 09:57:40 +05:30
Kovid Goyal
161eae78b6
Make generated asm_* files world readable
2024-02-25 09:57:40 +05:30
Kovid Goyal
6cdc7ac91d
A further 5% speedup for UTF-8 decoding
...
Achieved by decoding in larger chunks thereby amortizing the cost
of creating various constant vectors over larger chunks.
2024-02-25 09:57:40 +05:30
Kovid Goyal
0bccada9d1
No longer need to abort after dealing with trailing bytes
2024-02-25 09:57:40 +05:30
Kovid Goyal
9cb9373274
Allow unbounded output in UTF8Decoder
...
This will allow us to eventually decode more than a single
vector's worth in a fast inner loop
2024-02-25 09:57:39 +05:30
Kovid Goyal
d987ffe49a
Use unaligned stores
...
Makes no measurable difference in the benchmark. And will eventually
allow us to process larger chunks of data without need to reset a bunch
of vector registers to constant values each time.
2024-02-25 09:57:39 +05:30
Kovid Goyal
77cfd44f24
More efficient clearing of register to all zeros or all ones
2024-02-25 09:57:39 +05:30
Kovid Goyal
59be7213cf
Make set1_epi8 more general
2024-02-25 09:57:39 +05:30
Kovid Goyal
d60dacbd09
Implement > and < intrinsics for vector registers
2024-02-25 09:57:39 +05:30
Kovid Goyal
82b7b4fcce
Make a re-useable template for generating ASM index functions with different tests
2024-02-25 09:57:39 +05:30
Kovid Goyal
fa9a2b1e2e
Switch file input to use new SIMD parser to search for \n and \r in parallel
2024-02-25 09:57:39 +05:30
Kovid Goyal
4e6138d785
Generate SIMD code during build
2024-02-25 09:57:39 +05:30
Kovid Goyal
86a55e2c0a
Use an aligned slice for file reads
2024-02-25 09:57:39 +05:30
Kovid Goyal
de8c1e0206
Work on porting SIMD vt arser to Go for the kittens
2024-02-25 09:57:39 +05:30
Kovid Goyal
131716da00
Ignore another warning on some compiler versions in simde
2024-02-25 09:57:39 +05:30
Kovid Goyal
4d35fc2928
Use a custom movmask for ARM rather than the one from simde
...
Supposedly faster, not that I can measure it, but...
Also gives neater code, so keep it.
2024-02-25 09:57:39 +05:30
Kovid Goyal
3b65c1a58a
remove declaration without implementation
2024-02-25 09:57:39 +05:30
Kovid Goyal
9bca415af2
Use aligned loads when finding either of two bytes
...
No measurable performance improvement, but neater algorithm anyway.
2024-02-25 09:57:39 +05:30
Kovid Goyal
60bc8e6c25
...
2024-02-25 09:57:39 +05:30
Kovid Goyal
8aa1b112b8
Turns out the simde implementation of movemask is not slow enough to compensate for the speed bump from 256 bit
2024-02-25 09:57:39 +05:30
Kovid Goyal
0bd47d8457
Cleanup KITTY_NO_SIMD compilation
2024-02-25 09:57:39 +05:30
Kovid Goyal
fcbda63023
Move finding byte code into separate functions
...
movemask() is inefficient on ARM64 this will allow us to use a dedicated
implementation for finding bytes on that platform
2024-02-25 09:57:38 +05:30
Kovid Goyal
1d59bfade3
...
2024-02-25 09:57:38 +05:30
Kovid Goyal
fd7d0f8787
Fix event loop continuously ticking every input_delay seconds even when no input is available
2024-02-25 09:57:38 +05:30
Kovid Goyal
fa11858a72
Make bash integration tests more robust on macOS
2024-02-25 09:57:38 +05:30
Kovid Goyal
1293ee60e0
...
2024-02-25 09:57:38 +05:30
Kovid Goyal
66341aa28e
Make the env var controlling which SIMD level to use more capable
2024-02-25 09:57:38 +05:30
Kovid Goyal
73342411bc
Dont build any SIMD code when the target is neither ARM64 nor x86/amd64
2024-02-25 09:57:38 +05:30
Kovid Goyal
8dd6f9b07c
Get universal builds working again
...
Now we use lipo and build individually so we can pass the correct
compiler flags per arch
2024-02-25 09:57:38 +05:30
Kovid Goyal
7e77a196e6
Build only the SIMD code with SIMD compiler flags
2024-02-25 09:57:38 +05:30
Kovid Goyal
465616223c
Drop using the v2 microarch
...
No significant performance impact and small risk of breakage
2024-02-25 09:57:38 +05:30
Kovid Goyal
9d4193f4ea
Fix texture ref not useable on repurposed image object
2024-02-25 09:57:38 +05:30
Kovid Goyal
dafb876d75
Skip simd parser tests on machines without SIMD instructions
2024-02-25 09:57:38 +05:30
Kovid Goyal
4b846e0106
Turns out that using 256 bit code on ARM is slightly faster even though it is emulated with 128 bit registers
2024-02-25 09:57:38 +05:30
Kovid Goyal
76c6630084
Dont use 256 bit code paths on ARM
...
ARM only has 128 bit registers. simde simulates 256 bit operations using
them, which is fairly pointless for us.
2024-02-25 09:57:38 +05:30
Kovid Goyal
23a4012aeb
Add an env var to turn off use of SIMD instructions
2024-02-25 09:57:38 +05:30
Kovid Goyal
eee14ae148
Workaround for machines on GitHub Actions that incorrectly report CPU vector instruction availability
2024-02-25 09:57:37 +05:30
Kovid Goyal
b0ccaa09be
Clean up test env reporting
2024-02-25 09:57:37 +05:30
Kovid Goyal
bbaccfdaae
DRYer
2024-02-25 09:57:37 +05:30