feat: fused mul_hilo for 64-bit batches (shared 32x32->64 partials)#1367
Merged
serge-sans-paille merged 1 commit intoJun 11, 2026
Merged
Conversation
mul_hilo<uint64_t> previously fell through to the generic common path
{ mul_hi(x, y), x * y }, deriving the high half (mulhi_u64_core: 4 vpmuludq)
and the low half (operator*: 3 vpmuludq) from separately-computed 32-bit
partials -- 7 vpmuludq per pair, none CSE-able because the two halves split
the operands differently (&mask/>>32 vs vpshufd).
Add detail::mulhilo_u64_core, which derives BOTH halves from one set of four
32x32->64 partials (ll, lh, hl, hh): 4 vpmuludq per pair. By construction
hi == mulhi_u64_core and lo == operator*, so the returned pair is
bit-identical to the unfused result.
Native kernels for SSE4.1, AVX2 and AVX-512F each pass their _mm*_mul_epu32
widening functor, mirroring the existing mul_hi<uint64_t> structure. SSE2
keeps the common fallback (it has no fused mul_hi<uint64_t> either). Signed
int64 reuses the unsigned core through a single common overload (bitwise_cast
+ sign fixup on hi; lo is sign-invariant), so no per-arch signed overloads
are added.
Verified bit-identical to __int128 for uint64/int64 across SSE4.1 and AVX2
on g++ and clang (including edge cases); avx2 asm shows 4 vpmuludq for the
fused mul_hilo vs 7 for the unfused { mul_hi, mul_lo }.
5d2490f
into
xtensor-stack:master
88 of 92 checks passed
Contributor
|
Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
mul_hilo<uint64_t> previously fell through to the generic common path { mul_hi(x, y), x * y }