You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#579 made the elementwise kernel path adaptive (VectorBits ∈ {128, 256, 512}, detected once at startup), so add/sub/mul/div/comparisons/shift/modf and the unary-math ops already use Vector512 on capable hardware. But the migration stopped there: the reduction, NaN-masking, cast, and matmul subsystems are still hardcoded to Vector256<T> and never widen to Vector512. On AVX-512 hosts (Intel Ice Lake+/Sapphire Rapids, AMD Zen 4/5) these run at half the lane width they could.
This issue tracks completing the adaptive-width migration for the remaining subsystems.
AVX-512 appears in only 4 files of Backends/Kernels/. Everywhere else the SIMD body is written against Vector256<T> with zero VectorBits references — i.e. it is width-locked at 256 and cannot use a 512-bit register even when one is available.
most astype conversions are AVX2-only (source comments say "AVX2-only … no AVX512")
MatMul / dot
SimdMatMul.{cs,Double,Strided}, MatMul(5)
zero AVX-512 — dot/matmul cap at V256
Already adaptive (use VectorBits, no change needed): Binary, Comparison, Shift, Modf, InnerLoop, and the load/store/op emit helpers in DirectILKernelGenerator.cs.
Proposal
Apply the #579 adaptive-width pattern to each width-locked subsystem: drive lane count and load/store/op from VectorBits (via VectorMethodCache) instead of a literal Vector256, with a runtime capability probe + scalar/256 fallback so nothing regresses on non-AVX-512 hosts. Each item is independently shippable; correctness can be verified on a non-AVX-512 dev box via JIT software-emulation of Vector512 (the same technique used to verify the Round/Truncate fix in dde0a0a9 / a0581c6f), with the actual speedup benchmarked on AVX-512 hardware.
Axis reductions — Reduction.Axis.Simd, .Widening, .VarStd, .Boolean, .NaN (sum/prod/min/max/mean/var/std/any/all along an axis). Largest cluster; the horizontal-reduction tail needs a width-specific lane-fold.
Cast / astype subsystem — widen the AVX2-only conversion paths. Note the AVX-512 shuffles differ from AVX2 (e.g. VPERMB needs AVX512VBMI, VPMOVZX/VPMOV* truncating moves), so this is the most involved bucket and should be width-probed per conversion.
MatMul / dot — give SimdMatMul a Vector512 microkernel (most likely to show a clean win since matmul is compute-bound).
x86-512 binary routing polish — VectorMethodCache.ResolveX86BinaryApi(512, …) returns Avx512F for everything; wire Avx512BW (byte/word add/sub/min/max) and Avx512DQ (int64 multiply) so those take the x86 fast path instead of the cross-platform fallback. (Correctness-neutral — today it falls back to Vector512.*, which the JIT lowers correctly; this is pure perf.)
Evidence
Width-lock confirmed by grep: the files above contain Vector256<…> with 0VectorBits references, so the emitted IL pins a 256-bit register regardless of VectorBits.
Reduction.Axis.Simd.cs → AxisReductionSimdHelper<T> is generic over T but fixed at Vector256<T>.
Non-goal: new ufuncs or dtype support — this is width widening of existing kernels only, bit-for-bit identical results.
Non-goal: AVX-512 sub-feature detection policy beyond what each kernel needs (BW/DQ/VBMI probed where used).
Out of scope here: scalar-math unary ops (sin/cos/exp/log) that have no SIMD at any width.
Benchmark / Performance
Target: up to 2× throughput on the widened kernels on AVX-512 hosts; no regression on V256/V128/scalar hosts (capability-gated fallback).
Caveat: reductions and casts are frequently memory-bound, and heavy AVX-512 can trigger frequency throttling on some Intel parts — so the win is real but hardware-dependent and must be measured on Zen 4+/Ice Lake+ before claiming it. Use benchmark/layout/ (reduction × layout × dtype) and benchmark/cast/ (astype src→dst matrix) to A/B V256 vs V512.
Dev-box correctness without AVX-512: force VectorBits = 512 semantics and rely on the JIT's software emulation of Vector512<T> to bit-compare against the V256/scalar result.
Overview
#579 made the elementwise kernel path adaptive (
VectorBits∈ {128, 256, 512}, detected once at startup), soadd/sub/mul/div/comparisons/shift/modfand the unary-math ops already use Vector512 on capable hardware. But the migration stopped there: the reduction, NaN-masking, cast, and matmul subsystems are still hardcoded toVector256<T>and never widen to Vector512. On AVX-512 hosts (Intel Ice Lake+/Sapphire Rapids, AMD Zen 4/5) these run at half the lane width they could.This issue tracks completing the adaptive-width migration for the remaining subsystems.
Parent: #579 (Adaptive Vector Width: V128/256/512)
Problem
AVX-512 appears in only 4 files of
Backends/Kernels/. Everywhere else the SIMD body is written againstVector256<T>with zeroVectorBitsreferences — i.e. it is width-locked at 256 and cannot use a 512-bit register even when one is available.Survey (
Vector256<occurrences,VectorBits-refs = 0 ⇒ width-locked):Vector256<count)Reduction.Axis.Widening(66),.Simd(58),.VarStd(42),.Boolean(8),.NaN(8)sum/prod/min/max/mean/var/std/any/all+ nan-variants along an axisMasking.NaN(44),Masking.VarStd(16),Masking.Boolean(8)nansum/nanmean/nanvar/nanstd/nanmin/nanmaxReduction.Boolean(12),Reduction.Arg(4),ILKernelGenerator.Reduction(3)any/all,argmin/argmax, pairwisesum/prodWeightedSum(14)np.averageCast.ToHalf(16),.ToBool(12),.Complex(9),.Half(8, partial 512),.FloatToUInt(5),.FloatNarrow,.FloatWideInt,.IntNarrow,.ShortNarrow,.Subword{Copy,Narrow,Widen}astypeconversions are AVX2-only (source comments say "AVX2-only … no AVX512")SimdMatMul.{cs,Double,Strided},MatMul(5)dot/matmulcap at V256Already adaptive (use
VectorBits, no change needed):Binary,Comparison,Shift,Modf,InnerLoop, and the load/store/op emit helpers inDirectILKernelGenerator.cs.Proposal
Apply the #579 adaptive-width pattern to each width-locked subsystem: drive lane count and load/store/op from
VectorBits(viaVectorMethodCache) instead of a literalVector256, with a runtime capability probe + scalar/256 fallback so nothing regresses on non-AVX-512 hosts. Each item is independently shippable; correctness can be verified on a non-AVX-512 dev box via JIT software-emulation ofVector512(the same technique used to verify the Round/Truncate fix in dde0a0a9 / a0581c6f), with the actual speedup benchmarked on AVX-512 hardware.Reduction.Axis.Simd,.Widening,.VarStd,.Boolean,.NaN(sum/prod/min/max/mean/var/std/any/all along an axis). Largest cluster; the horizontal-reduction tail needs a width-specific lane-fold.Masking.NaN,Masking.VarStd,Masking.Boolean,Reduction.NaN(nansum/nanmean/nanvar/nanstd/…).Reduction.Boolean(any/all),Reduction.Arg(argmin/argmax),ILKernelGenerator.Reduction(pairwise sum/prod),WeightedSum(np.average).VPERMBneeds AVX512VBMI,VPMOVZX/VPMOV*truncating moves), so this is the most involved bucket and should be width-probed per conversion.SimdMatMula Vector512 microkernel (most likely to show a clean win since matmul is compute-bound).VectorMethodCache.ResolveX86BinaryApi(512, …)returnsAvx512Ffor everything; wire Avx512BW (byte/word add/sub/min/max) and Avx512DQ (int64 multiply) so those take the x86 fast path instead of the cross-platform fallback. (Correctness-neutral — today it falls back toVector512.*, which the JIT lowers correctly; this is pure perf.)Evidence
Vector256<…>with 0VectorBitsreferences, so the emitted IL pins a 256-bit register regardless ofVectorBits.Reduction.Axis.Simd.cs→AxisReductionSimdHelper<T>is generic overTbut fixed atVector256<T>.Cast.ToHalf.csheader documents the AVX2-only design ("i64/u64 → f16 (Wave 17, AVX2-only)…").SimdMatMul.*contains noAvx512/Vector512references.Cast.cs,Cast.Half.cs,DirectILKernelGenerator.cs,VectorMethodCache.cs.Scope / Non-goals
Benchmark / Performance
benchmark/layout/(reduction × layout × dtype) andbenchmark/cast/(astype src→dst matrix) to A/B V256 vs V512.VectorBits = 512semantics and rely on the JIT's software emulation ofVector512<T>to bit-compare against the V256/scalar result.Related issues