Skip to content

[Performance] SIMD Vector512 Coverage Gaps #614

Description

@Nucs

Overview

#579 made the elementwise kernel path adaptive (VectorBits ∈ {128, 256, 512}, detected once at startup), so add/sub/mul/div/comparisons/shift/modf and the unary-math ops already use Vector512 on capable hardware. But the migration stopped there: the reduction, NaN-masking, cast, and matmul subsystems are still hardcoded to Vector256<T> and never widen to Vector512. On AVX-512 hosts (Intel Ice Lake+/Sapphire Rapids, AMD Zen 4/5) these run at half the lane width they could.

This issue tracks completing the adaptive-width migration for the remaining subsystems.

Parent: #579 (Adaptive Vector Width: V128/256/512)

Problem

AVX-512 appears in only 4 files of Backends/Kernels/. Everywhere else the SIMD body is written against Vector256<T> with zero VectorBits references — i.e. it is width-locked at 256 and cannot use a 512-bit register even when one is available.

Survey (Vector256< occurrences, VectorBits-refs = 0 ⇒ width-locked):

Subsystem Files (Vector256< count) What stays 256-bit on AVX-512
Axis reductions Reduction.Axis.Widening(66), .Simd(58), .VarStd(42), .Boolean(8), .NaN(8) sum/prod/min/max/mean/var/std/any/all + nan-variants along an axis
NaN masking Masking.NaN(44), Masking.VarStd(16), Masking.Boolean(8) nansum/nanmean/nanvar/nanstd/nanmin/nanmax
Flat reductions Reduction.Boolean(12), Reduction.Arg(4), ILKernelGenerator.Reduction(3) flat any/all, argmin/argmax, pairwise sum/prod
Weighted sum WeightedSum(14) np.average
Cast / astype Cast.ToHalf(16), .ToBool(12), .Complex(9), .Half(8, partial 512), .FloatToUInt(5), .FloatNarrow, .FloatWideInt, .IntNarrow, .ShortNarrow, .Subword{Copy,Narrow,Widen} most astype conversions are AVX2-only (source comments say "AVX2-only … no AVX512")
MatMul / dot SimdMatMul.{cs,Double,Strided}, MatMul(5) zero AVX-512 — dot/matmul cap at V256

Already adaptive (use VectorBits, no change needed): Binary, Comparison, Shift, Modf, InnerLoop, and the load/store/op emit helpers in DirectILKernelGenerator.cs.

Proposal

Apply the #579 adaptive-width pattern to each width-locked subsystem: drive lane count and load/store/op from VectorBits (via VectorMethodCache) instead of a literal Vector256, with a runtime capability probe + scalar/256 fallback so nothing regresses on non-AVX-512 hosts. Each item is independently shippable; correctness can be verified on a non-AVX-512 dev box via JIT software-emulation of Vector512 (the same technique used to verify the Round/Truncate fix in dde0a0a9 / a0581c6f), with the actual speedup benchmarked on AVX-512 hardware.

  • Axis reductionsReduction.Axis.Simd, .Widening, .VarStd, .Boolean, .NaN (sum/prod/min/max/mean/var/std/any/all along an axis). Largest cluster; the horizontal-reduction tail needs a width-specific lane-fold.
  • NaN-aware masking & reductionsMasking.NaN, Masking.VarStd, Masking.Boolean, Reduction.NaN (nansum/nanmean/nanvar/nanstd/…).
  • Flat reductionsReduction.Boolean (any/all), Reduction.Arg (argmin/argmax), ILKernelGenerator.Reduction (pairwise sum/prod), WeightedSum (np.average).
  • Cast / astype subsystem — widen the AVX2-only conversion paths. Note the AVX-512 shuffles differ from AVX2 (e.g. VPERMB needs AVX512VBMI, VPMOVZX/VPMOV* truncating moves), so this is the most involved bucket and should be width-probed per conversion.
  • MatMul / dot — give SimdMatMul a Vector512 microkernel (most likely to show a clean win since matmul is compute-bound).
  • x86-512 binary routing polishVectorMethodCache.ResolveX86BinaryApi(512, …) returns Avx512F for everything; wire Avx512BW (byte/word add/sub/min/max) and Avx512DQ (int64 multiply) so those take the x86 fast path instead of the cross-platform fallback. (Correctness-neutral — today it falls back to Vector512.*, which the JIT lowers correctly; this is pure perf.)

Evidence

  • Width-lock confirmed by grep: the files above contain Vector256<…> with 0 VectorBits references, so the emitted IL pins a 256-bit register regardless of VectorBits.
  • Reduction.Axis.Simd.csAxisReductionSimdHelper<T> is generic over T but fixed at Vector256<T>.
  • Cast.ToHalf.cs header documents the AVX2-only design ("i64/u64 → f16 (Wave 17, AVX2-only)…").
  • SimdMatMul.* contains no Avx512/Vector512 references.
  • AVX-512 currently used only in: Cast.cs, Cast.Half.cs, DirectILKernelGenerator.cs, VectorMethodCache.cs.

Scope / Non-goals

  • Non-goal: the elementwise path ([SIMD] Adaptive Vector Width: Support Vector128/256/512 Based on Hardware #579 already covers it).
  • Non-goal: new ufuncs or dtype support — this is width widening of existing kernels only, bit-for-bit identical results.
  • Non-goal: AVX-512 sub-feature detection policy beyond what each kernel needs (BW/DQ/VBMI probed where used).
  • Out of scope here: scalar-math unary ops (sin/cos/exp/log) that have no SIMD at any width.

Benchmark / Performance

  • Target: up to 2× throughput on the widened kernels on AVX-512 hosts; no regression on V256/V128/scalar hosts (capability-gated fallback).
  • Caveat: reductions and casts are frequently memory-bound, and heavy AVX-512 can trigger frequency throttling on some Intel parts — so the win is real but hardware-dependent and must be measured on Zen 4+/Ice Lake+ before claiming it. Use benchmark/layout/ (reduction × layout × dtype) and benchmark/cast/ (astype src→dst matrix) to A/B V256 vs V512.
  • Dev-box correctness without AVX-512: force VectorBits = 512 semantics and rely on the JIT's software emulation of Vector512<T> to bit-compare against the V256/scalar result.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestperformancePerformance improvements or optimizations

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions