[Performance] SIMD Vector512 Coverage Gaps

## Overview

#579 made the **elementwise** kernel path adaptive (`VectorBits` ∈ {128, 256, 512}, detected once at startup), so `add`/`sub`/`mul`/`div`/comparisons/`shift`/`modf` and the unary-math ops already use Vector512 on capable hardware. But the migration stopped there: the **reduction, NaN-masking, cast, and matmul** subsystems are still **hardcoded to `Vector256<T>`** and never widen to Vector512. On AVX-512 hosts (Intel Ice Lake+/Sapphire Rapids, AMD Zen 4/5) these run at **half the lane width they could**.

This issue tracks completing the adaptive-width migration for the remaining subsystems.

**Parent:** #579 (Adaptive Vector Width: V128/256/512)

## Problem

AVX-512 appears in only **4 files** of `Backends/Kernels/`. Everywhere else the SIMD body is written against `Vector256<T>` with **zero `VectorBits` references** — i.e. it is width-locked at 256 and cannot use a 512-bit register even when one is available.

Survey (`Vector256<` occurrences, `VectorBits`-refs = 0 ⇒ width-locked):

| Subsystem | Files (`Vector256<` count) | What stays 256-bit on AVX-512 |
|---|---|---|
| **Axis reductions** | `Reduction.Axis.Widening`(66), `.Simd`(58), `.VarStd`(42), `.Boolean`(8), `.NaN`(8) | `sum/prod/min/max/mean/var/std/any/all` + nan-variants along an axis |
| **NaN masking** | `Masking.NaN`(44), `Masking.VarStd`(16), `Masking.Boolean`(8) | `nansum/nanmean/nanvar/nanstd/nanmin/nanmax` |
| **Flat reductions** | `Reduction.Boolean`(12), `Reduction.Arg`(4), `ILKernelGenerator.Reduction`(3) | flat `any/all`, `argmin/argmax`, pairwise `sum/prod` |
| **Weighted sum** | `WeightedSum`(14) | `np.average` |
| **Cast / astype** | `Cast.ToHalf`(16), `.ToBool`(12), `.Complex`(9), `.Half`(8, partial 512), `.FloatToUInt`(5), `.FloatNarrow`, `.FloatWideInt`, `.IntNarrow`, `.ShortNarrow`, `.Subword{Copy,Narrow,Widen}` | most `astype` conversions are AVX2-only (source comments say "AVX2-only … no AVX512") |
| **MatMul / dot** | `SimdMatMul.{cs,Double,Strided}`, `MatMul`(5) | **zero** AVX-512 — `dot`/`matmul` cap at V256 |

Already adaptive (use `VectorBits`, **no change needed**): `Binary`, `Comparison`, `Shift`, `Modf`, `InnerLoop`, and the load/store/op emit helpers in `DirectILKernelGenerator.cs`.

## Proposal

Apply the #579 adaptive-width pattern to each width-locked subsystem: drive lane count and load/store/op from `VectorBits` (via `VectorMethodCache`) instead of a literal `Vector256`, with a runtime capability probe + scalar/256 fallback so nothing regresses on non-AVX-512 hosts. Each item is independently shippable; correctness can be verified on a non-AVX-512 dev box via JIT software-emulation of `Vector512` (the same technique used to verify the Round/Truncate fix in dde0a0a9 / a0581c6f), with the actual speedup benchmarked on AVX-512 hardware.

- [ ] **Axis reductions** — `Reduction.Axis.Simd`, `.Widening`, `.VarStd`, `.Boolean`, `.NaN` (sum/prod/min/max/mean/var/std/any/all along an axis). Largest cluster; the horizontal-reduction tail needs a width-specific lane-fold.
- [ ] **NaN-aware masking & reductions** — `Masking.NaN`, `Masking.VarStd`, `Masking.Boolean`, `Reduction.NaN` (`nansum`/`nanmean`/`nanvar`/`nanstd`/…).
- [ ] **Flat reductions** — `Reduction.Boolean` (any/all), `Reduction.Arg` (argmin/argmax), `ILKernelGenerator.Reduction` (pairwise sum/prod), `WeightedSum` (np.average).
- [ ] **Cast / astype subsystem** — widen the AVX2-only conversion paths. Note the AVX-512 shuffles differ from AVX2 (e.g. `VPERMB` needs AVX512VBMI, `VPMOVZX`/`VPMOV*` truncating moves), so this is the most involved bucket and should be width-probed per conversion.
- [ ] **MatMul / dot** — give `SimdMatMul` a Vector512 microkernel (most likely to show a clean win since matmul is compute-bound).
- [ ] **x86-512 binary routing polish** — `VectorMethodCache.ResolveX86BinaryApi(512, …)` returns `Avx512F` for everything; wire **Avx512BW** (byte/word add/sub/min/max) and **Avx512DQ** (int64 multiply) so those take the x86 fast path instead of the cross-platform fallback. *(Correctness-neutral — today it falls back to `Vector512.*`, which the JIT lowers correctly; this is pure perf.)*

## Evidence

- Width-lock confirmed by grep: the files above contain `Vector256<…>` with **0** `VectorBits` references, so the emitted IL pins a 256-bit register regardless of `VectorBits`.
- `Reduction.Axis.Simd.cs` → `AxisReductionSimdHelper<T>` is generic over `T` but fixed at `Vector256<T>`.
- `Cast.ToHalf.cs` header documents the AVX2-only design ("i64/u64 → f16 (Wave 17, AVX2-only)…").
- `SimdMatMul.*` contains no `Avx512`/`Vector512` references.
- AVX-512 currently used only in: `Cast.cs`, `Cast.Half.cs`, `DirectILKernelGenerator.cs`, `VectorMethodCache.cs`.

## Scope / Non-goals

- **Non-goal:** the elementwise path (#579 already covers it).
- **Non-goal:** new ufuncs or dtype support — this is width widening of existing kernels only, bit-for-bit identical results.
- **Non-goal:** AVX-512 sub-feature *detection* policy beyond what each kernel needs (BW/DQ/VBMI probed where used).
- **Out of scope here:** scalar-math unary ops (sin/cos/exp/log) that have no SIMD at any width.

## Benchmark / Performance

- Target: up to **2× throughput** on the widened kernels on AVX-512 hosts; **no regression** on V256/V128/scalar hosts (capability-gated fallback).
- **Caveat:** reductions and casts are frequently **memory-bound**, and heavy AVX-512 can trigger frequency throttling on some Intel parts — so the win is real but **hardware-dependent and must be measured** on Zen 4+/Ice Lake+ before claiming it. Use `benchmark/layout/` (reduction × layout × dtype) and `benchmark/cast/` (astype src→dst matrix) to A/B V256 vs V512.
- Dev-box correctness without AVX-512: force `VectorBits = 512` semantics and rely on the JIT's software emulation of `Vector512<T>` to bit-compare against the V256/scalar result.

## Related issues

- #579 — Adaptive Vector Width (parent; elementwise path)
- #587 — IL Kernel/Generation Migration: eliminate NPTypeCode switch/case
- #585 — IL-generated kernels for UnmanagedMemoryBlock


Subsystem	Files (`Vector256<` count)	What stays 256-bit on AVX-512
Axis reductions	`Reduction.Axis.Widening`(66), `.Simd`(58), `.VarStd`(42), `.Boolean`(8), `.NaN`(8)	`sum/prod/min/max/mean/var/std/any/all` + nan-variants along an axis
NaN masking	`Masking.NaN`(44), `Masking.VarStd`(16), `Masking.Boolean`(8)	`nansum/nanmean/nanvar/nanstd/nanmin/nanmax`
Flat reductions	`Reduction.Boolean`(12), `Reduction.Arg`(4), `ILKernelGenerator.Reduction`(3)	flat `any/all`, `argmin/argmax`, pairwise `sum/prod`
Weighted sum	`WeightedSum`(14)	`np.average`
Cast / astype	`Cast.ToHalf`(16), `.ToBool`(12), `.Complex`(9), `.Half`(8, partial 512), `.FloatToUInt`(5), `.FloatNarrow`, `.FloatWideInt`, `.IntNarrow`, `.ShortNarrow`, `.Subword{Copy,Narrow,Widen}`	most `astype` conversions are AVX2-only (source comments say "AVX2-only … no AVX512")
MatMul / dot	`SimdMatMul.{cs,Double,Strided}`, `MatMul`(5)	zero AVX-512 — `dot`/`matmul` cap at V256

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] SIMD Vector512 Coverage Gaps #614

Overview

Problem

Proposal

Evidence

Scope / Non-goals

Benchmark / Performance

Related issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Performance] SIMD Vector512 Coverage Gaps #614

Description

Overview

Problem

Proposal

Evidence

Scope / Non-goals

Benchmark / Performance

Related issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions