Skip to content

[Major Release 0.60.0] NumPy nditer port, NDExpr/np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K layout, stride-native matmul, 15 dtypes#611

Merged
Nucs merged 567 commits into
masterfrom
nditer
Jun 28, 2026

Conversation

@Nucs

@Nucs Nucs commented Apr 22, 2026

Copy link
Copy Markdown
Member

The progress to 0.60.0

A from-scratch port of NumPy 2.4.2's nditer engine and the entire compute stack built on it — merging the nditer branch into master. It adds a fused-expression DSL (np.evaluate), full advanced-indexing parity with NumPy (down to the memory-safety and error-text level), byte-exact NumPy array printing, C/F/A/K memory-layout support wired through the whole API, stride-native matmul, NumPy-seed-compatible np.random (MT19937), all 15 NumPy dtypes, 36+ new np.* functions, deterministic memory management, and a differential-fuzz pipeline that proves bit-exactness against NumPy. It ships as NumSharp 0.60.0 — the first stable (non-prerelease) release in the 0.x line.

620 commits · 1,309 files · +330,971 / −273,664 (vs master). The net is a large body of new engine code offset by an equally large deletion of legacy generated code (the Regen template engine, NDIterator, and MultiIterator are all gone). Everything below was validated against NumPy 2.4.2 ground truth — a ~40,000-case differential corpus, 566 iterator-parity scenarios, a 12,000+-case index oracle, ~18,000 array-print fuzz cases, and per-feature battle tests run on actual NumPy output.


TL;DR

  • NDIter — full port of NumPy 2.4.2's nditer (~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. It is the production execution engine for the elementwise/reduce core, at or faster than NumPy on every probed aspect.
  • NDExpr DSL + three-tier custom-op API, exposed as the public np.evaluate — write your own ufunc (raw IL / element-wise SIMD / composable expression trees) and run fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPy result_type typing and fused reductions.
  • Full advanced-indexing parity. A faithful port of NumPy's prepare_index + unified advanced-index gather/scatter takes the get/set surface from ~697 → 0 divergences (full parity) against a new committed differential index oracle, closes a class of memory-safety bugs (out-of-bounds gather/scatter on exotic mixed indices), and aligns every IndexError/ValueError text with NumPy.
  • Byte-exact NumPy array printing. NDArray.ToString() is now a 1-to-1 port of NumPy 2.4.2's array_str/array_repr/array2string + Dragon4 float formatting; new public np.array2string/np.array_repr/np.array_str/np.set_printoptions/np.printoptions/np.format_float_positional/np.format_float_scientific. ~18,000 fuzz cases byte-identical to NumPy.
  • out= / where= / dtype= ufunc kwargs across the elementwise API — binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plus np.bitwise_and/or/xor, np.positive, and first-class np.maximum/minimum/fmax/fmin.
  • All 15 NumPy dtypes — adds int8 and float16 and promotes complex128 from a stub to first-class (was 12 dtypes), each wired through creation / astype / reductions / comparisons / IL kernels with full 15×15 parity; the six Complex transcendentals are implemented. NumPy 2.x type aliases (np.byte→int8, np.complex64 throws, np.intp/int_/uint) realigned.
  • 36+ new np.* APIssort/argsort, pad (11 modes), tile, median/percentile/quantile (all 13 interpolation methods) + nan* variants, average, ptp, take/put/place, extract/compress, diagonal/trace, argwhere/flatnonzero, unravel_index/ravel_multi_index/indices, delete/insert/append, split/array_split/hsplit/vsplit/dsplit, diff/ediff1d, asfortranarray/ascontiguousarray, np.multithreading, plus the printing APIs above.
  • np.random rebuilt for NumPy RNG parity — the legacy Knuth subtractive generator is replaced by MT19937 (NumPy's Mersenne Twister) for 1-to-1 seed/state compatibility, plus 24 new distribution samplers (weibull, vonmises, pareto, laplace, gumbel, dirichlet, multivariate_normal, noncentral_chisquare/noncentral_f, standard_t/standard_cauchy/standard_gamma/standard_exponential, triangular, zipf, logseries, rayleigh, wald, power, f, logistic, hypergeometric, multinomial, negative_binomial). Output is byte-identical to NumPy 2.4.2 at a given seed.
  • C/F/A/K order support wired through the whole APIShape understands F-contiguity, OrderResolver resolves NumPy order modes, ~68 layout bugs fixed.
  • Stride-native matmul/dot — BLIS-style GEBP GEMM absorbs arbitrary strides for all dtypes (kills a ~100× penalty on transposed inputs); fused 1-D dot 3.5–9× faster with zero GC; opt-in multithreaded dot ~2× faster than NumPy's default on 1M vectors.
  • Type casting (astype) faster than NumPy across the board — the entire copy/retype/cast surface is unified on one NDIter.CopyAs core (the 2,226-line legacy per-element cast loop deleted), then a SIMD campaign took the full 15×8×15 astype matrix from 716 → 129 lagging cells of 1,568 comparable (1,439 winning ≥1.0× vs NumPy 2.4.2).
  • NumPy-parity benchmark: geomean 1.26× at 10M elements (397 faster / 150 close / 42 slower of 615 ops; 1.14× at 1K, 0.90× at 100K), from a committed, reproducible BenchmarkDotNet-vs-NumPy harness — an op/dtype/N matrix over all 15 dtypes plus five appended subsystems (iterator, memory-layout, operand, cast, fusion) with per-release provenance snapshots.
  • Deterministic memory management — atomic reference counting + IDisposable on NDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).
  • Differential fuzzing vs NumPy~51,000 committed bit-exact cases across 28 tiers (a ~40,000-case op corpus + the ~12,400-case index oracle), replayed with no Python at test time, plus a metamorphic-invariant tier, a seeded fuzzer + shrinker, the CI FuzzMatrix gate, and a nightly soak (§10).
  • Legacy stacks deleted outrightMultiIterator, NDIterator (interface + class + AsIterator), and the Regen template engine (87 inline #if _REGEN blocks across 35 files) are all gone.
  • Cross-platform — macOS/Apple-Silicon (ARM64) signed-zero + integer-widening reduction parity fixed.
  • Test suite: ~10,980 passed / 0 failed on net8.0 + net10.0, plus the differential-fuzz corpora replayed by the FuzzMatrix gate. 177 formerly-[OpenBugs] reproductions were promoted into regular CI tests as their bugs were fixed.

1. NDIter — full NumPy nditer port

From-scratch C# port of NumPy 2.4.2's iterator machinery under src/NumSharp.Core/Backends/Iterators/ (~12,557 lines), promoted to public API with NDArray overloads. The public surface includes the NumPy-named flag enums (NDIterFlags/NDIterOpFlags/NDIterGlobalFlags, NPY_ORDER, NPY_CASTING), the NDIterRef kernel handle, and the NDInnerLoopFunc per-chunk delegate; the standalone flat iterator is NDFlatIterator (drives np.broadcast(...).iters).

Capability Detail
Iteration orders C, F, A, K — incl. NEGPERM negative-stride handling, axis reordering + coalescing to full 1-D collapse
Indexing modes MULTI_INDEX, C_INDEX, F_INDEX, RANGE (parallel chunking), GotoIndex / GotoMultiIndex / GotoIterIndex
Buffering Buffered casting with all 5 casting rules, windowed buffered iteration, DELAY_BUFALLOC, buffered-reduce double-loop (incl. bufferSize < coreSize)
Reductions op_axes with -1 reduction axes, REDUCE_OK, IsFirstVisit, REUSE_REDUCE_LOOPS slab accumulation
Overlap safety COPY_IF_OVERLAP via a port of NumPy's mem_overlap solver (NDMemOverlap.cs) — overlapping in/out operands no longer silently corrupt
Masking WRITEMASKED + ARRAYMASK executed — the buffered window flush writes back only mask-nonzero elements; VIRTUAL operands construct with NumPy 2.x semantics
Operands / dims Unlimited operands (NumPy caps at NPY_MAXARGS=64) and unlimited dimensions (NumPy caps at NPY_MAXDIMS=64) via dynamic allocation
APIs Copy, GetIterView, RemoveAxis, RemoveMultiIndex, ResetBasePointers, IterRange, DebugPrint, fixed/axis stride queries, GetValue<T>/SetValue<T>, …
Casting parity NDIterCasting.CanCast matches NumPy's safe/same_kind lattice exactly

Validated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, plus a permanent variation-probe harness. Dozens of parity bugs found and fixed against NumPy ground truth (negative-stride flipping, NO_BROADCAST enforcement, F_INDEX coalescing, buffered-reduction stride inversion, K-order on broadcast inputs, the size-1 stride-0 invariant, op_axes out-of-bounds reads on stretched size-1 axes, write-broadcast validation, unit-axis absorption) — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.

Execution at NumPy speed

NDIter isn't just correct — it is the production execution engine: DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NDIter Tier-3B shell.

Aspect (float32) NumSharp NumPy Ratio (NPY/NS)
contig sqrt 10M 2.98 ms 3.24 ms 1.09×
contig add 10M 3.91 ms 4.09 ms 1.05×
strided add 1M 319 µs 416 µs 1.30×
strided sqrt 1M 206 µs 374 µs 1.82×
strided sum 1M 109 µs 205 µs 1.88×
fused a*b+c 10M 4.77 ms 13.38 ms 2.81×
fused (a-b)/(a+b) 10M 4.12 ms 22.33 ms 5.42×

Key mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels.

2. NDExpr DSL + np.evaluate (fusion)

User-extensible kernel layer on top of NDIter — the public answer to "how do I write my own ufunc":

  • Tier 3A — ExecuteRawIL: emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* strides, long count, void* aux).
  • Tier 3B — ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.
  • Tier 3C — ExecuteExpression: compose NDExpr trees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons, Min/Max/Clamp/Where), plus Call() to splice any delegate/MethodInfo into a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.

Exposed publicly as np.evaluate(expr[, operands][, out]):

  • Per-node NumPy result_type typing — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly: (i4*i4)+f8 wraps the multiply in int32 (→ 1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4, i4/2 → f8) with NumPy's exact OverflowError, and special resolvers (true_divide, arctan2, negative-integer-literal powerValueError, bool add=OR/multiply=AND).
  • Fused reductionsNDExpr.Sum/Prod/Min/Max/Mean compile a one-pass inner loop; sum(a*b) reads a and b once and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).
  • out= joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing through COPY_IF_OVERLAP); an EXTERNAL_LOOP guard prevents the silent count==1 slow path.
  • Measured (Release, 4M f64, NumPy 2.4.2): a*b+c 3.2×, (a-b)/(a+b) 6.1×, sum(a*b) 3.6×, sum f32 2.9×, i4*2+f8 3.5× faster. Permanent gate in benchmark/fusion/.
  • A runtime+type-aware SIMD gate decides per-node whether the rounding family (Floor/Ceil/Round/Truncate) vectorizes (float/double only, and only where the BCL provides Vector{N}.<op> at the active width), so fused integer rounding no longer crashes and SIMD is used wherever it exists on the running runtime.

3. Full advanced-indexing parity + memory-safety

A differential index oracle (NumPy 2.4.2 as the sole oracle) proved the common get/set surface was bit-exact but exposed ~697 divergences across exotic mixed advanced-index combinations (boolean-array + fancy, multi-dim fancy + slice, 0-d-bool + fancy, multi-fancy, empty combos) plus a flaky heap-corruption crash. This PR replaces the ad-hoc per-shape "Try*" patchwork with a faithful port of NumPy's mapping.c model and drives the oracle from ~697 → 0 divergences (full parity). The final fixes closed the remaining divergence categories — value-broadcast on empty/scalar selections, 0-d-base over-indexing, empty-advanced gather, empty-slice assignment, and non-consecutive 0-d-bool axis placement (which uncovered and fixed a latent Shape.Broadcast hash collision) — all pinned by a new Indexing.CombinatorialParity FuzzMatrix gate; the curated, dtype, and seeded-random oracle tiers are bit-exact. The one known open item is a flaky teardown heap-corruption crash on a single corpus shape (Index_Random stays [OpenBugs] for the crash, not for any parity gap).

  • PrepareIndex (Selection/NDArray.Indexing.PrepareIndex.cs) — a port of NumPy's prepare_index: one up-front pass that classifies the whole index tuple (HAS_* bitmask) and validates it before any kernel runs. Raises NumPy-verbatim texts for too-many-indices, boolean-array-dim mismatch, integer/array value out-of-bounds, and un-broadcastable advanced blocks.
  • TryBuildMultiAdvancedGrid — NumPy's general advanced-index algorithm (block broadcast + consec-aware axis placement). All advanced tuples (single or multiple fancy, fancy + slice/newaxis, 0-d bool joining the block, pure-advanced) now route through one gather/scatter; only pure-basic tuples fall through to the view path. Correct axis placement for ≥2 advanced indices mixed with slices (consecutive → in place; separated → advanced axes to front).
  • Memory-safety fixes (the heap-corruption class is closed): negative scalar-index assignment no longer writes one element before the buffer; fancy negative-OOB no longer reads OOB; per-axis integer OOB is validated (a[0,4] on a 3×4 now raises instead of returning a neighbour); too-many advanced indices raise instead of walking strides past the array; the negative-stride gather offset bound was corrected (valid rows on a[::-1] no longer rejected); block-copy bounds guards + an opt-in Windows page-heap (NUMSHARP_GUARD_PAGES=1) backstop the gather/scatter.
  • Combined boolean + advanced indexing — a boolean mask mixed with int/slice/array (arr[mask, int], arr[:, mask], arr[mask, 1:3], leading k-D mask + basic) now works for get and set, expanding each mask to its nonzero() components like NumPy.
  • Assignment validation — value-broadcast is checked on assignment (partial writes rejected with NumPy's could not broadcast … ValueError); a valid smaller value tiles like NumPy; a broadcastable value scatters correctly into a ≥2-D fancy subspace; empty-into-non-empty raises, empty-into-empty no-ops.
  • Input-form parity — raw bool[]/bool[,] (any rank) and IEnumerable<bool> are recognized as masks; C# tuples (nd[(1,2)]) spread to coordinates; List<int>/ArrayList/any IEnumerable coerce to a fancy index; uint64 scalar indices and int8 fancy-index dtype are accepted (all eight integer dtypes now index identically).
  • Differential index oracle committed as a FuzzMatrix gate (test/oracle/gen_index_oracle.py + Fuzz/IndexOracleTests.cs): index_curated.jsonl (2,265) + index_dtype.jsonl (104) run in CI bit-exact, and the 10,000-case seeded random tier is now 0 divergences too (replayed in soak; held out of the per-PR gate only by the teardown crash above). The combinatorial fixes are additionally pinned by the Indexing.CombinatorialParity gate.

Breaking: a raw int[]/long[] used as the sole index is now fancy indexing, not coordinate access — see Breaking Changes.

4. Byte-exact NumPy array printing

NDArray.ToString() is refactored from the legacy Python-list output ([0, 1, 2], no alignment/precision/summarization/dtype) to a 1-to-1 port of NumPy 2.4.2's array printing (numpy/_core/arrayprint.py + dragon4.c).

  • ToString() / ToString(false)np.array_str"[0 1 2]" (the str() form).
  • ToString(true)np.array_repr"array([0, 1, 2], dtype=…)" (the repr() form).
  • New subsystem src/NumSharp.Core/Backends/Printing/: PrintOptions (AsyncLocal context mirroring NumPy's format_options), Dragon4 (positional + scientific float→string), ElementFormatters (bool/int/float/complex with the two-pass exp-format decision, per-dtype cutoffs, column sizing, nan/inf fields), ArrayFormatter (recursive layout, line-wrap at linewidth, summarization with edgeitems, the 0-d str-vs-repr asymmetry, repr dtype/shape suffixes).
  • Float digit generation uses .NET's shortest round-trip ToString("R") (== Dragon4 unique) but routes all rounding through ToString("F"|"E"+precision) (rounds the true binary value, IEEE half-to-even) — never the shortest string (which diverges ~50% on adversarial ties).
  • Public API (APIs/np.array2string.cs): np.array2string, np.array_str, np.array_repr, np.set_printoptions, np.get_printoptions, np.printoptions (IDisposable context), np.format_float_positional, np.format_float_scientific.
  • NumSharp's Char dtype (string storage, no NumPy equivalent) keeps its legacy rendering, so GetString/AsString are unaffected.

Validated by ~18,000 differential-fuzz cases byte-identical to NumPy 2.4.2 across all dtypes (incl. float16/32/64 scientific, adversarial ties/carries), 1-D–5-D, sliced/broadcast/transposed views, complex, nan/inf, summarization, line-wrap, and print options; ~174 strict parity tests on net8.0 + net10.0.

Breaking: any code parsing the old ToString() format must update — see Breaking Changes.

5. C/F/A/K memory-layout support

  • Shape now tracks F-contiguity with NumPy-convention contiguity computation; new OrderResolver resolves C/F/A/K for every API with an order parameter.
  • Order support wired through: copy, array, asarray, asanyarray, *_like, astype, flatten, ravel, reshape, eye, concatenate, cumsum, argsort, tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.
  • New: np.asfortranarray, np.ascontiguousarray.
  • np.where selects C/F output layout the way NumPy does; ravel('F') of an F-contig source returns a view (was a 3,000× copy).
  • ~68 layout bugs fixed across 9 TDD fix groups, backed by ~3,300 lines of order tests (reductions/keepdims, matmul/dot/outer/convolve, broadcasting-from-F, manipulation, file I/O fortran_order, Decimal scalar path, fancy-write isolation, …).

6. New & completed np.* APIs

New functions:

Area APIs
Fused / ufunc np.evaluate (fused expressions — §2), np.bitwise_and, np.bitwise_or, np.bitwise_xor, np.positive, first-class np.maximum/np.minimum/np.fmax/np.fmin
Printing np.array2string, np.array_str, np.array_repr, np.set_printoptions, np.get_printoptions, np.printoptions, np.format_float_positional, np.format_float_scientific
Sorting np.sort (+ ndarray.sort; np.argsort reimplemented) — radix line-kernel on NDIter, stable, NaN-last, all axes / orders (closes a long-standing Missing Function)
Manipulation np.pad (all 11 NumPy modes + callable), np.tile, np.delete, np.insert, np.append
Splitting np.split, np.array_split (uneven splits), np.hsplit, np.vsplit, np.dsplit
Indexing/selection np.take, np.put, np.place, np.extract, np.compress, np.argwhere, np.flatnonzero, np.diagonal, np.trace, np.unravel_index, np.ravel_multi_index, np.indices
Statistics np.median, np.percentile, np.quantile (all 13 interpolation methods, tuple axis, out=, keepdims, QuickSelect), np.average (weights, returned, tuple-axis), np.ptp, np.nanmedian, np.nanpercentile, np.nanquantile
Math np.diff, np.ediff1d
Creation np.asfortranarray, np.ascontiguousarray
Runtime np.multithreading(enabled, max_threads) — opt-in threaded kernels

Rebuilt to full NumPy 2.x parity: np.clip (min=/max= aliases, None bounds, 2.x promotion, out=, 4×-unrolled SIMD kernel), np.unique (5 missing kwargs, NaN partitioning, up to 43× faster), np.searchsorted (side=/sorter=, IL binary-search 5–25× faster), np.copyto (casting=/where=, overlap-safe, SIMD fast paths), np.asarray (copy=/like=/device=/string dtype), np.concatenate, np.all/np.any (tuple-axis, out=, where=), np.expand_dims (tuple axis), np.repeat (axis=), np.power (integer-power semantics + crash fix), np.broadcast (N-operand 0..∞, live cursor, lazy .iters/.numiter). Engine completeness: bool/char max/min, Complex quantile, IsInf implemented, and the six Complex transcendentals sinh/cosh/tanh/arcsin/arccos/arctan (were NotSupportedException).

np.maximum/minimum/fmax/fmin are now first-class binary ufuncs (NEP50 promotion, broadcasting, out=, where=, every execution path) instead of clip-based shims — which fixes a real correctness bug: fmax/fmin now ignore NaN (return the finite operand) while maximum/minimum propagate it.

out= / where= / dtype= ufunc kwargs (NumPy parity)

The kwargs on every NumPy ufunc now span the elementwise core — binary (add, subtract, multiply, divide, true_divide, mod, power, floor_divide), unary-math (sqrt, exp, log, sin, cos, tan, abs/absolute, negative, square, …), the six comparisons, predicates (isnan/isfinite/isinf), bitwise, invert, arctan2 — each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:

  • out joins the broadcast but never stretches (mismatched/stretchable out raise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50); out only needs a same_kind cast; the provided instance is returned (reference identity).
  • where must be exactly bool; it broadcasts over operands and participates in output shape; mask-false slots keep prior out contents.
  • out aliasing an input is well-defined via COPY_IF_OVERLAP.
  • dtype= computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the bool add→OR / multiply→AND remap keyed off the final loop dtype.

Random sampling — NumPy RNG parity (np.random)

np.random is rebuilt around NumPy's own bit generator and legacy distribution algorithms, so a seeded NumSharp stream is byte-identical to NumPy 2.4.2.

  • MT19937 bit generator replaces the legacy Knuth subtractive generator — a full Mersenne-Twister port (Seed(uint) / SeedByArray(uint[]), 53-bit NextDouble, rejection-sampled bounded ints, 624-word state Clone/SetState) with 1-to-1 seed/state compatibility: get_state()/set_state() use NumPy's state tuple format (Algorithm, Key[624], Pos, HasGauss, CachedGaussian). Gaussian generation moved from Box-Muller to the Marsaglia polar method with the _hasGauss/_gaussCache carry, matching NumPy's random_standard_normal (and its cached state) exactly.
  • 24 new distribution samplersdirichlet, f, gumbel, hypergeometric, laplace, logistic, logseries, multinomial, multivariate_normal, negative_binomial, noncentral_chisquare, noncentral_f, pareto, power, rayleigh, standard_cauchy, standard_exponential, standard_gamma, standard_t, triangular, vonmises, wald, weibull, zipf — joining the existing 16 (rand, randn, randint, uniform, normal, bernoulli, beta, binomial, chisquare, choice, exponential, gamma, geometric, lognormal, poisson, permutation/shuffle).
  • Legacy-algorithm alignment — the pre-existing samplers were corrected to NumPy's RandomState algorithms so they consume the RNG in the same order and emit byte-identical values: geometric (NumPy's search algorithm; previously returned negatives), beta (Jöhnk's algorithm for a,b ≤ 1), chisquare (SampleStandardGamma with Vaduva's algorithm for shape < 1), the lognormal(mean=0) NaN fix, and size=0 empty-array support across every sampler.
  • multivariate_normal uses an SVD transform (Jacobi eigendecomposition, eigenvalues sorted descending) matching NumPy's implementation; identity covariance is an exact sequence match.
  • Integer distributions return int64poisson, binomial, geometric, hypergeometric, zipf, logseries, negative_binomial (matching NumPy dtypes); scalar overloads return a 0-d NDArray; long is the canonical size/index type throughout (all int downcasts removed). Size/axis/seed validation, the return contract (size=None → scalar, size=() → 0-d), and the dual-overload API shape all follow NumPy.

Breaking: the RNG swap means a given seed now produces a different sequence than prior NumSharp versions — intentional, because the new sequence matches NumPy. See Breaking Changes.

7. Linear algebra

  • Stride-native GEMM for all 12 numeric dtypes — BLIS-style GEBP with stride-aware packers; the 8×16 Vector256 FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxing GetValue fallback chain.
  • Full matmul gufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a differential matrix (816 cases).
  • Fused single-pass 1-D dot — 3.5–9× faster, zero GC (was up to 446 gen-0 collections per call at 100K).
  • np.multithreading — opt-in parallel 1-D dot (1M float dot 172 → 60 µs, ~2× NumPy's default). Off by default; bitwise-identical summation order when off.

8. Performance (engine, reductions, shifts, sort, casts)

Op Improvement (NPY/NS, >1 = NumSharp faster)
Axis reductions, narrow ints Widening SIMD (int16→int32 accum etc.): sum(int16, axis=1) 1058 ms → 2.7 ms (389×); also fixes a uint32 axis-sum corruption bug
mean/var/std (axis) mean 217×, var/std 21×; count_nonzero 20×
Flat min/max (f64/f32) raw Avx.Min/Max drops the JIT's redundant NaN fixup with a separate finite-mask + cold scan: 0.64×/0.69× → 1.55×/1.73× @100k; broadcast/neg-stride axis min/max SIMD-routed (was 0.07–0.17×)
np.left_shift/right_shift reworked into first-class binary ufuncs + SIMD (variable-shift VPSLLV*/VPSRAV* via Tier-3B): common cases 2–4× NumPy (was 0.05–0.34× scalar); fixes 7 correctness bugs incl. NEP50 promotion, negative-count, bool→int8, sbyte SIMD; adds <</>> operator overloads on NDArray
np.sort / argsort radix line-kernel + insertion-sort fast path for short lines (fixes a 25–35× short-line regression) + single-pass multi-histogram radix; int8/16 6–9×, argsort 1.3–8× on most dtypes
np.cumsum reimplemented as NDIter-driven add.accumulate with KEEPORDER output (C-src→C, F-src→F); 1.6–11.8× by axis; fixes empty/0-d/size-1 NEP50 widening + negative-axis validation
Boolean mask get+set unified on one NDIter gather/scatter — tall-thin axis-0 select 372.9 ms / 3.86 GB → 9.1 ms / ~0 MB (~41×)
np.nonzero IL SIMD kernel closes an 8–241× gap
Broadcast-reduce stride-0 axes folded algebraically — sum(broadcast_to(...)) ~534–700× faster, bit-exact
sum/mean (float) bit-exact NumPy pairwise summation (matches np.add.reduce bit-for-bit; unblocks float32)
np.any/np.all (bool/char) reinterpret to byte/ushort → integer SIMD path (was 5–12× scalar); fixes a latent AVX2 32-lane mask-overflow bug
np.zeros calloc/demand-zero — O(1) (10M f64: 14.3 ms → ~0.01 ms)
Casts (astype) full 15×8×15 SIMD campaign: 716 → 129 lagging cells of 1,568 comparable (1,439 winning ≥1.0×) vs NumPy — per-src geomean f16→narrow 3.8×, f32→narrow 2.0×, f64→narrow 1.4× (architecture below)
np.abs exact dtype= parity (complex magnitude `

High-performance type casting

The single largest engineering effort in this PR rebuilt the astype/copy/retype machinery from the ground up.

  • One unified core. Every copy, retype, and cast now routes through NDIter.CopyAs(dstType, src, order) — it resolves C/F/A/K via OrderResolver, allocates the destination, and fills it through the in-place Copy primitive. NDArray.copy() and DefaultEngine.Cast (astype) both fold into it; the old scalar / (1,) / same-type-Clone / F-contig-special / CastCrossType branch maze and the 2,226-line per-element Converts.FindConverter cast loop are deleted. A TryCopySameType fast path fills scalar-broadcast and gap-free contiguous destinations with one typed InitBlock/memset/SIMD fill (6–8× for 1-byte, ~2× wider).
  • SIMD campaign over the whole matrix. A Phase-0 discovery sweep benchmarked all 15 src × 8 layouts × 15 dst astype combinations at 1M against NumPy 2.4.2, producing a lagging-cell worklist that successive waves drove from 716 → 129 of 1,568 comparable cells (1,439 now win ≥1.0×). The kernels: float→narrow-int (cvtt + truncating Narrow), float/int→bool (!=0 compare), Half↔ via the Giesen bit-fiddle (widen/narrow, sNaN-preserving), complex→int/bool (real deinterleave), sub-word strided/reversed lane shuffles (VPSHUFB/VPACKUS), fused VPGATHER whole-array kernels for strided float→narrow, an IL-emitted scalar cast kernel for the Vector-less dtypes (direct Converts.ToX, 0.65 → 1.5–2.6×), and a KEEPORDER same-type copy. Per-src geomeans: f16→narrow 3.8×, f32→narrow 2.0×, f64→narrow 1.4×.
  • Correctness rounds. Alongside the speed work, multiple rounds of NumPy-parity fixes closed precision-boundary bugs in the double→int converters, ToUInt32(double) overflow → 0, DateTime/TimeSpan conversions, and the Half/Complex/char converter paths, and replaced the IConvertible constraint with a Converts<T> path — all pinned by the cast tier of the differential-fuzz corpus (full 15×15).

Also in this PR: np.where gained an IL-generated AVX2/SSE4.1 (Neon-safe) SIMD kernel — 5.4× over the old call-based kernel and ~3.9× faster than NumPy at 1M — plus NEP50 weak-scalar type promotion; np.asanyarray now accepts every built-in C# collection (List/HashSet/Queue/Stack/ImmutableArray/LinkedList/Memory<T>/ArraySegment/LINQ results, non-generic IEnumerable/IEnumerator like ArrayList, Tuple/ValueTuple, and mixed object[]) with NumPy-like type promotion; and the test suite was migrated to MSTest v3.

All figures in this section come from the committed, reproducible NumSharp-vs-NumPy benchmark harness — the op/dtype/N matrix plus five subsystems (iterator / layout / operand / cast / fusion), with per-release benchmark/history/ provenance snapshots and the NPY/NS convention (>1 = NumSharp faster). Full details in §10.

9. Memory management — ARC + IDisposable

  • NDArray implements IDisposable backed by atomic reference counting on the unmanaged block: CAS-driven TryAddRef/Release, idempotent Dispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.
  • Hammered by a 15-case lifecycle suite incl. 32-thread × 1,000-op concurrency races and 50-way parallel dispose — zero corruption.
  • A tcache-style size-bucketed buffer pool with a 1 B – 64 MiB window (covers both small-N ufunc results and 4M+ outputs); deterministic release plus the pool removes most steady-state GC pressure (dot at 100K: 446 collections → 0).
  • Native allocations now register GC memory pressure (GC.AddMemoryPressure/RemoveMemoryPressure) so the runtime sees the true unmanaged footprint and collects on time — fixing a runaway-growth bug (a 1M-array loop peaked at 10+ GB → ~54 MB).

10. Validation & benchmarking infrastructure (NumPy as the oracle)

Two committed, reproducible systems back every correctness and performance claim in this PR. Both treat NumPy 2.4.2 as the sole oracle, both run with no Python at test time or in CI, and both ship as first-class, regeneratable infrastructure.

The NumPy differential oracle (correctness)

A property-style differential fuzzer that proves every NDIter-backed op is bit-identical to NumPy across the input space — caught systematically, not by hand-picked cases. (The motivating failure: a cast saturate-vs-wrap bug, latent in where/copyto/concatenate, that example-based tests missed. It must be impossible to ship again.)

How it works. Generators under test/oracle/ run real NumPy 2.4.2 and emit a committed, bytes-exact JSONL corpus; the C# harness under test/NumSharp.UnitTest/Fuzz/ replays the operand bytes and bit-compares. NumPy never runs at test time — the committed corpus is its frozen answer.

  • Corpus — ~51,000 cases across 28 JSONL tiers (a ~40,000-case op corpus over 25 tiers + the ~12,400-case index oracle over 3). Each case stores its operands as (dtype, shape, element-strides, element-offset, bufferSize, base-buffer-hex) plus NumPy's expected (result dtype + shape + C-contiguous result bytes) — or the exception NumPy raised. FuzzCorpus.Reconstruct rebuilds the exact logical array from the base bytes alone — broadcast (stride-0), negative strides, offset slices and all — by aliasing a contiguous storage with the operand's view shape (no validation, so even layouts the public API would normalize away replay faithfully).
  • Layouts — the "44 variations" (layout_catalog.py, mirrored 1:1 in C#): 26 single-array builders (C/F-contig 1–3-D, transposed, strided/step-2, negative-stride, offset slice, broadcast & scalar-broadcast stride-0, 0-D view at non-zero offset, singleton dims, newaxis, empty, high-rank 5-D, reshape-view), 9 pairwise builders (the SimdFull / SimdChunk / General / scalar-left-right / broadcast-row-col / negstride dispatch classes), and 5 where-triple builders. Value pools are edge-loaded — NaN, ±inf, −0.0, type min/max, and the float→int overflow boundaries are front-loaded so even an 8-element operand exercises the cvtt sentinel paths. Every builder self-validates that the operand is a true view into its base buffer, so the bytes fully determine the array.
  • Coverage — ~90 ops (OpRegistry): astype (15×15), binary arithmetic (NEP50), floor_divide/mod/power, the six comparisons, ~30 unary-math kernels (incl. the transcendental / hyperbolic / inverse-trig stragglers), every reduction + the NaN-aware family, cumulative scans + diff, statistics (median/percentile/quantile/average/ptp/count_nonzero/clip), logic & extrema (isnan/isinf/isfinite/maximum/minimum/fmax/fmin/isclose), bitwise + shift, where/place, ~25 manipulation ops, modf multi-output, sorting/searching, matmul/dot/outer, and copyto (incl. overlapping same-buffer copies). Every case is checked on three axes against NumPy: result dtype (NEP50), result shape (broadcasting), and the result bytes (bit-exact, NaN tokenized) — plus error-side parity (a tier where NumPy raises asserts NumSharp also throws).
  • The verdict is never silent. A case is bit-exact (pass), a documented difference logged by MisalignedRegistry (intended NEP50/algorithm divergences and a few tracked bugs — excused but printed, never hidden), or a failure (any unknown divergence — the gate goes red). A failing element-wise case is auto-shrunk to a 1-element minimal repro.
  • The index oracle — a separate getter/setter differential gate (gen_index_oracle.py + IndexOracleTests): a portable token-encoded index tuple over 15 base recipes (scalar, empty, 2-/3-D, transposed, strided, negative-stride, offset, broadcast), replayed for get and set with full error-side parity. Three tiers — index_curated (2,265, CI gate), index_dtype (104 forms × 13 dtypes, CI gate), and index_random (10,000 seeded, now 0 divergences). This is the gate that drove advanced indexing from ~697 → 0 divergences (§3).
  • Metamorphic tier — 12 oracle-free invariants (MetamorphicTests) that need no NumPy: -(-a)==a, (a+b)-b==a, a.T.T==a, reshape round-trip, widening-cast round-trip, a*1==a / a+0==a, abs idempotence, sum-all == flat-sum, concatenate split-free, argsort-of-sorted = identity, equality reflexivity. They catch internal-consistency bugs the differential corpus structurally cannot.
  • CI + soak. Every tier runs under the [FuzzMatrix] gate on each CI (Windows/Ubuntu/macOS), replaying the committed corpus with no Python. A nightly fuzz-soak (fuzz_random.py, deterministic from its seed) sweeps ~1M fresh cases/night; a divergence prints a shrunk repro to drop into corpus/regressions/, which then pins it on every CI thereafter. Regenerate with numpy==2.4.2; Char/Decimal (no NumPy analog) are covered by the separate Converts-oracle tests.

The benchmark harness (performance)

A committed, reproducible NumSharp-vs-NumPy comparison driven by one entry point (benchmark/run_benchmark.py, NumPy 2.4.2 pinned). It builds and runs the BenchmarkDotNet op/dtype/N matrix (benchmark/NumSharp.Benchmark.CSharp) — 1K / 100K / 10M elements × all 15 dtypes, ~615 ops per size (1,851 measured cells: ✅ 792 / 🟡 357 / 🟠 177 / 🔴 72) — joined per (op, dtype, N) to a warm NumPy process, then appends five subsystems that fill the axes the matrix can't express:

Subsystem What it isolates Headline (NPY/NS)
iterator (benchmark/nditer) the NDIter machinery itself — construction, traversal, reductions, selection, dtypes, pathologies, dividends — vs np.nditer, aspect × cache-tier 1.18× geomean; build+dispose 3.3× faster than np.nditer; bcast_reduce 538×
layout (benchmark/layout) reduction / copy / elementwise across 8 memory layouts (C/F/T/strided/sliced/negrow/negcol/bcast) — the op matrix is C-contiguous only at 1M copy ~2–3× & elementwise ~1.2–1.8×; the per-call dispatch tax shows at 100K
operand (benchmark/operand) 1-D / scalar / mixed-operand (C+F, C+T) / binary-broadcast layouts ~1.3–2.2× per case
cast (benchmark/cast) the full 15×15 astype src→dst × 8 layouts at 1M (no op-matrix coverage) 1,439 / 1,568 comparable cells win
fusion (benchmark/fusion) np.evaluate fused single-pass vs unfused np.* chains several-fold over NumPy (§2)

Methodology guards keep the numbers honest: an InProcessEmit BenchmarkDotNet toolchain (sibling worktrees hold same-named projects the out-of-process toolchain refuses), a 25 ms-capped 50-iteration job (so µs–ms array ops skip BDN's nanosecond invocation ramp), and an asserted -c Release build (file-based dotnet run defaults to Debug, which silently halves hand-written kernels). Each run writes a committable benchmark/history/<date>_<sha>/ snapshot (report + every subsystem sheet + cards + a provenance MANIFEST) and repoints a latest symlink; a decoupled post-release benchmark.yml regenerates it, renders the DocFX pages, and commits it to master. The convention is NPY/NS throughout — ratio = NumPy ÷ NumSharp time, >1 = NumSharp faster (the headline figures above are the 2026-06-23 snapshot on an i9-13900K).

11. Legacy stacks deleted

  • MultiIterator deleted; all callers migrated to NDIter.Copy / multi-operand execution.
  • NDIterator (interface + NDIterator<T> + AsIterator extensions) deleted entirely; production iteration runs through NDIter / NDIterRef / GetAtIndex / NDFlatIterator.
  • The Regen template engine is fully purged: 87 inline #if _REGEN … #else <generated> #endif blocks across 35 files collapsed to the generated code with the template DSL preserved as reference comments; all *.template.cs / .tt / GenerateCode.ps1 and their dead <Compile Remove> csproj guards are gone. ~400 per-dtype NPTypeCode switch sites were replaced by a generic NpFunc dispatch utility, and dead-code sweeps removed the 24 [Obsolete(error:true)] tombstones plus 10 confirmed-dead private methods.

12. New dtypes — int8, float16, complex128 (+ Char8/DateTime64)

NumSharp now supports all 15 NumPy dtypes (was 12). This PR adds SByte (int8) and Half (float16), and promotes Complex (complex128) from a stub to a first-class dtype — each mapped to its NumPy name and wired through array creation, np.dtype("int8"/"float16"/"complex128") parsing, astype, reductions, comparisons, and the IL kernels.

  • First-class across every hot path. A coverage audit of the NPTypeCode.Single => switch sites found the three dtypes silently dropping out of ~23 files (9 production crash sites + 5 perf gaps); each was filled so the full 15×15 dtype matrix works end-to-endNDIterCasting/NDIterBufferManager (safe-cast, buffered iteration, the Complex→real ComplexWarning drop), np.repeat, np.any/np.all axis, argmax/argmin axis (Half first-NaN-wins, Complex lexicographic real-then-imag), and the reduction identity / min-max kernels. The six Complex transcendentals sinh/cosh/tanh/arcsin/arccos/arctan were implemented (were NotSupportedException).
  • NumPy 2.x type-alias alignment + np.dtype parser rewrite. np.dtype(string) is now a FrozenDictionary lookup; finfo/iinfo extend to the new dtypes; and the class-level aliases match NumPy 2.4.2: np.byte → int8 (sbyte), np.int_intp, np.uintuintp, np.intp/np.uintp → long/ulong on 64-bit, and np.complex64/np.csingle now throw instead of silently widening to complex128. See Breaking Changes.
  • Half uses a scalar path (no Vector<Half> in the BCL) and routes conversions through double; Complex is excluded from the inherently-real ops (unique/clip/randint) per NumPy. Reductions for the three dtypes run through the NDIter path rather than a boxed loop (Decimal axis ops 5–13×, Half mean 1.6–3.7×), with bit-exact accumulation (Half/Complex flat sum/mean accumulate in a wider type and cast back).

Conversion-helper primitives — two further NumPy-parity types adapted from vendored .NET BCL sources under src/dotnet/, standalone helpers for now (not yet wired into NPTypeCode):

  • Char8 — the NumPy S1 / Python bytes(1) equivalent (readonly struct, 1 byte): conversions, operators, span helpers, and 100% Python bytes API parity validated against a Python oracle.
  • DateTime64 — the NumPy datetime64 equivalent (readonly struct over a long tick count): full long range and a NaT sentinel (long.MinValue) with IEEE-NaN-style propagation and NumPy comparison semantics (any ordering involving NaT is false, != is true, while Equals stays hash-contract-compliant). Implicit interop from DateTime/DateTimeOffset/long, checked conversion back, and Converts.ToDateTime64/ToX(DateTime64) matching NumPy exactly; calendar arithmetic delegates to System.DateTime.

13. Examples — trainable MNIST MLP

New examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive and a fused implementation (single-NDIter bias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). The stride-native GEMM made the old "copy transposed views before np.dot" workaround unnecessary; converges to >99% test accuracy in the bundled demo.

14. Cross-platform

macOS / Apple-Silicon (ARM64) reduction parity fixed (6 failing tests on macos-latest): maximum/minimum signed-zero ties (ARM FMAX/FMINNM vs x86 MAXPS/MINPS) now resolve to the second operand via an explicit strict-compare + ConditionalSelect; negate(+0.0) is an explicit sign-bit XOR (was 0 - x+0.0 on ARM); narrow-int sum/prod reductions use exact integer accumulators on ARM (the AVX2-gated widening kernel fell back to a saturating double path). Reproduced under linux/arm64 via QEMU and pinned as a committed parity suite.

15. Tests & CI

  • ~10,980 passed / 0 failed on net8.0 + net10.0, with zero regressions.
  • 177 formerly-[OpenBugs] reproductions promoted into regular CI tests as their bugs were fixed (each asserts NumPy-2.4.2-correct behavior). Deliberately kept flagged: 7 AVX-512-only and 2 timing-dependent repros.
  • New/expanded suites: the differential index oracle + exhaustive get/set parity matrices (basic/fancy/edge/layout/combined/boolean-mask), array-print parity (np.ArrayPrint.ParityTests), cumsum parity (54), abs parity (33), shift parity, np.evaluate/out=/where=/dtype= parity, NDIter battletests (566), order-support sections, ARC lifecycle, and the macOS/ARM64 signed-zero parity suite.
  • CI: the FuzzMatrix gate (the differential oracle + index oracle + metamorphic tiers — §10) runs on Windows/Ubuntu/macOS; nightly fuzz-soak.yml; a decoupled post-release benchmark.yml runs the whole NumSharp-vs-NumPy harness (op/dtype/N matrix + the five iterator/layout/operand/cast/fusion subsystems — §10), renders the DocFX benchmark pages, and auto-commits the refreshed report + cards + a committable benchmark/history/<date>_<sha>/ provenance snapshot (and its latest symlink) to master.
  • Known remaining gaps (checked in as failing-by-design tests rather than ignored): the still-unimplemented NumPy functions flip/fliplr/flipud/rot90, diag, gradient, and round (the function form; np.round_/np.around exist); the benchmark-surfaced slower paths, all tracked in the committed sheets — small-N (~1K) per-call dispatch overhead, the scalar Half/Decimal element paths (no BCL Vector<Half>/Vector<decimal>), large-N np.any full-scan, comparison→bool stores, and fancy gather/scatter; and a handful of iterator/indexing edge cases pinned as [OpenBugs].

Key highlights since 0.40.0

0.60.0 caps a three-release arc that rebuilt NumSharp's compute core from the ground up and aligned it with NumPy 2.x. For anyone upgrading from 0.40.0, the cumulative picture across the prerelease line:

0.41.0 — IL Kernel Generator (the compute-core rewrite, Mar 2026)

  • Runtime IL emission (System.Reflection.Emit.DynamicMethod) replaced the ~600 K-line Regen template engine with ~19 K lines — a net −533 K lines — with Vector128/256/512 SIMD and runtime width detection across every op.
  • NEP50 (NumPy 2.x) type promotion; single-threaded deterministic execution (SIMD in place of Parallel.*).
  • 35 new functions — the nan* reductions, cbrt, floor_divide, left/right_shift, cumprod, count_nonzero, isnan/isfinite/isinf/isclose, and the np.comparison + np.logical modules — plus the comparison/bitwise operators (==>=, &/|/^) implemented for the first time.
  • MatMul 35–100× faster (cache-blocked SIMD, 20+ GFLOPS); boolean indexing and axis reductions rewritten on SIMD.
  • 60+ NumPy-parity bug fixes; +4,200 tests; no breaking changes.

0.50.0 — Long Indexing (>2 GB arrays + the type system, Apr 2026)

  • Int64/long indexing migrated across Shape/NDArray/Storage/iterators/IL kernels — arrays beyond 2.1 billion elements (>2 GB) now work; np.argmax/np.nonzero return long. New UnmanagedSpan<T> (long-length Span parity), LongIntroSort, and unmanaged long index buffers.
  • 12 type-introspection APIscan_cast, promote_types, result_type, min_scalar_type, common_type, issubdtype, finfo, iinfo, isreal/iscomplex/isrealobj/iscomplexobj.
  • NumPy 2.x type systemnp.arange(10)int64, NPTypeHierarchy (bool not under Number), and 0-D scalar arrays (np.array(5) → 0-D).
  • Python container protocol__contains__/__len__/__iter__/__getitem__/__setitem__, plus tolist()/item(); np.frombuffer rewritten to the full NumPy signature (count/offset/big-endian/view semantics).
  • ValueTypeobject scalar migration; operator-overload cleanup (−74%); 600+ battle tests.

0.60.0 — nditer (this PR — the first stable)

  • Full NumPy nditer port as the execution engine; np.evaluate fusion (3.2–6.1×); full advanced-indexing parity + a differential index oracle; byte-exact array printing; out=/where=/dtype= ufunc kwargs; C/F/A/K memory layout; 36+ new np.* APIs (sort, pad, percentile/quantile, take/put, split, …); np.random rebuilt on MT19937 for NumPy seed parity + 24 new distributions; stride-native matmul and an astype SIMD campaign that both beat NumPy; ARC memory management + buffer pool; differential fuzzing vs NumPy; and the legacy iterator stack and the Regen engine deleted outright. (All detailed in the sections above.)

Breaking changes

Change Impact Migration
Raw int[]/long[] as the sole index is now FANCY nd[new int[]{0,2}] selects rows 0 and 2 (shape (2,…)), not the element at coordinate (0,2) Use nd.GetData(0, 2) for coordinate access. nd[0,2] (separate ints) is unchanged; NDArray<T>.this[int[]] is unchanged
NDArray.ToString() format changed to NumPy array_str/array_repr [0 1 2] (str) / array([0, 1, 2], dtype=int64) (repr) instead of [0, 1, 2] Update any code parsing ToString() output; use the typed accessors/GetData for values
np.left_shift/right_shift result dtype is now result_type(lhs, rhs) int8 << int32int32 (was int8, overflowing); bool << boolint8 — (matches NumPy)
np.fmax/fmin now ignore NaN return the finite operand; maximum/minimum still propagate NaN — (matches NumPy; fixes a prior correctness bug)
np.cumsum of empty / 0-d / size-1 integer input widens to int64 and never returns 0-d cumsum(empty int32) → fresh int64; cumsum(0-d)(1,) — (matches np.add.accumulate)
Boolean-mask axis-0/partial set with a 1-D count-length value now raises IncorrectShapeException was silently "one scalar per selected row" Use a (count, 1) value to fill one value per selected row
Over-indexing with slices now raises IndexError (too many indices) A[:, :, :] on a 2-D array Drop the extra indices (matches NumPy)
Per-axis / fancy out-of-bounds now raise IndexError A[0,4] on a 3×4, fancy -7 on a size-6 axis (was wrong value / OOB read) — (correctness + memory safety)
np.full argument order flipped to np.full(shape, fill_value, dtype) matches NumPy (was np.full(fill_value, shape, dtype)) Swap the first two arguments; dtype stays third
bool - bool, -bool, np.negative(bool) now throw Matches NumPy Use ^ / cast to int first
NaN <= / >= returns False Matches IEEE & NumPy Use np.isnan explicitly
floor_divide/mod divide-by-zero & floored results; np.negative(uint) wraps Matches NumPy
np.power(int, negative int) raises ValueError Matches NumPy Use float exponents
Cast edge cases (overflow/NaN/complex→bool/float→int truncation); transcendental NEP50 width promotion; np.clip/quantile dtype promotion Return values/dtypes may change
np.byte is now int8 (sbyte) was uint8 (byte) matches NumPy's C-char convention; use np.ubyte/np.uint8 for unsigned
np.complex64/np.csingle now throw was silent widening to complex128 use np.complex128
np.int_/np.uint/np.intp/np.uintp realigned to NumPy 2.x (intp/uintp; long/ulong on 64-bit) the dtype these aliases resolve to changes — (matches NumPy 2.x)
np.random seed sequences changed (Knuth subtractive generator → MT19937) Same seed now yields a different sequence — (intentional; output is now byte-identical to NumPy 2.4.2 at a given seed)
Integer np.random distributions return int64 poisson/binomial/geometric/hypergeometric/zipf/logseries/negative_binomial were double — (matches NumPy dtypes)
Broadcast views are read-only; broadcasting keeps rank for 1-D [1] Matches NumPy .copy() to write
MultiIterator and NDIterator (+ NDIterator<T>, AsIterator) removed Public types removed (threw at runtime anyway) Use NDIter / NDIter.Copy / NDFlatIterator
NDIter: MaxOperands=8 and 64-dim limits removed None (loosening)
np.copyto unwriteable-destination error type corrected Exception type change

Everything above was validated against NumPy 2.4.2 ground truth — by ~40,000 differential corpus cases, 566 iterator parity scenarios, a 12,000+-case index oracle, ~18,000 array-print fuzz cases, and per-feature battle tests run on actual NumPy output.


Closes: #435 #439 #456 #477 #480 #495 #501 #508 #515 #542 #567 #568 #604 #605 #608

@Nucs

Nucs commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

📊 Benchmark & performance — nditer

Two performance items on this branch: a fused strided-SIMD unary kernel, and an official NumSharp-vs-NumPy benchmark across ~all op categories at three sizes (benchmarked Core state = d01f1d63).


1. Fused strided-SIMD unary IL kernel (d01f1d63)

New whole-array kernel for unary ops over non-contiguous 1-D inputs: strided gather → Vector{W}.Create → unary op → contiguous store, single pass, no scratch buffer, no per-tile dispatch.

Isolated kernel, np.sqrt(a[::2]), ns/elem:

N fused NumPy 2.4.2 NumSharp speedup
64 0.547 4.322 7.9×
4,096 0.549 0.660 1.20×
262,144 0.557 1.419 2.55×

The kernel is size-invariant (~0.55 ns/elem at every size) while NumPy degrades 2–6× as data spills out of cache.

All 11 ops on this path — speedup vs NumPy @262K (f64):

abs        3.37×   negate  3.15×   floor   3.07×   trunc   3.03×   round  3.00×
sqrt       2.55×   rad2deg 2.41×   deg2rad 2.22×   square  2.18×   reciprocal 1.72×

Verified 22,000 bit-exact checks (fused == contiguous kernel); full unit suite 9447/0/11.

Note: this is a DirectILKernelGenerator whole-array kernel that bypasses NpyIter by design — the fusion (gather folded into Vector.Create) is incompatible with NpyIter's gather/kernel separation, which is exactly the (slower) buffered path it replaces.


2. Official NumSharp-vs-NumPy benchmark (6038990f)

Methodology: BenchmarkDotNet Full — 50 iterations, InProcessEmit toolchain, iteration-time capped at 25 ms — × {1K / 100K / 10M} vs NumPy 2.4.2. i9-13900K · .NET 10.0.101 · Python 3.12.12. 1,813 C# measurements → 1,111 matched comparisons.

The iteration-time cap is what makes a Full run feasible: BDN's default Throughput strategy ramps to ~8192 invocations/iteration, so a 10M-element op at 50 iters took ~25 s per case. Capping it ⇒ ~15× faster (a 30-case set went 18 min → 70 s) with all 50 iterations preserved.

Headline — geomean (NumSharp ÷ NumPy, lower = better):

        slower ◄───────── 1.0 (parity) ─────────► faster
1K    ████████████████████  1.96×   (102 win / 212 lose)
100K  ██████████████████▎   1.83×   (109 win / 196 lose)
10M   ██████████▏ ........  1.00×   (166 win /  36 lose)   ◄ PARITY

At the memory-bound 10M size NumSharp is at parity across ~409 ops (166 faster, only 36 slower). Small-size cost is the per-element dispatch + result-allocation tax (~2×).

Per-suite geomean by size:

suite 1K 100K 10M
Statistics 0.19× 0.68× 0.48×
Sorting 0.41× 1.13× 0.45×
Comparison 1.27× 2.22× 0.50×
Bitwise 8.16× 1.16× 0.61×
Reduction 0.48× 0.94× 0.91×
Arithmetic 3.09× 2.62× 1.25× 🟡
Unary 3.50× 4.44× 1.53× 🟡
Creation 12.26× 2.92× 2.24× 🟠
LinearAlgebra 2.76× 1.66× 4.02× 🔴

🏆 Biggest wins (@10m, real ms):

op dtype NumPy NumSharp speedup
average f32 9.60 0.94 10.2×
nansum f32 14.35 1.49 10.0×
nanprod f32 18.52 1.90 9.7×
var f32 16.96 2.60 6.5×
count_nonzero f64 22.61 3.74 6.0×
nanmean f64 33.47 5.69 5.9×

🎯 Biggest gaps (@10m) — optimization targets:

op dtype NumPy NumSharp gap
sum axis=1 uint8 3.12 49.74 16.0×
dot f64 1.23 16.46 13.4×
matmul f64 0.72 4.26 5.9×
argsort int32 369 2162 5.9×

→ three fronts: narrow-int axis reductions (no widening-SIMD), linear algebra (no BLAS), sort.

Per-dtype @10m (geomean):

int64 0.91  uint64 0.92  f32 0.93  f64 0.98  uint8 1.00  uint32 0.99   ◄ strong
int32 1.11  int16 1.14   uint16 1.24   bool 1.60                       ◄ weak (bool, narrow-uint)

Dtype coverage: 10 dtypes compared vs NumPy; char/decimal measured but have no NumPy peer (C#-only). SByte/Half/Complex were uncovered and have since been added to the benchmark code (48e85528) — the next run produces the full 15-dtype matrix.


Reproducibility

  • Reusable cross-platform runner: python benchmark/run_benchmark.py (builds C#, runs BDN per-suite, sweeps NumPy at 3 sizes, merges, archives).
  • Full report: benchmark/benchmark-report.md (1,311 rows).
  • Provenance snapshot keyed by date+hash: benchmark/history/2026-06-05_6038990f/ (manifest + report + NumPy timings).

Nucs added 8 commits June 13, 2026 20:37
…m the changelog

Per review: the changelog should describe the final state, not the
development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) —
added after the first changelog' umbrella section entirely and dissolved
its content into the proper topical sections, with all 'wave' terminology
and 'added after'/'previously absent'/'now reachable' path-language gone:

- np.evaluate folded into §2 (NDExpr DSL): per-node result_type typing,
  fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups.
- out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection.
- WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1
  stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis
  fixes folded into §1 (capability matrix + bug list); masked-write
  corruption fix added to §10.
- buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer
  suppression folded into §7; TL;DR memory bullet updated.
- canonical NDIter benchmark, benchmark.yml CI, DocFX benchmark pages,
  and the honest frontier findings folded into §8/§15.
- 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'.

Net: zero 'wave' occurrences; the 16-section topical structure is intact.
Same content (minus H1) pushed live to the PR #611 description.
… stat

Per updated direction: the ratio convention is NumPy ÷ NumSharp again (>1.0× = NumSharp
faster — bars grow right = faster, the original visual), AND every row now also carries
🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses. So a win
reads two intuitive ways: "12.63× faster" and "🕐 8%" (takes only 8% of the time NumPy
would); parity is 🕐 100%; >100% is slower. Huge slowdowns compact to e.g. 🕐 881×NP.

render_dashboard.py:
* r["sp"] = numpy/numsharp (speedup), r["pct"] = numsharp/numpy*100 (share of NumPy time).
* headline + every bar/table show both: HEADLINE 0.74× geomean · 🕐 136%; by-suite e.g.
  statistics 2.28× 🕐 44%, reduction 1.21× 🕐 83%, creation 0.35× 🕐 283%; FASTEST nansum
  12.63× 🕐 8%; SLOWEST np.zeros 0.001× 🕐 881×NP.
* status-mix bands relabelled in %NumPy terms (faster ≤100% / close 100–200% / slower
  200–500% / much >500%), a legend line explains the 🕐 stat, pct_str() keeps the column
  narrow (NN% under 1000%, else NN×NP).

benchmarks-dashboard.md re-seeded with the matching note (heredoc — printf mis-read %NumPy
as a format spec); docfx build clean, emoji verified present (U+1F550 ×54).

Supersedes the brief NS/NP experiment (c0a5346). The op-matrix report (merge-results.py)
still uses NS/NP "lower is better", and the nditer sheet / cards use NP/NS without the
%NumPy stat — rolling the NP/NS + 🕐 %NumPy convention out to those is the next step,
pending confirmation.
Completes the rollout chosen after the dashboard fix: every benchmark surface now uses the
SAME convention — speedup = NumPy ÷ NumSharp (>1.0× = NumSharp faster) — and every surface
also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses
(30% = takes only 30% of the time NumPy would; <100% = faster; huge slowdowns compact to
e.g. 880×NP). So a win reads two intuitive ways at once: "12.66× faster" and "🕐 8%".

Op-matrix report (merge-results.py) — FLIPPED from NS/NP to NP/NS (the one surface that was
"lower is better"):
* ratio = numpy_ms / numsharp_ms; new pct_numpy field on UnifiedResult (JSON + CSV).
* get_status bands inverted around >1 = faster (faster ≥1.0× / close 0.5–1.0× / slower
  0.2–0.5× / much <0.2×); classify() credibility gate flips to ratio > 20 (was < 1/20).
* Best/Worst now sort DESCENDING (fastest first); legend + tables + summary-by-size gain a
  🕐 %NumPy column; ratio_fmt keeps tiny slowdowns readable (0.001× not 0.00×).
* Regenerated from the on-disk run archive: Top Best nansum 12.66× 🕐 8%; Top Worst
  np.zeros 0.001× 🕐 880×NP; searchsorted stays negligible (now ratio>20). Counts
  unchanged (305/255/169/103/275/126) — same rows, just the direction relabelled.

nditer sheet (nditer_sheet.py) + cards (nditer_cards.py) — already NP/NS, ADDED 🕐 %NumPy:
* sheet: legend line + per-bar 🕐 %NumPy + headline "1.17× geomean · 🕐 85% of NumPy's time";
  re-rendered nditer_results.md (--render-only, AV block intact).
* cards: each bar label now "1.80× · 56%" (ops) / "4.3× · 23%" (dividends); footer explains
  the %. No emoji in matplotlib (DejaVu lacks the glyph) — the % carries it. Re-rendered.

Narrative benchmarks.md + README — already NP/NS, added the 🕐 %NumPy line to the convention
block, a %NumPy column to the by-class table, and a caption sentence.

DocFX pages (benchmark-matrix.md, benchmark-iterator.md) re-seeded from the regenerated
report + sheet; benchmarks.md updated; docfx build clean (0 errors). The dashboard
(render_dashboard.py / benchmarks-dashboard.md) already carries this convention (49af3af),
so the whole benchmark stack — report, dashboard, iterator sheet, cards, narrative, README —
is now identical: NumPy ÷ NumSharp speedup + 🕐 %NumPy.
The clock sat before the figure with the right-align padding landing between them
("🕐  87%"). Moved it to immediately follow the percentage, no space — "87%🕐" — across
every surface, and likewise the metric name (🕐 %NumPy → %NumPy🕐). The alignment padding
now sits before the number (where it belongs) instead of after the emoji.

* render_dashboard.py / nditer_sheet.py: bar values "{pct_str}🕐", headline "85%🕐 of
  NumPy's time", legend "%NumPy🕐 = …". Dashboard + sheet regenerated.
* merge-results.py: report legend, status-band table, summary-by-size "%NP🕐" column,
  Best/Worst note, and per-suite "%NumPy🕐" column headers. Report regenerated.
* benchmarks.md + README: convention line / table column / caption "%NumPy🕐".
* DocFX pages (matrix, iterator, dashboard) re-seeded; dashboard page note "%NumPy🕐".
  docfx build clean.

The matplotlib cards are unaffected (they show "1.80× · 56%" without the emoji — DejaVu
has no clock glyph — so there was never a gap to fix there).
… form

pct_str (dashboard/sheet) and pct_fmt (report) switched to a ×-multiplier form for huge
slowdowns (np.zeros etc.), so the %NumPy stat showed "880×NP🕐" / "880×" — breaking the
NN%🕐 depiction the column promises. Now they always render a percentage: np.zeros reads
"87957%" (report) / "88087%🕐" (dashboard) = takes ~880× as long, stated as a share of
NumPy's time like every other cell.

The ratio column is untouched — it legitimately uses × (0.001×, 12.65×); only the %NumPy
formatters changed. Report + sheet + dashboard regenerated, the three DocFX pages re-seeded,
docfx build clean.
…g from the report

The dashboard and benchmark-report.md disagreed on the SAME cell: np.nansum(f64,100K)
read 12.63× on the dashboard vs 12.65× in the report, np.zeros(i64,10M) read 88087% vs
87957%, quantile/percentile likewise — 161 rows printed a different ratio at 2dp between
the two committed surfaces.

Root cause: merge-results.py computes ratio = NumPy/NumSharp and pct_numpy from the
FULL-PRECISION means, then stores numpy_ms/numsharp_ms rounded to 4dp. render_dashboard.py
ignored the stored ratio/pct_numpy fields and RE-DIVIDED the rounded ms (r["numpy_ms"] /
r["numsharp_ms"]), so every row where the 4dp rounding moved a digit drifted from the
report. The report is correct (true ratio of the measured means); the dashboard was a
rounding artifact of its own recompute.

Fix: the credible loop now consumes r["ratio"] / r["pct_numpy"] straight from the JSON
(the same numbers benchmark-report.md prints), falling back to 100/ratio only if pct is
absent. Dashboard and report now agree cell-for-cell, and the per-suite/per-dtype geomeans
key off the same stored ratios the report's Summary-by-size uses.

Regenerated benchmark-dashboard.md (gitignored) and re-seeded the DocFX dashboard page;
header preserved, body refreshed. Verified: nansum 12.65×/8%, zeros 0.001×/87957%,
quantile 9.89×/10% identical on both surfaces; size tiers match Summary-by-size exactly.
…not run" cells

normalize_op_name dropped measured C# data on the floor whenever the C# benchmark label
and the NumPy suite name differed only cosmetically, so the report showed ⚪ "C# benchmark
not run" for ops that WERE run. Three archive-safe alias passes (applied identically to
both sides, so they only ever merge a true pair):

  * empty "()"  — a no-arg C# method call "a.flatten()" now meets NumPy's "a.flatten"
  * "->" spacing — C# "reshape 2D -> 1D" now meets NumPy's "reshape 2D->1D"
  * np.around    — IS np.round (NumPy alias); C# benchmarks rounding as np.around, NumPy
                   emits np.round, so the whole np.round family was ⚪ despite real data

Effect (re-merged from the same archive — no re-run): ⚪ no-data 126 → 116; the np.round
family gains 6 real rows (float32/float64 × 3 sizes), a.flatten +2 (100K/10M), reshape
2D->1D +2. Verified against the archive before editing: +10 joined cells, 0 regressions
(no previously-matched cell lost), 0 new key collisions.

Regenerated benchmark-report.{md,json,csv} + the dashboard (now 840 credible cells,
0.73× geomean) and re-seeded the matrix + dashboard DocFX pages (headers preserved
byte-for-byte). The dashboard stays cell-consistent with the report via the canonical
ratio/pct fix from the prior commit.

NOT fixed here (genuine gaps needing a benchmark re-run, not a name alias): np.prod has
no NumPy full-reduction row at all; isnan/isinf/isfinite/isclose/allclose/array_equal/
maximum/minimum have no C# benchmark; amax/amin/mean/std/var axis variants and np.mean
on uint*/int16 lack a counterpart on one side.
…lex (NumPy parity)

These six complex ufuncs previously threw NotSupportedException from the
EmitUnaryComplexOperation default arm, even though NumPy 2.x has complex
loops for all of them (csinh/ccosh/ctanh/casin/cacos/catan). This wires
them up with full NumPy 2.4.2 parity.

Approach (hybrid BCL + C99 fixups, mirroring the existing abs/log2/exp2
pattern): a bit-exact probe over a finite battery showed System.Numerics.
Complex matches NumPy to a few ULP on the finite interior, but diverges at
86/360 edge components -- it returns (NaN,NaN) for nearly all inf/NaN inputs
instead of the C99 Annex G values, drops the sign of zero on branch cuts,
and mishandles arctan's imaginary-axis cut. So:

- NDComplexMath.{Sinh,Cosh,Tanh,Asin,Acos,Atan} delegate the finite
  interior to the BCL and add the C99 fixups:
  * Non-finite inputs: special-value tables ported from NumPy's msun
    npy_csinh/ccosh/ctanh, with asin/atan reusing NumPy's own identities
    asin(z)=i*conj(casinh(i*conj z)) and atan(z)=i*conj(catanh(i*conj z)).
  * Branch-cut/signed-zero fixups (empirically derived against NumPy and
    verified on a 64-point signed-zero grid): asin negates Re on x=-0 and
    Im on y=-0; acos negates Im on the y=+0 cut; atan sets
    Re=copysign(|y|>1?pi/2:0, x) on the imaginary axis and negates Im on y=-0.
  * Where this NumPy build's system libm diverges from msun at infinities
    (sign-preserving sinh(-inf+i*inf).re, cosh's even-function +inf*sin(y)
    imaginary part, tanh's sign(y) zero, and the genuinely-unspecified
    zero signs), the helpers match the observed NumPy 2.4.2 output.
- DirectILKernelGenerator: register CachedMethods.Complex{Sinh,Cosh,Tanh,
  Asin,Acos,Atan} (pointing at NDComplexMath, not Complex.* directly) and
  add the six cases to EmitUnaryComplexOperation.

Verification: a bit-exact harness over a 117-point battery (finite + signed
zeros + branch cuts + inf/NaN) plus a 64-point grid, diffed against NumPy
2.4.2, gives 1402/1404 components matching (1249 bit-exact, 153 within
<=3 ULP). The only 2 residuals are arctan's finite interior (1e-10 tiny
input ~8e-8 rel; 100+100j at 3 ULP) -- .NET's Atan kernel is less accurate
than NumPy's log1p-based one; an accepted, documented divergence.

Tests:
- NewDtypesUnaryTests: 9 NumPy-verified cases covering interior, branch
  cuts, signed zeros, and C99 special values.
- Fuzz/MisalignedRegistry: the stale "complex kernel throws" excuse is
  corrected to Half-only; complex sinh/cosh/tanh/arcsin/arccos are now held
  to a tight 4-ULP gate (a real regression fails) instead of the blanket
  complex-unary excuse; arctan stays under the documented blanket for its
  accepted BCL-interior divergence.

All 609 Fuzz + NewDtypes tests pass (net10.0); the 26x5 complex corpus
cases for the five tightly-gated ops are all within 4 ULP.
Nucs added a commit that referenced this pull request Jun 13, 2026
…e nditer branch

Replaces the stale PR description (written ~64 commits in, +50k lines) with a
complete changelog of everything between the #612 merge-base (5eedb81) and
HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare.

Compiled via a two-pass audit:
- Pass 1: every commit subject+body mined for features, perf numbers, and
  breaking changes; APIs/CI/benchmark/corpus facts verified against the live
  tree (test counts, fuzz corpus wc, Direct partial count, NpyIter LOC).
- Pass 2: all 279 local commits re-walked against the draft. Caught and
  fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on
  master (verified via git grep origin/master) — reclassified from 'rebuilt'
  to new, raising the new-API count 22 -> 30; removed an unverifiable test
  count; added the 15-dtype hot-path parity item (786d705) and the
  DefaultEngine->NpyIter Tier-3B production routing.

Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612
(already on master) and are intentionally excluded; the local master ref is
stale, which is why master..HEAD misleadingly shows 339 commits.

The same content (minus the H1) is now the live PR #611 description, pushed
via REST PATCH (gh pr edit requires read:org scope the token lacks).
Nucs added a commit that referenced this pull request Jun 13, 2026
… 1.3–6.1)

Branch advanced 31 substantive commits past the first changelog (which
described through 33058b8). The branch was rebased meanwhile — the original
changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8
remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary.

Learned and folded in:
- np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing
  (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused
  reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy.
- out=/where=/dtype= across the elementwise ufunc API (binary, unary-math,
  comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped
  overload each, exact broadcast/cast/error-text semantics.
- New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive.
- nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent
  masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant,
  op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb).
- Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC
  pressure, finalizer suppression.
- Canonical NpyIter benchmark suite + post-release benchmark.yml CI +
  DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded
  (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0
  crash, parallel banding 4.7x win).

Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402.
Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35.
Same content (minus H1) pushed live to the PR #611 description via REST PATCH.
Nucs added a commit that referenced this pull request Jun 13, 2026
…m the changelog

Per review: the changelog should describe the final state, not the
development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) —
added after the first changelog' umbrella section entirely and dissolved
its content into the proper topical sections, with all 'wave' terminology
and 'added after'/'previously absent'/'now reachable' path-language gone:

- np.evaluate folded into §2 (NpyExpr DSL): per-node result_type typing,
  fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups.
- out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection.
- WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1
  stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis
  fixes folded into §1 (capability matrix + bug list); masked-write
  corruption fix added to §10.
- buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer
  suppression folded into §7; TL;DR memory bullet updated.
- canonical NpyIter benchmark, benchmark.yml CI, DocFX benchmark pages,
  and the honest frontier findings folded into §8/§15.
- 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'.

Net: zero 'wave' occurrences; the 16-section topical structure is intact.
Same content (minus H1) pushed live to the PR #611 description.
Nucs added 15 commits June 13, 2026 23:13
…ndentals

Adds AggressiveInlining/AggressiveOptimization to the complex hyperbolic and
inverse-trig helpers and restructures them into a hot/cold split, so the JIT
folds the per-element math into the IL-emitted unary kernel without a call
frame:

- Sinh/Cosh/Tanh/Asin/Acos/Atan (+ Abs and the tiny IsNegZero/IsPosZero/
  HypotInf/ClogLarge helpers) are marked AggressiveInlining. Each public op
  is now a tiny finite-path wrapper (finite check -> Complex.* + fixups, or a
  cold-helper call) so it fits the inliner's budget.
- The non-finite C99 special-value tables move into cold helpers
  (SinhSpecial/CoshSpecial/TanhSpecial/CasinhNonFinite/CacosNonFinite/
  CatanhNonFinite) marked AggressiveOptimization -- kept out-of-line (so the
  hot wrapper stays inlineable) and fully optimized when actually hit.
  Behavior is identical to the prior inline form (verified below).

IL-inlining experiment (the "emit the formula instead of call" question):
benchmarked complex sinh both ways over 4M finite elements, median of 15 reps.
The real-decomposition formula (Math.Sinh(x)*Math.Cos(y), Math.Cosh(x)*Math.
Sin(y)) is bit-identical to Complex.Sinh (0/4M mismatches) but only 1.15x
faster than the call; cosh 1.06x; asin/acos/atan have no real-Math.* formula
(dominated by complex log/sqrt) so inlining would only drop a wrapper frame.
The per-element cost is dominated by the transcendental itself, so emitting
~6 hand-written IL formulas is not worth the duplication/risk -- especially as
the call-based kernel is already ~1.56x faster than NumPy 2.4.2 (np.sinh:
26.1 ns/elem vs NumPy 40.9). Decision: keep the handwritten methods; the
inlining attributes capture the (small, safe) wrapper-elimination gain.

Verified: NewDtypesUnaryTests + Fuzz UnaryExtra (4-ULP complex gate) green
(62/62); the hot/cold split changes no results.
…exponential

The np.allclose / np.random.exponential working-set leak guards
(np.allclose.UsingTests, np.random.exponential.UsingTests) failed in CI with
hundreds of MB of working-set growth (e.g. 551 MB on Linux, threshold 20 MB),
while passing on Windows. Root cause: both functions allocate several
NDArray intermediates per call and never dispose them — the unmanaged buffers
ride the finalizer queue instead of being released synchronously. In a tight
loop the managed wrappers are tiny so the GC rarely runs, leaving the
intermediates LIVE between collections; the allocator can't reuse that memory,
so the high-water mark balloons. On glibc (Linux) freed pages are retained in
the arena, so the process RSS stays high even after the test's final
GC.Collect()+WaitForPendingFinalizers() — hence the large WorkingSet64 delta.

np.isclose: |a-b| <= atol + rtol*|b| materialized ~5 float64 temps (≈400 KB
each at 50K elements) plus several bool temps, none disposed. Wrapped every
fresh allocation in `using` (the elementwise operators/ufuncs each return a new
array). x/y come from astype(copy:false), which returns the input itself when
no conversion is needed, so they are caller-owned and never disposed here. The
final combined array is captured in `using` too: MakeGeneric<bool>() takes its
own refcount on the shared buffer, so disposing the backing temp on return
keeps the result alive while keeping that last buffer off the finalizer queue.

np.random.exponential: β·(-log(1-U)) left uniform, (1-U) and negate()
intermediates un-disposed (only the log result was released). Now disposes all
of them; only the trailing `* scale` allocates the fresh array returned to the
caller.

Effect (measured, peak WorkingSet64 growth across a 1000-iter no-GC loop —
the CI failure mode): allclose 551 MB -> 3 MB, exponential -> 1 MB. Behavior
is unchanged: full suite green on net8.0 and net10.0 (9718 passed / 0 failed
under the CI filter), including the Logic fuzz corpus and the isclose/allclose/
exponential unit tests.
…ity with NumPy 2.4.2)

The complex (complex128) overloads of the unary math ops deferred to System.Numerics.Complex
for their finite interior. The BCL transcendentals diverge from NumPy on a wide range of
edge inputs — large magnitudes, the unit circle, tiny/subnormal values, branch cuts and
signed zeros — because they do NOT implement the careful FreeBSD msun algorithms NumPy uses.
This replaces those deferrals with direct ports of NumPy's own routines in NDComplexMath,
verified by a 504-point bit-exact sweep (Python struct-packed int64 references) classifying
every result as exact / <=3 ULP / signed-zero / special / sign-flip.

Result: 18 of 20 complex unary ops are now at full parity (0 divergence beyond <=3 ULP):
  exp, log, log10, log2, log1p, expm1, exp2, sin, cos, tan, sqrt, square, reciprocal,
  negative, sinh, cosh, tanh, arcsin, arctan.

Algorithms ported / fixes (src/NumSharp.Core/Utilities/NDComplexMath.cs):

- Log (npy_clog): real part = log|z| with the four-regime rescale — |z| huge (x2 down),
  subnormal (x2^53 up), near the unit circle (0.71<=|z|<=1.73 uses 0.5*log1p((m-1)(m+1)+n^2)
  via a Goldberg MathLog1p), and 0. Complex.Log cancels the real part to 0 near |z|=1
  (e.g. log(1+1e-10 i).real must be 5e-21, not 0). ComplexLog is repointed here, so np.log,
  np.log2 and np.log10 all inherit the accuracy.

- Tanh (npy_ctanh): Kahan's algorithm (t=tan(y); beta=1+t^2; s=sinh(x); rho=sqrt(1+s^2);
  tanh=(beta*rho*s + i t)/(1+beta*s^2)) plus the |x|>=22 overflow-safe branch. The BCL
  Complex.Tanh drifts ~33 ULP (tan(1.5) through the tan(z)=-i*tanh(iz) identity).

- Sin/Cos/Tan: now route ALWAYS through Sinh/Cosh/Tanh, exactly as NumPy defines
  npy_csin/ccos/ctan (= -i*sinh(iz) / cosh(iz) / -i*tanh(iz)), so they match NumPy bit-for-bit
  instead of only on the BCL's finite interior. Fixes sin(0+1e300 i).real = NaN (BCL did
  cosh(huge)*0); the Sinh/Cosh y==0 guard returns (sinh(x), y)/(cosh(x), x*y) so a large real
  no longer yields inf*0 = NaN.

- Expm1 (nc_expm1): real = expm1(x)*cos(y) - 2 sin^2(y/2), imag = exp(x)*sin(y); the real
  expm1 fallback uses the Goldberg identity (e^x-1)*x/log(e^x) which recovers the ~10 digits
  exp(x)-1 cancels and avoids underflow (expm1(1e-300)=1e-300, not 0). Fixes the non-finite
  imaginary (expm1(+Inf+0i).imag = exp(+Inf)*sin(0) = NaN) and origin signed zeros.

- Square (z*z with FMA contraction): (fma(re,re,-(im*im)), fma(re,im,im*re)). NumPy's complex
  multiply is FMA-contracted, so square(1e-10+1e-10 i).real = -2.275e-37 (exact re^2 minus
  rounded im^2) and square(1e300+1e300 i).real = -inf; Complex.op_Multiply (no FMA) returned
  0 and NaN.

- Atan (npy_catanh, full): atanh(x) on the real axis, atan(y) on the imaginary axis, and the
  log1p(4|x|/sumsq(|x|-1,|y|))/4 interior, plus _sum_squares and an exponent-classified
  _real_part_reciprocal (raw biased-exponent field, NOT Math.ILogB which maps 0/Inf to
  int.MinValue/MaxValue and overflows the subtraction). Complex.Atan cancelled / underflowed
  the tiny imaginary part (arctan(0+1e-10 i).imag must be 1e-10).

- Exp (npy_cexp): exp(-Inf + I(Inf|NaN)).imag = copysign(0, y) so exp(-inf-inf i).imag = -0
  (the system libm keeps sign(y); npy_cexp's flat (0,0) dropped it). exp2 inherits this.

- Reciprocal already used Smith's nc_recip (overflow-safe, correct signed zeros).

Engine wiring (DirectILKernelGenerator[.Unary.Decimal].cs): ComplexLog repointed to
NDComplexMath.Log; new cached methods ComplexExpm1 and ComplexSquare; the Expm1 and Square
cases in EmitUnaryComplexOperation now call the ported helpers instead of inline
Complex.Exp(z)-1 / Complex.op_Multiply.

Accepted residuals (pathological inputs only, documented in code + the fuzz registry):
- cos/sin with a NaN imaginary part: the resulting zero's sign is C99-UNSPECIFIED; the
  platform libm and the npy_ccos identity pick opposite signs (2 cases).
- arccos with a sub-DBL_MIN imaginary part: Complex.Acos flushes the denormal real part to 0
  where cacos's _do_hard_work keeps it (~5.8e-309); a denormal-range edge (4 cases).
- sinh/cosh at the overflow boundary |x| in [710, 710.13]: Windows' CRT sinh overflows to inf
  while .NET Math.Sinh stays finite (a platform-libm boundary, absent on glibc).

Tests: NewDtypesUnaryTests.cs adds 11 NumPy-2.4.2-verified cases for the huge-imaginary
sin/cos, large-real sinh/cosh overflow, Kahan tan accuracy, near-unit-circle clog, scaled
log10/log2/log1p, Goldberg expm1, exp2, FMA square, reciprocal signed zeros, catanh tiny/large
arctan, and the exp -inf signed zero. Fuzz/MisalignedRegistry.cs tightens the complex-unary
gate to <=3 ULP across the whole set (was a 4-ULP gate on 5 ops + a blanket excuse for the
rest), narrows the >3-ULP excuse to the named pathological ops, and adds a separate entry for
the (pre-existing) complex reduction/scan NaN-ordering divergence the old blanket covered.

Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9729 passed, 0 failed. net8.0 +
net10.0 both build clean.
…-run

Adds the benchmark definitions that were missing on one side of the op-matrix join (so the
ops showed ⚪ "not run" or were discarded as C#-only), then re-runs the whole official suite
(all 14 comparison suites x 3 cache tiers, ~3h) to fill them in with live numbers. Result:
⚪ no-data 130 -> 76, and the headline moves from a stale 0.74x to 1.08x geomean (93%🕐)
over 1386 credible cells — the stale figure was dragged down by a broken searchsorted and by
simply missing most of NumSharp's fast reductions.

NumPy side (numpy_benchmark.py) — C# already benchmarked these; NumPy didn't:
  * unary: np.tan, np.exp2, np.expm1, np.log2, np.log1p, np.clip(a,-10,10),
    np.power(a,2|3|0.5)
  * reduction: np.cumsum (all arithmetic dtypes), np.prod + np.prod axis=0/1, and the axis
    variants np.amax/np.amin/np.mean axis=0(/1) and np.var/np.std axis=0
  All names normalize to the existing C# [Benchmark(Description=...)] so they join 1:1.

C# side:
  * ProdBenchmarks: was non-standard sizes (100/1000/10000) + method-form names (a.prod());
    nothing could join it. Switched to the standard Small/Medium/Large tiers and function-form
    np.prod(a)/np.prod(a, axis=k) — values stay in [0.5,1.0] so the product is overflow-safe at
    every size. prod now has full + axis coverage (18 cells).
  * MeanBenchmarks: CommonTypes -> ArithmeticTypes, closing the np.mean uint*/int16 ⚪ holes
    (15 cells) — matches SumBenchmarks/MinMaxBenchmarks.
  * LogicBenchmarks: isnan/isinf/isfinite/maximum/minimum/array_equal now join (54 cells).

Verified on the fresh run: searchsorted is purged of the 0.0000ms / >1e6x rows (now real,
1.16-1.44x faster), prod/cumsum/all axis reductions/the 6 predicates/mean-on-uint* all matched.
Regenerated benchmark-report.{md,json,csv} + dashboard and re-seeded the matrix + dashboard
DocFX pages.

KNOWN BUG surfaced (left as ⚪): np.isclose and np.allclose DETERMINISTICALLY segfault NumSharp
with the unmanaged-storage AccessViolation — each crashes even run alone, and in-class it killed
the whole logic suite before BenchmarkDotNet could export anything (took the 6 working predicates
down with it). Disabled both in LogicBenchmarks with a documented note; re-enable once the
NumSharp isclose/allclose lifetime bug is fixed. The 6 predicates were recovered by running each
in its own process (the same per-section isolation the NDIter harness uses for its AV).
…segfault) + doc review

Parity review of the complex unary math overloads (commit 416affc). Verified all 20 affected
ops across memory layouts (contiguous / F-contiguous / strided / transposed / both negative-stride
directions / sliced-offset / broadcast / 0-d / empty) with a fresh bit-exact sweep — every op is
layout-correct with 0 divergence — and confirmed the out=/where= ufunc parameters compose
bit-exactly with the new complex kernels (exp returns the same out instance; sqrt's where=mask
preserves masked-off slots).

The review surfaced a pre-existing MEMORY-SAFETY bug (segfault), now fixed:

  np.exp(complex_array, dtype=float64)   # and sqrt/log/log2/log10/log1p/expm1/exp2/sin/cos/tan/
                                         # sinh/cosh/tanh/arcsin/arccos/arctan

  segfaulted instead of raising. Root cause: ResolveUnaryFloatReturnType honored an explicit
  dtype= override after only rejecting integer/bool targets (over < Single -> "No loop matching").
  It never checked that the INPUT can reach the requested loop dtype by a same_kind cast. For a
  complex input + real-float dtype=, it returned the real type, ExecuteUnaryOp allocated an
  8-byte/element output buffer, and the 16-byte/element complex kernel overran it.

  NumPy 2.4.2 raises instead:
    "Cannot cast ufunc 'exp' input from dtype('complex128') to dtype('float64') with casting
     rule 'same_kind'"

  Fix: ResolveUnaryFloatReturnType now calls the existing ValidateUnaryInputCast (already used by
  square/reciprocal/negative, which were NOT affected) on the override path. This reuses
  NDIterCasting.CanCast(SAME_KIND), so it allows the legal narrowings (int->float32, float64->
  float32, float->complex) unchanged and rejects only the cross-kind complex->real cast, emitting
  NumPy's verbatim message. Probe matrix (complex/float/int inputs x float/complex/int dtype=)
  now matches NumPy across all 17 float-producing complex ufuncs; the order is preserved (integer
  dtype= still raises "No loop matching" before the cast check).

Also refreshes the NDComplexMath class doc comment, which still described the old fork state
("sinh/cosh/tanh/asin/acos/atan delegate straight to System.Numerics.Complex", "arctan's BCL
interior is the lone documented divergence") — it now lists the actual ported algorithms (npy_clog,
Kahan ctanh, csinh/ccosh, npy_catanh, npy_cexp/csqrt, nc_expm1/Goldberg, FMA square, nc_recip),
the two ops still delegating (asin/acos at parity), and the three accepted pathological residuals.

Tests: NewDtypesUnaryTests.cs adds Complex_FloatUfunc_NarrowingDtype_RaisesCastError_NotSegfault
(exp/log/sqrt/sin/tanh/arctan: complex+dtype=float64 raises the verbatim cast error, complex+
dtype=int64 raises "No loop matching", complex+dtype=complex128 returns complex). Full CI-style
suite (net10.0, exclude OpenBugs/HighMemory): 9730 passed, 0 failed. net8.0 + net10.0 build clean.

Note: ceil/floor/round/trunc on complex reject cleanly (no segfault) but with NumSharp's own
message rather than NumPy's "ufunc not supported for the input types" — left as-is (out of scope;
NumPy has no complex loop for them either). The int->exp2 InvalidProgramException (Single-output
kernel) remains a separate, already-tracked bug (fuzz registry W3-C), unrelated to complex.
…o (match/beat NumPy 2.4.2)

np.zeros was ~1000x slower than NumPy for large arrays (10M float64: 14.3 ms
vs NumPy 0.011 ms). Root cause: it allocated an uninitialized buffer and then
ran an eager per-element Fill loop that touched (and zeroed) every byte. NumPy
instead delegates zeroing to the OS: PyDataMem_NEW_ZEROED -> calloc, whose
demand-zero pages are committed and zeroed lazily on first write, so allocating
zeros is effectively O(1) regardless of size (numpy/_core/src/multiarray/alloc.c
npy_alloc_cache_zero: small sizes use a cache+memset, large sizes calloc).

This ports NumPy's structure. The zeroing is now done by the allocator/OS, never
an element loop — correct for all 15 dtypes because the all-zero bit pattern
equals default(T) for every one of them (incl. Half, Single, Double, Decimal,
Complex).

Implementation
--------------
- ArraySlice.Allocate(..., fillDefault: true) and Allocate<T>(..., true) now
  route to UnmanagedMemoryBlock<T>.AllocateZeroed instead of
  `new UnmanagedMemoryBlock<T>(count, default)` (Take + scalar Fill). All
  np.zeros overloads, np.zeros_like, np.eye/np.identity, and every internal
  fill-with-default allocation flow through here.
- SizeBucketedBufferPool.TakeZeroed: NativeMemory.AllocZeroed (calloc) with no
  dirty-bucket reuse — a recycled buffer would force a full memset, discarding
  the lazy demand-zero win for large sizes and being no cheaper than calloc for
  small ones.
- OsVirtualMemory (new, Windows-only): the Windows process heap eager-commits
  and memsets mid-size calloc requests (~256 KiB-2 MiB, ~0.05 ms for 800 KiB),
  unlike glibc/macOS which mmap large blocks lazily. For >= 128 KiB on Windows
  AllocateZeroed uses VirtualAlloc(MEM_COMMIT) (copy-on-write zero pages,
  ~0.002 ms) and a new Disposer AllocationType.Virtual that releases straight to
  the OS via VirtualFree (not pooled). Non-Windows and small sizes stay on
  calloc, which is already lazy/cheap there.

Benchmark fix (pre-existing bug)
--------------------------------
CreationBenchmarks returned each created array without disposing, leaking one
buffer per op. NumPy's harness (numpy_benchmark.py) discards each result inside
the timed loop, so CPython refcount frees it immediately — i.e. NumPy measures
alloc+free while the C# benchmark measured alloc-only (unfair) and leaked. Under
BenchmarkDotNet's thousands-of-ops-per-iteration, every untouched-but-committed
buffer still charges Windows commit, so any fast creation op OOMs at 10M
(np.empty(10M) already did; the old np.zeros only escaped by being slow enough
to throttle BDN to a couple ops/iteration). All creation benchmarks now dispose
per op, matching NumPy and bounding resident memory.

Results (this machine, vs NumPy 2.4.2; BDN alloc+free)
------------------------------------------------------
- 10M float64: 14.3 ms -> 0.0033 ms (was ~1000x slower; now 3.1x faster)
- medium (100K): 1.7-3.8x faster across i32/i64/f32/f64
- large (10M): 1.1-3.5x faster across i32/i64/f32/f64
- small (1K): ~1.5-2x slower — bounded by NDArray object construction
  (NDArray/Storage/Shape/ArraySlice/Disposer), shared by all creation APIs and
  sub-microsecond; the allocation itself is optimal.

Tests
-----
New Creation/np.zeros.AllocationTests.cs (12 tests): all 15 dtypes zeroed across
heap/VirtualAlloc size regimes, full-scan of a multi-MB array, VirtualAlloc
writeability/commit correctness, OwnsData, non-aliasing, reuse-after-dispose,
multi-dim/high-rank/empty/sliced, default dtype, all overloads. Full CI suite
(net8.0 + net10.0, excluding OpenBugs/HighMemory) green: 0 failed, 9742 passed.
…gramException)

np.exp2 threw InvalidProgramException for EVERY call resolving to a float32 (Single) output —
int16/uint16/char/float32 inputs (via ResolveUnaryFloatReturnType) or any input with
dtype=float32. exp2->float64 and exp2->float16 were unaffected.

Root cause: DirectILKernelGenerator.Unary.Math.cs, EmitExp2Call, the type == Single branch
emitted one spurious instruction. Math.Pow takes (base, exponent), so the exponent must be
stashed in a local and pushed after the base. The Single branch pushed an EXTRA `Ldc_R8 2.0`
before the Stloc, so:

  Conv_R8            ; [x_d]
  Ldc_R8 2.0         ; [x_d, 2.0]      <- spurious push
  Stloc locExp       ; [x_d]           locExp = 2.0  (saved the constant, NOT the exponent)
  Ldc_R8 2.0         ; [x_d, 2.0]
  Ldloc locExp       ; [x_d, 2.0, 2.0]
  Call Math.Pow      ; [x_d, 4.0]      Pow(2,2)=4, pops 2
  Conv_R4            ; [x_d, 4.0f]     <- TWO values left on the stack

The body left a net +1 on the evaluation stack, so the DynamicMethod failed IL verification ->
InvalidProgramException at JIT time. (Latent second defect, masked by the crash: locExp held the
constant 2.0, so even a balanced version would have returned Pow(2,2)=4 for every element.)

Fix: delete the spurious push so the Single branch mirrors the (correct) Double branch wrapped in
Conv_R8/Conv_R4 — Conv_R8 -> save exponent -> push base 2.0 -> push exponent -> Pow -> Conv_R4.
exp2(float32[0,.5,1,-1,3]) now returns [1, 1.4142135, 2, 0.5, 8], bit-matching NumPy 2.4.2.

Around-the-bug audit (no new findings):
- Code: the decimal exp2 path (Unary.Decimal.cs) already saves the exponent correctly, and every
  other unary transcendental routes through EmitMathCall (MathF.X for Single — one arg, balanced),
  so exp2's Single branch was the only emitter with this idiom.
- Empirical: swept all 20 float-producing unary ufuncs x 12 input dtypes (240 cells) vs NumPy ->
  0 InvalidProgram crashes, 0 dtype mismatches; the only value differences are the pre-existing,
  accepted 1-float32-ULP transcendental gaps (exp/sin/cos/rad2deg, MathF vs NumPy's float32 libm),
  unrelated to exp2 (which is now exact).

Tests:
- NpApiOverloadTests_UnaryMath: removed [OpenBugs] from Exp2_WithType_Compiles /
  Exp2_WithNPTypeCode_Compiles (they reproduced this bug) and strengthened them to assert the 2^x
  values [1,2,4], not just no-crash (guards the latent locExp=2.0 wrong-value defect).
- NewDtypesUnaryTests: new "Exp2 Single-output" region — int16/uint16/char inputs, float32
  preserve (incl. fractional sqrt(2)), dtype=float32 override from int and double, a strided
  float32 view, and the unaffected float64/float16 branches.
- MisalignedRegistry: removed the W3-C excuse so any regression of the crash now fails the fuzz gate.

Verified: full CI-style suite (net10.0, exclude OpenBugs/HighMemory) 9749 passed / 0 failed;
exp2 tests pass on net8.0 + net10.0 (runtime IL emission checked on both JITs).
np.power with a float16 (Half) scalar exponent threw InvalidCastException on the scalar-broadcast
path — e.g. np.power(f16[2,3], f16(2)). NumPy 2.4.2 returns float16 ([4,9]).

Root cause: Default.Power.cs ReadScalarAsDouble read the size-1 exponent via
Convert.ToDouble(object). System.Half does not implement IConvertible, so the boxed Half threw
InvalidCastException — even though the scalar-exponent fast paths (ScalarEqualsExact) explicitly
list Half as supported. The throw fired from TryScalarExponentFastPath -> IsScalarValueZero ->
ScalarEqualsExact -> ReadScalarAsDouble for ANY power call whose exponent value is a Half.

Fix: ReadScalarAsDouble casts a boxed Half directly to double (Half is the only supported
non-IConvertible dtype that reaches it; Complex is excluded upstream by ScalarEqualsExact's default
and the float-only fast-path guard). With this, the fast paths resolve correctly:
  power(f16[2,3], f16 2)   -> [4, 9]      (exp=2 multiply)
  power(f16[2,3], f16 3)   -> [8, 27]     (general path)
  power(f16[4,9], f16 0.5) -> [2, 3]      (sqrt fast path)
  power(f16[2,4], f16 -1)  -> [0.5, 0.25] (reciprocal fast path)
  power(f16 2, f16[2,3,4]) -> [4, 8, 16]  (scalar-broadcast base)
all float16, bit-matching NumPy.

Tests: NewDtypesComparisonTests.Half_Power_HalfExponent_ScalarBroadcast covers the formerly-throwing
Half-exponent cells (the existing Half_Power used an int exponent, which is IConvertible and never
hit the bug). MisalignedRegistry: removed the now-dead W1-B "power(float16): Half scalar path
InvalidCast" excuse so a regression of the crash fails the fuzz gate; the binary_divmod_power corpus
tally no longer lists it (was 3x). The kept power excuses are separate bugs: W1-F power(*,float16)
dtype widening and W1-C power(uint64,int64) negative-exponent throw.

Verified: power/Half/Fuzz suites 188 passed / 0 failed (net10.0); net8.0 + net10.0 build clean.
This is the documented NEP50 weak-scalar behavior for an INT exponent (power(f16, int_scalar)->f16
vs NumPy float64) is unchanged and out of scope (registry entry #1).
… (W11-A, W6-D)

np.maximum / np.minimum (and np.clip) must propagate NaN per NumPy 2.4.2 — the
result is NaN whenever EITHER operand is NaN. NumSharp routes all three through
the clip SIMD kernel, whose hardware MAXPS/MINPD intrinsics (Avx.Max / Avx.Min,
plus the Sse/Avx512 equivalents) return the SECOND operand on an unordered (NaN)
compare, silently dropping the NaN on every full SIMD vector. The scalar tail
(Math.Max/Min, HalfMaxNaN, ComplexMaxNaN) already propagated, so the bug only
surfaced once an array reached the vector width; small arrays (and earlier
size-2 spot checks) passed by luck.

Reproduction (float64, >= one V256<double> = 4 lanes):
  np.maximum([1,nan,3,nan,5,nan,7,8], [9,2,nan,4,nan,6,nan,nan])
    before: [9, 2, NaN, 4, NaN, 6, NaN, NaN]            (a==NaN lanes return b)
    after : [9, NaN, NaN, NaN, NaN, NaN, NaN, NaN]      == NumPy
  np.clip([nan,1,2,3,4,5,6,7], 2, 5)
    before: [2, 2, 2, 3, 4, 5, 5, 5]                    (NaN clamped to a_min)
    after : [NaN, 2, 2, 3, 4, 5, 5, 5]                  == NumPy

Fix (DirectILKernelGenerator):
  EmitVectorMinOrMax gains a `propagateNaN` flag. For Single/Double lanes it now
  emits the NaN-restoring select
      result = ConditionalSelect(Equals(a, a), hwMinMax(a, b), a)
  i.e. keep the hardware min/max where a is non-NaN, else re-inject a (= NaN).
  This is robust regardless of whether the underlying min/max already propagates
  (verified against Avx.Max/Min AND the cross-platform Vector{N}.Max/Min fallback).
  The raw hardware emit is factored into EmitRawVectorMinOrMax and is still used
  verbatim for integer lanes (no NaN) and any opt-out caller, so integer
  maximum/minimum are byte-for-byte unchanged. The two clip call sites
  (Max(vec,lo) / Min(vec,hi)) pass propagateNaN: true, fixing maximum (clip
  a_min), minimum (clip a_max), and clip (both bounds) in one place. clip with a
  NaN BOUND also propagates, matching NumPy's minimum(maximum(a,lo),hi)
  composition.

Scope: fmax/fmin (W7-B, the NaN-IGNORING opposite contract) still share the clip
path and remain divergent — a separate, already-tracked bug. This change does
not regress them (their fuzz excuse keys on a NaN appearing in NumSharp's output,
which still holds). The F-contiguous/strided element-pairing defect (W7-A) is
likewise untouched.

Registry: removed the now-dead W11-A (maximum_out/minimum_out), clip_out, and
W6-D (clip) NaN excuses from MisalignedRegistry so the differential matrix
verifies NaN propagation bit-exact; W7-A and W7-B excuses preserved.

Tests:
  - np.minimum.Test.cs: maximum/minimum NaN propagation on the SIMD path
    (float64, float32), strided views, the W11-A out=alias case
    maximum(a, roll(a), out=a), and an int64 no-NaN regression guard.
  - np.clip.Test.cs: clip(NaN) with scalar bounds, array bounds, and out=alias.
  All expected values verified against NumPy 2.4.2.

Validation: full suite green (9760 passed / 0 failed); FuzzMatrix green incl.
Logic, Stat, Aliasing, Tail — the Aliasing corpus now reports zero documented
divergences for maximum_out / minimum_out / clip_out.
…ng and float16 signed-zero (W7-A)

np.maximum / np.minimum / np.clip (and fmax/fmin, which share the same kernel) all route
through the flat clip IL kernel in DirectILKernelGenerator.Clip. That kernel walks
src/dst/lo/hi linearly and pairs them by C-order (row-major) index. The engine, however,
handed it operands in whatever layout they arrived in:

  - PrepareBound built an array bound via broadcast_to(...).astype(outType). astype keeps
    the source layout (order='K'), so an F-contiguous / strided / broadcast bound stayed
    non-C-contiguous and was read in raw buffer order.
  - The general src/dst path used Cast(lhs, outType, copy:true), which likewise preserves
    the source layout, so an F-contiguous / strided source produced a non-C-contiguous
    working buffer.
  - A non-C-contiguous user out= was clipped in place in its own (F/strided) buffer, again
    mismatching the C-order bound.

Result: every element was paired with the wrong bound whenever a second operand (or the
source, or out=) was F-contiguous / transposed / strided / negative-strided / broadcast.
maximum(a, b.T) returned b's raw buffer instead of the element-wise max. This is the W7-A
fuzz divergence (34x in logic.jsonl).

Fix (Default.ClipNDArray): normalize every operand the flat array-bounds kernel touches to
C-contiguous offset-0 before the call.
  - PrepareBound copies the prepared bound to C-order when it isn't already C-contiguous
    offset-0.
  - The general path re-materializes a C-order copy of dst when lhs is F-contiguous /
    strided / offset.
  - A non-C-contiguous out= is clipped into a C-order temp and copied back via np.copyto
    (which honors the out's strides), preserving the return-the-out instance contract.
  - Scalar bounds are order-insensitive (one value per lane) and keep their existing fast
    paths untouched.

Half signed-zero tie-break (DirectILKernelGenerator.Clip): the residual W7-A divergence
after the pairing fix was float16-only and showed even on both-contiguous inputs - a
separate bug. NumPy's float16 maximum/minimum return the FIRST operand on a tie, so
maximum(+0,-0)=+0, maximum(-0,+0)=-0 (likewise minimum). HalfMaxNaN/HalfMinNaN used strict
>/< and returned the SECOND operand on the +0/-0 tie (wrong sign). Switched to >= / <= so
the tie resolves to the first operand, matching NumPy. Only the signed-zero result changes;
every non-zero equal pair is bit-identical. float32/64 use the SIMD path / Math.Max and
were already correct.

Verified bit-exact vs NumPy 2.4.2 across C / F / strided / transposed / negative-stride /
broadcast / 3-D layouts, mixed-dtype bounds, two-bound clip, and non-C-contiguous out=. The
logic.jsonl fuzz corpus now reports zero W7-A divergences (was 34x); maximum/minimum are
bit-exact on every layout.

MisalignedRegistry: removed the now-dead W7-A pairing excuse (the fallback for extrema value
divergences). fmax/fmin keep only the NaN-propagation excuse (W7-B, a separate ignore-NaN
bug); the isclose strided-complex excuse (its own comparison kernel, not clip-routed)
remains.

Tests (np.minimum.Test.cs): added F-contiguous bound, F-contiguous source, two-bound
F-contiguous clip, strided-slice, non-C-contiguous out=, 3-D transposed, float16
F-contiguous, and float16 signed-zero tie-break regressions. Added a flat (logical C-order)
assertion helper since GetDouble(int) on a multi-dimensional result indexes the first axis,
not the flat buffer.

Full net10.0 suite: 9769 passed / 0 failed.
The Half/Complex clamp helpers (HalfMaxNaN, HalfMinNaN, ComplexIsNaN, ComplexMaxNaN,
ComplexMinNaN) are invoked once per element from the clip IL kernel's inner loop - the
non-SIMD scalar path for Half/Complex, which have no Vector<T> Min/Max. Marked them
[MethodImpl(AggressiveInlining | AggressiveOptimization)]:

  - AggressiveInlining lets the JIT fold them into the generated kernel where it can
    (e.g. ComplexIsNaN into ComplexMax/MinNaN, and the helpers into the dynamic-method
    body), removing the per-element call.
  - AggressiveOptimization gives them full tier-1 codegen from the first call when they
    are invoked standalone via the kernel's Call opcode, skipping the tier-0 quick-JIT
    penalty on a hot path.

Scope: only the per-element helpers in the 'Per-Element Helpers for Non-SIMD Types (called
from generated IL)' region. The one-time IL-codegen methods (Generate, Emit*, etc.) are
intentionally left untouched - they run once per (dtype, mode, kind) and are cached, so
AggressiveOptimization would only add JIT startup cost with no steady-state benefit.
… op helpers

Following the clip per-element helpers (d72822f), tag the other small per-element
scalar/generic op helpers that run inside kernel inner loops with
[MethodImpl(AggressiveInlining | AggressiveOptimization)] - inline into the loop where the
JIT can, full tier-1 codegen otherwise (skip the tier-0 quick-JIT penalty on the hot path).

Tagged (8 helpers, 3 files):
  - ComplexDivideNumPy (DirectILKernelGenerator.cs) - per-element complex divide called from
    the binary IL kernel.
  - HalfSignHelper (Unary.Decimal.cs) - per-element Half sign called from the unary IL kernel.
  - Reduction.Axis.cs accumulator helpers, each called once per element inside the
    axis-reduction loops: ReadAsDouble, CombineScalarsPromoted, ComplexLexPick, DivideByCount,
    ConvertToDouble<T>, ConvertFromDouble<T>. Their generic typeof(T) branches constant-fold to
    a single cast per instantiation, so the inlined bodies are tiny.

Deliberately NOT tagged:
  - Whole-array / whole-axis loop helpers (Nan*SimdHelper, Arg*Helper, CopyGeneralSameType, the
    *ReduceAxis* loops, scan helpers): they ARE the loop, called once per op/axis with large
    bodies - AggressiveInlining is ignored on them, and .NET on-stack-replacement already
    promotes their long loops to tier-1. Can be revisited with AggressiveOptimization-only if
    profiling shows a tier-0 entry cost.
  - One-time IL-codegen methods (Generate/Emit*): run once per cached kernel key, so
    AggressiveOptimization would only add JIT startup cost with no steady-state benefit.

JIT-only change - no behavior change. Full net10.0 suite: 9769 passed / 0 failed; FuzzMatrix
42/42.
…per-element op helpers

Sweep the remaining small per-element / per-vector / per-T op helpers in the kernel hot core so
they all carry [MethodImpl(AggressiveInlining | AggressiveOptimization)], extending the clip
(d72822f) and reduction-accumulator (06d942c) passes.

Verified the two flags compose with NO penalty: a clean per-variant microbenchmark (dedicated
AggressiveOptimization loop per variant, 1.5e9 iters) measured AggressiveInlining =
AggressiveInlining|AggressiveOptimization = AggressiveOptimization-only = 740-741 ms, vs
NoInlining = 1106 ms. AggressiveOptimization does NOT suppress the inlining that
AggressiveInlining requests; it only adds tier-1-from-first-call for the standalone version
(relevant when a helper is reached via the kernels' generated-IL Call rather than inlined).

Changes:
  - Upgraded 131 helpers that were already AggressiveInlining-only to add AggressiveOptimization
    (Reduction.Axis.Simd.cs, Reduction.Axis.Widening.cs, Reduction.Axis.VarStd.cs, Quantile.cs,
    Modf.cs, Where.cs, IndexCollector.cs, SimdMatMul.cs, SimdMatMul.Strided.cs, StrideDetector.cs).
  - Added the full attribute to 6 previously-untagged per-element / per-vector / seed helpers:
    CompareGreater<T>, CompareLess<T> (Reduction.Axis.Arg.cs - argmax/argmin per-element compare),
    GetNanIdentityValue<T> (Reduction.Axis.NaN.cs), GetIdentityValueTyped<T> (Reduction.Axis.cs),
    HorizontalReduce256<T>, HorizontalReduce128<T> (Reduction.Axis.Simd.cs - per-vector finalize).
  - Added using System.Runtime.CompilerServices; to the three files that lacked it.

No AggressiveInlining-only attributes remain in Backends/Kernels/ (150 now carry both flags).

JIT-only change - no behavior change. Full net10.0 suite: 9769 passed / 0 failed; FuzzMatrix 42/42.
…elpers codebase-wide

Completes the sweep started in the kernels (e3685be): every method already marked
[MethodImpl(MethodImplOptions.AggressiveInlining)] now also carries AggressiveOptimization.
The two flags compose with no penalty (benchmarked: AggressiveInlining = both =
AggressiveOptimization-only at 740-741ms, all far ahead of NoInlining at 1106ms) and the
change is JIT-only / behavior-neutral.

396 helpers across 48 files upgraded - Primitives (NDArray operators), Backends/Iterators
(NDIter/NDIterator per-element traversal), Utilities + Utilities/SpanSource, Backends/Default/Math
+ BLAS, Manipulation, Creation, RandomSampling, Backends/Unmanaged/Pooling, DateTime64, etc.
No AggressiveInlining-only attributes remain in NumSharp.Core (550 now carry both flags).

AggressiveOptimization is a no-op for methods that inline (most of these); its benefit is
tier-1-from-first-call for the standalone version when a helper is reached via a non-inlined
path (generated-IL Call, delegate, or a body too large to inline). Every changed line is the
attribute swap - no behavior or body changes.

Full net10.0 suite: 9769 passed / 0 failed; net8.0 core builds clean.
… axis-reduction perf

Investigation artifacts only — no library code changed. complex128 axis reductions
were the next-worst op cluster after np.zeros (sum/mean axis=0/1 ran 19–64× slower
than NumPy 2.4.2). These POCs find the root cause and benchmark every candidate fix
against NumPy 2.4.2, all re-runnable via `dotnet run -c Release` in benchmark/poc.

Root cause (complex_axis_probe / complex_alloc_probe / complex_why_probe):
- NOT boxing. np.sum(complex) allocates ~0 B/elem; the (object) casts in
  CombineScalarsPromoted JIT-elide. The earlier "30M boxes" theory was wrong.
- The real cost is scalar System.Numerics.Complex add where NumPy SIMD-vectorizes
  the reduction (complex == contiguous double pairs). A double-pair Vector256 add
  reaches NumPy parity: 7.52 vs 7.59 ms @10m, and beats it @1m (0.31 vs 0.33).
- Secondary: the scalar fallback's non-inlined per-element combine + op-switch +
  cache-hostile per-output column gather (axis=0); mean's MeanAxisComplex allocates
  an NDArray view + iterator per output row (6 B/elem).
- Small-N gap is compute (scalar), not allocation: np.zeros((N,)) output is ~4% of
  the op; pure compute is ~93%.

Candidate benchmark (complex_reduce_poc) — baseline / NDAxisIter / NDIter-2op via
ForEach / NDIter-2op via ExecuteGeneric / monolithic generic-math, both axes,
10K–10M, plus strided input, Half/Decimal sweep, and alloc/op:
- Monolithic generic-math is the most uniform (1.3–2.5× NumPy, never bad).
- NDAxisIter beats NumPy on inner-reduce (axis=1) but is cache-hostile on
  outer-reduce (axis=0, 6.6× @10m — per-output strided column gather).
- NDIter 2-op REDUCE is good on axis=0 but 5.8× on axis=1 (BUFFER+REDUCE inner
  double-loop unfinished) and allocates 2344 B/op for iterator construction.
- Half needs a wider <TInput,TAccum> accumulator (same-type is precision-wrong);
  Decimal/Complex are correct same-type.

generic_math_proto: proves .NET generic math (IAdditionOperators / IAdditiveIdentity)
yields ONE zero-boxing kernel covering Complex AND all numerics; np.cumsum(complex)
already rides this exact path (CumSumAxisKernel<T>).

Fallback handling (complex_fallback_poc): layout-aware dispatch keyed on which
LOGICAL axis is contiguous (NumPy's strategy), not the physical last axis —
  output-axis stride==1 -> SIMD slab-fold
  reduce-axis stride==1 -> SIMD 2-lane (re,im) contiguous reduce
  neither            -> strided scalar, smaller-stride axis innermost
Covers C-contig / row-sliced a[::2,:] / col-sliced a[:,::2] / transposed a.T, all
correct, 0.79–1.23× NumPy (beats it on 4 of 8). Mirrors the fast-path dispatch
AxisReductionSimdHelper already has for numeric dtypes, so the real fix = reuse that
dispatch + plug in two complex double-pair SIMD primitives.

Decision pending: (A) extend AxisReductionSimdHelper with Complex-as-double-pairs
(fastest, lowest risk, matches existing reduction design) vs (B) migrate reductions
to NDIter (long-term target; needs inner-reduce double-loop + small-N overhead work).
Nucs added 3 commits June 27, 2026 19:00
…e open item

Updates the combinatorial-indexing handover to the executed reality: all five
divergence buckets (R1 value-broadcast on empty/scalar selections, B2 0-d-base
over-index, B3 empty advanced gather, B4 empty-slice assignment, R2 non-consecutive
0-d-bool placement + the Shape.Broadcast hash collision it uncovered) are fixed,
committed (aea9fc78..7e968f5e), and pinned by the new Indexing.CombinatorialParity
[FuzzMatrix] gate. The random sweep is 0 divergences across every measurable window.

The only open item is R3, a pre-existing flaky teardown heap-corruption. Records the
diagnostics gathered this session that supersede the prior "pinned managed" guess: it
is a specific corpus shape (not allocation volume), it is NOT a pooled native-buffer
overrun (end-aligned guard pages run clean), and it resists deterministic repro
(crash point varies 6285-9879; ~1/3 even in the tightest window). Next-session plan:
a per-case red-zone on FromArray / the direct-VirtualAlloc zeroed path, or a
dotnet-dump capture. Index_Random stays [OpenBugs] purely on the crash, not parity.
Mechanical, behavior-preserving rename of NumSharp's iterator/expression
stack from the `Npy*` prefix to `ND*`. ND* matches NumSharp's house style
(`NDArray` ↔ numpy.ndarray, the retired NDIterator) and NumPy's Python
`nditer` name. The old `Npy*` prefix mirrored NumPy's C struct name
(`NpyIter`); `ND*` is the user-facing convention used everywhere else.

Scope (NumSharp-owned only):
- Types/delegates/enums: NpyIter→NDIter, NpyIterRef→NDIterRef,
  NpyExpr→NDExpr, NpyIterState/Flags/PerOpFlags/GlobalFlags/OpFlags,
  NpyFlatIterator→NDFlatIterator, NpyAxisIter→NDAxisIter,
  NpyMemOverlap→NDMemOverlap, reduction-kernel structs + interfaces
  (INpy…→IND…), NpyArrayMethodFlags→NDArrayMethodFlags, and the
  NpyIter_* C-API-style method names → NDIter_*.
- Utilities: NpyComplexMath→NDComplexMath, NpyDivision→NDDivision,
  NpyIntegerPower→NDIntegerPower.
- Benchmark subsystem: benchmark/npyiter → benchmark/nditer
  (npyiter_{bench,sheet,cards,results,headline} → nditer_*,
  --skip-npyiter → --skip-nditer).
- 65 files renamed via git mv; ~190 files content-swept; website docs,
  docs/numpy notes, and frozen benchmark/history snapshots included.

Preserved (genuine NumPy references, NOT the stack):
- src/numpy/** (the upstream clone — NpyIter is NumPy's real C type).
- The .npy/.npz file format: `#region NpyFormat` (np.save/np.load) and the
  SaveAndLoadWithNpyFileExt test.
- NumPy's C function names quoted in docs (npyiter_allocate_arrays,
  npyiter_coalesce_axes, … kept verbatim).

Build: solution green (0 errors). Tests: 10980 passed, 0 failed, 11 skipped
(net10.0, CI filter TestCategory!=OpenBugs&!=HighMemory).

Branch commit messages were rewritten Npy→ND separately (message-only
history rewrite; file blobs in historical commits untouched). This commit
is registered in .git-blame-ignore-revs as a mechanical rename.
…→ND rename

1. Expand all entries to full 40-char SHAs — git blame --ignore-revs-file
   ABORTS on abbreviated names ("fatal: invalid object name: ac02033").
2. Register the mechanical Npy→ND rename commit (301229b).
Nucs added a commit that referenced this pull request Jun 27, 2026
…e nditer branch

Replaces the stale PR description (written ~64 commits in, +50k lines) with a
complete changelog of everything between the #612 merge-base (5eedb81) and
HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare.

Compiled via a two-pass audit:
- Pass 1: every commit subject+body mined for features, perf numbers, and
  breaking changes; APIs/CI/benchmark/corpus facts verified against the live
  tree (test counts, fuzz corpus wc, Direct partial count, NDIter LOC).
- Pass 2: all 279 local commits re-walked against the draft. Caught and
  fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on
  master (verified via git grep origin/master) — reclassified from 'rebuilt'
  to new, raising the new-API count 22 -> 30; removed an unverifiable test
  count; added the 15-dtype hot-path parity item (786d705) and the
  DefaultEngine->NDIter Tier-3B production routing.

Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612
(already on master) and are intentionally excluded; the local master ref is
stale, which is why master..HEAD misleadingly shows 339 commits.

The same content (minus the H1) is now the live PR #611 description, pushed
via REST PATCH (gh pr edit requires read:org scope the token lacks).
Nucs added a commit that referenced this pull request Jun 27, 2026
… 1.3–6.1)

Branch advanced 31 substantive commits past the first changelog (which
described through 33058b8). The branch was rebased meanwhile — the original
changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8
remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary.

Learned and folded in:
- np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing
  (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused
  reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy.
- out=/where=/dtype= across the elementwise ufunc API (binary, unary-math,
  comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped
  overload each, exact broadcast/cast/error-text semantics.
- New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive.
- nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent
  masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant,
  op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb).
- Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC
  pressure, finalizer suppression.
- Canonical NDIter benchmark suite + post-release benchmark.yml CI +
  DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded
  (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0
  crash, parallel banding 4.7x win).

Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402.
Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35.
Same content (minus H1) pushed live to the PR #611 description via REST PATCH.
Nucs added 3 commits June 28, 2026 08:02
Rename the two standalone benchmark projects to names reflecting what they are:
- NumSharp.Benchmark.GraphEngine → NumSharp.Benchmark.CSharp — the comprehensive
  BenchmarkDotNet operation suite; pairs with NumSharp.Benchmark.Python (numpy baseline).
- NumSharp.Benchmark.Exploration → NumSharp.Benchmark.Sandbox — the SIMD & broadcasting
  experiment playground.

Project dirs, .csproj (AssemblyName/RootNamespace), all namespace/using decls (incl. the
partial-qualifier Exploration.Infrastructure → Sandbox.Infrastructure), and external
referencers. Neither project is in SciSharp.NumSharp.sln. Both build clean (0 errors).

NOTE: benchmark/CLAUDE.md excluded — concurrent uncommitted edits by another actor; its
references will be swept in a follow-up.

Left as prose: the Sandbox banner "Performance Exploration Suite" and a "#541" reference.
…3.0; rename GraphEngine->CSharp refs

RELEASE_0.60.0.md (new) — comprehensive release notes for NumSharp 0.60.0, the
first stable, consolidating everything since 0.50.0 (the nditer engine, np.evaluate
fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K, np.random
MT19937 parity, 36+ new np.* APIs, stride-native matmul, ARC, differential fuzzing).
Benchmark claims pinned to the committed 2026-06-23 history snapshot:
  - TL;DR geomean corrected to 1.26x @10m (397 faster / 150 close / 42 slower of
    615 ops; 1.14x @1k, 0.90x @100k) -- was the stale 1.00x/409-op June-5 figure.
  - Added a "Benchmark harness" section in section 8: the op/dtype/N matrix
    (1,851 cells) plus the five iterator/layout/operand/cast/fusion subsystems,
    history snapshots + latest symlink, and the NPY/NS convention.
  - section 8 cast row refreshed to the final campaign state (716 -> 129 lagging of
    1,568 comparable; 1,439 winning) with the per-src narrow-int geomeans.
  - section 15 now discloses the benchmark-surfaced slower paths (small-N dispatch,
    scalar Half/Decimal, large-N any full-scan, comparison->bool, fancy gather).
  - np.random (MT19937 + exactly 24 new distribution files) and the split funcs
    verified new-since-0.50.0 against ground truth; np.random added to the opening.
  - Named the renamed C# bench project (benchmark/NumSharp.Benchmark.CSharp).

benchmark/CLAUDE.md — brought the development guide from v2.0 (2026-02-13) to v3.0:
run_benchmark.py documented as THE orchestrator; metrics/type-system/categories/
python-suite/directory sections updated to current state -- 15 dtypes (added
int8/float16/complex128), ~615 ops/size (1,851 cells), 14 comparison suites + 5
appended subsystems, 3 cache tiers (1K/100K/10M), NPY/NS convention, committable
history/ snapshots; run-benchmarks.ps1 marked legacy. All 13
NumSharp.Benchmark.GraphEngine references -> NumSharp.Benchmark.CSharp
(folder/csproj/namespace/paths) for the rename in 002bc62.

.github/workflows/benchmark.yml — stale "builds GraphEngine+Core" step comment ->
"builds the CSharp bench project + Core".
@Nucs Nucs changed the title [Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul [Major Rewrite] NumPy nditer port, NDExpr/np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K layout, stride-native matmul, 15 dtypes Jun 28, 2026
@Nucs Nucs marked this pull request as ready for review June 28, 2026 12:41
@Nucs Nucs self-assigned this Jun 28, 2026
@Nucs Nucs added enhancement New feature or request architecture Cross-cutting structural changes affecting multiple components NumPy 2.x Compliance Aligns behavior with NumPy 2.x (NEPs, breaking changes) performance Performance improvements or optimizations documentation Improvements or additions to documentation infrastructure CI/CD, build system, testing framework, tooling core Internal engine: Shape, Storage, TensorEngine, iterators api Public API surface (np.*, NDArray methods, operators) release A newly released version labels Jun 28, 2026
@Nucs Nucs changed the title [Major Rewrite] NumPy nditer port, NDExpr/np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K layout, stride-native matmul, 15 dtypes [Major Release 0.60.0] NumPy nditer port, NDExpr/np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K layout, stride-native matmul, 15 dtypes Jun 28, 2026
Nucs added 2 commits June 28, 2026 17:31
Deleted the np-tests skill (write/migrate NumPy tests for NumSharp):
- .claude/skills/np-tests/SKILL.md  (the skill definition, ~3 KB)
- .agents/skills/np-tests/SKILL.md  (symlink to the above)

Nothing else references the skill; the np-function skill is unaffected.
Release prep for the 0.60.0 major release — the nditer rewrite that aligns
NumSharp's compute core with NumPy 2.4.2.

- Version bump 0.40.0 -> 0.60.0 across NumSharp.Core and NumSharp.Bitmap
  (AssemblyVersion / FileVersion / Version / PackageVersion). The tagged CI
  release also injects the version from the v* tag; this keeps the repo
  metadata correct for local and non-tagged builds.
- Add docs/releases/RELEASE_0.60.0.md — the full 0.60.0 release notes: NumPy
  nditer port, np.evaluate fusion, full advanced-indexing parity, byte-exact
  array printing, C/F/A/K memory layout, np.random rebuilt on MT19937 + 24 new
  distributions, all 15 dtypes (int8/float16/complex128), stride-native matmul,
  the astype SIMD campaign, ARC memory management, and differential fuzzing.
- CI (build-and-release.yml): the GitHub Release body is now composed from
  docs/releases/RELEASE_<version>.md (install header + packages table + notes),
  falling back to auto-generated notes only when no notes file is present.
- CLAUDE.md refreshed to the final/complete project state.
- Website docs updated (il-generation, index, theme CSS).
@Nucs Nucs merged commit 680f349 into master Jun 28, 2026
3 checks passed
@Nucs

Nucs commented Jun 28, 2026

Copy link
Copy Markdown
Member Author

Yes, I reviewed 600 files. Deal with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Public API surface (np.*, NDArray methods, operators) architecture Cross-cutting structural changes affecting multiple components core Internal engine: Shape, Storage, TensorEngine, iterators documentation Improvements or additions to documentation enhancement New feature or request infrastructure CI/CD, build system, testing framework, tooling NumPy 2.x Compliance Aligns behavior with NumPy 2.x (NEPs, breaking changes) performance Performance improvements or optimizations release A newly released version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant