[Major Release 0.60.0] NumPy nditer port, NDExpr/np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K layout, stride-native matmul, 15 dtypes by Nucs · Pull Request #611 · SciSharp/NumSharp

Nucs · 2026-04-22T09:19:43Z

The progress to 0.60.0

A from-scratch port of NumPy 2.4.2's nditer engine and the entire compute stack built on it — merging the nditer branch into master. It adds a fused-expression DSL (np.evaluate), full advanced-indexing parity with NumPy (down to the memory-safety and error-text level), byte-exact NumPy array printing, C/F/A/K memory-layout support wired through the whole API, stride-native matmul, NumPy-seed-compatible np.random (MT19937), all 15 NumPy dtypes, 36+ new np.* functions, deterministic memory management, and a differential-fuzz pipeline that proves bit-exactness against NumPy. It ships as NumSharp 0.60.0 — the first stable (non-prerelease) release in the 0.x line.

620 commits · 1,309 files · +330,971 / −273,664 (vs master). The net is a large body of new engine code offset by an equally large deletion of legacy generated code (the Regen template engine, NDIterator, and MultiIterator are all gone). Everything below was validated against NumPy 2.4.2 ground truth — a ~40,000-case differential corpus, 566 iterator-parity scenarios, a 12,000+-case index oracle, ~18,000 array-print fuzz cases, and per-feature battle tests run on actual NumPy output.

TL;DR

NDIter — full port of NumPy 2.4.2's nditer (~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. It is the production execution engine for the elementwise/reduce core, at or faster than NumPy on every probed aspect.
NDExpr DSL + three-tier custom-op API, exposed as the public np.evaluate — write your own ufunc (raw IL / element-wise SIMD / composable expression trees) and run fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPy result_type typing and fused reductions.
Full advanced-indexing parity. A faithful port of NumPy's prepare_index + unified advanced-index gather/scatter takes the get/set surface from ~697 → 0 divergences (full parity) against a new committed differential index oracle, closes a class of memory-safety bugs (out-of-bounds gather/scatter on exotic mixed indices), and aligns every IndexError/ValueError text with NumPy.
Byte-exact NumPy array printing. NDArray.ToString() is now a 1-to-1 port of NumPy 2.4.2's array_str/array_repr/array2string + Dragon4 float formatting; new public np.array2string/np.array_repr/np.array_str/np.set_printoptions/np.printoptions/np.format_float_positional/np.format_float_scientific. ~18,000 fuzz cases byte-identical to NumPy.
out= / where= / dtype= ufunc kwargs across the elementwise API — binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plus np.bitwise_and/or/xor, np.positive, and first-class np.maximum/minimum/fmax/fmin.
All 15 NumPy dtypes — adds int8 and float16 and promotes complex128 from a stub to first-class (was 12 dtypes), each wired through creation / astype / reductions / comparisons / IL kernels with full 15×15 parity; the six Complex transcendentals are implemented. NumPy 2.x type aliases (np.byte→int8, np.complex64 throws, np.intp/int_/uint) realigned.
36+ new np.* APIs — sort/argsort, pad (11 modes), tile, median/percentile/quantile (all 13 interpolation methods) + nan* variants, average, ptp, take/put/place, extract/compress, diagonal/trace, argwhere/flatnonzero, unravel_index/ravel_multi_index/indices, delete/insert/append, split/array_split/hsplit/vsplit/dsplit, diff/ediff1d, asfortranarray/ascontiguousarray, np.multithreading, plus the printing APIs above.
np.random rebuilt for NumPy RNG parity — the legacy Knuth subtractive generator is replaced by MT19937 (NumPy's Mersenne Twister) for 1-to-1 seed/state compatibility, plus 24 new distribution samplers (weibull, vonmises, pareto, laplace, gumbel, dirichlet, multivariate_normal, noncentral_chisquare/noncentral_f, standard_t/standard_cauchy/standard_gamma/standard_exponential, triangular, zipf, logseries, rayleigh, wald, power, f, logistic, hypergeometric, multinomial, negative_binomial). Output is byte-identical to NumPy 2.4.2 at a given seed.
C/F/A/K order support wired through the whole API — Shape understands F-contiguity, OrderResolver resolves NumPy order modes, ~68 layout bugs fixed.
Stride-native matmul/dot — BLIS-style GEBP GEMM absorbs arbitrary strides for all dtypes (kills a ~100× penalty on transposed inputs); fused 1-D dot 3.5–9× faster with zero GC; opt-in multithreaded dot ~2× faster than NumPy's default on 1M vectors.
Type casting (astype) faster than NumPy across the board — the entire copy/retype/cast surface is unified on one NDIter.CopyAs core (the 2,226-line legacy per-element cast loop deleted), then a SIMD campaign took the full 15×8×15 astype matrix from 716 → 129 lagging cells of 1,568 comparable (1,439 winning ≥1.0× vs NumPy 2.4.2).
NumPy-parity benchmark: geomean 1.26× at 10M elements (397 faster / 150 close / 42 slower of 615 ops; 1.14× at 1K, 0.90× at 100K), from a committed, reproducible BenchmarkDotNet-vs-NumPy harness — an op/dtype/N matrix over all 15 dtypes plus five appended subsystems (iterator, memory-layout, operand, cast, fusion) with per-release provenance snapshots.
Deterministic memory management — atomic reference counting + IDisposable on NDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).
Differential fuzzing vs NumPy — ~51,000 committed bit-exact cases across 28 tiers (a ~40,000-case op corpus + the ~12,400-case index oracle), replayed with no Python at test time, plus a metamorphic-invariant tier, a seeded fuzzer + shrinker, the CI FuzzMatrix gate, and a nightly soak (§10).
Legacy stacks deleted outright — MultiIterator, NDIterator (interface + class + AsIterator), and the Regen template engine (87 inline #if _REGEN blocks across 35 files) are all gone.
Cross-platform — macOS/Apple-Silicon (ARM64) signed-zero + integer-widening reduction parity fixed.
Test suite: ~10,980 passed / 0 failed on net8.0 + net10.0, plus the differential-fuzz corpora replayed by the FuzzMatrix gate. 177 formerly-[OpenBugs] reproductions were promoted into regular CI tests as their bugs were fixed.

1. `NDIter` — full NumPy `nditer` port

From-scratch C# port of NumPy 2.4.2's iterator machinery under src/NumSharp.Core/Backends/Iterators/ (~12,557 lines), promoted to public API with NDArray overloads. The public surface includes the NumPy-named flag enums (NDIterFlags/NDIterOpFlags/NDIterGlobalFlags, NPY_ORDER, NPY_CASTING), the NDIterRef kernel handle, and the NDInnerLoopFunc per-chunk delegate; the standalone flat iterator is NDFlatIterator (drives np.broadcast(...).iters).

Capability	Detail
Iteration orders	C, F, A, K — incl. NEGPERM negative-stride handling, axis reordering + coalescing to full 1-D collapse
Indexing modes	`MULTI_INDEX`, `C_INDEX`, `F_INDEX`, `RANGE` (parallel chunking), `GotoIndex` / `GotoMultiIndex` / `GotoIterIndex`
Buffering	Buffered casting with all 5 casting rules, windowed buffered iteration, `DELAY_BUFALLOC`, buffered-reduce double-loop (incl. `bufferSize < coreSize`)
Reductions	`op_axes` with `-1` reduction axes, `REDUCE_OK`, `IsFirstVisit`, `REUSE_REDUCE_LOOPS` slab accumulation
Overlap safety	`COPY_IF_OVERLAP` via a port of NumPy's `mem_overlap` solver (`NDMemOverlap.cs`) — overlapping in/out operands no longer silently corrupt
Masking	`WRITEMASKED` + `ARRAYMASK` executed — the buffered window flush writes back only mask-nonzero elements; `VIRTUAL` operands construct with NumPy 2.x semantics
Operands / dims	Unlimited operands (NumPy caps at `NPY_MAXARGS=64`) and unlimited dimensions (NumPy caps at `NPY_MAXDIMS=64`) via dynamic allocation
APIs	`Copy`, `GetIterView`, `RemoveAxis`, `RemoveMultiIndex`, `ResetBasePointers`, `IterRange`, `DebugPrint`, fixed/axis stride queries, `GetValue<T>`/`SetValue<T>`, …
Casting parity	`NDIterCasting.CanCast` matches NumPy's `safe`/`same_kind` lattice exactly

Validated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, plus a permanent variation-probe harness. Dozens of parity bugs found and fixed against NumPy ground truth (negative-stride flipping, NO_BROADCAST enforcement, F_INDEX coalescing, buffered-reduction stride inversion, K-order on broadcast inputs, the size-1 stride-0 invariant, op_axes out-of-bounds reads on stretched size-1 axes, write-broadcast validation, unit-axis absorption) — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.

Execution at NumPy speed

NDIter isn't just correct — it is the production execution engine: DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NDIter Tier-3B shell.

Aspect (float32)	NumSharp	NumPy	Ratio (NPY/NS)
contig sqrt 10M	2.98 ms	3.24 ms	1.09×
contig add 10M	3.91 ms	4.09 ms	1.05×
strided add 1M	319 µs	416 µs	1.30×
strided sqrt 1M	206 µs	374 µs	1.82×
strided sum 1M	109 µs	205 µs	1.88×
fused `a*b+c` 10M	4.77 ms	13.38 ms	2.81×
fused `(a-b)/(a+b)` 10M	4.12 ms	22.33 ms	5.42×

Key mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels.

2. `NDExpr` DSL + `np.evaluate` (fusion)

User-extensible kernel layer on top of NDIter — the public answer to "how do I write my own ufunc":

Tier 3A — ExecuteRawIL: emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* strides, long count, void* aux).
Tier 3B — ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.
Tier 3C — ExecuteExpression: compose NDExpr trees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons, Min/Max/Clamp/Where), plus Call() to splice any delegate/MethodInfo into a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.

Exposed publicly as np.evaluate(expr[, operands][, out]):

Per-node NumPy result_type typing — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly: (i4*i4)+f8 wraps the multiply in int32 (→ 1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4, i4/2 → f8) with NumPy's exact OverflowError, and special resolvers (true_divide, arctan2, negative-integer-literal power → ValueError, bool add=OR/multiply=AND).
Fused reductions — NDExpr.Sum/Prod/Min/Max/Mean compile a one-pass inner loop; sum(a*b) reads a and b once and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).
out= joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing through COPY_IF_OVERLAP); an EXTERNAL_LOOP guard prevents the silent count==1 slow path.
Measured (Release, 4M f64, NumPy 2.4.2): a*b+c 3.2×, (a-b)/(a+b) 6.1×, sum(a*b) 3.6×, sum f32 2.9×, i4*2+f8 3.5× faster. Permanent gate in benchmark/fusion/.
A runtime+type-aware SIMD gate decides per-node whether the rounding family (Floor/Ceil/Round/Truncate) vectorizes (float/double only, and only where the BCL provides Vector{N}.<op> at the active width), so fused integer rounding no longer crashes and SIMD is used wherever it exists on the running runtime.

3. Full advanced-indexing parity + memory-safety

A differential index oracle (NumPy 2.4.2 as the sole oracle) proved the common get/set surface was bit-exact but exposed ~697 divergences across exotic mixed advanced-index combinations (boolean-array + fancy, multi-dim fancy + slice, 0-d-bool + fancy, multi-fancy, empty combos) plus a flaky heap-corruption crash. This PR replaces the ad-hoc per-shape "Try*" patchwork with a faithful port of NumPy's mapping.c model and drives the oracle from ~697 → 0 divergences (full parity). The final fixes closed the remaining divergence categories — value-broadcast on empty/scalar selections, 0-d-base over-indexing, empty-advanced gather, empty-slice assignment, and non-consecutive 0-d-bool axis placement (which uncovered and fixed a latent Shape.Broadcast hash collision) — all pinned by a new Indexing.CombinatorialParity FuzzMatrix gate; the curated, dtype, and seeded-random oracle tiers are bit-exact. The one known open item is a flaky teardown heap-corruption crash on a single corpus shape (Index_Random stays [OpenBugs] for the crash, not for any parity gap).

PrepareIndex (Selection/NDArray.Indexing.PrepareIndex.cs) — a port of NumPy's prepare_index: one up-front pass that classifies the whole index tuple (HAS_* bitmask) and validates it before any kernel runs. Raises NumPy-verbatim texts for too-many-indices, boolean-array-dim mismatch, integer/array value out-of-bounds, and un-broadcastable advanced blocks.
TryBuildMultiAdvancedGrid — NumPy's general advanced-index algorithm (block broadcast + consec-aware axis placement). All advanced tuples (single or multiple fancy, fancy + slice/newaxis, 0-d bool joining the block, pure-advanced) now route through one gather/scatter; only pure-basic tuples fall through to the view path. Correct axis placement for ≥2 advanced indices mixed with slices (consecutive → in place; separated → advanced axes to front).
Memory-safety fixes (the heap-corruption class is closed): negative scalar-index assignment no longer writes one element before the buffer; fancy negative-OOB no longer reads OOB; per-axis integer OOB is validated (a[0,4] on a 3×4 now raises instead of returning a neighbour); too-many advanced indices raise instead of walking strides past the array; the negative-stride gather offset bound was corrected (valid rows on a[::-1] no longer rejected); block-copy bounds guards + an opt-in Windows page-heap (NUMSHARP_GUARD_PAGES=1) backstop the gather/scatter.
Combined boolean + advanced indexing — a boolean mask mixed with int/slice/array (arr[mask, int], arr[:, mask], arr[mask, 1:3], leading k-D mask + basic) now works for get and set, expanding each mask to its nonzero() components like NumPy.
Assignment validation — value-broadcast is checked on assignment (partial writes rejected with NumPy's could not broadcast … ValueError); a valid smaller value tiles like NumPy; a broadcastable value scatters correctly into a ≥2-D fancy subspace; empty-into-non-empty raises, empty-into-empty no-ops.
Input-form parity — raw bool[]/bool[,] (any rank) and IEnumerable<bool> are recognized as masks; C# tuples (nd[(1,2)]) spread to coordinates; List<int>/ArrayList/any IEnumerable coerce to a fancy index; uint64 scalar indices and int8 fancy-index dtype are accepted (all eight integer dtypes now index identically).
Differential index oracle committed as a FuzzMatrix gate (test/oracle/gen_index_oracle.py + Fuzz/IndexOracleTests.cs): index_curated.jsonl (2,265) + index_dtype.jsonl (104) run in CI bit-exact, and the 10,000-case seeded random tier is now 0 divergences too (replayed in soak; held out of the per-PR gate only by the teardown crash above). The combinatorial fixes are additionally pinned by the Indexing.CombinatorialParity gate.

Breaking: a raw int[]/long[] used as the sole index is now fancy indexing, not coordinate access — see Breaking Changes.

4. Byte-exact NumPy array printing

NDArray.ToString() is refactored from the legacy Python-list output ([0, 1, 2], no alignment/precision/summarization/dtype) to a 1-to-1 port of NumPy 2.4.2's array printing (numpy/_core/arrayprint.py + dragon4.c).

ToString() / ToString(false) → np.array_str → "[0 1 2]" (the str() form).
ToString(true) → np.array_repr → "array([0, 1, 2], dtype=…)" (the repr() form).
New subsystem src/NumSharp.Core/Backends/Printing/: PrintOptions (AsyncLocal context mirroring NumPy's format_options), Dragon4 (positional + scientific float→string), ElementFormatters (bool/int/float/complex with the two-pass exp-format decision, per-dtype cutoffs, column sizing, nan/inf fields), ArrayFormatter (recursive layout, line-wrap at linewidth, summarization with edgeitems, the 0-d str-vs-repr asymmetry, repr dtype/shape suffixes).
Float digit generation uses .NET's shortest round-trip ToString("R") (== Dragon4 unique) but routes all rounding through ToString("F"|"E"+precision) (rounds the true binary value, IEEE half-to-even) — never the shortest string (which diverges ~50% on adversarial ties).
Public API (APIs/np.array2string.cs): np.array2string, np.array_str, np.array_repr, np.set_printoptions, np.get_printoptions, np.printoptions (IDisposable context), np.format_float_positional, np.format_float_scientific.
NumSharp's Char dtype (string storage, no NumPy equivalent) keeps its legacy rendering, so GetString/AsString are unaffected.

Validated by ~18,000 differential-fuzz cases byte-identical to NumPy 2.4.2 across all dtypes (incl. float16/32/64 scientific, adversarial ties/carries), 1-D–5-D, sliced/broadcast/transposed views, complex, nan/inf, summarization, line-wrap, and print options; ~174 strict parity tests on net8.0 + net10.0.

Breaking: any code parsing the old ToString() format must update — see Breaking Changes.

5. C/F/A/K memory-layout support

Shape now tracks F-contiguity with NumPy-convention contiguity computation; new OrderResolver resolves C/F/A/K for every API with an order parameter.
Order support wired through: copy, array, asarray, asanyarray, *_like, astype, flatten, ravel, reshape, eye, concatenate, cumsum, argsort, tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.
New: np.asfortranarray, np.ascontiguousarray.
np.where selects C/F output layout the way NumPy does; ravel('F') of an F-contig source returns a view (was a 3,000× copy).
~68 layout bugs fixed across 9 TDD fix groups, backed by ~3,300 lines of order tests (reductions/keepdims, matmul/dot/outer/convolve, broadcasting-from-F, manipulation, file I/O fortran_order, Decimal scalar path, fancy-write isolation, …).

6. New & completed `np.*` APIs

New functions:

Area	APIs
Fused / ufunc	`np.evaluate` (fused expressions — §2), `np.bitwise_and`, `np.bitwise_or`, `np.bitwise_xor`, `np.positive`, first-class `np.maximum`/`np.minimum`/`np.fmax`/`np.fmin`
Printing	`np.array2string`, `np.array_str`, `np.array_repr`, `np.set_printoptions`, `np.get_printoptions`, `np.printoptions`, `np.format_float_positional`, `np.format_float_scientific`
Sorting	`np.sort` (+ `ndarray.sort`; `np.argsort` reimplemented) — radix line-kernel on NDIter, stable, NaN-last, all axes / orders (closes a long-standing Missing Function)
Manipulation	`np.pad` (all 11 NumPy modes + callable), `np.tile`, `np.delete`, `np.insert`, `np.append`
Splitting	`np.split`, `np.array_split` (uneven splits), `np.hsplit`, `np.vsplit`, `np.dsplit`
Indexing/selection	`np.take`, `np.put`, `np.place`, `np.extract`, `np.compress`, `np.argwhere`, `np.flatnonzero`, `np.diagonal`, `np.trace`, `np.unravel_index`, `np.ravel_multi_index`, `np.indices`
Statistics	`np.median`, `np.percentile`, `np.quantile` (all 13 interpolation methods, tuple axis, `out=`, `keepdims`, QuickSelect), `np.average` (`weights`, `returned`, tuple-axis), `np.ptp`, `np.nanmedian`, `np.nanpercentile`, `np.nanquantile`
Math	`np.diff`, `np.ediff1d`
Creation	`np.asfortranarray`, `np.ascontiguousarray`
Runtime	`np.multithreading(enabled, max_threads)` — opt-in threaded kernels

Rebuilt to full NumPy 2.x parity: np.clip (min=/max= aliases, None bounds, 2.x promotion, out=, 4×-unrolled SIMD kernel), np.unique (5 missing kwargs, NaN partitioning, up to 43× faster), np.searchsorted (side=/sorter=, IL binary-search 5–25× faster), np.copyto (casting=/where=, overlap-safe, SIMD fast paths), np.asarray (copy=/like=/device=/string dtype), np.concatenate, np.all/np.any (tuple-axis, out=, where=), np.expand_dims (tuple axis), np.repeat (axis=), np.power (integer-power semantics + crash fix), np.broadcast (N-operand 0..∞, live cursor, lazy .iters/.numiter). Engine completeness: bool/char max/min, Complex quantile, IsInf implemented, and the six Complex transcendentals sinh/cosh/tanh/arcsin/arccos/arctan (were NotSupportedException).

np.maximum/minimum/fmax/fmin are now first-class binary ufuncs (NEP50 promotion, broadcasting, out=, where=, every execution path) instead of clip-based shims — which fixes a real correctness bug: fmax/fmin now ignore NaN (return the finite operand) while maximum/minimum propagate it.

`out=` / `where=` / `dtype=` ufunc kwargs (NumPy parity)

The kwargs on every NumPy ufunc now span the elementwise core — binary (add, subtract, multiply, divide, true_divide, mod, power, floor_divide), unary-math (sqrt, exp, log, sin, cos, tan, abs/absolute, negative, square, …), the six comparisons, predicates (isnan/isfinite/isinf), bitwise, invert, arctan2 — each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:

out joins the broadcast but never stretches (mismatched/stretchable out raise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50); out only needs a same_kind cast; the provided instance is returned (reference identity).
where must be exactly bool; it broadcasts over operands and participates in output shape; mask-false slots keep prior out contents.
out aliasing an input is well-defined via COPY_IF_OVERLAP.
dtype= computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the bool add→OR / multiply→AND remap keyed off the final loop dtype.

Random sampling — NumPy RNG parity (`np.random`)

np.random is rebuilt around NumPy's own bit generator and legacy distribution algorithms, so a seeded NumSharp stream is byte-identical to NumPy 2.4.2.

MT19937 bit generator replaces the legacy Knuth subtractive generator — a full Mersenne-Twister port (Seed(uint) / SeedByArray(uint[]), 53-bit NextDouble, rejection-sampled bounded ints, 624-word state Clone/SetState) with 1-to-1 seed/state compatibility: get_state()/set_state() use NumPy's state tuple format (Algorithm, Key[624], Pos, HasGauss, CachedGaussian). Gaussian generation moved from Box-Muller to the Marsaglia polar method with the _hasGauss/_gaussCache carry, matching NumPy's random_standard_normal (and its cached state) exactly.
24 new distribution samplers — dirichlet, f, gumbel, hypergeometric, laplace, logistic, logseries, multinomial, multivariate_normal, negative_binomial, noncentral_chisquare, noncentral_f, pareto, power, rayleigh, standard_cauchy, standard_exponential, standard_gamma, standard_t, triangular, vonmises, wald, weibull, zipf — joining the existing 16 (rand, randn, randint, uniform, normal, bernoulli, beta, binomial, chisquare, choice, exponential, gamma, geometric, lognormal, poisson, permutation/shuffle).
Legacy-algorithm alignment — the pre-existing samplers were corrected to NumPy's RandomState algorithms so they consume the RNG in the same order and emit byte-identical values: geometric (NumPy's search algorithm; previously returned negatives), beta (Jöhnk's algorithm for a,b ≤ 1), chisquare (SampleStandardGamma with Vaduva's algorithm for shape < 1), the lognormal(mean=0) NaN fix, and size=0 empty-array support across every sampler.
multivariate_normal uses an SVD transform (Jacobi eigendecomposition, eigenvalues sorted descending) matching NumPy's implementation; identity covariance is an exact sequence match.
Integer distributions return int64 — poisson, binomial, geometric, hypergeometric, zipf, logseries, negative_binomial (matching NumPy dtypes); scalar overloads return a 0-d NDArray; long is the canonical size/index type throughout (all int downcasts removed). Size/axis/seed validation, the return contract (size=None → scalar, size=() → 0-d), and the dual-overload API shape all follow NumPy.

Breaking: the RNG swap means a given seed now produces a different sequence than prior NumSharp versions — intentional, because the new sequence matches NumPy. See Breaking Changes.

7. Linear algebra

Stride-native GEMM for all 12 numeric dtypes — BLIS-style GEBP with stride-aware packers; the 8×16 Vector256 FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxing GetValue fallback chain.
Full matmul gufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a differential matrix (816 cases).
Fused single-pass 1-D dot — 3.5–9× faster, zero GC (was up to 446 gen-0 collections per call at 100K).
np.multithreading — opt-in parallel 1-D dot (1M float dot 172 → 60 µs, ~2× NumPy's default). Off by default; bitwise-identical summation order when off.

8. Performance (engine, reductions, shifts, sort, casts)

Op	Improvement (NPY/NS, >1 = NumSharp faster)
Axis reductions, narrow ints	Widening SIMD (int16→int32 accum etc.): `sum(int16, axis=1)` 1058 ms → 2.7 ms (389×); also fixes a uint32 axis-sum corruption bug
`mean`/`var`/`std` (axis)	mean 217×, var/std 21×; `count_nonzero` 20×
Flat `min`/`max` (f64/f32)	raw `Avx.Min/Max` drops the JIT's redundant NaN fixup with a separate finite-mask + cold scan: 0.64×/0.69× → 1.55×/1.73× @100k; broadcast/neg-stride axis min/max SIMD-routed (was 0.07–0.17×)
`np.left_shift`/`right_shift`	reworked into first-class binary ufuncs + SIMD (variable-shift `VPSLLV`/`VPSRAV` via Tier-3B): common cases 2–4× NumPy (was 0.05–0.34× scalar); fixes 7 correctness bugs incl. NEP50 promotion, negative-count, bool→int8, sbyte SIMD; adds `<<`/`>>` operator overloads on `NDArray`
`np.sort` / `argsort`	radix line-kernel + insertion-sort fast path for short lines (fixes a 25–35× short-line regression) + single-pass multi-histogram radix; int8/16 6–9×, argsort 1.3–8× on most dtypes
`np.cumsum`	reimplemented as NDIter-driven `add.accumulate` with KEEPORDER output (C-src→C, F-src→F); 1.6–11.8× by axis; fixes empty/0-d/size-1 NEP50 widening + negative-axis validation
Boolean mask get+set	unified on one NDIter gather/scatter — tall-thin axis-0 select 372.9 ms / 3.86 GB → 9.1 ms / ~0 MB (~41×)
`np.nonzero`	IL SIMD kernel closes an 8–241× gap
Broadcast-reduce	stride-0 axes folded algebraically — `sum(broadcast_to(...))` ~534–700× faster, bit-exact
`sum`/`mean` (float)	bit-exact NumPy pairwise summation (matches `np.add.reduce` bit-for-bit; unblocks float32)
`np.any`/`np.all` (bool/char)	reinterpret to byte/ushort → integer SIMD path (was 5–12× scalar); fixes a latent AVX2 32-lane mask-overflow bug
`np.zeros`	`calloc`/demand-zero — O(1) (10M f64: 14.3 ms → ~0.01 ms)
Casts (`astype`)	full 15×8×15 SIMD campaign: 716 → 129 lagging cells of 1,568 comparable (1,439 winning ≥1.0×) vs NumPy — per-src geomean f16→narrow 3.8×, f32→narrow 2.0×, f64→narrow 1.4× (architecture below)
`np.abs`	exact `dtype=` parity (complex magnitude `

High-performance type casting

The single largest engineering effort in this PR rebuilt the astype/copy/retype machinery from the ground up.

One unified core. Every copy, retype, and cast now routes through NDIter.CopyAs(dstType, src, order) — it resolves C/F/A/K via OrderResolver, allocates the destination, and fills it through the in-place Copy primitive. NDArray.copy() and DefaultEngine.Cast (astype) both fold into it; the old scalar / (1,) / same-type-Clone / F-contig-special / CastCrossType branch maze and the 2,226-line per-element Converts.FindConverter cast loop are deleted. A TryCopySameType fast path fills scalar-broadcast and gap-free contiguous destinations with one typed InitBlock/memset/SIMD fill (6–8× for 1-byte, ~2× wider).
SIMD campaign over the whole matrix. A Phase-0 discovery sweep benchmarked all 15 src × 8 layouts × 15 dst astype combinations at 1M against NumPy 2.4.2, producing a lagging-cell worklist that successive waves drove from 716 → 129 of 1,568 comparable cells (1,439 now win ≥1.0×). The kernels: float→narrow-int (cvtt + truncating Narrow), float/int→bool (!=0 compare), Half↔ via the Giesen bit-fiddle (widen/narrow, sNaN-preserving), complex→int/bool (real deinterleave), sub-word strided/reversed lane shuffles (VPSHUFB/VPACKUS), fused VPGATHER whole-array kernels for strided float→narrow, an IL-emitted scalar cast kernel for the Vector-less dtypes (direct Converts.ToX, 0.65 → 1.5–2.6×), and a KEEPORDER same-type copy. Per-src geomeans: f16→narrow 3.8×, f32→narrow 2.0×, f64→narrow 1.4×.
Correctness rounds. Alongside the speed work, multiple rounds of NumPy-parity fixes closed precision-boundary bugs in the double→int converters, ToUInt32(double) overflow → 0, DateTime/TimeSpan conversions, and the Half/Complex/char converter paths, and replaced the IConvertible constraint with a Converts<T> path — all pinned by the cast tier of the differential-fuzz corpus (full 15×15).

Also in this PR: np.where gained an IL-generated AVX2/SSE4.1 (Neon-safe) SIMD kernel — 5.4× over the old call-based kernel and ~3.9× faster than NumPy at 1M — plus NEP50 weak-scalar type promotion; np.asanyarray now accepts every built-in C# collection (List/HashSet/Queue/Stack/ImmutableArray/LinkedList/Memory<T>/ArraySegment/LINQ results, non-generic IEnumerable/IEnumerator like ArrayList, Tuple/ValueTuple, and mixed object[]) with NumPy-like type promotion; and the test suite was migrated to MSTest v3.

All figures in this section come from the committed, reproducible NumSharp-vs-NumPy benchmark harness — the op/dtype/N matrix plus five subsystems (iterator / layout / operand / cast / fusion), with per-release benchmark/history/ provenance snapshots and the NPY/NS convention (>1 = NumSharp faster). Full details in §10.

9. Memory management — ARC + `IDisposable`

NDArray implements IDisposable backed by atomic reference counting on the unmanaged block: CAS-driven TryAddRef/Release, idempotent Dispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.
Hammered by a 15-case lifecycle suite incl. 32-thread × 1,000-op concurrency races and 50-way parallel dispose — zero corruption.
A tcache-style size-bucketed buffer pool with a 1 B – 64 MiB window (covers both small-N ufunc results and 4M+ outputs); deterministic release plus the pool removes most steady-state GC pressure (dot at 100K: 446 collections → 0).
Native allocations now register GC memory pressure (GC.AddMemoryPressure/RemoveMemoryPressure) so the runtime sees the true unmanaged footprint and collects on time — fixing a runaway-growth bug (a 1M-array loop peaked at 10+ GB → ~54 MB).

10. Validation & benchmarking infrastructure (NumPy as the oracle)

Two committed, reproducible systems back every correctness and performance claim in this PR. Both treat NumPy 2.4.2 as the sole oracle, both run with no Python at test time or in CI, and both ship as first-class, regeneratable infrastructure.

The NumPy differential oracle (correctness)

A property-style differential fuzzer that proves every NDIter-backed op is bit-identical to NumPy across the input space — caught systematically, not by hand-picked cases. (The motivating failure: a cast saturate-vs-wrap bug, latent in where/copyto/concatenate, that example-based tests missed. It must be impossible to ship again.)

How it works. Generators under test/oracle/ run real NumPy 2.4.2 and emit a committed, bytes-exact JSONL corpus; the C# harness under test/NumSharp.UnitTest/Fuzz/ replays the operand bytes and bit-compares. NumPy never runs at test time — the committed corpus is its frozen answer.

Corpus — ~51,000 cases across 28 JSONL tiers (a ~40,000-case op corpus over 25 tiers + the ~12,400-case index oracle over 3). Each case stores its operands as (dtype, shape, element-strides, element-offset, bufferSize, base-buffer-hex) plus NumPy's expected (result dtype + shape + C-contiguous result bytes) — or the exception NumPy raised. FuzzCorpus.Reconstruct rebuilds the exact logical array from the base bytes alone — broadcast (stride-0), negative strides, offset slices and all — by aliasing a contiguous storage with the operand's view shape (no validation, so even layouts the public API would normalize away replay faithfully).
Layouts — the "44 variations" (layout_catalog.py, mirrored 1:1 in C#): 26 single-array builders (C/F-contig 1–3-D, transposed, strided/step-2, negative-stride, offset slice, broadcast & scalar-broadcast stride-0, 0-D view at non-zero offset, singleton dims, newaxis, empty, high-rank 5-D, reshape-view), 9 pairwise builders (the SimdFull / SimdChunk / General / scalar-left-right / broadcast-row-col / negstride dispatch classes), and 5 where-triple builders. Value pools are edge-loaded — NaN, ±inf, −0.0, type min/max, and the float→int overflow boundaries are front-loaded so even an 8-element operand exercises the cvtt sentinel paths. Every builder self-validates that the operand is a true view into its base buffer, so the bytes fully determine the array.
Coverage — ~90 ops (OpRegistry): astype (15×15), binary arithmetic (NEP50), floor_divide/mod/power, the six comparisons, ~30 unary-math kernels (incl. the transcendental / hyperbolic / inverse-trig stragglers), every reduction + the NaN-aware family, cumulative scans + diff, statistics (median/percentile/quantile/average/ptp/count_nonzero/clip), logic & extrema (isnan/isinf/isfinite/maximum/minimum/fmax/fmin/isclose), bitwise + shift, where/place, ~25 manipulation ops, modf multi-output, sorting/searching, matmul/dot/outer, and copyto (incl. overlapping same-buffer copies). Every case is checked on three axes against NumPy: result dtype (NEP50), result shape (broadcasting), and the result bytes (bit-exact, NaN tokenized) — plus error-side parity (a tier where NumPy raises asserts NumSharp also throws).
The verdict is never silent. A case is bit-exact (pass), a documented difference logged by MisalignedRegistry (intended NEP50/algorithm divergences and a few tracked bugs — excused but printed, never hidden), or a failure (any unknown divergence — the gate goes red). A failing element-wise case is auto-shrunk to a 1-element minimal repro.
The index oracle — a separate getter/setter differential gate (gen_index_oracle.py + IndexOracleTests): a portable token-encoded index tuple over 15 base recipes (scalar, empty, 2-/3-D, transposed, strided, negative-stride, offset, broadcast), replayed for get and set with full error-side parity. Three tiers — index_curated (2,265, CI gate), index_dtype (104 forms × 13 dtypes, CI gate), and index_random (10,000 seeded, now 0 divergences). This is the gate that drove advanced indexing from ~697 → 0 divergences (§3).
Metamorphic tier — 12 oracle-free invariants (MetamorphicTests) that need no NumPy: -(-a)==a, (a+b)-b==a, a.T.T==a, reshape round-trip, widening-cast round-trip, a*1==a / a+0==a, abs idempotence, sum-all == flat-sum, concatenate split-free, argsort-of-sorted = identity, equality reflexivity. They catch internal-consistency bugs the differential corpus structurally cannot.
CI + soak. Every tier runs under the [FuzzMatrix] gate on each CI (Windows/Ubuntu/macOS), replaying the committed corpus with no Python. A nightly fuzz-soak (fuzz_random.py, deterministic from its seed) sweeps ~1M fresh cases/night; a divergence prints a shrunk repro to drop into corpus/regressions/, which then pins it on every CI thereafter. Regenerate with numpy==2.4.2; Char/Decimal (no NumPy analog) are covered by the separate Converts-oracle tests.

The benchmark harness (performance)

A committed, reproducible NumSharp-vs-NumPy comparison driven by one entry point (benchmark/run_benchmark.py, NumPy 2.4.2 pinned). It builds and runs the BenchmarkDotNet op/dtype/N matrix (benchmark/NumSharp.Benchmark.CSharp) — 1K / 100K / 10M elements × all 15 dtypes, ~615 ops per size (1,851 measured cells: ✅ 792 / 🟡 357 / 🟠 177 / 🔴 72) — joined per (op, dtype, N) to a warm NumPy process, then appends five subsystems that fill the axes the matrix can't express:

Subsystem	What it isolates	Headline (NPY/NS)
iterator (`benchmark/nditer`)	the `NDIter` machinery itself — construction, traversal, reductions, selection, dtypes, pathologies, dividends — vs `np.nditer`, aspect × cache-tier	1.18× geomean; build+dispose 3.3× faster than `np.nditer`; `bcast_reduce` 538×
layout (`benchmark/layout`)	reduction / copy / elementwise across 8 memory layouts (C/F/T/strided/sliced/negrow/negcol/bcast) — the op matrix is C-contiguous only	at 1M copy ~2–3× & elementwise ~1.2–1.8×; the per-call dispatch tax shows at 100K
operand (`benchmark/operand`)	1-D / scalar / mixed-operand (C+F, C+T) / binary-broadcast layouts	~1.3–2.2× per case
cast (`benchmark/cast`)	the full 15×15 `astype` src→dst × 8 layouts at 1M (no op-matrix coverage)	1,439 / 1,568 comparable cells win
fusion (`benchmark/fusion`)	`np.evaluate` fused single-pass vs unfused np.* chains	several-fold over NumPy (§2)

Methodology guards keep the numbers honest: an InProcessEmit BenchmarkDotNet toolchain (sibling worktrees hold same-named projects the out-of-process toolchain refuses), a 25 ms-capped 50-iteration job (so µs–ms array ops skip BDN's nanosecond invocation ramp), and an asserted -c Release build (file-based dotnet run defaults to Debug, which silently halves hand-written kernels). Each run writes a committable benchmark/history/<date>_<sha>/ snapshot (report + every subsystem sheet + cards + a provenance MANIFEST) and repoints a latest symlink; a decoupled post-release benchmark.yml regenerates it, renders the DocFX pages, and commits it to master. The convention is NPY/NS throughout — ratio = NumPy ÷ NumSharp time, >1 = NumSharp faster (the headline figures above are the 2026-06-23 snapshot on an i9-13900K).

11. Legacy stacks deleted

MultiIterator deleted; all callers migrated to NDIter.Copy / multi-operand execution.
NDIterator (interface + NDIterator<T> + AsIterator extensions) deleted entirely; production iteration runs through NDIter / NDIterRef / GetAtIndex / NDFlatIterator.
The Regen template engine is fully purged: 87 inline #if _REGEN … #else <generated> #endif blocks across 35 files collapsed to the generated code with the template DSL preserved as reference comments; all *.template.cs / .tt / GenerateCode.ps1 and their dead <Compile Remove> csproj guards are gone. ~400 per-dtype NPTypeCode switch sites were replaced by a generic NpFunc dispatch utility, and dead-code sweeps removed the 24 [Obsolete(error:true)] tombstones plus 10 confirmed-dead private methods.

12. New dtypes — `int8`, `float16`, `complex128` (+ `Char8`/`DateTime64`)

NumSharp now supports all 15 NumPy dtypes (was 12). This PR adds SByte (int8) and Half (float16), and promotes Complex (complex128) from a stub to a first-class dtype — each mapped to its NumPy name and wired through array creation, np.dtype("int8"/"float16"/"complex128") parsing, astype, reductions, comparisons, and the IL kernels.

First-class across every hot path. A coverage audit of the NPTypeCode.Single => switch sites found the three dtypes silently dropping out of ~23 files (9 production crash sites + 5 perf gaps); each was filled so the full 15×15 dtype matrix works end-to-end — NDIterCasting/NDIterBufferManager (safe-cast, buffered iteration, the Complex→real ComplexWarning drop), np.repeat, np.any/np.all axis, argmax/argmin axis (Half first-NaN-wins, Complex lexicographic real-then-imag), and the reduction identity / min-max kernels. The six Complex transcendentals sinh/cosh/tanh/arcsin/arccos/arctan were implemented (were NotSupportedException).
NumPy 2.x type-alias alignment + np.dtype parser rewrite. np.dtype(string) is now a FrozenDictionary lookup; finfo/iinfo extend to the new dtypes; and the class-level aliases match NumPy 2.4.2: np.byte → int8 (sbyte), np.int_ → intp, np.uint → uintp, np.intp/np.uintp → long/ulong on 64-bit, and np.complex64/np.csingle now throw instead of silently widening to complex128. See Breaking Changes.
Half uses a scalar path (no Vector<Half> in the BCL) and routes conversions through double; Complex is excluded from the inherently-real ops (unique/clip/randint) per NumPy. Reductions for the three dtypes run through the NDIter path rather than a boxed loop (Decimal axis ops 5–13×, Half mean 1.6–3.7×), with bit-exact accumulation (Half/Complex flat sum/mean accumulate in a wider type and cast back).

Conversion-helper primitives — two further NumPy-parity types adapted from vendored .NET BCL sources under src/dotnet/, standalone helpers for now (not yet wired into NPTypeCode):

Char8 — the NumPy S1 / Python bytes(1) equivalent (readonly struct, 1 byte): conversions, operators, span helpers, and 100% Python bytes API parity validated against a Python oracle.
DateTime64 — the NumPy datetime64 equivalent (readonly struct over a long tick count): full long range and a NaT sentinel (long.MinValue) with IEEE-NaN-style propagation and NumPy comparison semantics (any ordering involving NaT is false, != is true, while Equals stays hash-contract-compliant). Implicit interop from DateTime/DateTimeOffset/long, checked conversion back, and Converts.ToDateTime64/ToX(DateTime64) matching NumPy exactly; calendar arithmetic delegates to System.DateTime.

13. Examples — trainable MNIST MLP

New examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive and a fused implementation (single-NDIter bias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). The stride-native GEMM made the old "copy transposed views before np.dot" workaround unnecessary; converges to >99% test accuracy in the bundled demo.

14. Cross-platform

macOS / Apple-Silicon (ARM64) reduction parity fixed (6 failing tests on macos-latest): maximum/minimum signed-zero ties (ARM FMAX/FMINNM vs x86 MAXPS/MINPS) now resolve to the second operand via an explicit strict-compare + ConditionalSelect; negate(+0.0) is an explicit sign-bit XOR (was 0 - x → +0.0 on ARM); narrow-int sum/prod reductions use exact integer accumulators on ARM (the AVX2-gated widening kernel fell back to a saturating double path). Reproduced under linux/arm64 via QEMU and pinned as a committed parity suite.

15. Tests & CI

~10,980 passed / 0 failed on net8.0 + net10.0, with zero regressions.
177 formerly-[OpenBugs] reproductions promoted into regular CI tests as their bugs were fixed (each asserts NumPy-2.4.2-correct behavior). Deliberately kept flagged: 7 AVX-512-only and 2 timing-dependent repros.
New/expanded suites: the differential index oracle + exhaustive get/set parity matrices (basic/fancy/edge/layout/combined/boolean-mask), array-print parity (np.ArrayPrint.ParityTests), cumsum parity (54), abs parity (33), shift parity, np.evaluate/out=/where=/dtype= parity, NDIter battletests (566), order-support sections, ARC lifecycle, and the macOS/ARM64 signed-zero parity suite.
CI: the FuzzMatrix gate (the differential oracle + index oracle + metamorphic tiers — §10) runs on Windows/Ubuntu/macOS; nightly fuzz-soak.yml; a decoupled post-release benchmark.yml runs the whole NumSharp-vs-NumPy harness (op/dtype/N matrix + the five iterator/layout/operand/cast/fusion subsystems — §10), renders the DocFX benchmark pages, and auto-commits the refreshed report + cards + a committable benchmark/history/<date>_<sha>/ provenance snapshot (and its latest symlink) to master.
Known remaining gaps (checked in as failing-by-design tests rather than ignored): the still-unimplemented NumPy functions flip/fliplr/flipud/rot90, diag, gradient, and round (the function form; np.round_/np.around exist); the benchmark-surfaced slower paths, all tracked in the committed sheets — small-N (~1K) per-call dispatch overhead, the scalar Half/Decimal element paths (no BCL Vector<Half>/Vector<decimal>), large-N np.any full-scan, comparison→bool stores, and fancy gather/scatter; and a handful of iterator/indexing edge cases pinned as [OpenBugs].

Key highlights since 0.40.0

0.60.0 caps a three-release arc that rebuilt NumSharp's compute core from the ground up and aligned it with NumPy 2.x. For anyone upgrading from 0.40.0, the cumulative picture across the prerelease line:

0.41.0 — IL Kernel Generator (the compute-core rewrite, Mar 2026)

Runtime IL emission (System.Reflection.Emit.DynamicMethod) replaced the ~600 K-line Regen template engine with ~19 K lines — a net −533 K lines — with Vector128/256/512 SIMD and runtime width detection across every op.
NEP50 (NumPy 2.x) type promotion; single-threaded deterministic execution (SIMD in place of Parallel.*).
35 new functions — the nan* reductions, cbrt, floor_divide, left/right_shift, cumprod, count_nonzero, isnan/isfinite/isinf/isclose, and the np.comparison + np.logical modules — plus the comparison/bitwise operators (== … >=, &/|/^) implemented for the first time.
MatMul 35–100× faster (cache-blocked SIMD, 20+ GFLOPS); boolean indexing and axis reductions rewritten on SIMD.
60+ NumPy-parity bug fixes; +4,200 tests; no breaking changes.

0.50.0 — Long Indexing (>2 GB arrays + the type system, Apr 2026)

Int64/long indexing migrated across Shape/NDArray/Storage/iterators/IL kernels — arrays beyond 2.1 billion elements (>2 GB) now work; np.argmax/np.nonzero return long. New UnmanagedSpan<T> (long-length Span parity), LongIntroSort, and unmanaged long index buffers.
12 type-introspection APIs — can_cast, promote_types, result_type, min_scalar_type, common_type, issubdtype, finfo, iinfo, isreal/iscomplex/isrealobj/iscomplexobj.
NumPy 2.x type system — np.arange(10) → int64, NPTypeHierarchy (bool not under Number), and 0-D scalar arrays (np.array(5) → 0-D).
Python container protocol — __contains__/__len__/__iter__/__getitem__/__setitem__, plus tolist()/item(); np.frombuffer rewritten to the full NumPy signature (count/offset/big-endian/view semantics).
ValueType→object scalar migration; operator-overload cleanup (−74%); 600+ battle tests.

0.60.0 — nditer (this PR — the first stable)

Full NumPy nditer port as the execution engine; np.evaluate fusion (3.2–6.1×); full advanced-indexing parity + a differential index oracle; byte-exact array printing; out=/where=/dtype= ufunc kwargs; C/F/A/K memory layout; 36+ new np.* APIs (sort, pad, percentile/quantile, take/put, split, …); np.random rebuilt on MT19937 for NumPy seed parity + 24 new distributions; stride-native matmul and an astype SIMD campaign that both beat NumPy; ARC memory management + buffer pool; differential fuzzing vs NumPy; and the legacy iterator stack and the Regen engine deleted outright. (All detailed in the sections above.)

Breaking changes

Change	Impact	Migration
Raw `int[]`/`long[]` as the sole index is now FANCY	`nd[new int[]{0,2}]` selects rows 0 and 2 (shape `(2,…)`), not the element at coordinate `(0,2)`	Use `nd.GetData(0, 2)` for coordinate access. `nd[0,2]` (separate ints) is unchanged; `NDArray<T>.this[int[]]` is unchanged
`NDArray.ToString()` format changed to NumPy `array_str`/`array_repr`	`[0 1 2]` (str) / `array([0, 1, 2], dtype=int64)` (repr) instead of `[0, 1, 2]`	Update any code parsing `ToString()` output; use the typed accessors/`GetData` for values
`np.left_shift`/`right_shift` result dtype is now `result_type(lhs, rhs)`	`int8 << int32` → `int32` (was `int8`, overflowing); `bool << bool` → `int8`	— (matches NumPy)
`np.fmax`/`fmin` now ignore NaN	return the finite operand; `maximum`/`minimum` still propagate NaN	— (matches NumPy; fixes a prior correctness bug)
`np.cumsum` of empty / 0-d / size-1 integer input widens to int64 and never returns 0-d	`cumsum(empty int32)` → fresh `int64`; `cumsum(0-d)` → `(1,)`	— (matches `np.add.accumulate`)
Boolean-mask axis-0/partial set with a 1-D count-length value now raises `IncorrectShapeException`	was silently "one scalar per selected row"	Use a `(count, 1)` value to fill one value per selected row
Over-indexing with slices now raises `IndexError` (too many indices)	`A[:, :, :]` on a 2-D array	Drop the extra indices (matches NumPy)
Per-axis / fancy out-of-bounds now raise `IndexError`	`A[0,4]` on a 3×4, fancy `-7` on a size-6 axis (was wrong value / OOB read)	— (correctness + memory safety)
`np.full` argument order flipped to `np.full(shape, fill_value, dtype)`	matches NumPy (was `np.full(fill_value, shape, dtype)`)	Swap the first two arguments; `dtype` stays third
`bool - bool`, `-bool`, `np.negative(bool)` now throw	Matches NumPy	Use `^` / cast to int first
NaN `<=` / `>=` returns `False`	Matches IEEE & NumPy	Use `np.isnan` explicitly
`floor_divide`/`mod` divide-by-zero & floored results; `np.negative(uint)` wraps	Matches NumPy	—
`np.power(int, negative int)` raises `ValueError`	Matches NumPy	Use float exponents
Cast edge cases (overflow/NaN/`complex→bool`/`float→int` truncation); transcendental NEP50 width promotion; `np.clip`/quantile dtype promotion	Return values/dtypes may change	—
`np.byte` is now `int8` (sbyte)	was `uint8` (byte)	matches NumPy's C-`char` convention; use `np.ubyte`/`np.uint8` for unsigned
`np.complex64`/`np.csingle` now throw	was silent widening to complex128	use `np.complex128`
`np.int_`/`np.uint`/`np.intp`/`np.uintp` realigned to NumPy 2.x (`intp`/`uintp`; long/ulong on 64-bit)	the dtype these aliases resolve to changes	— (matches NumPy 2.x)
`np.random` seed sequences changed (Knuth subtractive generator → MT19937)	Same seed now yields a different sequence	— (intentional; output is now byte-identical to NumPy 2.4.2 at a given seed)
Integer `np.random` distributions return `int64`	`poisson`/`binomial`/`geometric`/`hypergeometric`/`zipf`/`logseries`/`negative_binomial` were `double`	— (matches NumPy dtypes)
Broadcast views are read-only; broadcasting keeps rank for 1-D `[1]`	Matches NumPy	`.copy()` to write
`MultiIterator` and `NDIterator` (+ `NDIterator<T>`, `AsIterator`) removed	Public types removed (threw at runtime anyway)	Use `NDIter` / `NDIter.Copy` / `NDFlatIterator`
`NDIter`: `MaxOperands=8` and 64-dim limits removed	None (loosening)	—
`np.copyto` unwriteable-destination error type corrected	Exception type change	—

Everything above was validated against NumPy 2.4.2 ground truth — by ~40,000 differential corpus cases, 566 iterator parity scenarios, a 12,000+-case index oracle, ~18,000 array-print fuzz cases, and per-feature battle tests run on actual NumPy output.

Closes: #435 #439 #456 #477 #480 #495 #501 #508 #515 #542 #567 #568 #604 #605 #608

Nucs · 2026-06-05T13:40:10Z

📊 Benchmark & performance — `nditer`

Two performance items on this branch: a fused strided-SIMD unary kernel, and an official NumSharp-vs-NumPy benchmark across ~all op categories at three sizes (benchmarked Core state = d01f1d63).

1. Fused strided-SIMD unary IL kernel (`d01f1d63`)

New whole-array kernel for unary ops over non-contiguous 1-D inputs: strided gather → Vector{W}.Create → unary op → contiguous store, single pass, no scratch buffer, no per-tile dispatch.

Isolated kernel, np.sqrt(a[::2]), ns/elem:

N	fused	NumPy 2.4.2	NumSharp speedup
64	0.547	4.322	7.9×
4,096	0.549	0.660	1.20×
262,144	0.557	1.419	2.55×

The kernel is size-invariant (~0.55 ns/elem at every size) while NumPy degrades 2–6× as data spills out of cache.

All 11 ops on this path — speedup vs NumPy @262K (f64):

abs        3.37×   negate  3.15×   floor   3.07×   trunc   3.03×   round  3.00×
sqrt       2.55×   rad2deg 2.41×   deg2rad 2.22×   square  2.18×   reciprocal 1.72×

Verified 22,000 bit-exact checks (fused == contiguous kernel); full unit suite 9447/0/11.

Note: this is a DirectILKernelGenerator whole-array kernel that bypasses NpyIter by design — the fusion (gather folded into Vector.Create) is incompatible with NpyIter's gather/kernel separation, which is exactly the (slower) buffered path it replaces.

2. Official NumSharp-vs-NumPy benchmark (`6038990f`)

Methodology: BenchmarkDotNet Full — 50 iterations, InProcessEmit toolchain, iteration-time capped at 25 ms — × {1K / 100K / 10M} vs NumPy 2.4.2. i9-13900K · .NET 10.0.101 · Python 3.12.12. 1,813 C# measurements → 1,111 matched comparisons.

The iteration-time cap is what makes a Full run feasible: BDN's default Throughput strategy ramps to ~8192 invocations/iteration, so a 10M-element op at 50 iters took ~25 s per case. Capping it ⇒ ~15× faster (a 30-case set went 18 min → 70 s) with all 50 iterations preserved.

Headline — geomean (NumSharp ÷ NumPy, lower = better):

        slower ◄───────── 1.0 (parity) ─────────► faster
1K    ████████████████████  1.96×   (102 win / 212 lose)
100K  ██████████████████▎   1.83×   (109 win / 196 lose)
10M   ██████████▏ ........  1.00×   (166 win /  36 lose)   ◄ PARITY

At the memory-bound 10M size NumSharp is at parity across ~409 ops (166 faster, only 36 slower). Small-size cost is the per-element dispatch + result-allocation tax (~2×).

Per-suite geomean by size:

suite	1K	100K	10M
Statistics	0.19×	0.68×	0.48× ✅
Sorting	0.41×	1.13×	0.45× ✅
Comparison	1.27×	2.22×	0.50× ✅
Bitwise	8.16×	1.16×	0.61× ✅
Reduction	0.48×	0.94×	0.91× ✅
Arithmetic	3.09×	2.62×	1.25× 🟡
Unary	3.50×	4.44×	1.53× 🟡
Creation	12.26×	2.92×	2.24× 🟠
LinearAlgebra	2.76×	1.66×	4.02× 🔴

🏆 Biggest wins (@10m, real ms):

op	dtype	NumPy	NumSharp	speedup
`average`	f32	9.60	0.94	10.2×
`nansum`	f32	14.35	1.49	10.0×
`nanprod`	f32	18.52	1.90	9.7×
`var`	f32	16.96	2.60	6.5×
`count_nonzero`	f64	22.61	3.74	6.0×
`nanmean`	f64	33.47	5.69	5.9×

🎯 Biggest gaps (@10m) — optimization targets:

op	dtype	NumPy	NumSharp	gap
`sum axis=1`	uint8	3.12	49.74	16.0×
`dot`	f64	1.23	16.46	13.4×
`matmul`	f64	0.72	4.26	5.9×
`argsort`	int32	369	2162	5.9×

→ three fronts: narrow-int axis reductions (no widening-SIMD), linear algebra (no BLAS), sort.

Per-dtype @10m (geomean):

int64 0.91  uint64 0.92  f32 0.93  f64 0.98  uint8 1.00  uint32 0.99   ◄ strong
int32 1.11  int16 1.14   uint16 1.24   bool 1.60                       ◄ weak (bool, narrow-uint)

Dtype coverage: 10 dtypes compared vs NumPy; char/decimal measured but have no NumPy peer (C#-only). SByte/Half/Complex were uncovered and have since been added to the benchmark code (48e85528) — the next run produces the full 15-dtype matrix.

Reproducibility

Reusable cross-platform runner: python benchmark/run_benchmark.py (builds C#, runs BDN per-suite, sweeps NumPy at 3 sizes, merges, archives).
Full report: benchmark/benchmark-report.md (1,311 rows).
Provenance snapshot keyed by date+hash: benchmark/history/2026-06-05_6038990f/ (manifest + report + NumPy timings).

…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NDExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NDIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.

… stat Per updated direction: the ratio convention is NumPy ÷ NumSharp again (>1.0× = NumSharp faster — bars grow right = faster, the original visual), AND every row now also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses. So a win reads two intuitive ways: "12.63× faster" and "🕐 8%" (takes only 8% of the time NumPy would); parity is 🕐 100%; >100% is slower. Huge slowdowns compact to e.g. 🕐 881×NP. render_dashboard.py: * r["sp"] = numpy/numsharp (speedup), r["pct"] = numsharp/numpy*100 (share of NumPy time). * headline + every bar/table show both: HEADLINE 0.74× geomean · 🕐 136%; by-suite e.g. statistics 2.28× 🕐 44%, reduction 1.21× 🕐 83%, creation 0.35× 🕐 283%; FASTEST nansum 12.63× 🕐 8%; SLOWEST np.zeros 0.001× 🕐 881×NP. * status-mix bands relabelled in %NumPy terms (faster ≤100% / close 100–200% / slower 200–500% / much >500%), a legend line explains the 🕐 stat, pct_str() keeps the column narrow (NN% under 1000%, else NN×NP). benchmarks-dashboard.md re-seeded with the matching note (heredoc — printf mis-read %NumPy as a format spec); docfx build clean, emoji verified present (U+1F550 ×54). Supersedes the brief NS/NP experiment (c0a5346). The op-matrix report (merge-results.py) still uses NS/NP "lower is better", and the nditer sheet / cards use NP/NS without the %NumPy stat — rolling the NP/NS + 🕐 %NumPy convention out to those is the next step, pending confirmation.

Completes the rollout chosen after the dashboard fix: every benchmark surface now uses the SAME convention — speedup = NumPy ÷ NumSharp (>1.0× = NumSharp faster) — and every surface also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses (30% = takes only 30% of the time NumPy would; <100% = faster; huge slowdowns compact to e.g. 880×NP). So a win reads two intuitive ways at once: "12.66× faster" and "🕐 8%". Op-matrix report (merge-results.py) — FLIPPED from NS/NP to NP/NS (the one surface that was "lower is better"): * ratio = numpy_ms / numsharp_ms; new pct_numpy field on UnifiedResult (JSON + CSV). * get_status bands inverted around >1 = faster (faster ≥1.0× / close 0.5–1.0× / slower 0.2–0.5× / much <0.2×); classify() credibility gate flips to ratio > 20 (was < 1/20). * Best/Worst now sort DESCENDING (fastest first); legend + tables + summary-by-size gain a 🕐 %NumPy column; ratio_fmt keeps tiny slowdowns readable (0.001× not 0.00×). * Regenerated from the on-disk run archive: Top Best nansum 12.66× 🕐 8%; Top Worst np.zeros 0.001× 🕐 880×NP; searchsorted stays negligible (now ratio>20). Counts unchanged (305/255/169/103/275/126) — same rows, just the direction relabelled. nditer sheet (nditer_sheet.py) + cards (nditer_cards.py) — already NP/NS, ADDED 🕐 %NumPy: * sheet: legend line + per-bar 🕐 %NumPy + headline "1.17× geomean · 🕐 85% of NumPy's time"; re-rendered nditer_results.md (--render-only, AV block intact). * cards: each bar label now "1.80× · 56%" (ops) / "4.3× · 23%" (dividends); footer explains the %. No emoji in matplotlib (DejaVu lacks the glyph) — the % carries it. Re-rendered. Narrative benchmarks.md + README — already NP/NS, added the 🕐 %NumPy line to the convention block, a %NumPy column to the by-class table, and a caption sentence. DocFX pages (benchmark-matrix.md, benchmark-iterator.md) re-seeded from the regenerated report + sheet; benchmarks.md updated; docfx build clean (0 errors). The dashboard (render_dashboard.py / benchmarks-dashboard.md) already carries this convention (49af3af), so the whole benchmark stack — report, dashboard, iterator sheet, cards, narrative, README — is now identical: NumPy ÷ NumSharp speedup + 🕐 %NumPy.

The clock sat before the figure with the right-align padding landing between them ("🕐 87%"). Moved it to immediately follow the percentage, no space — "87%🕐" — across every surface, and likewise the metric name (🕐 %NumPy → %NumPy🕐). The alignment padding now sits before the number (where it belongs) instead of after the emoji. * render_dashboard.py / nditer_sheet.py: bar values "{pct_str}🕐", headline "85%🕐 of NumPy's time", legend "%NumPy🕐 = …". Dashboard + sheet regenerated. * merge-results.py: report legend, status-band table, summary-by-size "%NP🕐" column, Best/Worst note, and per-suite "%NumPy🕐" column headers. Report regenerated. * benchmarks.md + README: convention line / table column / caption "%NumPy🕐". * DocFX pages (matrix, iterator, dashboard) re-seeded; dashboard page note "%NumPy🕐". docfx build clean. The matplotlib cards are unaffected (they show "1.80× · 56%" without the emoji — DejaVu has no clock glyph — so there was never a gap to fix there).

… form pct_str (dashboard/sheet) and pct_fmt (report) switched to a ×-multiplier form for huge slowdowns (np.zeros etc.), so the %NumPy stat showed "880×NP🕐" / "880×" — breaking the NN%🕐 depiction the column promises. Now they always render a percentage: np.zeros reads "87957%" (report) / "88087%🕐" (dashboard) = takes ~880× as long, stated as a share of NumPy's time like every other cell. The ratio column is untouched — it legitimately uses × (0.001×, 12.65×); only the %NumPy formatters changed. Report + sheet + dashboard regenerated, the three DocFX pages re-seeded, docfx build clean.

…g from the report The dashboard and benchmark-report.md disagreed on the SAME cell: np.nansum(f64,100K) read 12.63× on the dashboard vs 12.65× in the report, np.zeros(i64,10M) read 88087% vs 87957%, quantile/percentile likewise — 161 rows printed a different ratio at 2dp between the two committed surfaces. Root cause: merge-results.py computes ratio = NumPy/NumSharp and pct_numpy from the FULL-PRECISION means, then stores numpy_ms/numsharp_ms rounded to 4dp. render_dashboard.py ignored the stored ratio/pct_numpy fields and RE-DIVIDED the rounded ms (r["numpy_ms"] / r["numsharp_ms"]), so every row where the 4dp rounding moved a digit drifted from the report. The report is correct (true ratio of the measured means); the dashboard was a rounding artifact of its own recompute. Fix: the credible loop now consumes r["ratio"] / r["pct_numpy"] straight from the JSON (the same numbers benchmark-report.md prints), falling back to 100/ratio only if pct is absent. Dashboard and report now agree cell-for-cell, and the per-suite/per-dtype geomeans key off the same stored ratios the report's Summary-by-size uses. Regenerated benchmark-dashboard.md (gitignored) and re-seeded the DocFX dashboard page; header preserved, body refreshed. Verified: nansum 12.65×/8%, zeros 0.001×/87957%, quantile 9.89×/10% identical on both surfaces; size tiers match Summary-by-size exactly.

…not run" cells normalize_op_name dropped measured C# data on the floor whenever the C# benchmark label and the NumPy suite name differed only cosmetically, so the report showed ⚪ "C# benchmark not run" for ops that WERE run. Three archive-safe alias passes (applied identically to both sides, so they only ever merge a true pair): * empty "()" — a no-arg C# method call "a.flatten()" now meets NumPy's "a.flatten" * "->" spacing — C# "reshape 2D -> 1D" now meets NumPy's "reshape 2D->1D" * np.around — IS np.round (NumPy alias); C# benchmarks rounding as np.around, NumPy emits np.round, so the whole np.round family was ⚪ despite real data Effect (re-merged from the same archive — no re-run): ⚪ no-data 126 → 116; the np.round family gains 6 real rows (float32/float64 × 3 sizes), a.flatten +2 (100K/10M), reshape 2D->1D +2. Verified against the archive before editing: +10 joined cells, 0 regressions (no previously-matched cell lost), 0 new key collisions. Regenerated benchmark-report.{md,json,csv} + the dashboard (now 840 credible cells, 0.73× geomean) and re-seeded the matrix + dashboard DocFX pages (headers preserved byte-for-byte). The dashboard stays cell-consistent with the report via the canonical ratio/pct fix from the prior commit. NOT fixed here (genuine gaps needing a benchmark re-run, not a name alias): np.prod has no NumPy full-reduction row at all; isnan/isinf/isfinite/isclose/allclose/array_equal/ maximum/minimum have no C# benchmark; amax/amin/mean/std/var axis variants and np.mean on uint*/int16 lack a counterpart on one side.

…lex (NumPy parity) These six complex ufuncs previously threw NotSupportedException from the EmitUnaryComplexOperation default arm, even though NumPy 2.x has complex loops for all of them (csinh/ccosh/ctanh/casin/cacos/catan). This wires them up with full NumPy 2.4.2 parity. Approach (hybrid BCL + C99 fixups, mirroring the existing abs/log2/exp2 pattern): a bit-exact probe over a finite battery showed System.Numerics. Complex matches NumPy to a few ULP on the finite interior, but diverges at 86/360 edge components -- it returns (NaN,NaN) for nearly all inf/NaN inputs instead of the C99 Annex G values, drops the sign of zero on branch cuts, and mishandles arctan's imaginary-axis cut. So: - NDComplexMath.{Sinh,Cosh,Tanh,Asin,Acos,Atan} delegate the finite interior to the BCL and add the C99 fixups: * Non-finite inputs: special-value tables ported from NumPy's msun npy_csinh/ccosh/ctanh, with asin/atan reusing NumPy's own identities asin(z)=i*conj(casinh(i*conj z)) and atan(z)=i*conj(catanh(i*conj z)). * Branch-cut/signed-zero fixups (empirically derived against NumPy and verified on a 64-point signed-zero grid): asin negates Re on x=-0 and Im on y=-0; acos negates Im on the y=+0 cut; atan sets Re=copysign(|y|>1?pi/2:0, x) on the imaginary axis and negates Im on y=-0. * Where this NumPy build's system libm diverges from msun at infinities (sign-preserving sinh(-inf+i*inf).re, cosh's even-function +inf*sin(y) imaginary part, tanh's sign(y) zero, and the genuinely-unspecified zero signs), the helpers match the observed NumPy 2.4.2 output. - DirectILKernelGenerator: register CachedMethods.Complex{Sinh,Cosh,Tanh, Asin,Acos,Atan} (pointing at NDComplexMath, not Complex.* directly) and add the six cases to EmitUnaryComplexOperation. Verification: a bit-exact harness over a 117-point battery (finite + signed zeros + branch cuts + inf/NaN) plus a 64-point grid, diffed against NumPy 2.4.2, gives 1402/1404 components matching (1249 bit-exact, 153 within <=3 ULP). The only 2 residuals are arctan's finite interior (1e-10 tiny input ~8e-8 rel; 100+100j at 3 ULP) -- .NET's Atan kernel is less accurate than NumPy's log1p-based one; an accepted, documented divergence. Tests: - NewDtypesUnaryTests: 9 NumPy-verified cases covering interior, branch cuts, signed zeros, and C99 special values. - Fuzz/MisalignedRegistry: the stale "complex kernel throws" excuse is corrected to Half-only; complex sinh/cosh/tanh/arcsin/arccos are now held to a tight 4-ULP gate (a real regression fails) instead of the blanket complex-unary excuse; arctan stays under the documented blanket for its accepted BCL-interior divergence. All 609 Fuzz + NewDtypes tests pass (net10.0); the 26x5 complex corpus cases for the five tightly-gated ops are all within 4 ULP.

…e nditer branch Replaces the stale PR description (written ~64 commits in, +50k lines) with a complete changelog of everything between the #612 merge-base (5eedb81) and HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare. Compiled via a two-pass audit: - Pass 1: every commit subject+body mined for features, perf numbers, and breaking changes; APIs/CI/benchmark/corpus facts verified against the live tree (test counts, fuzz corpus wc, Direct partial count, NpyIter LOC). - Pass 2: all 279 local commits re-walked against the draft. Caught and fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on master (verified via git grep origin/master) — reclassified from 'rebuilt' to new, raising the new-API count 22 -> 30; removed an unverifiable test count; added the 15-dtype hot-path parity item (786d705) and the DefaultEngine->NpyIter Tier-3B production routing. Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612 (already on master) and are intentionally excluded; the local master ref is stale, which is why master..HEAD misleadingly shows 339 commits. The same content (minus the H1) is now the live PR #611 description, pushed via REST PATCH (gh pr edit requires read:org scope the token lacks).

… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NpyIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.

…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NpyExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NpyIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.

…ndentals Adds AggressiveInlining/AggressiveOptimization to the complex hyperbolic and inverse-trig helpers and restructures them into a hot/cold split, so the JIT folds the per-element math into the IL-emitted unary kernel without a call frame: - Sinh/Cosh/Tanh/Asin/Acos/Atan (+ Abs and the tiny IsNegZero/IsPosZero/ HypotInf/ClogLarge helpers) are marked AggressiveInlining. Each public op is now a tiny finite-path wrapper (finite check -> Complex.* + fixups, or a cold-helper call) so it fits the inliner's budget. - The non-finite C99 special-value tables move into cold helpers (SinhSpecial/CoshSpecial/TanhSpecial/CasinhNonFinite/CacosNonFinite/ CatanhNonFinite) marked AggressiveOptimization -- kept out-of-line (so the hot wrapper stays inlineable) and fully optimized when actually hit. Behavior is identical to the prior inline form (verified below). IL-inlining experiment (the "emit the formula instead of call" question): benchmarked complex sinh both ways over 4M finite elements, median of 15 reps. The real-decomposition formula (Math.Sinh(x)*Math.Cos(y), Math.Cosh(x)*Math. Sin(y)) is bit-identical to Complex.Sinh (0/4M mismatches) but only 1.15x faster than the call; cosh 1.06x; asin/acos/atan have no real-Math.* formula (dominated by complex log/sqrt) so inlining would only drop a wrapper frame. The per-element cost is dominated by the transcendental itself, so emitting ~6 hand-written IL formulas is not worth the duplication/risk -- especially as the call-based kernel is already ~1.56x faster than NumPy 2.4.2 (np.sinh: 26.1 ns/elem vs NumPy 40.9). Decision: keep the handwritten methods; the inlining attributes capture the (small, safe) wrapper-elimination gain. Verified: NewDtypesUnaryTests + Fuzz UnaryExtra (4-ULP complex gate) green (62/62); the hot/cold split changes no results.

…exponential The np.allclose / np.random.exponential working-set leak guards (np.allclose.UsingTests, np.random.exponential.UsingTests) failed in CI with hundreds of MB of working-set growth (e.g. 551 MB on Linux, threshold 20 MB), while passing on Windows. Root cause: both functions allocate several NDArray intermediates per call and never dispose them — the unmanaged buffers ride the finalizer queue instead of being released synchronously. In a tight loop the managed wrappers are tiny so the GC rarely runs, leaving the intermediates LIVE between collections; the allocator can't reuse that memory, so the high-water mark balloons. On glibc (Linux) freed pages are retained in the arena, so the process RSS stays high even after the test's final GC.Collect()+WaitForPendingFinalizers() — hence the large WorkingSet64 delta. np.isclose: |a-b| <= atol + rtol*|b| materialized ~5 float64 temps (≈400 KB each at 50K elements) plus several bool temps, none disposed. Wrapped every fresh allocation in `using` (the elementwise operators/ufuncs each return a new array). x/y come from astype(copy:false), which returns the input itself when no conversion is needed, so they are caller-owned and never disposed here. The final combined array is captured in `using` too: MakeGeneric<bool>() takes its own refcount on the shared buffer, so disposing the backing temp on return keeps the result alive while keeping that last buffer off the finalizer queue. np.random.exponential: β·(-log(1-U)) left uniform, (1-U) and negate() intermediates un-disposed (only the log result was released). Now disposes all of them; only the trailing `* scale` allocates the fresh array returned to the caller. Effect (measured, peak WorkingSet64 growth across a 1000-iter no-GC loop — the CI failure mode): allclose 551 MB -> 3 MB, exponential -> 1 MB. Behavior is unchanged: full suite green on net8.0 and net10.0 (9718 passed / 0 failed under the CI filter), including the Logic fuzz corpus and the isclose/allclose/ exponential unit tests.

…ity with NumPy 2.4.2) The complex (complex128) overloads of the unary math ops deferred to System.Numerics.Complex for their finite interior. The BCL transcendentals diverge from NumPy on a wide range of edge inputs — large magnitudes, the unit circle, tiny/subnormal values, branch cuts and signed zeros — because they do NOT implement the careful FreeBSD msun algorithms NumPy uses. This replaces those deferrals with direct ports of NumPy's own routines in NDComplexMath, verified by a 504-point bit-exact sweep (Python struct-packed int64 references) classifying every result as exact / <=3 ULP / signed-zero / special / sign-flip. Result: 18 of 20 complex unary ops are now at full parity (0 divergence beyond <=3 ULP): exp, log, log10, log2, log1p, expm1, exp2, sin, cos, tan, sqrt, square, reciprocal, negative, sinh, cosh, tanh, arcsin, arctan. Algorithms ported / fixes (src/NumSharp.Core/Utilities/NDComplexMath.cs): - Log (npy_clog): real part = log|z| with the four-regime rescale — |z| huge (x2 down), subnormal (x2^53 up), near the unit circle (0.71<=|z|<=1.73 uses 0.5*log1p((m-1)(m+1)+n^2) via a Goldberg MathLog1p), and 0. Complex.Log cancels the real part to 0 near |z|=1 (e.g. log(1+1e-10 i).real must be 5e-21, not 0). ComplexLog is repointed here, so np.log, np.log2 and np.log10 all inherit the accuracy. - Tanh (npy_ctanh): Kahan's algorithm (t=tan(y); beta=1+t^2; s=sinh(x); rho=sqrt(1+s^2); tanh=(beta*rho*s + i t)/(1+beta*s^2)) plus the |x|>=22 overflow-safe branch. The BCL Complex.Tanh drifts ~33 ULP (tan(1.5) through the tan(z)=-i*tanh(iz) identity). - Sin/Cos/Tan: now route ALWAYS through Sinh/Cosh/Tanh, exactly as NumPy defines npy_csin/ccos/ctan (= -i*sinh(iz) / cosh(iz) / -i*tanh(iz)), so they match NumPy bit-for-bit instead of only on the BCL's finite interior. Fixes sin(0+1e300 i).real = NaN (BCL did cosh(huge)*0); the Sinh/Cosh y==0 guard returns (sinh(x), y)/(cosh(x), x*y) so a large real no longer yields inf*0 = NaN. - Expm1 (nc_expm1): real = expm1(x)*cos(y) - 2 sin^2(y/2), imag = exp(x)*sin(y); the real expm1 fallback uses the Goldberg identity (e^x-1)*x/log(e^x) which recovers the ~10 digits exp(x)-1 cancels and avoids underflow (expm1(1e-300)=1e-300, not 0). Fixes the non-finite imaginary (expm1(+Inf+0i).imag = exp(+Inf)*sin(0) = NaN) and origin signed zeros. - Square (z*z with FMA contraction): (fma(re,re,-(im*im)), fma(re,im,im*re)). NumPy's complex multiply is FMA-contracted, so square(1e-10+1e-10 i).real = -2.275e-37 (exact re^2 minus rounded im^2) and square(1e300+1e300 i).real = -inf; Complex.op_Multiply (no FMA) returned 0 and NaN. - Atan (npy_catanh, full): atanh(x) on the real axis, atan(y) on the imaginary axis, and the log1p(4|x|/sumsq(|x|-1,|y|))/4 interior, plus _sum_squares and an exponent-classified _real_part_reciprocal (raw biased-exponent field, NOT Math.ILogB which maps 0/Inf to int.MinValue/MaxValue and overflows the subtraction). Complex.Atan cancelled / underflowed the tiny imaginary part (arctan(0+1e-10 i).imag must be 1e-10). - Exp (npy_cexp): exp(-Inf + I(Inf|NaN)).imag = copysign(0, y) so exp(-inf-inf i).imag = -0 (the system libm keeps sign(y); npy_cexp's flat (0,0) dropped it). exp2 inherits this. - Reciprocal already used Smith's nc_recip (overflow-safe, correct signed zeros). Engine wiring (DirectILKernelGenerator[.Unary.Decimal].cs): ComplexLog repointed to NDComplexMath.Log; new cached methods ComplexExpm1 and ComplexSquare; the Expm1 and Square cases in EmitUnaryComplexOperation now call the ported helpers instead of inline Complex.Exp(z)-1 / Complex.op_Multiply. Accepted residuals (pathological inputs only, documented in code + the fuzz registry): - cos/sin with a NaN imaginary part: the resulting zero's sign is C99-UNSPECIFIED; the platform libm and the npy_ccos identity pick opposite signs (2 cases). - arccos with a sub-DBL_MIN imaginary part: Complex.Acos flushes the denormal real part to 0 where cacos's _do_hard_work keeps it (~5.8e-309); a denormal-range edge (4 cases). - sinh/cosh at the overflow boundary |x| in [710, 710.13]: Windows' CRT sinh overflows to inf while .NET Math.Sinh stays finite (a platform-libm boundary, absent on glibc). Tests: NewDtypesUnaryTests.cs adds 11 NumPy-2.4.2-verified cases for the huge-imaginary sin/cos, large-real sinh/cosh overflow, Kahan tan accuracy, near-unit-circle clog, scaled log10/log2/log1p, Goldberg expm1, exp2, FMA square, reciprocal signed zeros, catanh tiny/large arctan, and the exp -inf signed zero. Fuzz/MisalignedRegistry.cs tightens the complex-unary gate to <=3 ULP across the whole set (was a 4-ULP gate on 5 ops + a blanket excuse for the rest), narrows the >3-ULP excuse to the named pathological ops, and adds a separate entry for the (pre-existing) complex reduction/scan NaN-ordering divergence the old blanket covered. Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9729 passed, 0 failed. net8.0 + net10.0 both build clean.

…-run Adds the benchmark definitions that were missing on one side of the op-matrix join (so the ops showed ⚪ "not run" or were discarded as C#-only), then re-runs the whole official suite (all 14 comparison suites x 3 cache tiers, ~3h) to fill them in with live numbers. Result: ⚪ no-data 130 -> 76, and the headline moves from a stale 0.74x to 1.08x geomean (93%🕐) over 1386 credible cells — the stale figure was dragged down by a broken searchsorted and by simply missing most of NumSharp's fast reductions. NumPy side (numpy_benchmark.py) — C# already benchmarked these; NumPy didn't: * unary: np.tan, np.exp2, np.expm1, np.log2, np.log1p, np.clip(a,-10,10), np.power(a,2|3|0.5) * reduction: np.cumsum (all arithmetic dtypes), np.prod + np.prod axis=0/1, and the axis variants np.amax/np.amin/np.mean axis=0(/1) and np.var/np.std axis=0 All names normalize to the existing C# [Benchmark(Description=...)] so they join 1:1. C# side: * ProdBenchmarks: was non-standard sizes (100/1000/10000) + method-form names (a.prod()); nothing could join it. Switched to the standard Small/Medium/Large tiers and function-form np.prod(a)/np.prod(a, axis=k) — values stay in [0.5,1.0] so the product is overflow-safe at every size. prod now has full + axis coverage (18 cells). * MeanBenchmarks: CommonTypes -> ArithmeticTypes, closing the np.mean uint*/int16 ⚪ holes (15 cells) — matches SumBenchmarks/MinMaxBenchmarks. * LogicBenchmarks: isnan/isinf/isfinite/maximum/minimum/array_equal now join (54 cells). Verified on the fresh run: searchsorted is purged of the 0.0000ms / >1e6x rows (now real, 1.16-1.44x faster), prod/cumsum/all axis reductions/the 6 predicates/mean-on-uint* all matched. Regenerated benchmark-report.{md,json,csv} + dashboard and re-seeded the matrix + dashboard DocFX pages. KNOWN BUG surfaced (left as ⚪): np.isclose and np.allclose DETERMINISTICALLY segfault NumSharp with the unmanaged-storage AccessViolation — each crashes even run alone, and in-class it killed the whole logic suite before BenchmarkDotNet could export anything (took the 6 working predicates down with it). Disabled both in LogicBenchmarks with a documented note; re-enable once the NumSharp isclose/allclose lifetime bug is fixed. The 6 predicates were recovered by running each in its own process (the same per-section isolation the NDIter harness uses for its AV).

…segfault) + doc review Parity review of the complex unary math overloads (commit 416affc). Verified all 20 affected ops across memory layouts (contiguous / F-contiguous / strided / transposed / both negative-stride directions / sliced-offset / broadcast / 0-d / empty) with a fresh bit-exact sweep — every op is layout-correct with 0 divergence — and confirmed the out=/where= ufunc parameters compose bit-exactly with the new complex kernels (exp returns the same out instance; sqrt's where=mask preserves masked-off slots). The review surfaced a pre-existing MEMORY-SAFETY bug (segfault), now fixed: np.exp(complex_array, dtype=float64) # and sqrt/log/log2/log10/log1p/expm1/exp2/sin/cos/tan/ # sinh/cosh/tanh/arcsin/arccos/arctan segfaulted instead of raising. Root cause: ResolveUnaryFloatReturnType honored an explicit dtype= override after only rejecting integer/bool targets (over < Single -> "No loop matching"). It never checked that the INPUT can reach the requested loop dtype by a same_kind cast. For a complex input + real-float dtype=, it returned the real type, ExecuteUnaryOp allocated an 8-byte/element output buffer, and the 16-byte/element complex kernel overran it. NumPy 2.4.2 raises instead: "Cannot cast ufunc 'exp' input from dtype('complex128') to dtype('float64') with casting rule 'same_kind'" Fix: ResolveUnaryFloatReturnType now calls the existing ValidateUnaryInputCast (already used by square/reciprocal/negative, which were NOT affected) on the override path. This reuses NDIterCasting.CanCast(SAME_KIND), so it allows the legal narrowings (int->float32, float64-> float32, float->complex) unchanged and rejects only the cross-kind complex->real cast, emitting NumPy's verbatim message. Probe matrix (complex/float/int inputs x float/complex/int dtype=) now matches NumPy across all 17 float-producing complex ufuncs; the order is preserved (integer dtype= still raises "No loop matching" before the cast check). Also refreshes the NDComplexMath class doc comment, which still described the old fork state ("sinh/cosh/tanh/asin/acos/atan delegate straight to System.Numerics.Complex", "arctan's BCL interior is the lone documented divergence") — it now lists the actual ported algorithms (npy_clog, Kahan ctanh, csinh/ccosh, npy_catanh, npy_cexp/csqrt, nc_expm1/Goldberg, FMA square, nc_recip), the two ops still delegating (asin/acos at parity), and the three accepted pathological residuals. Tests: NewDtypesUnaryTests.cs adds Complex_FloatUfunc_NarrowingDtype_RaisesCastError_NotSegfault (exp/log/sqrt/sin/tanh/arctan: complex+dtype=float64 raises the verbatim cast error, complex+ dtype=int64 raises "No loop matching", complex+dtype=complex128 returns complex). Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9730 passed, 0 failed. net8.0 + net10.0 build clean. Note: ceil/floor/round/trunc on complex reject cleanly (no segfault) but with NumSharp's own message rather than NumPy's "ufunc not supported for the input types" — left as-is (out of scope; NumPy has no complex loop for them either). The int->exp2 InvalidProgramException (Single-output kernel) remains a separate, already-tracked bug (fuzz registry W3-C), unrelated to complex.

…o (match/beat NumPy 2.4.2) np.zeros was ~1000x slower than NumPy for large arrays (10M float64: 14.3 ms vs NumPy 0.011 ms). Root cause: it allocated an uninitialized buffer and then ran an eager per-element Fill loop that touched (and zeroed) every byte. NumPy instead delegates zeroing to the OS: PyDataMem_NEW_ZEROED -> calloc, whose demand-zero pages are committed and zeroed lazily on first write, so allocating zeros is effectively O(1) regardless of size (numpy/_core/src/multiarray/alloc.c npy_alloc_cache_zero: small sizes use a cache+memset, large sizes calloc). This ports NumPy's structure. The zeroing is now done by the allocator/OS, never an element loop — correct for all 15 dtypes because the all-zero bit pattern equals default(T) for every one of them (incl. Half, Single, Double, Decimal, Complex). Implementation -------------- - ArraySlice.Allocate(..., fillDefault: true) and Allocate<T>(..., true) now route to UnmanagedMemoryBlock<T>.AllocateZeroed instead of `new UnmanagedMemoryBlock<T>(count, default)` (Take + scalar Fill). All np.zeros overloads, np.zeros_like, np.eye/np.identity, and every internal fill-with-default allocation flow through here. - SizeBucketedBufferPool.TakeZeroed: NativeMemory.AllocZeroed (calloc) with no dirty-bucket reuse — a recycled buffer would force a full memset, discarding the lazy demand-zero win for large sizes and being no cheaper than calloc for small ones. - OsVirtualMemory (new, Windows-only): the Windows process heap eager-commits and memsets mid-size calloc requests (~256 KiB-2 MiB, ~0.05 ms for 800 KiB), unlike glibc/macOS which mmap large blocks lazily. For >= 128 KiB on Windows AllocateZeroed uses VirtualAlloc(MEM_COMMIT) (copy-on-write zero pages, ~0.002 ms) and a new Disposer AllocationType.Virtual that releases straight to the OS via VirtualFree (not pooled). Non-Windows and small sizes stay on calloc, which is already lazy/cheap there. Benchmark fix (pre-existing bug) -------------------------------- CreationBenchmarks returned each created array without disposing, leaking one buffer per op. NumPy's harness (numpy_benchmark.py) discards each result inside the timed loop, so CPython refcount frees it immediately — i.e. NumPy measures alloc+free while the C# benchmark measured alloc-only (unfair) and leaked. Under BenchmarkDotNet's thousands-of-ops-per-iteration, every untouched-but-committed buffer still charges Windows commit, so any fast creation op OOMs at 10M (np.empty(10M) already did; the old np.zeros only escaped by being slow enough to throttle BDN to a couple ops/iteration). All creation benchmarks now dispose per op, matching NumPy and bounding resident memory. Results (this machine, vs NumPy 2.4.2; BDN alloc+free) ------------------------------------------------------ - 10M float64: 14.3 ms -> 0.0033 ms (was ~1000x slower; now 3.1x faster) - medium (100K): 1.7-3.8x faster across i32/i64/f32/f64 - large (10M): 1.1-3.5x faster across i32/i64/f32/f64 - small (1K): ~1.5-2x slower — bounded by NDArray object construction (NDArray/Storage/Shape/ArraySlice/Disposer), shared by all creation APIs and sub-microsecond; the allocation itself is optimal. Tests ----- New Creation/np.zeros.AllocationTests.cs (12 tests): all 15 dtypes zeroed across heap/VirtualAlloc size regimes, full-scan of a multi-MB array, VirtualAlloc writeability/commit correctness, OwnsData, non-aliasing, reuse-after-dispose, multi-dim/high-rank/empty/sliced, default dtype, all overloads. Full CI suite (net8.0 + net10.0, excluding OpenBugs/HighMemory) green: 0 failed, 9742 passed.

…gramException) np.exp2 threw InvalidProgramException for EVERY call resolving to a float32 (Single) output — int16/uint16/char/float32 inputs (via ResolveUnaryFloatReturnType) or any input with dtype=float32. exp2->float64 and exp2->float16 were unaffected. Root cause: DirectILKernelGenerator.Unary.Math.cs, EmitExp2Call, the type == Single branch emitted one spurious instruction. Math.Pow takes (base, exponent), so the exponent must be stashed in a local and pushed after the base. The Single branch pushed an EXTRA `Ldc_R8 2.0` before the Stloc, so: Conv_R8 ; [x_d] Ldc_R8 2.0 ; [x_d, 2.0] <- spurious push Stloc locExp ; [x_d] locExp = 2.0 (saved the constant, NOT the exponent) Ldc_R8 2.0 ; [x_d, 2.0] Ldloc locExp ; [x_d, 2.0, 2.0] Call Math.Pow ; [x_d, 4.0] Pow(2,2)=4, pops 2 Conv_R4 ; [x_d, 4.0f] <- TWO values left on the stack The body left a net +1 on the evaluation stack, so the DynamicMethod failed IL verification -> InvalidProgramException at JIT time. (Latent second defect, masked by the crash: locExp held the constant 2.0, so even a balanced version would have returned Pow(2,2)=4 for every element.) Fix: delete the spurious push so the Single branch mirrors the (correct) Double branch wrapped in Conv_R8/Conv_R4 — Conv_R8 -> save exponent -> push base 2.0 -> push exponent -> Pow -> Conv_R4. exp2(float32[0,.5,1,-1,3]) now returns [1, 1.4142135, 2, 0.5, 8], bit-matching NumPy 2.4.2. Around-the-bug audit (no new findings): - Code: the decimal exp2 path (Unary.Decimal.cs) already saves the exponent correctly, and every other unary transcendental routes through EmitMathCall (MathF.X for Single — one arg, balanced), so exp2's Single branch was the only emitter with this idiom. - Empirical: swept all 20 float-producing unary ufuncs x 12 input dtypes (240 cells) vs NumPy -> 0 InvalidProgram crashes, 0 dtype mismatches; the only value differences are the pre-existing, accepted 1-float32-ULP transcendental gaps (exp/sin/cos/rad2deg, MathF vs NumPy's float32 libm), unrelated to exp2 (which is now exact). Tests: - NpApiOverloadTests_UnaryMath: removed [OpenBugs] from Exp2_WithType_Compiles / Exp2_WithNPTypeCode_Compiles (they reproduced this bug) and strengthened them to assert the 2^x values [1,2,4], not just no-crash (guards the latent locExp=2.0 wrong-value defect). - NewDtypesUnaryTests: new "Exp2 Single-output" region — int16/uint16/char inputs, float32 preserve (incl. fractional sqrt(2)), dtype=float32 override from int and double, a strided float32 view, and the unaffected float64/float16 branches. - MisalignedRegistry: removed the W3-C excuse so any regression of the crash now fails the fuzz gate. Verified: full CI-style suite (net10.0, exclude OpenBugs/HighMemory) 9749 passed / 0 failed; exp2 tests pass on net8.0 + net10.0 (runtime IL emission checked on both JITs).

np.power with a float16 (Half) scalar exponent threw InvalidCastException on the scalar-broadcast path — e.g. np.power(f16[2,3], f16(2)). NumPy 2.4.2 returns float16 ([4,9]). Root cause: Default.Power.cs ReadScalarAsDouble read the size-1 exponent via Convert.ToDouble(object). System.Half does not implement IConvertible, so the boxed Half threw InvalidCastException — even though the scalar-exponent fast paths (ScalarEqualsExact) explicitly list Half as supported. The throw fired from TryScalarExponentFastPath -> IsScalarValueZero -> ScalarEqualsExact -> ReadScalarAsDouble for ANY power call whose exponent value is a Half. Fix: ReadScalarAsDouble casts a boxed Half directly to double (Half is the only supported non-IConvertible dtype that reaches it; Complex is excluded upstream by ScalarEqualsExact's default and the float-only fast-path guard). With this, the fast paths resolve correctly: power(f16[2,3], f16 2) -> [4, 9] (exp=2 multiply) power(f16[2,3], f16 3) -> [8, 27] (general path) power(f16[4,9], f16 0.5) -> [2, 3] (sqrt fast path) power(f16[2,4], f16 -1) -> [0.5, 0.25] (reciprocal fast path) power(f16 2, f16[2,3,4]) -> [4, 8, 16] (scalar-broadcast base) all float16, bit-matching NumPy. Tests: NewDtypesComparisonTests.Half_Power_HalfExponent_ScalarBroadcast covers the formerly-throwing Half-exponent cells (the existing Half_Power used an int exponent, which is IConvertible and never hit the bug). MisalignedRegistry: removed the now-dead W1-B "power(float16): Half scalar path InvalidCast" excuse so a regression of the crash fails the fuzz gate; the binary_divmod_power corpus tally no longer lists it (was 3x). The kept power excuses are separate bugs: W1-F power(*,float16) dtype widening and W1-C power(uint64,int64) negative-exponent throw. Verified: power/Half/Fuzz suites 188 passed / 0 failed (net10.0); net8.0 + net10.0 build clean. This is the documented NEP50 weak-scalar behavior for an INT exponent (power(f16, int_scalar)->f16 vs NumPy float64) is unchanged and out of scope (registry entry #1).

… (W11-A, W6-D) np.maximum / np.minimum (and np.clip) must propagate NaN per NumPy 2.4.2 — the result is NaN whenever EITHER operand is NaN. NumSharp routes all three through the clip SIMD kernel, whose hardware MAXPS/MINPD intrinsics (Avx.Max / Avx.Min, plus the Sse/Avx512 equivalents) return the SECOND operand on an unordered (NaN) compare, silently dropping the NaN on every full SIMD vector. The scalar tail (Math.Max/Min, HalfMaxNaN, ComplexMaxNaN) already propagated, so the bug only surfaced once an array reached the vector width; small arrays (and earlier size-2 spot checks) passed by luck. Reproduction (float64, >= one V256<double> = 4 lanes): np.maximum([1,nan,3,nan,5,nan,7,8], [9,2,nan,4,nan,6,nan,nan]) before: [9, 2, NaN, 4, NaN, 6, NaN, NaN] (a==NaN lanes return b) after : [9, NaN, NaN, NaN, NaN, NaN, NaN, NaN] == NumPy np.clip([nan,1,2,3,4,5,6,7], 2, 5) before: [2, 2, 2, 3, 4, 5, 5, 5] (NaN clamped to a_min) after : [NaN, 2, 2, 3, 4, 5, 5, 5] == NumPy Fix (DirectILKernelGenerator): EmitVectorMinOrMax gains a `propagateNaN` flag. For Single/Double lanes it now emits the NaN-restoring select result = ConditionalSelect(Equals(a, a), hwMinMax(a, b), a) i.e. keep the hardware min/max where a is non-NaN, else re-inject a (= NaN). This is robust regardless of whether the underlying min/max already propagates (verified against Avx.Max/Min AND the cross-platform Vector{N}.Max/Min fallback). The raw hardware emit is factored into EmitRawVectorMinOrMax and is still used verbatim for integer lanes (no NaN) and any opt-out caller, so integer maximum/minimum are byte-for-byte unchanged. The two clip call sites (Max(vec,lo) / Min(vec,hi)) pass propagateNaN: true, fixing maximum (clip a_min), minimum (clip a_max), and clip (both bounds) in one place. clip with a NaN BOUND also propagates, matching NumPy's minimum(maximum(a,lo),hi) composition. Scope: fmax/fmin (W7-B, the NaN-IGNORING opposite contract) still share the clip path and remain divergent — a separate, already-tracked bug. This change does not regress them (their fuzz excuse keys on a NaN appearing in NumSharp's output, which still holds). The F-contiguous/strided element-pairing defect (W7-A) is likewise untouched. Registry: removed the now-dead W11-A (maximum_out/minimum_out), clip_out, and W6-D (clip) NaN excuses from MisalignedRegistry so the differential matrix verifies NaN propagation bit-exact; W7-A and W7-B excuses preserved. Tests: - np.minimum.Test.cs: maximum/minimum NaN propagation on the SIMD path (float64, float32), strided views, the W11-A out=alias case maximum(a, roll(a), out=a), and an int64 no-NaN regression guard. - np.clip.Test.cs: clip(NaN) with scalar bounds, array bounds, and out=alias. All expected values verified against NumPy 2.4.2. Validation: full suite green (9760 passed / 0 failed); FuzzMatrix green incl. Logic, Stat, Aliasing, Tail — the Aliasing corpus now reports zero documented divergences for maximum_out / minimum_out / clip_out.

…ng and float16 signed-zero (W7-A) np.maximum / np.minimum / np.clip (and fmax/fmin, which share the same kernel) all route through the flat clip IL kernel in DirectILKernelGenerator.Clip. That kernel walks src/dst/lo/hi linearly and pairs them by C-order (row-major) index. The engine, however, handed it operands in whatever layout they arrived in: - PrepareBound built an array bound via broadcast_to(...).astype(outType). astype keeps the source layout (order='K'), so an F-contiguous / strided / broadcast bound stayed non-C-contiguous and was read in raw buffer order. - The general src/dst path used Cast(lhs, outType, copy:true), which likewise preserves the source layout, so an F-contiguous / strided source produced a non-C-contiguous working buffer. - A non-C-contiguous user out= was clipped in place in its own (F/strided) buffer, again mismatching the C-order bound. Result: every element was paired with the wrong bound whenever a second operand (or the source, or out=) was F-contiguous / transposed / strided / negative-strided / broadcast. maximum(a, b.T) returned b's raw buffer instead of the element-wise max. This is the W7-A fuzz divergence (34x in logic.jsonl). Fix (Default.ClipNDArray): normalize every operand the flat array-bounds kernel touches to C-contiguous offset-0 before the call. - PrepareBound copies the prepared bound to C-order when it isn't already C-contiguous offset-0. - The general path re-materializes a C-order copy of dst when lhs is F-contiguous / strided / offset. - A non-C-contiguous out= is clipped into a C-order temp and copied back via np.copyto (which honors the out's strides), preserving the return-the-out instance contract. - Scalar bounds are order-insensitive (one value per lane) and keep their existing fast paths untouched. Half signed-zero tie-break (DirectILKernelGenerator.Clip): the residual W7-A divergence after the pairing fix was float16-only and showed even on both-contiguous inputs - a separate bug. NumPy's float16 maximum/minimum return the FIRST operand on a tie, so maximum(+0,-0)=+0, maximum(-0,+0)=-0 (likewise minimum). HalfMaxNaN/HalfMinNaN used strict >/< and returned the SECOND operand on the +0/-0 tie (wrong sign). Switched to >= / <= so the tie resolves to the first operand, matching NumPy. Only the signed-zero result changes; every non-zero equal pair is bit-identical. float32/64 use the SIMD path / Math.Max and were already correct. Verified bit-exact vs NumPy 2.4.2 across C / F / strided / transposed / negative-stride / broadcast / 3-D layouts, mixed-dtype bounds, two-bound clip, and non-C-contiguous out=. The logic.jsonl fuzz corpus now reports zero W7-A divergences (was 34x); maximum/minimum are bit-exact on every layout. MisalignedRegistry: removed the now-dead W7-A pairing excuse (the fallback for extrema value divergences). fmax/fmin keep only the NaN-propagation excuse (W7-B, a separate ignore-NaN bug); the isclose strided-complex excuse (its own comparison kernel, not clip-routed) remains. Tests (np.minimum.Test.cs): added F-contiguous bound, F-contiguous source, two-bound F-contiguous clip, strided-slice, non-C-contiguous out=, 3-D transposed, float16 F-contiguous, and float16 signed-zero tie-break regressions. Added a flat (logical C-order) assertion helper since GetDouble(int) on a multi-dimensional result indexes the first axis, not the flat buffer. Full net10.0 suite: 9769 passed / 0 failed.

The Half/Complex clamp helpers (HalfMaxNaN, HalfMinNaN, ComplexIsNaN, ComplexMaxNaN, ComplexMinNaN) are invoked once per element from the clip IL kernel's inner loop - the non-SIMD scalar path for Half/Complex, which have no Vector<T> Min/Max. Marked them [MethodImpl(AggressiveInlining | AggressiveOptimization)]: - AggressiveInlining lets the JIT fold them into the generated kernel where it can (e.g. ComplexIsNaN into ComplexMax/MinNaN, and the helpers into the dynamic-method body), removing the per-element call. - AggressiveOptimization gives them full tier-1 codegen from the first call when they are invoked standalone via the kernel's Call opcode, skipping the tier-0 quick-JIT penalty on a hot path. Scope: only the per-element helpers in the 'Per-Element Helpers for Non-SIMD Types (called from generated IL)' region. The one-time IL-codegen methods (Generate, Emit*, etc.) are intentionally left untouched - they run once per (dtype, mode, kind) and are cached, so AggressiveOptimization would only add JIT startup cost with no steady-state benefit.

… op helpers Following the clip per-element helpers (d72822f), tag the other small per-element scalar/generic op helpers that run inside kernel inner loops with [MethodImpl(AggressiveInlining | AggressiveOptimization)] - inline into the loop where the JIT can, full tier-1 codegen otherwise (skip the tier-0 quick-JIT penalty on the hot path). Tagged (8 helpers, 3 files): - ComplexDivideNumPy (DirectILKernelGenerator.cs) - per-element complex divide called from the binary IL kernel. - HalfSignHelper (Unary.Decimal.cs) - per-element Half sign called from the unary IL kernel. - Reduction.Axis.cs accumulator helpers, each called once per element inside the axis-reduction loops: ReadAsDouble, CombineScalarsPromoted, ComplexLexPick, DivideByCount, ConvertToDouble<T>, ConvertFromDouble<T>. Their generic typeof(T) branches constant-fold to a single cast per instantiation, so the inlined bodies are tiny. Deliberately NOT tagged: - Whole-array / whole-axis loop helpers (Nan*SimdHelper, Arg*Helper, CopyGeneralSameType, the *ReduceAxis* loops, scan helpers): they ARE the loop, called once per op/axis with large bodies - AggressiveInlining is ignored on them, and .NET on-stack-replacement already promotes their long loops to tier-1. Can be revisited with AggressiveOptimization-only if profiling shows a tier-0 entry cost. - One-time IL-codegen methods (Generate/Emit*): run once per cached kernel key, so AggressiveOptimization would only add JIT startup cost with no steady-state benefit. JIT-only change - no behavior change. Full net10.0 suite: 9769 passed / 0 failed; FuzzMatrix 42/42.

…per-element op helpers Sweep the remaining small per-element / per-vector / per-T op helpers in the kernel hot core so they all carry [MethodImpl(AggressiveInlining | AggressiveOptimization)], extending the clip (d72822f) and reduction-accumulator (06d942c) passes. Verified the two flags compose with NO penalty: a clean per-variant microbenchmark (dedicated AggressiveOptimization loop per variant, 1.5e9 iters) measured AggressiveInlining = AggressiveInlining|AggressiveOptimization = AggressiveOptimization-only = 740-741 ms, vs NoInlining = 1106 ms. AggressiveOptimization does NOT suppress the inlining that AggressiveInlining requests; it only adds tier-1-from-first-call for the standalone version (relevant when a helper is reached via the kernels' generated-IL Call rather than inlined). Changes: - Upgraded 131 helpers that were already AggressiveInlining-only to add AggressiveOptimization (Reduction.Axis.Simd.cs, Reduction.Axis.Widening.cs, Reduction.Axis.VarStd.cs, Quantile.cs, Modf.cs, Where.cs, IndexCollector.cs, SimdMatMul.cs, SimdMatMul.Strided.cs, StrideDetector.cs). - Added the full attribute to 6 previously-untagged per-element / per-vector / seed helpers: CompareGreater<T>, CompareLess<T> (Reduction.Axis.Arg.cs - argmax/argmin per-element compare), GetNanIdentityValue<T> (Reduction.Axis.NaN.cs), GetIdentityValueTyped<T> (Reduction.Axis.cs), HorizontalReduce256<T>, HorizontalReduce128<T> (Reduction.Axis.Simd.cs - per-vector finalize). - Added using System.Runtime.CompilerServices; to the three files that lacked it. No AggressiveInlining-only attributes remain in Backends/Kernels/ (150 now carry both flags). JIT-only change - no behavior change. Full net10.0 suite: 9769 passed / 0 failed; FuzzMatrix 42/42.

…elpers codebase-wide Completes the sweep started in the kernels (e3685be): every method already marked [MethodImpl(MethodImplOptions.AggressiveInlining)] now also carries AggressiveOptimization. The two flags compose with no penalty (benchmarked: AggressiveInlining = both = AggressiveOptimization-only at 740-741ms, all far ahead of NoInlining at 1106ms) and the change is JIT-only / behavior-neutral. 396 helpers across 48 files upgraded - Primitives (NDArray operators), Backends/Iterators (NDIter/NDIterator per-element traversal), Utilities + Utilities/SpanSource, Backends/Default/Math + BLAS, Manipulation, Creation, RandomSampling, Backends/Unmanaged/Pooling, DateTime64, etc. No AggressiveInlining-only attributes remain in NumSharp.Core (550 now carry both flags). AggressiveOptimization is a no-op for methods that inline (most of these); its benefit is tier-1-from-first-call for the standalone version when a helper is reached via a non-inlined path (generated-IL Call, delegate, or a body too large to inline). Every changed line is the attribute swap - no behavior or body changes. Full net10.0 suite: 9769 passed / 0 failed; net8.0 core builds clean.

@10m

… axis-reduction perf Investigation artifacts only — no library code changed. complex128 axis reductions were the next-worst op cluster after np.zeros (sum/mean axis=0/1 ran 19–64× slower than NumPy 2.4.2). These POCs find the root cause and benchmark every candidate fix against NumPy 2.4.2, all re-runnable via `dotnet run -c Release` in benchmark/poc. Root cause (complex_axis_probe / complex_alloc_probe / complex_why_probe): - NOT boxing. np.sum(complex) allocates ~0 B/elem; the (object) casts in CombineScalarsPromoted JIT-elide. The earlier "30M boxes" theory was wrong. - The real cost is scalar System.Numerics.Complex add where NumPy SIMD-vectorizes the reduction (complex == contiguous double pairs). A double-pair Vector256 add reaches NumPy parity: 7.52 vs 7.59 ms @10m, and beats it @1m (0.31 vs 0.33). - Secondary: the scalar fallback's non-inlined per-element combine + op-switch + cache-hostile per-output column gather (axis=0); mean's MeanAxisComplex allocates an NDArray view + iterator per output row (6 B/elem). - Small-N gap is compute (scalar), not allocation: np.zeros((N,)) output is ~4% of the op; pure compute is ~93%. Candidate benchmark (complex_reduce_poc) — baseline / NDAxisIter / NDIter-2op via ForEach / NDIter-2op via ExecuteGeneric / monolithic generic-math, both axes, 10K–10M, plus strided input, Half/Decimal sweep, and alloc/op: - Monolithic generic-math is the most uniform (1.3–2.5× NumPy, never bad). - NDAxisIter beats NumPy on inner-reduce (axis=1) but is cache-hostile on outer-reduce (axis=0, 6.6× @10m — per-output strided column gather). - NDIter 2-op REDUCE is good on axis=0 but 5.8× on axis=1 (BUFFER+REDUCE inner double-loop unfinished) and allocates 2344 B/op for iterator construction. - Half needs a wider <TInput,TAccum> accumulator (same-type is precision-wrong); Decimal/Complex are correct same-type. generic_math_proto: proves .NET generic math (IAdditionOperators / IAdditiveIdentity) yields ONE zero-boxing kernel covering Complex AND all numerics; np.cumsum(complex) already rides this exact path (CumSumAxisKernel<T>). Fallback handling (complex_fallback_poc): layout-aware dispatch keyed on which LOGICAL axis is contiguous (NumPy's strategy), not the physical last axis — output-axis stride==1 -> SIMD slab-fold reduce-axis stride==1 -> SIMD 2-lane (re,im) contiguous reduce neither -> strided scalar, smaller-stride axis innermost Covers C-contig / row-sliced a[::2,:] / col-sliced a[:,::2] / transposed a.T, all correct, 0.79–1.23× NumPy (beats it on 4 of 8). Mirrors the fast-path dispatch AxisReductionSimdHelper already has for numeric dtypes, so the real fix = reuse that dispatch + plug in two complex double-pair SIMD primitives. Decision pending: (A) extend AxisReductionSimdHelper with Complex-as-double-pairs (fastest, lowest risk, matches existing reduction design) vs (B) migrate reductions to NDIter (long-term target; needs inner-reduce double-loop + small-N overhead work).

…e open item Updates the combinatorial-indexing handover to the executed reality: all five divergence buckets (R1 value-broadcast on empty/scalar selections, B2 0-d-base over-index, B3 empty advanced gather, B4 empty-slice assignment, R2 non-consecutive 0-d-bool placement + the Shape.Broadcast hash collision it uncovered) are fixed, committed (aea9fc78..7e968f5e), and pinned by the new Indexing.CombinatorialParity [FuzzMatrix] gate. The random sweep is 0 divergences across every measurable window. The only open item is R3, a pre-existing flaky teardown heap-corruption. Records the diagnostics gathered this session that supersede the prior "pinned managed" guess: it is a specific corpus shape (not allocation volume), it is NOT a pooled native-buffer overrun (end-aligned guard pages run clean), and it resists deterministic repro (crash point varies 6285-9879; ~1/3 even in the tightest window). Next-session plan: a per-case red-zone on FromArray / the direct-VirtualAlloc zeroed path, or a dotnet-dump capture. Index_Random stays [OpenBugs] purely on the crash, not parity.

Mechanical, behavior-preserving rename of NumSharp's iterator/expression stack from the `Npy*` prefix to `ND*`. ND* matches NumSharp's house style (`NDArray` ↔ numpy.ndarray, the retired NDIterator) and NumPy's Python `nditer` name. The old `Npy*` prefix mirrored NumPy's C struct name (`NpyIter`); `ND*` is the user-facing convention used everywhere else. Scope (NumSharp-owned only): - Types/delegates/enums: NpyIter→NDIter, NpyIterRef→NDIterRef, NpyExpr→NDExpr, NpyIterState/Flags/PerOpFlags/GlobalFlags/OpFlags, NpyFlatIterator→NDFlatIterator, NpyAxisIter→NDAxisIter, NpyMemOverlap→NDMemOverlap, reduction-kernel structs + interfaces (INpy…→IND…), NpyArrayMethodFlags→NDArrayMethodFlags, and the NpyIter_* C-API-style method names → NDIter_*. - Utilities: NpyComplexMath→NDComplexMath, NpyDivision→NDDivision, NpyIntegerPower→NDIntegerPower. - Benchmark subsystem: benchmark/npyiter → benchmark/nditer (npyiter_{bench,sheet,cards,results,headline} → nditer_*, --skip-npyiter → --skip-nditer). - 65 files renamed via git mv; ~190 files content-swept; website docs, docs/numpy notes, and frozen benchmark/history snapshots included. Preserved (genuine NumPy references, NOT the stack): - src/numpy/** (the upstream clone — NpyIter is NumPy's real C type). - The .npy/.npz file format: `#region NpyFormat` (np.save/np.load) and the SaveAndLoadWithNpyFileExt test. - NumPy's C function names quoted in docs (npyiter_allocate_arrays, npyiter_coalesce_axes, … kept verbatim). Build: solution green (0 errors). Tests: 10980 passed, 0 failed, 11 skipped (net10.0, CI filter TestCategory!=OpenBugs&!=HighMemory). Branch commit messages were rewritten Npy→ND separately (message-only history rewrite; file blobs in historical commits untouched). This commit is registered in .git-blame-ignore-revs as a mechanical rename.

…→ND rename 1. Expand all entries to full 40-char SHAs — git blame --ignore-revs-file ABORTS on abbreviated names ("fatal: invalid object name: ac02033"). 2. Register the mechanical Npy→ND rename commit (301229b).

…e nditer branch Replaces the stale PR description (written ~64 commits in, +50k lines) with a complete changelog of everything between the #612 merge-base (5eedb81) and HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare. Compiled via a two-pass audit: - Pass 1: every commit subject+body mined for features, perf numbers, and breaking changes; APIs/CI/benchmark/corpus facts verified against the live tree (test counts, fuzz corpus wc, Direct partial count, NDIter LOC). - Pass 2: all 279 local commits re-walked against the draft. Caught and fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on master (verified via git grep origin/master) — reclassified from 'rebuilt' to new, raising the new-API count 22 -> 30; removed an unverifiable test count; added the 15-dtype hot-path parity item (786d705) and the DefaultEngine->NDIter Tier-3B production routing. Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612 (already on master) and are intentionally excluded; the local master ref is stale, which is why master..HEAD misleadingly shows 339 commits. The same content (minus the H1) is now the live PR #611 description, pushed via REST PATCH (gh pr edit requires read:org scope the token lacks).

… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NDIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.

Rename the two standalone benchmark projects to names reflecting what they are: - NumSharp.Benchmark.GraphEngine → NumSharp.Benchmark.CSharp — the comprehensive BenchmarkDotNet operation suite; pairs with NumSharp.Benchmark.Python (numpy baseline). - NumSharp.Benchmark.Exploration → NumSharp.Benchmark.Sandbox — the SIMD & broadcasting experiment playground. Project dirs, .csproj (AssemblyName/RootNamespace), all namespace/using decls (incl. the partial-qualifier Exploration.Infrastructure → Sandbox.Infrastructure), and external referencers. Neither project is in SciSharp.NumSharp.sln. Both build clean (0 errors). NOTE: benchmark/CLAUDE.md excluded — concurrent uncommitted edits by another actor; its references will be swept in a follow-up. Left as prose: the Sandbox banner "Performance Exploration Suite" and a "#541" reference.

@10m

…3.0; rename GraphEngine->CSharp refs RELEASE_0.60.0.md (new) — comprehensive release notes for NumSharp 0.60.0, the first stable, consolidating everything since 0.50.0 (the nditer engine, np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K, np.random MT19937 parity, 36+ new np.* APIs, stride-native matmul, ARC, differential fuzzing). Benchmark claims pinned to the committed 2026-06-23 history snapshot: - TL;DR geomean corrected to 1.26x @10m (397 faster / 150 close / 42 slower of 615 ops; 1.14x @1k, 0.90x @100k) -- was the stale 1.00x/409-op June-5 figure. - Added a "Benchmark harness" section in section 8: the op/dtype/N matrix (1,851 cells) plus the five iterator/layout/operand/cast/fusion subsystems, history snapshots + latest symlink, and the NPY/NS convention. - section 8 cast row refreshed to the final campaign state (716 -> 129 lagging of 1,568 comparable; 1,439 winning) with the per-src narrow-int geomeans. - section 15 now discloses the benchmark-surfaced slower paths (small-N dispatch, scalar Half/Decimal, large-N any full-scan, comparison->bool, fancy gather). - np.random (MT19937 + exactly 24 new distribution files) and the split funcs verified new-since-0.50.0 against ground truth; np.random added to the opening. - Named the renamed C# bench project (benchmark/NumSharp.Benchmark.CSharp). benchmark/CLAUDE.md — brought the development guide from v2.0 (2026-02-13) to v3.0: run_benchmark.py documented as THE orchestrator; metrics/type-system/categories/ python-suite/directory sections updated to current state -- 15 dtypes (added int8/float16/complex128), ~615 ops/size (1,851 cells), 14 comparison suites + 5 appended subsystems, 3 cache tiers (1K/100K/10M), NPY/NS convention, committable history/ snapshots; run-benchmarks.ps1 marked legacy. All 13 NumSharp.Benchmark.GraphEngine references -> NumSharp.Benchmark.CSharp (folder/csproj/namespace/paths) for the rename in 002bc62. .github/workflows/benchmark.yml — stale "builds GraphEngine+Core" step comment -> "builds the CSharp bench project + Core".

Deleted the np-tests skill (write/migrate NumPy tests for NumSharp): - .claude/skills/np-tests/SKILL.md (the skill definition, ~3 KB) - .agents/skills/np-tests/SKILL.md (symlink to the above) Nothing else references the skill; the np-function skill is unaffected.

Release prep for the 0.60.0 major release — the nditer rewrite that aligns NumSharp's compute core with NumPy 2.4.2. - Version bump 0.40.0 -> 0.60.0 across NumSharp.Core and NumSharp.Bitmap (AssemblyVersion / FileVersion / Version / PackageVersion). The tagged CI release also injects the version from the v* tag; this keeps the repo metadata correct for local and non-tagged builds. - Add docs/releases/RELEASE_0.60.0.md — the full 0.60.0 release notes: NumPy nditer port, np.evaluate fusion, full advanced-indexing parity, byte-exact array printing, C/F/A/K memory layout, np.random rebuilt on MT19937 + 24 new distributions, all 15 dtypes (int8/float16/complex128), stride-native matmul, the astype SIMD campaign, ARC memory management, and differential fuzzing. - CI (build-and-release.yml): the GitHub Release body is now composed from docs/releases/RELEASE_<version>.md (install header + packages table + notes), falling back to auto-generated notes only when no notes file is present. - CLAUDE.md refreshed to the final/complete project state. - Website docs updated (il-generation, index, theme CSS).

Nucs · 2026-06-28T17:47:47Z

Yes, I reviewed 600 files. Deal with it.

Nucs force-pushed the nditer branch from f5c05a7 to 574a0d8 Compare April 23, 2026 09:34

Nucs mentioned this pull request Apr 28, 2026

Add NDIterator<T> overload with support for specific axis. #363

Open

Nucs mentioned this pull request May 17, 2026

[Core] Layout 'F/A/K' support #610

Open

8 tasks

Nucs added 8 commits June 13, 2026 20:37

Nucs added 15 commits June 13, 2026 23:13

Nucs added 3 commits June 27, 2026 19:00

Nucs force-pushed the nditer branch from 2d0fc6b to d08c296 Compare June 27, 2026 16:16

Nucs added 3 commits June 28, 2026 08:02

chore(blame): ignore the benchmark-project rename (002bc62)

b0bdd55

Nucs added 4 commits June 28, 2026 13:44

docs: add benchmark dashboard

e8ce1a9

Update toc.yml

400a86f

docs: highlight extreme benchmark wins

857c070

docs: add benchmark function explorer

3f00304

Nucs marked this pull request as ready for review June 28, 2026 12:41

Nucs self-assigned this Jun 28, 2026

Nucs added 2 commits June 28, 2026 17:31

Nucs merged commit 680f349 into master Jun 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Major Release 0.60.0] NumPy nditer port, NDExpr/np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K layout, stride-native matmul, 15 dtypes#611

[Major Release 0.60.0] NumPy nditer port, NDExpr/np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K layout, stride-native matmul, 15 dtypes#611
Nucs merged 567 commits into
masterfrom
nditer

Nucs commented Apr 22, 2026 •

edited

Loading

Uh oh!

Nucs commented Jun 5, 2026

Uh oh!

Uh oh!

Nucs commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Nucs commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The progress to 0.60.0

TL;DR

1. NDIter — full NumPy nditer port

Execution at NumPy speed

2. NDExpr DSL + np.evaluate (fusion)

3. Full advanced-indexing parity + memory-safety

4. Byte-exact NumPy array printing

5. C/F/A/K memory-layout support

6. New & completed np.* APIs

out= / where= / dtype= ufunc kwargs (NumPy parity)

Random sampling — NumPy RNG parity (np.random)

7. Linear algebra

8. Performance (engine, reductions, shifts, sort, casts)

High-performance type casting

9. Memory management — ARC + IDisposable

10. Validation & benchmarking infrastructure (NumPy as the oracle)

The NumPy differential oracle (correctness)

The benchmark harness (performance)

11. Legacy stacks deleted

12. New dtypes — int8, float16, complex128 (+ Char8/DateTime64)

13. Examples — trainable MNIST MLP

14. Cross-platform

15. Tests & CI

Key highlights since 0.40.0

0.41.0 — IL Kernel Generator (the compute-core rewrite, Mar 2026)

0.50.0 — Long Indexing (>2 GB arrays + the type system, Apr 2026)

0.60.0 — nditer (this PR — the first stable)

Breaking changes

Uh oh!

Nucs commented Jun 5, 2026

📊 Benchmark & performance — nditer

1. Fused strided-SIMD unary IL kernel (d01f1d63)

2. Official NumSharp-vs-NumPy benchmark (6038990f)

Reproducibility

Uh oh!

Uh oh!

Nucs commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Nucs commented Apr 22, 2026 •

edited

Loading

1. `NDIter` — full NumPy `nditer` port

2. `NDExpr` DSL + `np.evaluate` (fusion)

6. New & completed `np.*` APIs

`out=` / `where=` / `dtype=` ufunc kwargs (NumPy parity)

Random sampling — NumPy RNG parity (`np.random`)

9. Memory management — ARC + `IDisposable`

12. New dtypes — `int8`, `float16`, `complex128` (+ `Char8`/`DateTime64`)

📊 Benchmark & performance — `nditer`

1. Fused strided-SIMD unary IL kernel (`d01f1d63`)

2. Official NumSharp-vs-NumPy benchmark (`6038990f`)