Source↔sink reachability API over the call graph (forward / reverse / path witnesses)

## Summary

We now have a precise interprocedural call graph available two ways: in-memory as a NetworkX `DiGraph`, and — with the new TypeScript Neo4j backend (`TSNeo4jBackend`) — as a property graph queryable with Cypher. On top of that call graph we should expose a small, composable family of **reachability** queries whose purpose is **source↔sink matching**.

This is intentionally broader than a single "reverse reachability" method. The objective is a coherent API in which **sources** and **sinks** are just sets of call-graph nodes, and we can answer the three fundamental questions between them (forward, reverse, and witness paths). Source↔sink matching is the headline use case; the three primitives compose into it.

## Motivation

"Can untrusted input reach a dangerous operation, and how?" is the core question behind taint/data-flow-style analysis, attack-surface mapping, and security triage. With the call graph as the substrate we can answer the call-graph-reachability approximation of this cheaply — especially over Neo4j, where transitive reachability is exactly what the engine is built for.

## Terminology

- **Source** — a callable where interesting/untrusted data originates: an HTTP route handler / entrypoint, `process.env`, `req.body`, a deserialization entry, etc.
- **Sink** — a callable or **external** (phantom/library) symbol where it is dangerous/interesting for data to arrive: `child_process.exec`, `fs.writeFile`, a DB driver call, `node:crypto.*`, etc.
- **Reaches** — `a` reaches `b` iff there is a directed path `a -CALLS*-> b` in the call graph.

Sources and sinks should be expressible flexibly: explicit signature sets, predicates (e.g. "every `External` symbol whose module is `node:child_process`"), or built-in catalogs (entrypoints as sources; a curated list of dangerous externals as sinks). The graph already distinguishes internal callables from `External` phantom symbols and marks entrypoints, so these catalogs are largely already present.

## The family of operations

1. **Forward reachability** — given a source, which sinks does it reach? → #171
2. **Reverse reachability** — given a sink, which sources reach it? → #172  *(the originally requested direction)*
3. **Path witnesses** — given a (source, sink) pair, return the connecting call-graph path(s) — the trace that explains *why* they match. → #173
4. **Many-to-many matching** *(composed)* — given a set of sources and a set of sinks, return the connected `(source, sink)` pairs, optionally each with a witness path. This is the headline "source↔sink matching" capability; it is built on 1–3.

## Why this fits the graph backends

The key insight is that **reverse reachability needs no new data** — it is the forward query with the edge direction flipped. The same holds across backends:

| Operation | Neo4j (Cypher) | In-memory (NetworkX) |
|---|---|---|
| Forward | `MATCH (s {signature:$src})-[:CALLS*1..]->(t)` | `nx.descendants(G, src)` |
| Reverse | `MATCH (s)-[:CALLS*1..]->(t {signature:$sink})` | `nx.ancestors(G, sink)` |
| Paths | `shortestPath(...)`, `allShortestPaths(...)`, APOC `apoc.path.*` for bounded k-paths | `nx.shortest_path`, `nx.all_shortest_paths`, `nx.all_simple_paths(cutoff=…)` |

Neo4j makes the transitive case a one-liner and scales it; the in-memory backends give identical semantics for languages without a graph DB.

## Design notes / open questions

- **Where it lives**: these belong on the backend contract (`TSAnalysisBackend` / `JavaAnalysisBackend` / `PythonAnalysisBackend`) and surfaced on the façades, so every language with a call graph gets them — Neo4j just makes them cheap. (The backend ABCs landed in #159.)
- **Parameters worth supporting**: `max_depth` / hop bound; `max_paths`; shortest-only vs. all; include/exclude `External` (phantom) nodes; filter by edge `provenance` / dispatch tags; restrict to a module subtree.
- **Return shapes**: matched node signatures, and where relevant the witness path as a list of signatures plus per-edge metadata (so a result is auditable, not just a boolean).
- **Sources/sinks as predicates and named catalogs**, not just explicit sets (proposed: yes).
- **Bounding**: simple-path enumeration can explode; default to shortest/`k`-bounded with an explicit opt-in for exhaustive enumeration, and always surface when results were truncated.

## Tracking

- [ ] #171 — Forward reachability (source → reachable sinks)
- [ ] #172 — Reverse reachability (sink → reaching sources)
- [ ] #173 — Path witnesses (source, sink → connecting paths)

The composed many-to-many "matching matrix" is the end goal and can be a thin convenience over the three once they exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source↔sink reachability API over the call graph (forward / reverse / path witnesses) #170

Summary

Motivation

Terminology

The family of operations

Why this fits the graph backends

Design notes / open questions

Tracking

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Operation	Neo4j (Cypher)	In-memory (NetworkX)
Forward	`MATCH (s {signature:$src})-[:CALLS*1..]->(t)`	`nx.descendants(G, src)`
Reverse	`MATCH (s)-[:CALLS*1..]->(t {signature:$sink})`	`nx.ancestors(G, sink)`
Paths	`shortestPath(...)`, `allShortestPaths(...)`, APOC `apoc.path.*` for bounded k-paths	`nx.shortest_path`, `nx.all_shortest_paths`, `nx.all_simple_paths(cutoff=…)`

Source↔sink reachability API over the call graph (forward / reverse / path witnesses) #170

Description

Summary

Motivation

Terminology

The family of operations

Why this fits the graph backends

Design notes / open questions

Tracking

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions