Skip to content

Source↔sink reachability API over the call graph (forward / reverse / path witnesses) #170

Description

@rahlk

Summary

We now have a precise interprocedural call graph available two ways: in-memory as a NetworkX DiGraph, and — with the new TypeScript Neo4j backend (TSNeo4jBackend) — as a property graph queryable with Cypher. On top of that call graph we should expose a small, composable family of reachability queries whose purpose is source↔sink matching.

This is intentionally broader than a single "reverse reachability" method. The objective is a coherent API in which sources and sinks are just sets of call-graph nodes, and we can answer the three fundamental questions between them (forward, reverse, and witness paths). Source↔sink matching is the headline use case; the three primitives compose into it.

Motivation

"Can untrusted input reach a dangerous operation, and how?" is the core question behind taint/data-flow-style analysis, attack-surface mapping, and security triage. With the call graph as the substrate we can answer the call-graph-reachability approximation of this cheaply — especially over Neo4j, where transitive reachability is exactly what the engine is built for.

Terminology

  • Source — a callable where interesting/untrusted data originates: an HTTP route handler / entrypoint, process.env, req.body, a deserialization entry, etc.
  • Sink — a callable or external (phantom/library) symbol where it is dangerous/interesting for data to arrive: child_process.exec, fs.writeFile, a DB driver call, node:crypto.*, etc.
  • Reachesa reaches b iff there is a directed path a -CALLS*-> b in the call graph.

Sources and sinks should be expressible flexibly: explicit signature sets, predicates (e.g. "every External symbol whose module is node:child_process"), or built-in catalogs (entrypoints as sources; a curated list of dangerous externals as sinks). The graph already distinguishes internal callables from External phantom symbols and marks entrypoints, so these catalogs are largely already present.

The family of operations

  1. Forward reachability — given a source, which sinks does it reach? → Reachability API: forward reachability (given a source, find all reachable sinks) #171
  2. Reverse reachability — given a sink, which sources reach it? → Reachability API: reverse reachability (given a sink, find all sources that reach it) #172 (the originally requested direction)
  3. Path witnesses — given a (source, sink) pair, return the connecting call-graph path(s) — the trace that explains why they match. → Reachability API: path witnesses between a source and a sink #173
  4. Many-to-many matching (composed) — given a set of sources and a set of sinks, return the connected (source, sink) pairs, optionally each with a witness path. This is the headline "source↔sink matching" capability; it is built on 1–3.

Why this fits the graph backends

The key insight is that reverse reachability needs no new data — it is the forward query with the edge direction flipped. The same holds across backends:

Operation Neo4j (Cypher) In-memory (NetworkX)
Forward MATCH (s {signature:$src})-[:CALLS*1..]->(t) nx.descendants(G, src)
Reverse MATCH (s)-[:CALLS*1..]->(t {signature:$sink}) nx.ancestors(G, sink)
Paths shortestPath(...), allShortestPaths(...), APOC apoc.path.* for bounded k-paths nx.shortest_path, nx.all_shortest_paths, nx.all_simple_paths(cutoff=…)

Neo4j makes the transitive case a one-liner and scales it; the in-memory backends give identical semantics for languages without a graph DB.

Design notes / open questions

  • Where it lives: these belong on the backend contract (TSAnalysisBackend / JavaAnalysisBackend / PythonAnalysisBackend) and surfaced on the façades, so every language with a call graph gets them — Neo4j just makes them cheap. (The backend ABCs landed in Add Neo4j-backed TypeScript analysis backend and formalize backend contracts #159.)
  • Parameters worth supporting: max_depth / hop bound; max_paths; shortest-only vs. all; include/exclude External (phantom) nodes; filter by edge provenance / dispatch tags; restrict to a module subtree.
  • Return shapes: matched node signatures, and where relevant the witness path as a list of signatures plus per-edge metadata (so a result is auditable, not just a boolean).
  • Sources/sinks as predicates and named catalogs, not just explicit sets (proposed: yes).
  • Bounding: simple-path enumeration can explode; default to shortest/k-bounded with an explicit opt-in for exhaustive enumeration, and always surface when results were truncated.

Tracking

The composed many-to-many "matching matrix" is the end goal and can be a thin convenience over the three once they exist.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions