Skip to content

codellm-devkit/codeanalyzer-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

86 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

CodeLLM-DevKit

codeanalyzer-python (canpy)

A Python static-analysis toolkit โ€” the CLDK backend that emits a canonical symbol table and call graph, as analysis.json or a Neo4j property graph.

PyPI GitHub release Release License


canpy is a static analyzer for Python built on Jedi, with optional CodeQL-resolved call edges and Tree-sitter parsing. It produces the canonical CodeLLM-DevKit (CLDK) analysis.json โ€” a symbol table plus a call graph โ€” and can project that same analysis into a Neo4j property graph. It is the Python backend behind CLDK, mirroring its TypeScript (cants) and Java siblings.

Every run produces a symbol table and a call graph. Edges come from Jedi's lexical resolution by default; --codeql resolves additional edges (RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges, also backfilling callees Jedi could not resolve.

Table of Contents

Features

  • Symbol table โ€” modules, classes, functions, methods, variables, decorators, imports, and docstrings, with precise source spans.
  • Call graph โ€” Jedi's lexical resolver by default, with optional CodeQL-resolved edges (--codeql) for RPC / third-party / dynamically-dispatched targets, merged with the Jedi edges; CodeQL also backfills callees Jedi could not resolve.
  • Neo4j output โ€” project the analysis into a labeled property graph: a self-contained graph.cypher snapshot, or an incremental push to a live database over Bolt.
  • Versioned schema โ€” a machine-readable, version-stamped Neo4j schema contract (--emit schema), checked in as schema.neo4j.json and shipped with every release.
  • Incremental cache โ€” per-file results are cached under .codeanalyzer; --lazy (default) reuses them, --eager forces a clean rebuild. --ray distributes the work across cores.
  • Compact output โ€” canonical analysis.json, or binary analysis.msgpack for smaller artifacts.

Installation

Prerequisites

  • Python 3.10 or newer.

  • A C toolchain and the venv / development headers โ€” the analyzer builds an isolated virtual environment per project (via Python's venv) so Jedi can resolve types and imports:

    # Ubuntu / Debian
    sudo apt install python3-venv python3-dev build-essential
    
    # Fedora / RHEL / CentOS
    sudo dnf group install "Development Tools" && sudo dnf install python3-venv python3-devel
    
    # macOS
    xcode-select --install

Install via pip (PyPI)

pip install codeanalyzer-python
canpy --help

For the optional live Neo4j push (--emit neo4j --neo4j-uri โ€ฆ), install the neo4j extra:

pip install 'codeanalyzer-python[neo4j]'

Install via shell script

Install the CLI as an isolated tool with the one-line installer (provisions via uv / pipx / pip):

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/codellm-devkit/codeanalyzer-python/releases/latest/download/canpy-installer.sh | sh

Install via Homebrew

brew install codellm-devkit/tap/codeanalyzer-python

The formula depends on uv and installs canpy as an isolated, version-pinned uv tool (the package and its dependencies are resolved and cached on first run).

Build from source

This project uses uv for dependency management.

git clone https://github.com/codellm-devkit/codeanalyzer-python
cd codeanalyzer-python
uv sync --all-groups
uv run canpy --help

Usage

canpy --input /path/to/python/project

With no --output, the analysis is printed to stdout as compact JSON; with --output <dir> it is written to analysis.json (or graph.cypher for --emit neo4j, or analysis.msgpack with --format msgpack) in that directory.

Options

$ canpy --help

 Usage: canpy [OPTIONS] COMMAND [ARGS]...

 Static Analysis on Python source code using Jedi, CodeQL and Tree sitter.

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --input           -i                     PATH              Path to the       โ”‚
โ”‚                                                            project root      โ”‚
โ”‚                                                            directory (not    โ”‚
โ”‚                                                            required for      โ”‚
โ”‚                                                            --emit schema).   โ”‚
โ”‚ --output          -o                     PATH              Output directory  โ”‚
โ”‚                                                            for artifacts.    โ”‚
โ”‚ --format          -f                     [json|msgpack]    Output format for โ”‚
โ”‚                                                            --emit json: json โ”‚
โ”‚                                                            or msgpack.       โ”‚
โ”‚                                                            [default: json]   โ”‚
โ”‚ --emit                                   [json|neo4j|sche  Output target:    โ”‚
โ”‚                                          ma]               json              โ”‚
โ”‚                                                            (analysis.json,   โ”‚
โ”‚                                                            default) | neo4j  โ”‚
โ”‚                                                            (graph.cypher or  โ”‚
โ”‚                                                            live Bolt push) | โ”‚
โ”‚                                                            schema (the Neo4j โ”‚
โ”‚                                                            schema.json       โ”‚
โ”‚                                                            contract).        โ”‚
โ”‚                                                            [default: json]   โ”‚
โ”‚ --app-name                               TEXT              Logical           โ”‚
โ”‚                                                            application name  โ”‚
โ”‚                                                            for the graph     โ”‚
โ”‚                                                            :PyApplication    โ”‚
โ”‚                                                            anchor (default:  โ”‚
โ”‚                                                            input dir name).  โ”‚
โ”‚ --neo4j-uri                              TEXT              Push the graph to โ”‚
โ”‚                                                            a live Neo4j over โ”‚
โ”‚                                                            Bolt              โ”‚
โ”‚                                                            (incremental);    โ”‚
โ”‚                                                            omit to write     โ”‚
โ”‚                                                            graph.cypher.     โ”‚
โ”‚                                                            [env var:         โ”‚
โ”‚                                                            NEO4J_URI]        โ”‚
โ”‚ --neo4j-user                             TEXT              Neo4j username.   โ”‚
โ”‚                                                            [env var:         โ”‚
โ”‚                                                            NEO4J_USERNAME]   โ”‚
โ”‚                                                            [default: neo4j]  โ”‚
โ”‚ --neo4j-password                         TEXT              Neo4j password.   โ”‚
โ”‚                                                            Prefer the env    โ”‚
โ”‚                                                            var over the flag โ”‚
โ”‚                                                            (the flag is      โ”‚
โ”‚                                                            visible in shell  โ”‚
โ”‚                                                            history / process โ”‚
โ”‚                                                            list).            โ”‚
โ”‚                                                            [env var:         โ”‚
โ”‚                                                            NEO4J_PASSWORD]   โ”‚
โ”‚                                                            [default: neo4j]  โ”‚
โ”‚ --neo4j-database                         TEXT              Neo4j database    โ”‚
โ”‚                                                            name (default:    โ”‚
โ”‚                                                            server default).  โ”‚
โ”‚                                                            [env var:         โ”‚
โ”‚                                                            NEO4J_DATABASE]   โ”‚
โ”‚ --codeql              --no-codeql                          Enable            โ”‚
โ”‚                                                            CodeQL-based      โ”‚
โ”‚                                                            analysis.         โ”‚
โ”‚                                                            [default:         โ”‚
โ”‚                                                            no-codeql]        โ”‚
โ”‚ --ray                 --no-ray                             Enable Ray for    โ”‚
โ”‚                                                            distributed       โ”‚
โ”‚                                                            analysis.         โ”‚
โ”‚                                                            [default: no-ray] โ”‚
โ”‚ --eager               --lazy                               Enable eager or   โ”‚
โ”‚                                                            lazy analysis.    โ”‚
โ”‚                                                            Defaults to lazy. โ”‚
โ”‚                                                            [default: lazy]   โ”‚
โ”‚ --skip-tests          --include-tests                      Skip test files   โ”‚
โ”‚                                                            in analysis.      โ”‚
โ”‚                                                            [default:         โ”‚
โ”‚                                                            skip-tests]       โ”‚
โ”‚ --no-venv             --venv                               Skip virtualenv   โ”‚
โ”‚                                                            creation and      โ”‚
โ”‚                                                            dependency        โ”‚
โ”‚                                                            installation;     โ”‚
โ”‚                                                            resolve imports   โ”‚
โ”‚                                                            against the       โ”‚
โ”‚                                                            ambient Python    โ”‚
โ”‚                                                            environment       โ”‚
โ”‚                                                            instead.          โ”‚
โ”‚                                                            [default: venv]   โ”‚
โ”‚ --file-name                              PATH              Analyze only the  โ”‚
โ”‚                                                            specified file    โ”‚
โ”‚                                                            (relative to      โ”‚
โ”‚                                                            input directory). โ”‚
โ”‚ --cache-dir       -c                     PATH              Directory to      โ”‚
โ”‚                                                            store analysis    โ”‚
โ”‚                                                            cache. Defaults   โ”‚
โ”‚                                                            to                โ”‚
โ”‚                                                            '.codeanalyzer'   โ”‚
โ”‚                                                            in the input      โ”‚
โ”‚                                                            directory.        โ”‚
โ”‚ --clear-cache         --keep-cache                         Clear cache after โ”‚
โ”‚                                                            analysis. By      โ”‚
โ”‚                                                            default, cache is โ”‚
โ”‚                                                            retained.         โ”‚
โ”‚                                                            [default:         โ”‚
โ”‚                                                            keep-cache]       โ”‚
โ”‚                   -v                     INTEGER           Increase          โ”‚
โ”‚                                                            verbosity: -v,    โ”‚
โ”‚                                                            -vv, -vvv         โ”‚
โ”‚                                                            [default: 0]      โ”‚
โ”‚ --help                                                     Show this message โ”‚
โ”‚                                                            and exit.         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Examples

  1. Basic analysis to stdout, or to a file:

    canpy --input ./my-python-project                        # compact JSON on stdout
    canpy --input ./my-python-project --output ./out         # โ†’ ./out/analysis.json
  2. Binary output (msgpack):

    canpy --input ./my-python-project --output ./out --format msgpack   # โ†’ ./out/analysis.msgpack
  3. Resolve extra call edges with CodeQL:

    canpy --input ./my-python-project --codeql

    By default, edges come from Jedi's lexical analysis. Adding --codeql resolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges; CodeQL also backfills resolved callees Jedi could not resolve. CodeQL integration is experimental; the CLI is downloaded into <cache_dir>/codeql/ on first use.

  4. Emit a Neo4j snapshot, or push to a live database:

    canpy --input ./my-python-project --emit neo4j --output ./out   # โ†’ ./out/graph.cypher
    canpy --input ./my-python-project --emit neo4j \
      --neo4j-uri bolt://localhost:7687 --neo4j-user neo4j --neo4j-password secret
  5. Emit the Neo4j schema contract:

    canpy --emit schema                   # print schema.json to stdout (no project needed)
    canpy --emit schema --output ./out    # โ†’ ./out/schema.json
  6. Force a clean rebuild with a custom cache directory:

    canpy --input ./my-python-project --eager --cache-dir /path/to/custom-cache

Output targets

canpy builds one analysis in memory and can emit it three ways (--emit):

analysis.json (default)

A PyApplication document โ€” the canonical CLDK contract:

{
  "symbol_table": { /* file path โ†’ module (classes, functions, variables, imports, โ€ฆ) */ },
  "call_graph":   [ /* CALL_DEP edges: { source, target, weight, provenance } keyed by callable signature */ ]
}

By default this is printed to stdout in JSON; with --output it is written to analysis.json (or analysis.msgpack with --format msgpack, a more compact binary format).

Neo4j graph

--emit neo4j projects the same analysis into a labeled property graph. Every node label is Py-prefixed and every relationship type is PY_-prefixed (e.g. :PyClass, PY_CALLS) so multiple language analyzers can share one database without label or relationship-type collisions. Declarations are keyed by their signature under a shared :PySymbol label; calls, imports, inheritance, decorators, and call sites are relationships:

  • Without --neo4j-uri โ€” writes a self-contained graph.cypher (constraints + indexes, a scoped wipe, then batched MERGEs). Load it with cypher-shell < graph.cypher. Needs no extra dependencies.
  • With --neo4j-uri โ€” pushes to a live Neo4j over Bolt incrementally: only modules whose content hash changed are rewritten, and on a full run modules whose source file vanished are pruned. Requires the neo4j extra. Every graph carries a schema_version on its :PyApplication node.

Call-graph endpoints that aren't present in the symbol table (third-party / framework / RPC targets) are materialized as :PyExternal ghost nodes, mirroring the analyzer's own ghost-node behaviour.

The connection options also read from the standard Neo4j environment variables โ€” NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE โ€” when the corresponding flag is omitted (an explicit flag wins). Prefer the env var for the password so it doesn't land in shell history or the process list:

export NEO4J_URI=bolt://localhost:7687
export NEO4J_PASSWORD=secret
canpy -i ./my-project --emit neo4j     # credentials picked up from the environment

Schema contract

--emit schema writes the machine-readable, version-stamped Neo4j schema (schema.json: node labels, relationships, properties, constraints, and indexes). It needs no project and is checked into the repo as schema.neo4j.json and bundled in every release as a GitHub Release asset, so a consumer can validate producer/consumer compatibility without invoking the tool. The shape of the contract matches the codeanalyzer-typescript backend.

A UML of the analysis.json schema (the PyApplication containment tree) is checked in as schema-uml.drawio, and the property-graph schema as neo4j-schema.drawio.

Development

This project uses uv.

uv sync --all-groups
uv run canpy --input /path/to/project           # run from source
uv run canpy --emit schema > schema.neo4j.json  # regenerate the checked-in schema contract
uv run python scripts/update_readme.py          # regenerate the canpy --help block above
uv run pytest                                   # run the test suite

The Neo4j schema-conformance test always runs. The Neo4j bolt integration test spins up a real Neo4j via Testcontainers and is opt-in โ€” it needs a container runtime (Docker or Podman) and is enabled with an environment variable:

RUN_CONTAINER_TESTS=1 uv run pytest test/test_neo4j_bolt.py -s

License

Apache 2.0 โ€” see LICENSE.

About

Python Static Analysis Backend for CLDK

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors