Skip to content

Integrate DataFusion as execution engine for compute-heavy operations #3554

Description

@qzyu999

Feature Request / Improvement

Problem

PyIceberg cannot perform several operations at production scale due to unbounded memory requirements in the PyArrow execution path:

  • Tables with equality deletes are unreadable (hard ValueError)
  • CoW deletes OOM on large Parquet files (~1GB)
  • CoW overwrite OOMs (same pattern as delete)
  • Upsert uses O(n²) row-by-row comparison
  • Compaction not implemented, requires external sort (infeasible in-memory for large tables)
  • Orphan file deletion OOMs (LEFT ANTI JOIN of millions of paths)

The list goes on (documented below) with operations that don't scale in the typical single-node environment of PyIceberg. These block PyIceberg from achieving feature parity with Java Iceberg for V2/V3.

Proposed Solution

Integrate Apache DataFusion as an optional execution engine (pip install 'pyiceberg[pyiceberg-core]') behind an automatic engine-resolution layer. When installed, compute-heavy operations use DataFusion's spill-to-disk execution (bounded memory). When not installed, the existing PyArrow path remains unchanged (works for small data, OOMs gracefully on large).

No existing behavior changes. No forced dependency. DuckDB-style UX where only a developer only needs to configure a memory budget if they so choose.

Design Doc

Support for PyIceberg DataFusion Integration

Operations Unblocked

  • Equality delete read resolution
  • Streaming CoW delete/overwrite
  • Table compaction (sort + rewrite)
  • Orphan file deletion
  • Upsert via hash join
  • Equality-to-positional conversion
  • Position delete compaction
  • Full MoR compaction
  • Z-Order / Hilbert sorting
  • DV compaction
  • Incremental compaction
  • Sort-order enforcement on write
  • Dynamic partition overwrite (bounded memory)

Related Issues

PyIceberg

iceberg-rust

datafusion-python

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions