Skip to content

[spark] Record the write operation type in snapshot properties#8236

Open
Zouxxyy wants to merge 2 commits into
apache:masterfrom
Zouxxyy:xinyu/paimon-operation
Open

[spark] Record the write operation type in snapshot properties#8236
Zouxxyy wants to merge 2 commits into
apache:masterfrom
Zouxxyy:xinyu/paimon-operation

Conversation

@Zouxxyy

@Zouxxyy Zouxxyy commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Purpose

Add a first-class operation field (Snapshot.Operation enum) to Snapshot, recording the logical operation type that produced it. This complements the physical CommitKind (APPEND/COMPACT/OVERWRITE/ANALYZE) and lets downstream tooling distinguish, e.g., an APPEND from INSERT vs. one from MERGE.

Design:

  • Snapshot.Operation enum: WRITE, OVERWRITE, DELETE, TRUNCATE, UPDATE, MERGE, CREATE_TABLE_AS_SELECT, REPLACE_TABLE_AS_SELECT, CREATE_OR_REPLACE_TABLE_AS_SELECT
  • Nullable field with @JsonInclude(NON_NULL) — old snapshots deserialize as null, old readers ignore the unknown field via @JsonIgnoreProperties(ignoreUnknown = true)
  • BatchTableCommit.withOperation(Operation)default method, no breaking change for existing implementations
  • FileStoreCommit.withOperation(Operation) — internal API, propagated to Snapshot construction in FileStoreCommitImpl
  • TRUNCATE is automatically set by TableCommitImpl.truncateTable()/truncatePartitions() in core, so callers don't need to handle it

Spark coverage (both v1 and v2 write paths):

SQL operation
INSERT INTO WRITE
INSERT OVERWRITE OVERWRITE
DELETE (row-level) DELETE
DELETE (full-table / partition) TRUNCATE
TRUNCATE TABLE TRUNCATE
UPDATE UPDATE
MERGE INTO MERGE
CREATE TABLE AS SELECT CREATE_TABLE_AS_SELECT
(CREATE OR) REPLACE TABLE AS SELECT REPLACE_TABLE_AS_SELECT / CREATE_OR_REPLACE_TABLE_AS_SELECT

Tests

  • SnapshotTest.testSnapshotWithOperation (paimon-core): JSON serialization round-trip, backward compatibility with old snapshots
  • SnapshotOperationTest (paimon-spark-ut): all operations under both use-v2-write=true/false, including CTAS/RTAS and truncate paths

Comment thread paimon-core/src/main/java/org/apache/paimon/table/sink/BatchTableCommit.java Outdated
@JingsongLi

Copy link
Copy Markdown
Contributor

I think adding operation as a dedicated nullable field in Snapshot is a better direction than storing it in properties.
The parsing overhead should be negligible. Snapshot metadata is already read and deserialized as JSON, so one additional nullable string/enum field will not have meaningful performance impact compared with filesystem IO and manifest planning. With @JsonInclude(NON_NULL), old snapshots and snapshots without operation will not carry extra JSON size either.

Compatibility should also be fine:

  • Old snapshots do not have this field, so the new reader can treat it as null.
  • Older readers should ignore the new field because Snapshot already uses @JsonIgnoreProperties(ignoreUnknown = true).

I would suggest modeling it as a first-class nullable enum or string field, for example Snapshot.Operation, rather than putting it into properties. commitKind describes the physical snapshot change, while operation describes the logical user operation, so both feel like core snapshot metadata.

This would also avoid introducing a generic withCommitProperties API just for one standard field, and avoids potential conflicts around the "operation" property key.

@Zouxxyy Zouxxyy force-pushed the xinyu/paimon-operation branch from 8f80251 to 1c14391 Compare June 19, 2026 06:57
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Zouxxyy Zouxxyy force-pushed the xinyu/paimon-operation branch from 1c14391 to 2705c1c Compare June 19, 2026 07:04
@Zouxxyy Zouxxyy force-pushed the xinyu/paimon-operation branch from 2705c1c to 87c8a42 Compare June 19, 2026 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants