-
Notifications
You must be signed in to change notification settings - Fork 0
runtime: checkpoint/restore (CheckpointRuntime) + Podman backend #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
f5f58d5
runtime: add checkpoint/restore (CheckpointRuntime) + Podman backend
bilby91 a63b789
address PR review: harden Podman checkpoint/restore + fix CI
bilby91 5a8dd70
ci: gate podman C/R job on podman>=5; skip green on inadequate hosted…
bilby91 fa74dcc
ci: run podman checkpoint/restore for real in a privileged container
bilby91 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| #!/usr/bin/env bash | ||
| # Run the gated Podman checkpoint/restore tests inside a modern-podman | ||
| # container (invoked by ci.yml's test-integration-podman job, which docker-runs | ||
| # this privileged + --cgroupns=host so CRIU can use the runner's kernel). | ||
| # | ||
| # The hosted runner's own apt podman is unusable (24.04 has no criu; 22.04 | ||
| # ships podman 3.4.4 whose runtime can't checkpoint and predates the libpod | ||
| # v5 API), so we bring podman 5.x + crun + criu via the container image and | ||
| # only need the runner for its kernel + Docker. | ||
| # | ||
| # Skips GREEN (exit 0 + ::warning::) if this runner can't actually | ||
| # checkpoint — e.g. the nested cgroup freezer is not permitted. Runs the | ||
| # tests for real (failing red) only once a checkpoint smoke test proves the | ||
| # environment is capable. Real C/R is also validated locally on podman 5.x + | ||
| # criu (OrbStack). | ||
| set -uo pipefail | ||
|
|
||
| dnf install -y -q criu iptables >/dev/null 2>&1 \ | ||
| || { echo "::warning::criu install failed in container — skipping C/R run"; exit 0; } | ||
| echo "stack: $(podman --version) / $(criu --version | head -1) / $(crun --version | head -1)" | ||
|
|
||
| criu check || { echo "::warning::criu check failed on this runner kernel — skipping C/R run"; exit 0; } | ||
|
|
||
| mkdir -p /etc/containers /run/podman | ||
| printf '[engine]\nevents_logger="file"\nruntime="crun"\n' > /etc/containers/containers.conf | ||
| podman system service --time=0 unix:///run/podman/podman.sock & | ||
| for _ in $(seq 1 30); do [ -S /run/podman/podman.sock ] && break; sleep 1; done | ||
| test -S /run/podman/podman.sock || { echo "::warning::podman service socket did not come up — skipping"; exit 0; } | ||
|
|
||
| # Capability smoke test: can this runner actually freeze + dump a container? | ||
| # Nested CRIU frequently can't ("Unable to freeze tasks: Operation not | ||
| # permitted"). If it can't, skip green with the real reason rather than fail. | ||
| podman run -d --name smoke docker.io/library/alpine:3.20 sleep 600 >/dev/null | ||
| sleep 2 | ||
| if ! podman container checkpoint smoke >/tmp/ckpt.log 2>&1; then | ||
| echo "::warning::this runner cannot checkpoint a container (likely cgroup freezer perms): $(tail -1 /tmp/ckpt.log) — skipping. Real C/R is validated locally on podman 5.x + criu." | ||
| exit 0 | ||
| fi | ||
| podman rm -f smoke >/dev/null 2>&1 || true | ||
| echo "checkpoint smoke passed — running the gated tests for real" | ||
|
|
||
| export PODMAN_SOCKET=unix:///run/podman/podman.sock | ||
| /w/podman.test -test.run TestIntegration -test.v -test.timeout 15m | ||
| /w/int.test -test.run '^TestPodman' -test.v -test.timeout 15m |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| package devcontainer | ||
|
|
||
| import ( | ||
| "context" | ||
| "fmt" | ||
|
|
||
| "github.com/crunchloop/devcontainer/runtime" | ||
| ) | ||
|
|
||
| // CheckpointOptions configures Engine.Checkpoint. | ||
| type CheckpointOptions struct { | ||
| // ArchivePath is where the portable checkpoint archive is written. | ||
| // Required. Point it at durable, transferable storage (the workspace | ||
| // volume, object storage) — the archive is self-contained, so a | ||
| // later Restore can run on a different node by moving this file. | ||
| ArchivePath string | ||
|
|
||
| // StopAfter stops/removes the container once the archive is written | ||
| // — the spot-eviction path, where the node is going away anyway. | ||
| // False keeps the container running ("backup" checkpoint). | ||
| StopAfter bool | ||
|
|
||
| // TCPEstablished requests checkpoint of established TCP connections. | ||
| // Recommended true for devcontainers: a container holding a live | ||
| // connection at checkpoint time fails to checkpoint without it. | ||
| TCPEstablished bool | ||
| } | ||
|
|
||
| // RestoreOptions configures Engine.Restore. | ||
| type RestoreOptions struct { | ||
| // ArchivePath is the archive a prior Checkpoint wrote. Required. | ||
| ArchivePath string | ||
|
|
||
| // Name optionally names the restored container. | ||
| Name string | ||
|
|
||
| // TCPEstablished must match the checkpoint when the archive captured | ||
| // established connections. | ||
| TCPEstablished bool | ||
|
|
||
| // LocalEnv overrides os.Environ() for the reattached workspace's | ||
| // substituter localEnv pass. Nil means use the current process | ||
| // environment — matches AttachOptions.LocalEnv. On a cross-node | ||
| // restore the destination's env may differ from the source's, so a | ||
| // caller that cares can pin it here. | ||
| LocalEnv map[string]string | ||
| } | ||
|
|
||
| // Checkpoint writes a portable checkpoint archive for the workspace's | ||
| // container (process + memory state plus the writable rootfs), so it can | ||
| // later be restored — possibly on another node — by Restore. | ||
| // | ||
| // Returns ErrCheckpointUnsupported (wrapped) if the active backend does | ||
| // not implement runtime.CheckpointRuntime or advertises | ||
| // Capabilities().Checkpoint == false. Callers can errors.Is against | ||
| // runtime.ErrCheckpointUnsupported and fall back to a cold path. | ||
| // | ||
| // Checkpoint is the primitive; deciding *when* to checkpoint (e.g. on a | ||
| // spot-reclaim notice) is the caller's job. | ||
| func (e *Engine) Checkpoint(ctx context.Context, ws *Workspace, opts CheckpointOptions) (runtime.CheckpointRef, error) { | ||
| if err := ctxIfDone(ctx); err != nil { | ||
| return runtime.CheckpointRef{}, err | ||
| } | ||
| if ws == nil || ws.Container == nil { | ||
| return runtime.CheckpointRef{}, fmt.Errorf("Checkpoint: workspace has no container") | ||
| } | ||
| if opts.ArchivePath == "" { | ||
| return runtime.CheckpointRef{}, fmt.Errorf("Checkpoint: ArchivePath is required") | ||
| } | ||
|
|
||
| cr, ok := e.runtime.(runtime.CheckpointRuntime) | ||
| if !ok || !e.runtime.Capabilities().Checkpoint { | ||
| return runtime.CheckpointRef{}, fmt.Errorf("Checkpoint: %w", runtime.ErrCheckpointUnsupported) | ||
| } | ||
|
|
||
| ref, err := cr.Checkpoint(ctx, ws.Container.ID, runtime.CheckpointSpec{ | ||
| ArchivePath: opts.ArchivePath, | ||
| StopAfter: opts.StopAfter, | ||
| TCPEstablished: opts.TCPEstablished, | ||
| }) | ||
| if err != nil { | ||
| return runtime.CheckpointRef{}, fmt.Errorf("checkpoint: %w", err) | ||
| } | ||
| return ref, nil | ||
| } | ||
|
|
||
| // Restore re-creates and resumes a container from a checkpoint archive | ||
| // written by Checkpoint, reconstructing its mounts and re-attaching | ||
| // networking, then rebuilds the *Workspace around it. The original | ||
| // container may be gone (the migration case). | ||
| // | ||
| // The returned Workspace has the MINIMAL config Attach produces — the | ||
| // devcontainer labels the checkpoint archive preserves plus the image's | ||
| // merged-config metadata — with the substituter bound to the restored | ||
| // container's live env and userEnv re-probed. It is enough to drive Exec | ||
| // and Down; callers needing the full devcontainer.json view should | ||
| // Resolve from source. See the Workspace type docs. | ||
| // | ||
| // Returns ErrCheckpointUnsupported (wrapped) when the backend can't, and | ||
| // a *runtime.RestoreFailedError (from the backend) on a restore failure | ||
| // — distinct from a cold-start failure, so callers can fall back to a | ||
| // cold Up on the (intact) workspace volume. | ||
| func (e *Engine) Restore(ctx context.Context, opts RestoreOptions) (*Workspace, error) { | ||
| if err := ctxIfDone(ctx); err != nil { | ||
| return nil, err | ||
| } | ||
| if opts.ArchivePath == "" { | ||
| return nil, fmt.Errorf("Restore: ArchivePath is required") | ||
| } | ||
|
|
||
| cr, ok := e.runtime.(runtime.CheckpointRuntime) | ||
| if !ok || !e.runtime.Capabilities().Checkpoint { | ||
| return nil, fmt.Errorf("Restore: %w", runtime.ErrCheckpointUnsupported) | ||
| } | ||
|
|
||
| c, err := cr.Restore(ctx, runtime.RestoreSpec{ | ||
| ArchivePath: opts.ArchivePath, | ||
| Name: opts.Name, | ||
| TCPEstablished: opts.TCPEstablished, | ||
| }) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("restore: %w", err) | ||
| } | ||
|
|
||
| // Reattach: the restored container carries the devcontainer labels | ||
| // from the archive, so rebuild the Workspace the same way Attach | ||
| // does. inspectStable absorbs the post-restore state lag (the daemon | ||
| // reports state asynchronously after import-and-start). The workspace | ||
| // id is recovered from the container's label. | ||
| details, err := e.inspectStable(ctx, c.ID) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("restore: inspect restored container %s: %w", c.ID, err) | ||
| } | ||
| id := WorkspaceID(details.Labels[LabelDevcontainerID]) | ||
| if id == "" { | ||
| return nil, fmt.Errorf("restore: restored container %s has no %s label — not a devcontainer workspace archive", c.ID, LabelDevcontainerID) | ||
| } | ||
| return e.reattachWorkspace(ctx, details, id, opts.LocalEnv), nil | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.