Job Persistence
JobManager tracks long-running async tool executions (see issues #316, #318, #326, #371). By default it keeps all job state in-process, which means every Pending or Running row is lost if the server crashes or restarts.
Issue #328 adds an optional, feature-gated SQLite backend so tracked jobs survive a restart and so operators can prune completed history with a built-in MCP tool.
Design at a glance
┌──────────────┐ put/get/list/update_status/delete_older_than ┌─────────────────┐
│ JobManager │ ─────────────────────────────────────────────────────▶ │ JobStorage │
│ (in-proc) │ │ (trait) │
└──────────────┘ └─────────────────┘
│
┌───────────────────────────┼─────────────────────────┐
▼ ▼
InMemoryStorage SqliteStorage
(default, zero-dep) (feature job-persist-sqlite)JobStorageisSend + Syncand synchronous — write-through on every job transition. No background flush task, no batching.InMemoryStorageis the default and ships in every wheel. Same semantics as the trait, but obviously does not survive a restart.SqliteStorageis gated behind the Cargo featurejob-persist-sqliteand pulls inrusqlitewithbundledso no system SQLite is required.
Enabling SQLite persistence
Build
# default build — in-memory only, no SQLite code compiled in
vx just dev
# opt-in — adds SqliteStorage and the rusqlite dependency
cargo build --workspace --features job-persist-sqlite,python-bindings,ext-moduleConfigure
from dcc_mcp_core import McpHttpConfig, McpHttpServer, ToolRegistry
config = McpHttpConfig(port=8765)
config.job_storage_path = "/var/lib/dcc-mcp/jobs.sqlite3"
registry = ToolRegistry()
# register tools...
server = McpHttpServer(registry, config)
handle = server.start()If job_storage_path is set but the wheel was built withoutjob-persist-sqlite, server.start() fails fast with a descriptive error rather than silently falling back to the in-memory store.
Startup recovery
When JobManager boots with a storage backend, it scans for rows whose status is Pending or Running — those are jobs that were in flight when the previous process exited. Each one is rewritten to the new terminal JobStatus::Interrupted variant with error = "server restart" and a fresh updated_at, and a $/dcc.jobUpdated notification is emitted (if the JobNotifier is wired). Clients that re-subscribe after reconnect therefore see a clean terminal transition instead of a dangling Running job.
Recovery failures are logged at error level and do not abort startup — the in-process map simply starts empty and the process continues to serve new requests.
Recovery policy (issue #567)
McpHttpConfig::job_recovery selects how the recovery sweep treats in-flight rows. Two values are accepted today:
| Value (Rust) | Wire string | Effect |
|---|---|---|
JobRecoveryPolicy::Drop | "drop" | Default. Every Pending / Running row is rewritten to Interrupted. Always safe — never re-runs a partially-applied tool. Matches the behaviour described above. |
JobRecoveryPolicy::Requeue | "requeue" | Reserved. The contract is "re-submit idempotent in-flight jobs from the persisted spec". Today the jobs row stores only tool_name, not the original arguments, so true requeue is not yet possible. The variant is accepted but degrades to Drop — startup logs a WARN (requested_policy=requeue effective_policy=drop) and the recovery sweep behaves identically to Drop. The accepted-but-degraded contract lets DCC adapters (dcc-mcp-maya, dcc-mcp-houdini) plumb the knob today; the real implementation will land alongside tool-arg persistence without a config-shape break. |
from dcc_mcp_core import McpHttpConfig
config = McpHttpConfig(port=8765)
config.job_storage_path = "/var/lib/dcc-mcp/jobs.sqlite3"
config.job_recovery = "requeue" # accepted, currently behaves as "drop" + WARNuse dcc_mcp_http::config::{JobRecoveryPolicy, McpHttpConfig};
let cfg = McpHttpConfig::new(8765)
.with_job_storage_path("/var/lib/dcc-mcp/jobs.sqlite3")
.with_job_recovery(JobRecoveryPolicy::Requeue);If you want to guarantee drop semantics regardless of future releases, leave job_recovery at its default. If you want to opt in to true requeue when it ships, set job_recovery = "requeue" today and the same configuration will pick up the new behaviour automatically.
jobs.cleanup built-in tool
A SEP-986-compliant built-in MCP tool prunes terminal jobs:
// tools/call
{
"name": "jobs.cleanup",
"arguments": { "older_than_hours": 24 } // default: 24
}
// → { "removed": <count>, "older_than_hours": 24 }- Annotations:
destructive_hint: true,idempotent_hint: true,read_only_hint: false. - Only terminal statuses (
Completed,Failed,Cancelled,Interrupted) are eligible;PendingandRunningrows are never removed regardless of age. - Works against whichever backend is configured — in-memory or SQLite.
Storage schema
CREATE TABLE IF NOT EXISTS jobs (
job_id TEXT PRIMARY KEY,
parent_job_id TEXT,
tool TEXT NOT NULL,
status TEXT NOT NULL,
progress_json TEXT,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
error TEXT,
result_json TEXT
);
CREATE INDEX IF NOT EXISTS jobs_status_idx ON jobs(status);
CREATE INDEX IF NOT EXISTS jobs_parent_idx ON jobs(parent_job_id);
CREATE INDEX IF NOT EXISTS jobs_updated_idx ON jobs(updated_at);Timestamps are stored as RFC 3339 UTC strings. Progress and result payloads are JSON-serialized — the schema stays stable even if internal Job fields evolve.
Operational guidance
- Backup: the SQLite file is a single path; standard file-level snapshotting works. There is no WAL-mode promise across versions — treat it as a durable cache, not a system of record.
- Growth: call
jobs.cleanupon a schedule (cron, k8s CronJob, or from an orchestrator agent). Default 24h window works for most interactive use. - Migration: there is no cross-version migration today. If the schema changes in a future release, delete the file and let
JobManagerrecreate it — the file is meant to survive restarts, not upgrades. - Concurrency:
SqliteStorageserializes through aparking_lot::Mutexon the connection. Good enough for per-DCC servers; do not point multipleMcpHttpServerinstances at the same file.
Related issues
- #316 — Async job execution with
Pending/Running/Completedstates - #318 —
JobManagercore - #326 —
$/dcc.jobUpdatednotifications - #371 —
jobs.get_statustool - #328 — this document
- #567 —
job_recoverypolicy contract (drop/requeue)