Gateway
The gateway (McpHttpConfig::gateway_port > 0) is a first-wins HTTP façade that presents every live DCC instance under one MCP endpoint. A single client can talk to Maya, Blender and Houdini through the same /mcp URL; the gateway discovers live backends via FileRegistry, keeps its MCP tools/list bounded to discover+dispatch primitives, indexes backend capabilities on demand, routes search_tools / describe_tool / call_tool (or REST /v1/*) to the right backend, and multiplexes server-pushed notifications back to the originating client session.
Set gateway_name, --gateway-name, or DCC_MCP_GATEWAY_NAME on each candidate to make ownership explicit. The elected process writes this label to the __gateway__ sentinel and /admin/api/health.gateway.current; a challenger writes the same label with gateway_role=challenger, so operators can see both the current owner and the next peer trying to take over.
For production, prefer the machine-wide standalone gateway:
dcc-mcp-server gateway --port 9765 --name studio-gatewayPer-DCC sidecars now auto-launch that process when GET /health is not reachable. They use a single-flight gateway-launch.lock in the registry directory so three DCCs starting at once still spawn at most one gateway. Use dcc-mcp-server sidecar --no-ensure-gateway to disable auto-launch, or --legacy-gateway-election to restore the old per-DCC first-wins election.
Topology
┌──────────────── gateway ────────────────┐
client_A ──▶│ POST /mcp (tools/list, tools/call) │───▶ backend (maya)
│ GET /mcp (SSE — MCP 2025-03-26) │───▶ backend (blender)
client_B ──▶│ subscribers: per-client broadcast sink │
│ backend SSE sub: one per backend URL │
└────────────────────────────────────────┘SSE multiplex (#320)
When the gateway detects a new backend it opens a persistent SSE connection to <backend>/mcp (the same Streamable HTTP transport the client uses against the gateway). Notifications emitted by the backend are parsed as JSON-RPC messages and routed to the right client:
| MCP method | Correlation key | Source |
|---|---|---|
notifications/progress | params.progressToken | Set by the gateway when the outbound tools/call carried _meta.progressToken |
notifications/$/dcc.jobUpdated | params.job_id | Set from the backend reply's _meta.dcc.jobId / structuredContent.job_id |
notifications/$/dcc.workflowUpdated | params.job_id | Same as above |
Pending buffer
Notifications that arrive before the correlation is known (race between backend SSE push and the tools/call HTTP reply) are held in a bounded per-backend queue: 256 events or 30 s, whichever comes first. When the mapping appears the buffer is drained; stale entries are dropped with a warn! log.
Reconnect + synthetic $/dcc.gatewayReconnect
Each backend subscriber owns a reconnect loop with jittered exponential backoff (100 ms → 10 s, ±25% jitter). When a broken stream reconnects, the gateway emits a synthetic notifications/$/dcc.gatewayReconnect notification to every client that had an in-flight job on that backend:
{
"jsonrpc": "2.0",
"method": "notifications/$/dcc.gatewayReconnect",
"params": { "backend_url": "http://127.0.0.1:18812/mcp" }
}Clients use this to re-query in-flight jobs via jobs.get_status.
Session lifecycle
Per-client SSE sinks are keyed on Mcp-Session-Id. A SessionCleanup RAII guard runs when the GET /mcp response body is dropped (client disconnect): the client's sink is removed from the subscriber manager and any job_routes / progress_token_routes bound to that session are scrubbed. Backend subscriptions stay alive — another client might still depend on them.
Self-loop guard + pre-subscribe hygiene (#419)
When a DCC process (Maya, Blender, Houdini…) wins gateway election it keeps two rows in FileRegistry: the __gateway__ sentinel and its own plain "maya" / "blender" / … row. Without filtering, the backend SSE subscriber would open a connection to its own /mcp endpoint — a self-loop that wastes a socket and floods the reconnect logs whenever the facade blips.
Two invariants prevent this:
- Self-exclusion in every fan-out path.
GatewayState::live_instancesskips rows whose(host, port)matches the gateway's own binding, usingis_own_instancefromcrates/dcc-mcp-gateway/src/gateway/sentinel.rs. The helper normalises localhost aliases (localhost,::1,0.0.0.0,[::]) to127.0.0.1so an adapter that advertises its host as"localhost"is still filtered when the gateway is bound to127.0.0.1. Thebackend_sub_handlesubscription loop and thecompute_tools_fingerprint_with_ownwatcher apply the same filter. - Synchronous hygiene before the subscriber loop starts. Inside
start_gateway_tasks, a one-shotprune_dead_pids()+cleanup_stale()pass runs beforebackend_sub_handleis spawned. The periodic cleanup task only ticks every 15 s; without the synchronous pre-pass, ghost rows left behind by a previous crash would eat the full exponential-backoff retry budget during the first ~15 s of gateway lifetime.
Instance and Diagnostics Discovery
The gateway exposes the live DCC registry as a gateway-native MCP resource (see also docs/api/http.md):
{"jsonrpc":"2.0","id":1,"method":"resources/read",
"params":{"uri":"gateway://instances"}}The payload includes live, stale, and unhealthy rows so clients can decide whether to route, reconnect, or ask the user to restart a DCC instance. Each entry already carries mcp_url, so clients that have read this resource can connect directly. Optional URI query parameters (?include_stale=false, ?include_dead=true) match the legacy tool flags. resources/list advertises only root pointers for gateway-native families; it does not enumerate every instance-specific URI. Backend capability indexes refresh on demand before search_tools / describe_tool, so instances registered after gateway startup are picked up without a restart.
Optional Instance Pooling
Instances can opt into warm-pool semantics through the registry fields surfaced under pool in the gateway://instances resource:
{
"status": "busy",
"pool": {
"capacity": 1,
"lease_owner": "workflow-42",
"current_job_id": "render-001",
"lease_expires_at": 1770000000,
"available": false
}
}Gateway-local tools manage these leases:
| Tool | Purpose |
|---|---|
acquire_dcc_instance | Reserve an idle instance by dcc_type (or a specific instance_id) and mark it busy |
release_dcc_instance | Release the lease and mark the instance available again |
Pooling is optional. Adapters that never call these tools keep the previous single-instance behavior: entries default to capacity: 1, no lease owner, and status: "available".
Before the gateway routes REST traffic to a backend, it verifies that the target responds to GET /v1/readyz and falls back to GET /health only when the readiness surface is absent. This avoids treating non-MCP listeners such as Maya commandPort as routable backends; posting MCP JSON-RPC to commandPort was a pre-#818 failure mode that could trigger Maya's modal commandPort security dialog and block the DCC main thread.
Gateway-native diagnostics are always available as MCP resources (read via resources/read), even when no backend is routable:
| Resource URI | Purpose |
|---|---|
gateway://diagnostics/process | Gateway process metadata plus live/stale/unhealthy instance counts. Optional ?dcc_type=<type> filter. |
gateway://diagnostics/audit | Gateway pending-call and subscription summary |
gateway://diagnostics/metrics | Gateway-local tool count, live backend count, and timeout settings |
Backend diagnostics tools remain available as normal prefixed instance tools when a DCC exposes them.
Operations: ingress limits, X-Forwarded-For, resilience, metrics
These knobs apply to the elected gateway process (the HTTP listener on McpHttpConfig::gateway_port). They are read once at process start from the environment unless noted otherwise.
Rate limiting and client IP
| Variable | Default | Meaning |
|---|---|---|
DCC_MCP_GATEWAY_RATE_LIMIT_PER_MINUTE | 0 (off) | Max HTTP requests per client key per rolling UTC minute. OPTIONS is not counted. |
DCC_MCP_GATEWAY_XFF_TRUSTED_DEPTH | 0 | When > 0, the client key for rate limiting prefers X-Forwarded-For: treat the rightmost depth comma-separated fields as trusted reverse-proxy hops; the next field to the left is the client IP. If the header is missing, malformed, or shorter than depth + 1, the TCP peer address is used. |
Security: only set DCC_MCP_GATEWAY_XFF_TRUSTED_DEPTH when every path to the gateway passes through that many trusted proxies that overwrite (not concatenate untrusted) X-Forwarded-For. A client that can reach the gateway directly could otherwise spoof the header unless your edge strips or replaces it.
Request body size
| Variable | Default | Meaning |
|---|---|---|
DCC_MCP_GATEWAY_HTTP_BODY_LIMIT_BYTES | 16777216 (16 MiB) | Hard cap on non-streaming request bodies (tower_http::limit::RequestBodyLimitLayer). Long-lived GET /mcp SSE streams are not subject to a short global HTTP timeout. |
Backend retries and circuit breaker
| Variable | Default | Meaning |
|---|---|---|
DCC_MCP_GATEWAY_READ_RETRY_MAX | 2 | Extra attempts for idempotent read REST hops (GET and read-like POST /v1/search) after transport / 5xx / 429 failures, with jittered backoff. Writes (POST /v1/call, JSON-RPC post) are not retried. |
DCC_MCP_GATEWAY_CIRCUIT_FAILURE_THRESHOLD | 5 | Consecutive transport-class failures per backend REST base before the circuit opens. |
DCC_MCP_GATEWAY_CIRCUIT_OPEN_SECS | 30 | How long to short-circuit new calls to that backend base. |
Durable admin audit / trace JSONL (optional)
When DCC_MCP_GATEWAY_AUDIT_DIR is set, audit rows and dispatch traces append to JSONL files under that directory.
| Variable | Default | Meaning |
|---|---|---|
DCC_MCP_GATEWAY_AUDIT_MAX_ROWS | 5000 | Trim oldest lines when a file exceeds this row count. |
DCC_MCP_GATEWAY_AUDIT_MAX_BYTES | 52428800 (~50 MiB) | After row trim, drop oldest lines until each JSONL is under this size. |
Prometheus (GET /metrics)
Build dcc-mcp-http / dcc-mcp-gateway with the prometheus Cargo feature and expose GET /metrics on the gateway listener (see attach_gateway_metrics_route in crates/dcc-mcp-gateway).
Gateway → backend hop failures increment:
dcc_mcp_gateway_backend_errors_total{kind="…"}
kind is a small fixed vocabulary (low cardinality). Typical values:
kind | When |
|---|---|
transport | TCP/TLS/DNS errors, timeouts on send |
unreachable | Readiness probe could not reach the backend |
booting | /v1/readyz reports not ready |
http_4xx / http_5xx / http_other | Non-success HTTP from the backend REST hop |
read_body | Failed to read the HTTP response body |
invalid_json | Response was not valid JSON where expected |
jsonrpc_backend | JSON-RPC error object from the backend |
empty_result | JSON-RPC success without result |
circuit_open | Local circuit breaker is open for that backend base |
other | REST string errors that do not match the patterns above |
Other series on the same registry (instance gauges, request histograms, etc.) are documented in crates/dcc-mcp-telemetry/src/prometheus.rs.
Admin dashboard
GET /admin/api/health includes rss_bytes, limits (echoing the env-backed values above, including xff_trusted_depth), and a circuits snapshot (tracked_backends, circuits_open).
Dynamic Capability Index and Bounded Tool Exposure (#652-#657)
For large multi-DCC deployments, the gateway never publishes every backend action directly through tools/list. The removed GatewayToolExposure enum, McpHttpConfig.gateway_tool_exposure, publishes_backend_tools, and --gateway-tool-exposure switch are pre-0.15 concepts. There is now one unconditional surface:
| Surface | What appears in tools/list | Agent workflow |
|---|---|---|
| Gateway MCP | Fixed discover+dispatch primitives: search_skills, load_skill, search_tools, describe_tool, call_tool, call_tools, and pooling tools. Instance registry, diagnostics, catalog, and the agent workflow guide are gateway-native resources (gateway://instances, gateway://diagnostics/*, gateway://catalog, gateway://docs/agent-workflows) read via resources/read, not tools | resources/read uri=gateway://instances (or skip it and go straight to search_tools → describe_tool → call_tool / call_tools). Optional: resources/read uri=gateway://docs/agent-workflows for MCP+resources+efficiency guidance |
| Gateway REST | /v1/search, /v1/load_skill, /v1/unload_skill, /v1/describe, /v1/call, /v1/call_batch, /v1/instances, plus /v1/resources*, /v1/prompts*, and /v1/jobs* | POST /v1/search → optional /v1/load_skill from next_step.arguments → /v1/describe → /v1/call (or POST /v1/call_batch for ordered batches); use resources/prompts/jobs routes for non-tool MCP primitives |
| Direct per-DCC MCP | One DCC server's skills and loaded tools | search_skills → load_skill → tool call |
The gateway capability index stores compact records keyed by <dcc>.<id8>.<tool> and refreshes on demand, so the first agent query after startup or load_skill sees fresh results without a polling delay. The fixed MCP wrappers are cursor-safe and stable:
| Tool | Purpose |
|---|---|
search_tools | Search compact capability records by query, DCC type, tags, instance, scene hint, and pagination options |
describe_tool | Fetch the full schema, annotations, and routing record for a selected tool_slug |
call_tool | Invoke the selected backend capability with validated arguments and optional MCP _meta |
call_tools | Ordered multi-invocation (max 25 items); optional stop_on_error. REST twin: POST /v1/call_batch |
Use this dynamic-capability flow whenever an agent is connected to the gateway. Use the per-DCC Skills-First flow (search_skills → load_skill → tool call) when the agent is connected directly to one DCC server.
For REST-only clients, POST /v1/search with loaded_only=false returns unloaded hits with a machine-executable next_step. POST that next_step.arguments object to /v1/load_skill, then search or describe again. The same shape is included in MCP search_tools results, so REST and MCP agents can share the same progressive-loading planner.
Gateway call wrapper payloads
call_tool, call_tools, POST /v1/call, and POST /v1/call_batch all share the same wrapper contract:
{
"tool_slug": "maya.a1b2c3d4.maya_scripting__execute_python",
"arguments": { "code": "cmds.polySphere()" },
"meta": { "progressToken": "session-42" }
}Only tool_slug, arguments, and meta belong at the wrapper top level. Backend-specific fields (code, script, file_path, radius, …) must be inside arguments. Missing / null / empty-string arguments normalize to {}; object roots pass through; object-shaped JSON strings are accepted for connector compatibility; arrays, numbers, booleans, and non-object strings are rejected by dcc-mcp-wire. Host adapters and connectors should reuse dcc_mcp_core.host.normalize_tool_arguments() / normalize_tool_meta() instead of each reimplementing coercion.
Resources and Prompts Aggregation (#731, #732, #818)
The gateway also forwards MCP resources and prompts so agents can exchange hand-off artefacts and prompt templates across all live DCC instances without opening per-backend sessions. Since #818 the gateway's backend hop is REST, not backend JSON-RPC: GET /v1/resources, GET /v1/resources/{uri}, GET /v1/prompts, and GET /v1/prompts/{name}?args=<json>.
Resources workflow:
- Call
resources/liston the gateway. - Treat every returned URI as opaque. Gateway-native resources use
gateway://instances,gateway://diagnostics/*, andgateway://catalog. Forwarded backend resources use a gateway-routable prefix so reads and subscriptions can find the owning backend. resources/listonly emits root pointers for gateway-native families; it does not enumerate everygateway://instances/{id}orgateway://catalog/{name}. Read those single-entry URIs directly when you already know the id/name.- Pass the exact URI returned by
resources/listtoresources/read,resources/subscribe, orresources/unsubscribe. Do not strip the instance prefix or rebuild URIs manually.
Prompts workflow: use prompts/list on the gateway to browse prompt templates from all live backends, then call prompts/get with the returned namespaced prompt name. Any MCP arguments object is forwarded through the REST args query parameter and rendered by the backend prompt provider. Backend prompt changes are surfaced through notifications/prompts/list_changed.
Code pointers
| Piece | File |
|---|---|
| Subscriber manager, reconnect loop | crates/dcc-mcp-gateway/src/gateway/sse_subscriber.rs |
| Per-session SSE plumbing | crates/dcc-mcp-gateway/src/gateway/handlers/ (handle_gateway_get) |
tools/call correlation hooks | crates/dcc-mcp-gateway/src/gateway/aggregator.rs / aggregator/ |
| Subscription watcher and runtime tasks | crates/dcc-mcp-gateway/src/gateway/tasks.rs |
Waiting for terminal results from the gateway (#321)
The gateway applies two separate request budgets to an outbound tools/call:
| Case | Timeout | Source |
|---|---|---|
Sync call (no _meta.dcc.async, no progressToken) | backend_timeout_ms (default 120 s) | McpHttpConfig |
Async opt-in call (_meta.dcc.async=true or _meta.progressToken) | gateway_async_dispatch_timeout_ms (default 60 s) | McpHttpConfig |
Async opt-in with _meta.dcc.wait_for_terminal=true | gateway_wait_terminal_timeout_ms (default 10 min) for the wait, gateway_async_dispatch_timeout_ms for the initial queuing step | McpHttpConfig |
Why two timeouts? An async-dispatched tool replies immediately with {status:"pending", job_id:"…"} once the job has been queued on the backend. Under cold-start conditions (Maya re-importing a heavy module, Blender firing up a fresh Python interpreter) even that queuing step can legitimately take >10 s, so the short sync timeout would surface a spurious transport error while the backend is still starting the work.
Response stitching (opt-in)
Clients that cannot consume SSE (plain curl, a batch script, a CI runner) can still get the final result in a single tools/call response by setting _meta.dcc.wait_for_terminal = true alongside _meta.dcc.async = true:
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "maya__bake_simulation",
"arguments": {...},
"_meta": {
"dcc": {"async": true, "wait_for_terminal": true}
}
}
}The gateway now:
- Forwards the call to the backend with the longer
gateway_async_dispatch_timeout_msbudget. - Receives the
{pending, job_id}envelope and subscribes to the per-job broadcast bus owned by the SSE subscriber manager. - Blocks the HTTP response until a
notifications/$/dcc.jobUpdatedframe withstatus in {completed, failed, cancelled, interrupted}arrives over the backend's SSE stream, or untilgateway_wait_terminal_timeout_mselapses. - Merges the terminal status,
result, anderrorinto the original pending envelope'sstructuredContentand returns the resultingCallToolResult.isErroris set for any non-completedstatus.
Timeout semantics
If gateway_wait_terminal_timeout_ms elapses before a terminal event arrives, the gateway returns the last observed job envelope annotated with _meta.dcc.timed_out = true and leaves the job running on the backend. Callers can either reconnect over SSE or keep polling jobs.get_status to collect the eventual result.
Backend disconnect
If the backend SSE stream drops while a waiter is blocked, the gateway returns a JSON-RPC -32000 error identifying the backend and the job_id. The job itself is not cancelled — a subsequent restart of the backend may surface it as interrupted (issue #328) when the persisted job store rehydrates.
Job-to-backend routing cache (#322)
To forward a notifications/cancelled { requestId } from the client to the backend that actually owns the job, the gateway keeps a small cache:
pub struct JobRoute {
pub client_session_id: ClientSessionId,
pub backend_id: BackendId, // e.g. http://127.0.0.1:8001/mcp
pub tool: String, // for logs + cancel payload
pub created_at: DateTime<Utc>, // GC anchor
pub parent_job_id: Option<String>, // #318 cascade
}
// DashMap<Uuid, JobRoute>Populated when the backend reply to a tools/call carries a job_id. Consumed by:
notifications/cancelled { requestId }— the gateway resolvesrequestId → job_id → JobRouteand POSTs a cancel tobackend_id.- Parent-job cascade — if the cancelled job has a
parent_job_id, or is itself a parent, the gateway walks thechildren_ofindex and fans the cancel out to every distinctbackend_id(which may differ from the originating backend —#318only covered single-server cascade, the gateway extends this across backends).
Lifecycle
- Insert —
aggregator::route_tools_call→SubscriberManager::bind_job_route. - Auto-evict —
deliver()removes the route as soon as a$/dcc.jobUpdatedwith a terminal status (completed,failed,cancelled,interrupted) is observed. - TTL GC — a background task sweeps routes older than
gateway_route_ttl_secs(default 24 h) every 60 s, so a backend crash that never emits a terminal event doesn't leak the route. - Per-session cap —
gateway_max_routes_per_session(default 1 000). When a session is already holdingcaplive routes a new dispatch is rejected with JSON-RPC-32005 too_many_in_flight_jobs.
Python configuration
from dcc_mcp_core import McpHttpConfig
cfg = McpHttpConfig(
port=0,
gateway_route_ttl_secs=3600, # 1 hour
gateway_max_routes_per_session=500,
)Both fields are also accessible as getters/setters on the returned McpHttpConfig instance.
Non-goals
HTTP/2 multiplexing tuning and multi-backend failover for the routing cache (routes are sticky) are out of scope for #320 / #321 / #322.