Automating the Rubber Stamp: What If an Agent Ran Your Deployment Gate?

2026-06-11 · by Matthew van Bird

The car park

Three weeks ago a colleague approved a production deployment from a Tesco car park. The GitHub notification arrived on their phone. They tapped "Review pending deployments", ticked the production environment, typed "lgtm" in the comment box, and hit approve. They had not looked at the staging environment. They had not looked at the e2e run. They had not looked at a single dashboard. They were loading shopping into a car.

I happened to be watching over their shoulder. Then I went through the audit log for the previous month. Thirty-one production approvals. Zero rejections. Median time between "review requested" and "approved": 47 seconds. Nobody reviews anything in 47 seconds. The staging-to-prod gate, the thing we point at when someone asks about our release controls, is a rubber stamp with a pulse attached. Approvals from a car park, a train, and once, memorably, a swimming pool changing room. The audit log records all three as considered decisions.

And it's not just me. Ask around. Every team I've worked with has the same gate and the same 47 seconds. The approval step exists because a compliance checklist somewhere says "human review before production", and the humans have collectively decided, without ever saying it out loud, that the review is a tap on a phone. We've built an industry-wide ritual where the control everyone points at is the control nobody performs.

So an agent should do it instead. Not because a model has better judgement than a human. It doesn't, and we'll get to the spectacular way that assumption fails. But because the human gate has become nothing, and a machine that actually runs the checklist would beat a person who pretends to.

What the gate was supposed to be

Think about what your production gate was meant to check when someone set it up. It was almost certainly a checklist like this:

The e2e suite passed against staging, on this exact build.
The staging error rate hasn't regressed since the deploy landed there.
No open Sev1 incidents.
We're not inside a change freeze.
Someone is around to watch it land.

Read that list again. Every item is checkable. The e2e result is a workflow conclusion in the GitHub API. The error rate is a CloudWatch query. The incidents are in PagerDuty. The freeze calendar is a YAML file in a repo. Not one item requires judgement, taste, or seniority. The human is in the loop because GitHub's required-reviewer box wants a name in it, and because "a person approved it" reads well in an audit.

If the approval is a checklist, the checklist can be automated. The only hard question is what the model should be allowed to decide, and that's the question this whole idea lives or dies on.

The shape of it

Take an ordinary GitHub Actions pipeline: build, deploy to staging, a gated production deploy, a post-live job. The gate on the production environment becomes a custom deployment protection rule backed by a GitHub App. When a run reaches that environment, GitHub parks the run and fires a deploymentprotectionrule webhook at you. The webhook lands on API Gateway, a thin Lambda validates the signature, and it invokes an agent running on Bedrock AgentCore Runtime. The agent gathers the evidence, then calls GitHub back on the URL in the webhook payload with approved or rejected, and the run either continues or dies.

After the production job finishes, a second webhook (workflow_run completed) wakes the same agent for post-live checks. If those fail, it triggers the rollback workflow and pages whoever is on call.

One gotcha worth a line before you sketch this on a whiteboard: custom deployment protection rules are free on public repos but need GitHub Enterprise on private ones. On a Team plan the fallback is a machine user as a required reviewer and the pending_deployments REST endpoint. Same agent, uglier plumbing.

The AWS pieces

AgentCore is Amazon's answer to "I have an agent in a Python file, now what". It decomposes into parts, and a deployment agent would use most of them:

Runtime hosts the agent itself. Serverless, session-isolated, and a session can run for up to eight hours, which matters later because watching a deploy is a long, boring job and you want one process to do the whole watch.
Gateway turns existing APIs into MCP tools. Point it at an OpenAPI spec for the slice of the GitHub REST API you need and at a Lambda that wraps your CloudWatch queries. The agent sees tools; the credentials and retries live elsewhere.
Identity is the token vault. The GitHub App's credentials sit there, not in an environment variable, and the agent gets short-lived tokens at call time. An agent that can approve production deploys is a credential worth treating properly.
Memory keeps the deployment history: what shipped, what the metrics did, what was rejected and why.
Observability gives you OpenTelemetry traces of every run. When a robot approves a production deploy, "why" needs to be a query, not an archaeology project.

The model is Claude on Bedrock, and the agent itself would be written with Strands, Amazon's agent SDK. AgentCore is framework-agnostic so you could use LangGraph or roll your own loop, but Strands is the path of least resistance on AWS and the whole agent stays small:

import json

from bedrock_agentcore.runtime import BedrockAgentCoreApp
from strands import Agent
from strands.models import BedrockModel

from gate.prompt import GATE_PROMPT
from gate.tools import (
    e2e_result, staging_error_delta, open_sev1_incidents,
    change_freeze_active, approve_gate, reject_gate, page_oncall,
)

app = BedrockAgentCoreApp()

agent = Agent(
    model=BedrockModel(
        model_id="anthropic.claude-sonnet-4-6",
        region_name="eu-west-2",
    ),
    system_prompt=GATE_PROMPT,
    tools=[
        e2e_result, staging_error_delta, open_sev1_incidents,
        change_freeze_active, approve_gate, reject_gate, page_oncall,
    ],
)

@app.entrypoint
def handle(payload):
    # payload is the deployment_protection_rule webhook, forwarded by the Lambda
    return agent(json.dumps(payload))

if __name__ == "__main__":
    app.run()

agentcore configure and agentcore launch from the starter toolkit put that into Runtime. The deployment agent would itself be deployed by a pipeline, which is the kind of sentence you have to stop thinking about after the second reading.

The trap

Here is the system prompt everyone would write on day one, because it's the obvious one:

You are the release manager. Review the evidence from staging and
approve the production deployment if you believe the release is safe.

And here is how it would betray you. A release goes through staging on a Thursday afternoon and the 5xx rate triples. The agent sees it. The agent reports it. And the agent approves the deploy anyway, with something like this in the comment:

> Staging error rate is elevated (0.9% vs 0.3% baseline), but the spike coincides with the scheduled 14:00 load test and is most likely unrelated to this change. Approving.

A confident, well-written, completely reasonable paragraph. Except there is no 14:00 load test. There never was, or it was retired two sprints ago, or it runs on Tuesdays. There is only the bug, and a model doing what models do: producing a plausible explanation in the absence of understanding. This is the single most documented failure mode of large language models, pointed directly at your production environment. A human in a hurry writes "lgtm". A model in no hurry at all writes three sentences of convincing fiction around the same wrong answer, and the fiction is the dangerous part, because a rejection comment that cites a specific load test reads like diligence. Nobody questions it. The deploy goes out at 16:50 on a Thursday and your evening belongs to the incident channel.

I wrote a year ago that AI coding tools are a fast junior, not a replacement for judgement. The deployment version of that lesson: never, under any circumstances, let the model own the verdict.

Policy as code, agent as clerk

The fix is structural. You can't ask a model nicely to stop hallucinating. What you can do is take the decision away from it entirely. Every check becomes a tool that computes its own pass or fail in plain, deterministic Python, and returns a structured verdict:

@tool
def staging_error_delta(run_id: int) -> dict:
    """Compare the staging 5xx rate in the 30 minutes since this build
    deployed to staging against the same window over the prior 7 days."""
    current = cloudwatch_5xx_rate("staging", since_deploy(run_id), minutes=30)
    baseline = cloudwatch_5xx_rate("staging", weekly_baseline(run_id), minutes=30)
    return {
        "check": "staging_error_delta",
        "pass": current <= baseline * 1.5,
        "current_pct": current,
        "baseline_pct": baseline,
        "threshold": "1.5x baseline",
    }

The threshold lives in code review, not in a model's mood. And the system prompt stops asking for an opinion:

Run every gate check tool exactly once. Then:

- If every check returned pass: true, call approve_gate with a one-line
  summary of the evidence.
- If any check returned pass: false, call reject_gate. Quote the failing
  check's numbers in the comment. Do not interpret, excuse, or explain
  away a failing check. A failing check is a rejection.
- If any check errored or returned incomplete data, call page_oncall
  and stop. Do not approve on partial evidence.

You do not have an opinion on whether the release is safe. The checks do.

Decision flow: four deterministic checks (e2e suite, error-rate delta, open Sev1s, change freeze) each emit pass or fail and feed a fixed policy. All pass leads to approve<em>gate. Any fail leads to reject</em>gate quoting the numbers. An errored or incomplete check leads to page_oncall. The model assembles evidence and writes the comments; it never decides.

So what's left for the model? Everything that isn't the verdict, which turns out to be most of the work. It orchestrates the checks, handles the webhook plumbing, writes a rejection comment that quotes the actual numbers instead of "checks failed", and assembles a deploy narrative a human can read in ten seconds. The approval itself is one API call back to GitHub:

@tool
def approve_gate(callback_url: str, environment: str, comment: str) -> dict:
    """Approve the pending production deployment gate."""
    return github.post(callback_url, json={
        "environment_name": environment,
        "state": "approved",
        "comment": comment,
    })

The workflow side needs nothing exotic. The gate is just an environment reference, and the run parks there until the agent answers:

jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - run: ./deploy.sh staging

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production   # the protection rule fires here
    steps:
      - run: ./deploy.sh production

  post-live:
    needs: deploy-production
    runs-on: ubuntu-latest
    steps:
      - run: ./trigger-post-live.sh

Post-live

Be honest about what happens after your deploys go out. The deploy job goes green, someone glances at a dashboard if they remember, and everyone goes back to what they were doing. The "watch it land" item on the checklist is the most fictional one of all. Nobody watches anything land at 17:30.

This is where the agent earns its keep, because this is the part no human was ever doing, car park or not. When the workflow_run completed event arrives, the agent starts a watch: it kicks the CloudWatch Synthetics canaries (a login, a checkout, whichever journeys pay your bills), then compares the production error rate and p95 latency against the same 30-minute window a week earlier, sampling every five minutes. Runtime's long sessions earn their keep here; one invocation owns the whole watch instead of a chain of cron jobs passing state through a bucket.

If a threshold breaches, the agent fires workflow_dispatch on the rollback workflow, pages on-call, and writes the incident summary before the human has unlocked their laptop. Think about the Friday 17:40 deploy that goes subtly wrong, the one where the graphs drift for twenty minutes before anyone looks. The agent is eleven minutes into its watch when the error rate crosses the line, and the rollback is running before the first customer tweet. The rollback decision is the same deterministic shape as the gate: thresholds in code, model as clerk.

If the watch passes, it writes a record to Memory and posts a summary on the deployment, something like:

> Deploy 4123 live for 30 minutes. Canaries green (login 1.2s, checkout 2.1s). Error rate 0.02% vs 0.03% baseline. p95 412ms vs 405ms. One new log signature: WARN PriceCache fallback, 14 occurrences, non-fatal.

That last line is the sleeper feature. No human post-deploy check ever run in the history of software would have surfaced a new warning signature with a count attached. The model doesn't decide anything there. It notices something, which is the job we keep pretending the approver is doing.

What you'd actually get

Run the thought experiment forward. Every deploy gets every check, every time, including the 17:40 Friday one and the one during the change freeze everybody forgot was in effect. The rejections come with the failing numbers quoted, not a shrug. The e2e failure that a hurried human waves through as "probably flaky" gets rejected, because the agent doesn't have a dinner reservation. And when CloudWatch returns no data because someone renamed a metric, the agent refuses to approve on partial evidence and wakes a person, which is the move the 47-second human never makes.

The cost would be pennies per deploy. The real spend is the plumbing: the GitHub App, the OpenAPI spec for Gateway, the thresholds, the rollback workflow. Expect the agent code to be the smallest file in the repo, and treat that as the health check. If your version of this build is 80% model and 20% checks, you've built the trap version, the one that invents load tests.

The remaining risk isn't the agent approving badly. The checks decide, and the checks are in code review. The risk is the one I keep writing about: in a year, will anyone on the team still know why the error-delta threshold is 1.5x, or will it be another number nobody owns, guarded by a robot nobody questions? The gate can be automated. The understanding can't be.

Closing

The uncomfortable part of this idea is that it doesn't propose removing human judgement from the deployment process. It points out that there's none left to remove. Thirty-one approvals, zero rejections, 47 seconds. A clerk that runs every check every time, writes down exactly what it saw, and cannot be talked into a yes is a lower bar than the release process we all claim to have, and a much higher one than the one we actually have.