A ranking engineer at Meta spends most of their day inside a loop. Look at a metric that moved the wrong way. Form a hypothesis about why. Write a feature or tweak a model. Ship it behind an A/B experiment. Wait. Read the readout. Decide. Repeat.
That loop is mostly mechanical. The judgement lives at the edges, in the hypothesis and the decision. Everything between is plumbing: pulling data, writing boilerplate feature code, launching the experiment with the right safety config, parsing the readout dashboard.
Meta's internal Ranking Engineer Agent automates the plumbing and takes a first pass at the judgement. It runs the whole loop on its own, generating hypotheses and testing them, but it does so inside a fence of hard guardrails. The autonomy is the headline. The guardrails are the actual engineering.
What "Autonomous Experimentation" Actually Means Here
The agent is not a chatbot that suggests ideas. It is a closed-loop system with a goal (improve a target metric without regressing guarded metrics) and the authority to act on a production ranking system, within limits.
A single autonomous cycle looks like this:
- Observe. Read the current state: target metric, guarded metrics, recent experiments, feature catalog.
- Hypothesize. Generate candidate explanations and interventions ("watch-time dipped for short-form because the freshness feature decays too fast; try a slower decay").
- Rank hypotheses. Score candidates by expected impact, cost, and risk. Pick the most promising one that passes the guardrails.
- Implement. Write the feature/model change as code against the ranking stack.
- Guard-check. Validate the change against static guardrails before anything ships (blast radius, forbidden features, metric guards).
- Experiment. Launch a small, capped A/B test with automatic kill-switches.
- Read out. Parse the experiment results, including statistical significance and guarded-metric movement.
- Decide. Ship, iterate, or discard. Then loop back to step 1.
The key shift from a human workflow: steps 1, 4, 6, 7 are fully mechanized, and steps 2, 3, 8 are done by the agent with a human approval gate before anything graduates from experiment to production.
Architecture
The system is best read as a control loop wrapped in a policy layer. The inner loop does the experimentation. The outer policy layer (the guardrails) decides what the inner loop is allowed to do at every step.
RANKING ENGINEER AGENT
============================================================================
+--------------------------------------------------------------------+
| GUARDRAIL POLICY LAYER |
| (evaluated BEFORE every action -- nothing acts without passing) |
| |
| [blast-radius cap] [metric guards] [forbidden-feature list] |
| [budget / quota] [rate limits] [human-approval gate] |
+--------------------------------------------------------------------+
^ ^ ^ ^
| check | check | check | check
| | | |
.----+-------. .----+-------. .----+-------. .-----+--------.
| OBSERVE | | HYPOTHESIZE| | IMPLEMENT | | EXPERIMENT |
| | | | | | | |
| metrics |-->| LLM core |->| codegen |->| capped A/B |
| feature | | + ranker | | + tests | | + kill-switch|
| catalog | | (score by | | + lint | | + auto- |
| past exps | | impact / | | | | rollback |
| | | risk/cost)| | | | |
'------------' '------------' '------------' '------+-------'
^ |
| v
| .-----+------.
| | READ OUT |
| | |
| | sig test |
| | guarded |
| | metric |
| | delta |
| '-----+------'
| |
| v
| .-----+------.
| discard / iterate | DECIDE |
+--------------------------------------------| |
| | ship? |
| | iterate? |
| | discard? |
| '-----+------'
| |
| ship (needs HUMAN gate)
| v
| .-----+------.
| | GRADUATE |
+--------------------------------------------| to prod |
loop back to OBSERVE '------------'
============================================================================
Every arrow into an action node is intercepted by the guardrail layer.
A failed guardrail check aborts the action and feeds back as a constraint.
The inner loop components
- Observation interface. Read-only adapters over the metrics store, the experiment registry, and the feature catalog. This is what grounds the agent in reality instead of hallucination: every hypothesis must reference real features and real metric movements.
- Hypothesis engine. An LLM core that proposes interventions, plus a separate ranker that scores each candidate on expected impact, implementation cost, and risk. The ranker matters more than the generator: a generator that produces a hundred ideas is useless without a disciplined way to pick the two worth testing.
- Implementation layer. Codegen against the ranking stack, with mandatory unit tests and lint. The agent does not get to ship code that doesn't compile or that skips its own tests.
- Experiment harness. Launches A/B tests through the same internal experimentation platform humans use, but with safety config injected: small traffic allocation, kill-switches wired to guarded metrics, and a fixed maximum duration.
- Readout parser. Turns the experiment dashboard into a structured verdict: did the target metric move, was it significant, did any guarded metric regress.
The Guardrails Are the Product
On a system that ranks content for billions of people, an unconstrained autonomous agent is a liability, not an asset. The guardrails are what make the autonomy shippable. They fall into a few categories.
Blast-radius caps. Any experiment the agent launches is hard-limited to a tiny fraction of traffic. The agent cannot allocate more, ever. The cap is enforced by the platform, not by the agent's own judgement, so a reasoning error can't widen the blast radius.
Metric guards. A set of protected metrics (integrity, time-well-spent, ad load, latency) that may never regress beyond a threshold. If a guarded metric crosses the line mid-experiment, the kill-switch fires and the experiment auto-rolls-back without waiting for the agent to notice.
Forbidden-feature lists. Certain features and signals are off-limits to the agent for legal, privacy, or integrity reasons. The implementation layer refuses to generate code that touches them.
Budget and rate limits. A cap on how many experiments the agent can run per unit time and how much compute it can spend. This bounds both cost and the rate at which mistakes can compound.
Human approval gate. This is the most important one. The agent can run experiments and read them out autonomously, but graduating a change from experiment to production requires a human sign-off. The agent does the loop; a human owns the irreversible step.
The design principle underneath all of these: enforce guardrails outside the agent's reasoning. A guardrail that the agent can talk itself past is not a guardrail. Blast-radius caps, kill-switches, and forbidden-feature checks all live in the policy layer and the platform, where the agent's outputs are inputs to be validated, not commands to be obeyed.
AGENT PROPOSES POLICY LAYER DECIDES RESULT
-------------- -------------------- ------
"ship to 5%" --> cap = 1% (hard) --> clamped to 1%
"use feature X" --> X on forbidden list --> rejected, fed back
"run 50 exps" --> budget = 6 / day --> queued / throttled
guarded metric --> threshold breached --> auto-rollback, no ask
"graduate this" --> needs human approval --> paused for review
Why This Shape Wins
The agent does not try to be smarter than the experimentation system. It tries to be faster at the loop while the system stays the source of truth for safety. Three properties make it work:
- Grounding over generation. Every hypothesis is anchored to observed metrics and the real feature catalog, so the LLM core proposes inside a constrained space rather than free-associating.
- External enforcement. Guardrails live in the platform, not in the prompt. The agent cannot reason its way around a hard cap.
- Reversibility by default. Everything the agent does on its own is small, capped, and auto-rollback-able. The one irreversible action, graduating to production, is gated on a human.
That is the general recipe for autonomous agents in high-stakes systems: let the agent own the fast, reversible inner loop, and put hard, externally-enforced fences around every action that could cause real harm. The autonomy gets the headlines. The guardrails are why it ships.