0 · Orchestration model (the manager agent)
A single orchestrator coordinates everything as a manager, not a worker. The defining rule:
it decides what needs to happen and which agent does it, and never does the work itself: no code,
no number-crunching, no slide-writing, no reviewing.
- Delegates, doesn't perform. Dispatches the data tools, the Drafter, the Renderer; writes nothing into the deck or fact store.
- Commissions reviews. It never forms its own opinion of quality. It sends the Reviewer (L4) to produce the ranked weak-parts list, then a Reviser (L6) to fix what was found.
- Owns control flow, not content. It reads compact signals and routes the next move. Every arrow in §1 is an orchestrator decision.
0a · Context discipline
- Heavy artifacts live outside its context: decks, filings, parsed tables pass by reference.
- Workers return summaries, not dumps: counts, scores, pass/fail, a short reason, a handle.
- State lives in the persisted files (§7), not the prompt: context doesn't grow with iteration count.
1 · State machine
flowchart TD
P["Human prompt"] --> SC["A1 Scope Builder"]
SC --> RB["A2 Rubric Builder (R1+R2, versioned)"]
RB --> D0["A3 Data pull · EDGAR/parse/analysis → fact store"]
D0 --> DR["First draft (Slide-create)"]
DR --> R1{"R1 gate: hard gates PASS + composite ≥ entry bar?"}
R1 -->|no, rebuilds left| DR
R1 -->|budget exhausted| CE["Escalate: can't clear entry bar"]
R1 -->|yes| EP
subgraph L["G2 refinement cycle · up to 2x/day"]
EP["Re-baseline epoch (rubric_version + data_version)"] --> RN["Render"]
RN --> FC["Deterministic fact-check: every figure ↔ fact store"]
FC --> SCO["A4 Scorer (LLM judge) · R2 per-dim + evidence"]
SCO --> G{"Accept? invariants hold AND no protected regression AND composite improves within epoch"}
G -->|accept| PS["Update working best-so-far + history + changelog"]
G -->|reject| RO["Rollback to best-so-far"]
PS --> ST{"Stop run? score ≥ ship bar OR margin < ε for N OR max iters"}
RO --> ST
ST -->|keep going| PL["A4 Planner → work order (target dim + action, chosen fresh each lap)"]
PL -->|refine numbers| D1["A3 re-pull/re-run → bumps data_version"] --> BU
PL -->|refine story/visual/trace| BU["A4 Reviser executes work order"]
BU --> RN
ST -->|run done| FZ
end
FZ["Freeze best as immutable version vN"] --> NO["T9 Notifier: email/message + status + score delta + link"]
NO --> G2{"Publish to C-suite? (human approves)"}
G2 -->|approved| DP["Publish vN externally"]
G2 -->|not yet| HO["vN kept as internal version"]
CE --> NO
Two rubrics, two jobs. R1 is a one-time acceptance gate. R2 is the
per-iteration delta rubric. The Score → Plan → Revise sequence mirrors a research desk: a senior reviewer
marks up the deck, the work is triaged into one task, and an analyst does exactly that rework.
The Plan step is dynamic, not scripted. Each lap the Planner reads this lap's
comments and routes to whichever action the deck needs most, including re-pulling an analysis, rewriting a section,
or fixing one graphic. Two consecutive laps can take completely different actions.
2 · How the rubrics are built
- Derive dimensions from the brief and domain invariants (in finance: every number sourced → hard gates).
- Anchor each dimension with concrete descriptors at each scale point.
- Assign weights in the rubric file, so they tune without code changes.
- Calibrate against gold examples until the LLM judge lands within ±1 of human scores.
- Validate & version: require judge↔human agreement before the rubric drives the loop.
Calibration is not optional. Without it, R2's "improvement" is just judge variance.
3 · R1, Initial-Output Rubric (acceptance gate)
3a · Hard gates (binary, all must PASS)
| Gate | Pass condition |
| Source-of-truth | 100% of figures resolve to a fact-store entry with a citation. |
| Scope coverage | Every required section from the brief is present and non-empty. |
| Structural validity | Deck renders; no broken slides, no placeholder text. |
3b · Graded baseline (0 to 5; composite must clear the entry bar)
| Dimension | Weight | "5" looks like |
| Thesis clarity | 0.30 | thesis up front, every section ladders to it |
| Coverage depth | 0.25 | drivers + risks + catalysts quantified |
| Narrative flow | 0.20 | story builds; each slide earns the next |
| Visual baseline | 0.15 | one idea per slide, charts where numbers belong |
| Actionability | 0.10 | clear, decision-relevant recommendation |
Entry bar: composite ≥ 3.2 / 5 AND all hard gates PASS.
4 · R2, Loop Improvement Rubric
4a · Protected invariants (regression = auto-reject)
| Invariant | Measured by | Rule |
| Truth | deterministic fact-check (not the judge) | Every figure traces to the fact store. Mismatch → reject. |
| Scope adherence | judge vs. brief | Required sections present. Drop below R1 → reject. |
| Story integrity | judge, coherence check | The thesis through-line must not break. |
| Visual integrity | judge + render lint | No new walls of text, no chart that carried a number removed. |
4b · Improvement targets (the loop hill-climbs the composite)
| Dimension | Weight | "5" looks like |
| Actionability | 0.25 | every section ends in a decision-relevant takeaway |
| Executive altitude | 0.20 | "so what" up front; right abstraction for a board; anticipates exec questions |
| Visual digestibility | 0.20 | each slide graspable in <10s |
| Clean & concise story | 0.20 | tight through-line, no redundant slides |
| Metric traceability | 0.15 | every metric labeled with source, period, units |
R2 = 0.25·Action + 0.20·Exec + 0.20·Visual + 0.20·Story + 0.15·Trace
(computed only when all 4a invariants hold)
5 · Accept / reject math
Let b = best-so-far, c = candidate. Accept c iff: invariants hold,
no protected regression (s_d(c) ≥ s_d(b) − τ), and strict improvement
(R2(c) ≥ R2(b) + δ, δ ≈ 0.15). Otherwise roll back.
Monotonicity holds only within an epoch by design. An epoch is a fixed
(rubric_version, data_version) pair. When fresh data lands, reality can lower a score and the
deck must follow, so the loop re-baselines. Truth wins over a prettier score.
Stop conditions: ship (R2(b) ≥ 4.3) · plateau · max_iters. All route to "freeze version + notify."
6 · Skill / agent catalog
| Manager | Responsibility |
| O0 Orchestrator | Decide what's next and who does it; route hand-offs; never performs work. Holds only plan + handles. |
6a · Deterministic tools (code, the source of truth)
| T1 EDGAR/SEC fetcher | Pull filings by ticker/CIK; cache raw |
| T2 Document parser | Extract tables & text from filings and uploads |
| T3 Analysis engine | Compute ratios, growth, margins, bridges |
| T4 Fact-store writer | Normalize each figure into a traceable record |
| T5 Fact-check / tracer | Confirm every tagged figure equals its fact-store value |
| T6 Renderer | Turn the deck spec into rendered slides |
| T7 Render lint | Detect overflow, walls of text, broken slides |
| T8 Score aggregator | Apply weights, compute composite, run accept/reject (§5) |
| T9 Notifier | Email/message on run completion with status + score delta + link |
| T10 Version writer | Freeze the run's deck as a new immutable version |
| T11 Judge-consistency | Re-score unchanged best-so-far to detect judge drift |
| T12 De-AI / Humanizer | Strips AI tells from the built deck: em-dashes, hedging ("it's worth noting"), generic filler, over-symmetrical lists, and AI-looking design (gradient/glow). Flags them on render; a Reviser rewrite resolves them so the deck reads as human-authored. |
6b · LLM agents (judgment)
| L1 Scope Builder | Prompt → frozen brief |
| L2 Rubric Builder | Author + calibrate R1/R2 |
| L3 Drafter | Write the deck; tag every figure with its fact_id |
| L4 Reviewer / Scorer | Review → per-dim scores + ranked weakest-parts list |
| L5 Planner | Triage the weak-parts list into one work order, fresh each lap |
| L6 Reviser | Execute the work order, fixing exactly what was flagged |
Two rules hold it together. (1) Reviewer ≠ Reviser. (2) LLMs never own a
number. Every figure originates in T1–T4, is verified by T5, and is only ever arranged by an LLM.
7 · Persisted state & version history
Outputs are immutable and versioned and never overwritten. Every completed run freezes its deck as
a new versions/vN/ folder; best_so_far is only a working pointer inside a run.
versions/
v001/ deck.json render.pdf scores.json changelog.jsonl
v002/ ...
index.jsonl ← one immutable row per completed run
{ "version":"v003", "status":"ready_for_review", "composite":4.3,
"delta_vs_prev":+0.4, "rubric_version":"r2-1.2.0", "data_version":"d-2026Q1" }
8 · Open decisions to confirm before building
- Thresholds: entry bar (3.2), ship bar (4.3), margin δ (0.15), plateau N, max_iters.
- Weights: R2 weights reflect "actionability + exec-fit first"; adjust to the audience.
- Planner action taxonomy: confirm the five action types cover real reviewer comments.
- Judge model & determinism: pin the model, low temperature.
- Notification channel & cadence · run concurrency: lock against overlapping runs.
9 · Known weaknesses (honest risks)
- Three of four invariants rest on the LLM judge. Story integrity is irreducibly its opinion. Mitigated by calibration, T11, and the human gate.
- Greedy hill-climb finds local optima. Mitigation: a bounded multi-edit work order scored as one unit.
- Cost grows with cadence. Add a compute budget; short-circuit a run already ≥ ship bar with no new data.
- Calibration can rot. Re-calibrate periodically.
- Scope is frozen, the world is not. Only the human gate catches "the whole angle is now stale."