Autonomous equity-research decks

An agent that writes a deck, then critiques and rewrites it in a loop until it's boardroom-ready.

Give it a prompt. It pulls the real numbers from SEC filings, drafts the deck, then improves it in a loop, scoring each version against a fixed rubric and keeping only the changes that raise the score. The numbers always trace back to a filing.

Grounded in SEC / EDGAR filings Scored against a mathematical rubric Versioned and never overwrites Notifies you on completion

The whole system at a glance

How it works

A one-time setup builds the first draft. Then a loop refines it through review, planning, revision, and scoring until it clears the bar. Walk through it live in the demo.

G1 · Setup, runs once

Human-injected prompt

A1 · Scope Builderprompt → frozen brief

A2 · Rubric Builderdefines "good" before drafting

A3 · Deterministic toolsdoc parse · analysis · SEC/EDGAR pull

A4 · LLM agentsslide-create · review · revise · de-AI pass

Initial Outputgate · R1first draft must pass before the loop

↓ feeds the loop

G2 · The loop, up to 2×/day

Composite

4.3

Board ready

Review

find the weakest part

Plan

choose the fix · adapts each lap

Revise

build just that one change

Score

fact-check · keep if better

LLM agent Dynamic, picked each lap Deterministic code

Step

Hover or tap any step to see what it does.

Watch it run on a real question

See the agent take the prompt "What is the pension risk due to AI for the PBGC?" and drive it through the pipeline and the scoring cycle all the way to a board-ready deck.

Why you can trust the output

Two kinds of work, kept strictly apart

The secret to a finance deck that won't hallucinate a number: never let the language model touch the math. Every job is either deterministic code (the source of truth) or LLM judgment (the writing and critiquing).

Deterministic tools

Code owns every number

SEC/EDGAR fetching, parsing, analysis, the fact store, fact-checking, rendering, scoring arithmetic, versioning, notifications. Same input → same output. If a figure isn't traceable to a filing, it can't appear in the deck.

LLM agents

Judgment owns the words

Interpreting the prompt, writing slides, reviewing for weak spots, planning a fix, rewriting. The models arrange facts and sharpen the story. They never invent a value and never get the final say on accuracy.

Reviewer ≠ Reviser

One agent finds the weakest parts; a different agent fixes them. A critic grading its own rework drifts toward self-justification.

The plan isn't fixed

Each lap, the Reviewer and Orchestrator look at what's weak right now and choose the focus, whether that means refreshing an analysis, rewriting the story, or fixing a graphic.

A manager, not a doer

An orchestrator decides what happens next and which agent does it. It writes no code and no slides. It routes the work and stays context-light.

The promise

It gets better at what matters without sacrificing the rest

What it improves

Actionability	a clear "so what" and what to do
Executive altitude	right level for a CEO/CFO/board
Visual digestibility	one idea per slide, grasped in seconds
Clean story	tight through-line, nothing redundant
Traceability	every metric labeled with its source

What it protects

Truth	every figure matches its SEC source (checked by code)
Scope	still answers the original brief
Story integrity	the thesis chain never breaks
Visual integrity	no walls of text, no lost charts

When the world changes, the deck follows the truth, even downward

If a fresh quarterly filing moves a number the wrong way, the deck updates to match reality. That isn't counted as the deck "getting worse." Truth always wins over a prettier score.

Progress you can see

Every run is a new version

Nothing is overwritten. Each completed run freezes an immutable version: v001, v002, v003 … so you can open any point in time and watch the deck mature.

v001 · 3.2v002 · 3.6v003 · 4.3 ✓

You'll always know

A message when each run finishes

Every loop completion sends an email or message: the new version link, the status, and how much the score moved. Publishing to the C-suite still waits for your approval.

"Run complete: v003, 3.98 → 4.12 (+0.14)"

Live demo · worked example

From a question to a board-ready deck

Press play and watch one real question travel the whole pipeline through research, a first draft, the de-AI pass, and the scoring loop until the deck is board-ready. Step through it at your own pace.

Step 1 of 8

It starts with one plain-language question.

User inputs an idea, but the agent delivers the full cycle of the project from scope to deliverable, and even refines the final output to make it board ready, without constantly asking for human input.

Audience: state leadership Output: board-style deck Human input: one prompt

From your question "What is the impact on Virginia if they miss the AI boom?"

The brief, frozen up front

✓

SubjectVirginia's economy & AI adoption

✓

QuestionCost of missing the AI boom

✓

SectionsExposure · mechanism · scenarios · actions

The rubric, what "good" means

Actionability

0.25

Executive altitude

0.20

Visual digestibility

0.20

Clean story

0.20

Traceability

0.15

"Good" is defined before a single slide exists, so improvement can be measured, not guessed.

BLS

Labor & jobs data

BLS · Census, public series

Tech-sector share of jobs 9.4%

5-yr AI-role growth +38%

State budget & GDP

VA state filings, public

Annual GDP $0.68T

Tech share of GDP 13%

SEC

SEC EDGAR, VA employers

10-K / 10-Q filings

Filings parsed 112

Figures stored 1,940

AI-exposure by sectorbuilt from the sources

Sector	Score	Tier
Data centers / cloud	94	High
Federal & defense tech	88	High
Logistics & ports	71	Watch
Higher education	63	Watch

Fixed, repeatable tools, never the model. Every source feeds the ranked table, and every number traces back to a public filing.

✓ Approved

Human gate 1 of 2

Approve the brief & sources

✓

Scope: Virginia's exposure if it misses the AI boom

✓

Sources: BLS, Census, VA budget, SEC EDGAR, all public

✓

Your call: adjust scope or sources before any deck is built

You're in control. The same approval ritual, every gate.

✍︎

Builder agent

assembles slides from the facts

◎

Fresh set of eyes

reviews next and flags the weak spots

✎

Separate fixer agent

fixes only what was flagged

issuefixed

Each agent acts on the deck in turn, build, flag, fix. Every figure on every slide stays tagged to its source.

Before · AI tells

"In today's fast-paced landscape, it is important to note that Virginia stands at a pivotal crossroads , a state poised to leverage transformative opportunities."

Design tells: gradient banners, glow shadows, emoji bullets.

After · human

Virginia risks $14B in foregone GDP by 2030 if AI adoption lags peer states. The gap is concentrated in data centers and defense tech.

Design: flat layout, plain charts, one idea per slide.

Before · slide design

After · slide design

The de-AI / Humanizer tool strips em-dashes, hedging, filler, and AI-looking design, including gradients, glow, and oversized colored boxes, on every build, so the deck reads and looks like a person made it.

weakfixed

◎

Fresh set of eyes

flags the weakest slide this lap

✎

Separate fixer agent

fixes only that, then re-scores

Composite

3.2

Lap 1 → 3.6 Lap 2 → 4.0 Lap 3 → 4.3

An edit is kept only if the score rises and no guardrail slips.

Composite

4.3

Board ready

Finalizing score…

Saved immutable version v003, nothing overwritten

Notified "Run complete: Virginia / AI boom · 3.2 → 5.0 (+1.8)"

Gate 2 publishing to leadership still waits for your approval

1 / 8

For builders

Technical specification

Everything a developer needs to rebuild this workflow, including the orchestration model, the two rubrics, the accept/reject math, the full skill catalog, and the persisted state schemas.

0 · Orchestration model (the manager agent)

A single orchestrator coordinates everything as a manager, not a worker. The defining rule: it decides what needs to happen and which agent does it, and never does the work itself: no code, no number-crunching, no slide-writing, no reviewing.

Delegates, doesn't perform. Dispatches the data tools, the Drafter, the Renderer; writes nothing into the deck or fact store.
Commissions reviews. It never forms its own opinion of quality. It sends the Reviewer (L4) to produce the ranked weak-parts list, then a Reviser (L6) to fix what was found.
Owns control flow, not content. It reads compact signals and routes the next move. Every arrow in §1 is an orchestrator decision.

0a · Context discipline

Heavy artifacts live outside its context: decks, filings, parsed tables pass by reference.
Workers return summaries, not dumps: counts, scores, pass/fail, a short reason, a handle.
State lives in the persisted files (§7), not the prompt: context doesn't grow with iteration count.

1 · State machine

flowchart TD P["Human prompt"] --> SC["A1 Scope Builder"] SC --> RB["A2 Rubric Builder (R1+R2, versioned)"] RB --> D0["A3 Data pull · EDGAR/parse/analysis → fact store"] D0 --> DR["First draft (Slide-create)"] DR --> R1{"R1 gate: hard gates PASS + composite ≥ entry bar?"} R1 -->|no, rebuilds left| DR R1 -->|budget exhausted| CE["Escalate: can't clear entry bar"] R1 -->|yes| EP subgraph L["G2 refinement cycle · up to 2x/day"] EP["Re-baseline epoch (rubric_version + data_version)"] --> RN["Render"] RN --> FC["Deterministic fact-check: every figure ↔ fact store"] FC --> SCO["A4 Scorer (LLM judge) · R2 per-dim + evidence"] SCO --> G{"Accept? invariants hold AND no protected regression AND composite improves within epoch"} G -->|accept| PS["Update working best-so-far + history + changelog"] G -->|reject| RO["Rollback to best-so-far"] PS --> ST{"Stop run? score ≥ ship bar OR margin < ε for N OR max iters"} RO --> ST ST -->|keep going| PL["A4 Planner → work order (target dim + action, chosen fresh each lap)"] PL -->|refine numbers| D1["A3 re-pull/re-run → bumps data_version"] --> BU PL -->|refine story/visual/trace| BU["A4 Reviser executes work order"] BU --> RN ST -->|run done| FZ end FZ["Freeze best as immutable version vN"] --> NO["T9 Notifier: email/message + status + score delta + link"] NO --> G2{"Publish to C-suite? (human approves)"} G2 -->|approved| DP["Publish vN externally"] G2 -->|not yet| HO["vN kept as internal version"] CE --> NO

Two rubrics, two jobs. R1 is a one-time acceptance gate. R2 is the per-iteration delta rubric. The Score → Plan → Revise sequence mirrors a research desk: a senior reviewer marks up the deck, the work is triaged into one task, and an analyst does exactly that rework.

The Plan step is dynamic, not scripted. Each lap the Planner reads this lap's comments and routes to whichever action the deck needs most, including re-pulling an analysis, rewriting a section, or fixing one graphic. Two consecutive laps can take completely different actions.

2 · How the rubrics are built

Derive dimensions from the brief and domain invariants (in finance: every number sourced → hard gates).
Anchor each dimension with concrete descriptors at each scale point.
Assign weights in the rubric file, so they tune without code changes.
Calibrate against gold examples until the LLM judge lands within ±1 of human scores.
Validate & version: require judge↔human agreement before the rubric drives the loop.

Calibration is not optional. Without it, R2's "improvement" is just judge variance.

3 · R1, Initial-Output Rubric (acceptance gate)

3a · Hard gates (binary, all must PASS)

Gate	Pass condition
Source-of-truth	100% of figures resolve to a fact-store entry with a citation.
Scope coverage	Every required section from the brief is present and non-empty.
Structural validity	Deck renders; no broken slides, no placeholder text.

3b · Graded baseline (0 to 5; composite must clear the entry bar)

Dimension	Weight	"5" looks like
Thesis clarity	0.30	thesis up front, every section ladders to it
Coverage depth	0.25	drivers + risks + catalysts quantified
Narrative flow	0.20	story builds; each slide earns the next
Visual baseline	0.15	one idea per slide, charts where numbers belong
Actionability	0.10	clear, decision-relevant recommendation

Entry bar: composite ≥ 3.2 / 5 AND all hard gates PASS.

4 · R2, Loop Improvement Rubric

4a · Protected invariants (regression = auto-reject)

Invariant	Measured by	Rule
Truth	deterministic fact-check (not the judge)	Every figure traces to the fact store. Mismatch → reject.
Scope adherence	judge vs. brief	Required sections present. Drop below R1 → reject.
Story integrity	judge, coherence check	The thesis through-line must not break.
Visual integrity	judge + render lint	No new walls of text, no chart that carried a number removed.

4b · Improvement targets (the loop hill-climbs the composite)

Dimension	Weight	"5" looks like
Actionability	0.25	every section ends in a decision-relevant takeaway
Executive altitude	0.20	"so what" up front; right abstraction for a board; anticipates exec questions
Visual digestibility	0.20	each slide graspable in <10s
Clean & concise story	0.20	tight through-line, no redundant slides
Metric traceability	0.15	every metric labeled with source, period, units

R2 = 0.25·Action + 0.20·Exec + 0.20·Visual + 0.20·Story + 0.15·Trace
     (computed only when all 4a invariants hold)

5 · Accept / reject math

Let b = best-so-far, c = candidate. Accept c iff: invariants hold, no protected regression (s_d(c) ≥ s_d(b) − τ), and strict improvement (R2(c) ≥ R2(b) + δ, δ ≈ 0.15). Otherwise roll back.

Monotonicity holds only within an epoch by design. An epoch is a fixed (rubric_version, data_version) pair. When fresh data lands, reality can lower a score and the deck must follow, so the loop re-baselines. Truth wins over a prettier score.

Stop conditions: ship (R2(b) ≥ 4.3) · plateau · max_iters. All route to "freeze version + notify."

6 · Skill / agent catalog

Manager	Responsibility
O0 Orchestrator	Decide what's next and who does it; route hand-offs; never performs work. Holds only plan + handles.

6a · Deterministic tools (code, the source of truth)

T1 EDGAR/SEC fetcher	Pull filings by ticker/CIK; cache raw
T2 Document parser	Extract tables & text from filings and uploads
T3 Analysis engine	Compute ratios, growth, margins, bridges
T4 Fact-store writer	Normalize each figure into a traceable record
T5 Fact-check / tracer	Confirm every tagged figure equals its fact-store value
T6 Renderer	Turn the deck spec into rendered slides
T7 Render lint	Detect overflow, walls of text, broken slides
T8 Score aggregator	Apply weights, compute composite, run accept/reject (§5)
T9 Notifier	Email/message on run completion with status + score delta + link
T10 Version writer	Freeze the run's deck as a new immutable version
T11 Judge-consistency	Re-score unchanged best-so-far to detect judge drift
T12 De-AI / Humanizer	Strips AI tells from the built deck: em-dashes, hedging ("it's worth noting"), generic filler, over-symmetrical lists, and AI-looking design (gradient/glow). Flags them on render; a Reviser rewrite resolves them so the deck reads as human-authored.

6b · LLM agents (judgment)

L1 Scope Builder	Prompt → frozen brief
L2 Rubric Builder	Author + calibrate R1/R2
L3 Drafter	Write the deck; tag every figure with its fact_id
L4 Reviewer / Scorer	Review → per-dim scores + ranked weakest-parts list
L5 Planner	Triage the weak-parts list into one work order, fresh each lap
L6 Reviser	Execute the work order, fixing exactly what was flagged

Two rules hold it together. (1) Reviewer ≠ Reviser. (2) LLMs never own a number. Every figure originates in T1–T4, is verified by T5, and is only ever arranged by an LLM.

7 · Persisted state & version history

Outputs are immutable and versioned and never overwritten. Every completed run freezes its deck as a new versions/vN/ folder; best_so_far is only a working pointer inside a run.

versions/
  v001/  deck.json  render.pdf  scores.json  changelog.jsonl
  v002/  ...
  index.jsonl   ← one immutable row per completed run

{ "version":"v003", "status":"ready_for_review", "composite":4.3,
  "delta_vs_prev":+0.4, "rubric_version":"r2-1.2.0", "data_version":"d-2026Q1" }

8 · Open decisions to confirm before building

Thresholds: entry bar (3.2), ship bar (4.3), margin δ (0.15), plateau N, max_iters.
Weights: R2 weights reflect "actionability + exec-fit first"; adjust to the audience.
Planner action taxonomy: confirm the five action types cover real reviewer comments.
Judge model & determinism: pin the model, low temperature.
Notification channel & cadence · run concurrency: lock against overlapping runs.

9 · Known weaknesses (honest risks)

Three of four invariants rest on the LLM judge. Story integrity is irreducibly its opinion. Mitigated by calibration, T11, and the human gate.
Greedy hill-climb finds local optima. Mitigation: a bounded multi-edit work order scored as one unit.
Cost grows with cadence. Add a compute budget; short-circuit a run already ≥ ship bar with no new data.
Calibration can rot. Re-calibrate periodically.
Scope is frozen, the world is not. Only the human gate catches "the whole angle is now stale."