Autonomous equity-research decks

An agent that writes a deck, then critiques and rewrites it in a loop until it's boardroom-ready.

Give it a prompt. It pulls the real numbers from SEC filings, drafts the deck, then improves it in a loop, scoring each version against a fixed rubric and keeping only the changes that raise the score. The numbers always trace back to a filing.

Grounded in SEC / EDGAR filings Scored against a mathematical rubric Versioned and never overwrites Notifies you on completion
The whole system at a glance

How it works

A one-time setup builds the first draft. Then a loop refines it through review, planning, revision, and scoring until it clears the bar. Walk through it live in the demo.

G1 · Setup, runs once

Human-injected prompt
A1 · Scope Builderprompt → frozen brief
A2 · Rubric Builderdefines "good" before drafting
A3 · Deterministic toolsdoc parse · analysis · SEC/EDGAR pull
A4 · LLM agentsslide-create · review · revise · de-AI pass
Initial Outputgate · R1first draft must pass before the loop
↓ feeds the loop

G2 · The loop, up to 2×/day

Composite
4.3
Board ready
1
Review
find the weakest part
2
Plan
choose the fix · adapts each lap
3
Revise
build just that one change
4
Score
fact-check · keep if better
LLM agent Dynamic, picked each lap Deterministic code
Step

Hover or tap any step to see what it does.

Watch it run on a real question

See the agent take the prompt "What is the pension risk due to AI for the PBGC?" and drive it through the pipeline and the scoring cycle all the way to a board-ready deck.

Why you can trust the output

Two kinds of work, kept strictly apart

The secret to a finance deck that won't hallucinate a number: never let the language model touch the math. Every job is either deterministic code (the source of truth) or LLM judgment (the writing and critiquing).

Deterministic tools

Code owns every number

SEC/EDGAR fetching, parsing, analysis, the fact store, fact-checking, rendering, scoring arithmetic, versioning, notifications. Same input → same output. If a figure isn't traceable to a filing, it can't appear in the deck.

LLM agents

Judgment owns the words

Interpreting the prompt, writing slides, reviewing for weak spots, planning a fix, rewriting. The models arrange facts and sharpen the story. They never invent a value and never get the final say on accuracy.

Reviewer ≠ Reviser

One agent finds the weakest parts; a different agent fixes them. A critic grading its own rework drifts toward self-justification.

The plan isn't fixed

Each lap, the Reviewer and Orchestrator look at what's weak right now and choose the focus, whether that means refreshing an analysis, rewriting the story, or fixing a graphic.

A manager, not a doer

An orchestrator decides what happens next and which agent does it. It writes no code and no slides. It routes the work and stays context-light.

The promise

It gets better at what matters without sacrificing the rest

What it improves

Actionabilitya clear "so what" and what to do
Executive altituderight level for a CEO/CFO/board
Visual digestibilityone idea per slide, grasped in seconds
Clean storytight through-line, nothing redundant
Traceabilityevery metric labeled with its source

What it protects

Truthevery figure matches its SEC source (checked by code)
Scopestill answers the original brief
Story integritythe thesis chain never breaks
Visual integrityno walls of text, no lost charts

When the world changes, the deck follows the truth, even downward

If a fresh quarterly filing moves a number the wrong way, the deck updates to match reality. That isn't counted as the deck "getting worse." Truth always wins over a prettier score.

Progress you can see

Every run is a new version

Nothing is overwritten. Each completed run freezes an immutable version: v001, v002, v003 … so you can open any point in time and watch the deck mature.

v001 · 3.2v002 · 3.6v003 · 4.3 ✓
You'll always know

A message when each run finishes

Every loop completion sends an email or message: the new version link, the status, and how much the score moved. Publishing to the C-suite still waits for your approval.

"Run complete: v003, 3.98 → 4.12 (+0.14)"
Live demo · worked example

From a question to a board-ready deck

Press play and watch one real question travel the whole pipeline through research, a first draft, the de-AI pass, and the scoring loop until the deck is board-ready. Step through it at your own pace.

Step 1 of 8

It starts with one plain-language question.

 

User inputs an idea, but the agent delivers the full cycle of the project from scope to deliverable, and even refines the final output to make it board ready, without constantly asking for human input.

Audience: state leadership Output: board-style deck Human input: one prompt
From your question "What is the impact on Virginia if they miss the AI boom?"

The brief, frozen up front

SubjectVirginia's economy & AI adoption
QuestionCost of missing the AI boom
SectionsExposure · mechanism · scenarios · actions

The rubric, what "good" means

Actionability
0.25
Executive altitude
0.20
Visual digestibility
0.20
Clean story
0.20
Traceability
0.15

"Good" is defined before a single slide exists, so improvement can be measured, not guessed.

BLS
Labor & jobs data
BLS · Census, public series
Tech-sector share of jobs 9.4%
5-yr AI-role growth +38%
VA
State budget & GDP
VA state filings, public
Annual GDP $0.68T
Tech share of GDP 13%
SEC
SEC EDGAR, VA employers
10-K / 10-Q filings
Filings parsed 112
Figures stored 1,940
AI-exposure by sectorbuilt from the sources
SectorScoreTier
Data centers / cloud94High
Federal & defense tech88High
Logistics & ports71Watch
Higher education63Watch

Fixed, repeatable tools, never the model. Every source feeds the ranked table, and every number traces back to a public filing.

✓ Approved
Human gate 1 of 2

Approve the brief & sources

Scope: Virginia's exposure if it misses the AI boom
Sources: BLS, Census, VA budget, SEC EDGAR, all public
Your call: adjust scope or sources before any deck is built
You're in control. The same approval ritual, every gate.
✍︎
Builder agent
assembles slides from the facts
Fresh set of eyes
reviews next and flags the weak spots
Separate fixer agent
fixes only what was flagged
issuefixed
issuefixed

Each agent acts on the deck in turn, build, flag, fix. Every figure on every slide stays tagged to its source.

Before · AI tells

"In today's fast-paced landscape, it is important to note that Virginia stands at a pivotal crossroads , a state poised to leverage transformative opportunities."

Design tells: gradient banners, glow shadows, emoji bullets.

After · human

Virginia risks $14B in foregone GDP by 2030 if AI adoption lags peer states. The gap is concentrated in data centers and defense tech.

Design: flat layout, plain charts, one idea per slide.

Before · slide design
After · slide design

The de-AI / Humanizer tool strips em-dashes, hedging, filler, and AI-looking design, including gradients, glow, and oversized colored boxes, on every build, so the deck reads and looks like a person made it.

weakfixed
weakfixed
weakfixed
Fresh set of eyes
flags the weakest slide this lap
Separate fixer agent
fixes only that, then re-scores
Composite
3.2
Lap 1 → 3.6 Lap 2 → 4.0 Lap 3 → 4.3

An edit is kept only if the score rises and no guardrail slips.

Composite
4.3
Board ready
Finalizing score…
Saved immutable version v003, nothing overwritten
Notified "Run complete: Virginia / AI boom · 3.2 → 5.0 (+1.8)"
Gate 2 publishing to leadership still waits for your approval
1 / 8
For builders

Technical specification

Everything a developer needs to rebuild this workflow, including the orchestration model, the two rubrics, the accept/reject math, the full skill catalog, and the persisted state schemas.

0 · Orchestration model (the manager agent)

A single orchestrator coordinates everything as a manager, not a worker. The defining rule: it decides what needs to happen and which agent does it, and never does the work itself: no code, no number-crunching, no slide-writing, no reviewing.

  • Delegates, doesn't perform. Dispatches the data tools, the Drafter, the Renderer; writes nothing into the deck or fact store.
  • Commissions reviews. It never forms its own opinion of quality. It sends the Reviewer (L4) to produce the ranked weak-parts list, then a Reviser (L6) to fix what was found.
  • Owns control flow, not content. It reads compact signals and routes the next move. Every arrow in §1 is an orchestrator decision.

0a · Context discipline

  • Heavy artifacts live outside its context: decks, filings, parsed tables pass by reference.
  • Workers return summaries, not dumps: counts, scores, pass/fail, a short reason, a handle.
  • State lives in the persisted files (§7), not the prompt: context doesn't grow with iteration count.

1 · State machine

flowchart TD P["Human prompt"] --> SC["A1 Scope Builder"] SC --> RB["A2 Rubric Builder (R1+R2, versioned)"] RB --> D0["A3 Data pull · EDGAR/parse/analysis → fact store"] D0 --> DR["First draft (Slide-create)"] DR --> R1{"R1 gate: hard gates PASS + composite ≥ entry bar?"} R1 -->|no, rebuilds left| DR R1 -->|budget exhausted| CE["Escalate: can't clear entry bar"] R1 -->|yes| EP subgraph L["G2 refinement cycle · up to 2x/day"] EP["Re-baseline epoch (rubric_version + data_version)"] --> RN["Render"] RN --> FC["Deterministic fact-check: every figure ↔ fact store"] FC --> SCO["A4 Scorer (LLM judge) · R2 per-dim + evidence"] SCO --> G{"Accept? invariants hold AND no protected regression AND composite improves within epoch"} G -->|accept| PS["Update working best-so-far + history + changelog"] G -->|reject| RO["Rollback to best-so-far"] PS --> ST{"Stop run? score ≥ ship bar OR margin < ε for N OR max iters"} RO --> ST ST -->|keep going| PL["A4 Planner → work order (target dim + action, chosen fresh each lap)"] PL -->|refine numbers| D1["A3 re-pull/re-run → bumps data_version"] --> BU PL -->|refine story/visual/trace| BU["A4 Reviser executes work order"] BU --> RN ST -->|run done| FZ end FZ["Freeze best as immutable version vN"] --> NO["T9 Notifier: email/message + status + score delta + link"] NO --> G2{"Publish to C-suite? (human approves)"} G2 -->|approved| DP["Publish vN externally"] G2 -->|not yet| HO["vN kept as internal version"] CE --> NO
Two rubrics, two jobs. R1 is a one-time acceptance gate. R2 is the per-iteration delta rubric. The Score → Plan → Revise sequence mirrors a research desk: a senior reviewer marks up the deck, the work is triaged into one task, and an analyst does exactly that rework.
The Plan step is dynamic, not scripted. Each lap the Planner reads this lap's comments and routes to whichever action the deck needs most, including re-pulling an analysis, rewriting a section, or fixing one graphic. Two consecutive laps can take completely different actions.

2 · How the rubrics are built

  • Derive dimensions from the brief and domain invariants (in finance: every number sourced → hard gates).
  • Anchor each dimension with concrete descriptors at each scale point.
  • Assign weights in the rubric file, so they tune without code changes.
  • Calibrate against gold examples until the LLM judge lands within ±1 of human scores.
  • Validate & version: require judge↔human agreement before the rubric drives the loop.
Calibration is not optional. Without it, R2's "improvement" is just judge variance.

3 · R1, Initial-Output Rubric (acceptance gate)

3a · Hard gates (binary, all must PASS)

GatePass condition
Source-of-truth100% of figures resolve to a fact-store entry with a citation.
Scope coverageEvery required section from the brief is present and non-empty.
Structural validityDeck renders; no broken slides, no placeholder text.

3b · Graded baseline (0 to 5; composite must clear the entry bar)

DimensionWeight"5" looks like
Thesis clarity0.30thesis up front, every section ladders to it
Coverage depth0.25drivers + risks + catalysts quantified
Narrative flow0.20story builds; each slide earns the next
Visual baseline0.15one idea per slide, charts where numbers belong
Actionability0.10clear, decision-relevant recommendation

Entry bar: composite ≥ 3.2 / 5 AND all hard gates PASS.

4 · R2, Loop Improvement Rubric

4a · Protected invariants (regression = auto-reject)

InvariantMeasured byRule
Truthdeterministic fact-check (not the judge)Every figure traces to the fact store. Mismatch → reject.
Scope adherencejudge vs. briefRequired sections present. Drop below R1 → reject.
Story integrityjudge, coherence checkThe thesis through-line must not break.
Visual integrityjudge + render lintNo new walls of text, no chart that carried a number removed.

4b · Improvement targets (the loop hill-climbs the composite)

DimensionWeight"5" looks like
Actionability0.25every section ends in a decision-relevant takeaway
Executive altitude0.20"so what" up front; right abstraction for a board; anticipates exec questions
Visual digestibility0.20each slide graspable in <10s
Clean & concise story0.20tight through-line, no redundant slides
Metric traceability0.15every metric labeled with source, period, units
R2 = 0.25·Action + 0.20·Exec + 0.20·Visual + 0.20·Story + 0.15·Trace
     (computed only when all 4a invariants hold)

5 · Accept / reject math

Let b = best-so-far, c = candidate. Accept c iff: invariants hold, no protected regression (s_d(c) ≥ s_d(b) − τ), and strict improvement (R2(c) ≥ R2(b) + δ, δ ≈ 0.15). Otherwise roll back.

Monotonicity holds only within an epoch by design. An epoch is a fixed (rubric_version, data_version) pair. When fresh data lands, reality can lower a score and the deck must follow, so the loop re-baselines. Truth wins over a prettier score.

Stop conditions: ship (R2(b) ≥ 4.3) · plateau · max_iters. All route to "freeze version + notify."

6 · Skill / agent catalog

ManagerResponsibility
O0 OrchestratorDecide what's next and who does it; route hand-offs; never performs work. Holds only plan + handles.

6a · Deterministic tools (code, the source of truth)

T1 EDGAR/SEC fetcherPull filings by ticker/CIK; cache raw
T2 Document parserExtract tables & text from filings and uploads
T3 Analysis engineCompute ratios, growth, margins, bridges
T4 Fact-store writerNormalize each figure into a traceable record
T5 Fact-check / tracerConfirm every tagged figure equals its fact-store value
T6 RendererTurn the deck spec into rendered slides
T7 Render lintDetect overflow, walls of text, broken slides
T8 Score aggregatorApply weights, compute composite, run accept/reject (§5)
T9 NotifierEmail/message on run completion with status + score delta + link
T10 Version writerFreeze the run's deck as a new immutable version
T11 Judge-consistencyRe-score unchanged best-so-far to detect judge drift
T12 De-AI / HumanizerStrips AI tells from the built deck: em-dashes, hedging ("it's worth noting"), generic filler, over-symmetrical lists, and AI-looking design (gradient/glow). Flags them on render; a Reviser rewrite resolves them so the deck reads as human-authored.

6b · LLM agents (judgment)

L1 Scope BuilderPrompt → frozen brief
L2 Rubric BuilderAuthor + calibrate R1/R2
L3 DrafterWrite the deck; tag every figure with its fact_id
L4 Reviewer / ScorerReview → per-dim scores + ranked weakest-parts list
L5 PlannerTriage the weak-parts list into one work order, fresh each lap
L6 ReviserExecute the work order, fixing exactly what was flagged
Two rules hold it together. (1) Reviewer ≠ Reviser. (2) LLMs never own a number. Every figure originates in T1–T4, is verified by T5, and is only ever arranged by an LLM.

7 · Persisted state & version history

Outputs are immutable and versioned and never overwritten. Every completed run freezes its deck as a new versions/vN/ folder; best_so_far is only a working pointer inside a run.

versions/
  v001/  deck.json  render.pdf  scores.json  changelog.jsonl
  v002/  ...
  index.jsonl   ← one immutable row per completed run
{ "version":"v003", "status":"ready_for_review", "composite":4.3,
  "delta_vs_prev":+0.4, "rubric_version":"r2-1.2.0", "data_version":"d-2026Q1" }

8 · Open decisions to confirm before building

  • Thresholds: entry bar (3.2), ship bar (4.3), margin δ (0.15), plateau N, max_iters.
  • Weights: R2 weights reflect "actionability + exec-fit first"; adjust to the audience.
  • Planner action taxonomy: confirm the five action types cover real reviewer comments.
  • Judge model & determinism: pin the model, low temperature.
  • Notification channel & cadence · run concurrency: lock against overlapping runs.

9 · Known weaknesses (honest risks)

  • Three of four invariants rest on the LLM judge. Story integrity is irreducibly its opinion. Mitigated by calibration, T11, and the human gate.
  • Greedy hill-climb finds local optima. Mitigation: a bounded multi-edit work order scored as one unit.
  • Cost grows with cadence. Add a compute budget; short-circuit a run already ≥ ship bar with no new data.
  • Calibration can rot. Re-calibrate periodically.
  • Scope is frozen, the world is not. Only the human gate catches "the whole angle is now stale."