← Home All Case Files ▦
SCROLL 000%
EVERY SPAN: TRACED
01Home 02Case Files 03About 04Skills 05Contact Resume Hire Me ↗
// CASE FILE 01 — CODENAME: VISION WATCH — FLAGSHIP · OBSERVABILITY

THE ALL-SEEING HARNESS, MACH2.

A multi-agent ReAct harness instrumented end-to-end — self-hosted Langfuse traces every thought, tool call, and LLM span, an LLM-as-a-Judge (Grok) scores each response for hallucination and quality, and a five-critic pipeline gates every answer before it reaches the user. Like Vision: it sees everything, and it judges what it sees.

Inspired By
Vision
Role
Creator & Architect
Type
Desktop Harness
Evaluation
LLM-as-a-Judge
ThinkActTraceJudgeCritiqueSynthesise ThinkActTraceJudgeCritiqueSynthesise
FILE 01 — THE MISSION

NO ANSWER SHIPS UNJUDGED

Most agent demos optimise for the happy path. Mach2 — the successor to Mach1 — optimises for the question that actually matters in production: how good is this answer, and can you prove it?

Every run is traced end-to-end in self-hosted Langfuse — each thought, tool call, and LLM span on record. An LLM-as-a-Judge (Grok) scores every response on hallucination and quality, surfacing live evaluation and performance metrics. And before anything reaches the user, a multi-layer critic pipeline gets its say — maximising answer quality per token.

"If you can't see the reasoning, you can't trust the answer."

  • FILE SNAPSHOT
  • ReAct core — LangGraph multi-agent harness
  • Tracing — self-hosted Langfuse, every span
  • Judge — Grok scoring hallucination + quality
  • Critics — Fact · Gap · Logic · Efficiency · Sign-off
  • Memory — shared ChromaDB across agents
  • Desktop — Electron + React, live SSE panel
FILE 02 — THE WATCHTOWER LOOP

EVERY THOUGHT TRACED, EVERY ANSWER GATED

01

Reason & Act

A LangGraph ReAct harness drives multi-agent reasoning — thoughts, tool calls, and observations in an auditable loop.

02

Fan Out Research

A nested Research Agent dynamically spawns scoped scouting sub-agents (agent-as-tool), all backed by shared ChromaDB memory.

03

Trace Everything

Self-hosted Langfuse instruments the run end-to-end — every thought, tool call, and LLM span lands in the trace, nothing happens off the record.

04

Judge the Response

An LLM-as-a-Judge (Grok) scores each response on hallucination and quality — live evaluation and performance metrics, not vibes.

05

Critique, Gate, Re-Research

Fact, Gap, Logic, Efficiency, and Sign-off critics emit a severity-tagged critique ledger (blocker / major / minor) that gates synthesis and triggers targeted re-research — maximising answer quality per token.

FILE 03 — CORE SYSTEMS

BUILT TO WATCH ITSELF WORK

🔭

Langfuse Observability

Self-Hosted · Full-Span Tracing

Every thought, tool call, and LLM span traced end-to-end — the whole reasoning chain is inspectable after the fact.

LangfuseTracing
⚖️

LLM-as-a-Judge

Grok Scoring Every Response

Hallucination and quality scores on each answer, surfacing live evaluation and performance metrics across the harness.

GrokEval Metrics
🧾

Critique Ledger

Five Critics · Severity Tags

Fact, Gap, Logic, Efficiency, Sign-off — every response earns a blocker/major/minor ledger that gates synthesis and fires targeted re-research.

Critic PipelineQuality Gates
🛰️

Research Fan-Out

Agent-as-Tool Sub-Agents

A nested Research Agent spawns scoped scouting sub-agents on demand, sharing ChromaDB memory so no scout starts blind.

Agent-as-ToolChromaDB
🖥️

Live Thinking Panel

Electron + React · SSE Streaming

A desktop UI with a live Thinking & Critique panel streaming agent reasoning and critic verdicts over SSE as they happen.

ElectronReactSSE
🎯

Quality per Token

Gated Synthesis · Targeted Re-Research

The ledger decides what ships: blockers stop synthesis, gaps trigger re-research aimed exactly where the answer is weak.

ReActLangGraph
FILE 04 — THE BLUEPRINT

REASONING UNDER FULL SURVEILLANCE

Electron + React UI

Thinking & Critique Panel · SSE

▼ ▼ ▼
LangGraph ReAct Orchestrator

Multi-Agent Reasoning Loop

▼ ▼ ▼
Research Agent

Scoped Scout Sub-Agents

Critic Pipeline

Fact · Gap · Logic · Efficiency · Sign-off

LLM Judge

Grok · Hallucination + Quality

▼ ▼ ▼
ChromaDB

Shared Agent Memory

Langfuse

Self-Hosted · Every Span Traced

Nothing ships until the ledger clears — blockers gate synthesis, gaps trigger re-research
5
Critics on Every Answer
100%
Spans Traced in Langfuse
3
Severity Tiers Gating Synthesis
END OF FILE

MISSION LOGGED. RETURN TO BASE.