// CASE FILE 01 — CODENAME: VISION WATCH — FLAGSHIP · OBSERVABILITY

THE ALL-SEEING HARNESS, MACH2.

A multi-agent ReAct harness instrumented end-to-end — self-hosted Langfuse traces every thought, tool call, and LLM span, an LLM-as-a-Judge (Grok) scores each response for hallucination and quality, and a five-critic pipeline gates every answer before it reaches the user. Like Vision: it sees everything, and it judges what it sees.

Inspired By

Vision

Role

Creator & Architect

Type

Desktop Harness

Evaluation

LLM-as-a-Judge

View on GitHub ↗ ← Back to Home

Think★Act★Trace★Judge★Critique★Synthesise★ Think★Act★Trace★Judge★Critique★Synthesise★

FILE 01 — THE MISSION

NO ANSWER SHIPS UNJUDGED

Most agent demos optimise for the happy path. Mach2 — the successor to Mach1 — optimises for the question that actually matters in production: how good is this answer, and can you prove it?

Every run is traced end-to-end in self-hosted Langfuse — each thought, tool call, and LLM span on record. An LLM-as-a-Judge (Grok) scores every response on hallucination and quality, surfacing live evaluation and performance metrics. And before anything reaches the user, a multi-layer critic pipeline gets its say — maximising answer quality per token.

"If you can't see the reasoning, you can't trust the answer."

FILE SNAPSHOT
ReAct core — LangGraph multi-agent harness
Tracing — self-hosted Langfuse, every span
Judge — Grok scoring hallucination + quality
Critics — Fact · Gap · Logic · Efficiency · Sign-off
Memory — shared ChromaDB across agents
Desktop — Electron + React, live SSE panel

FILE 02 — THE WATCHTOWER LOOP

EVERY THOUGHT TRACED, EVERY ANSWER GATED

Reason & Act

A LangGraph ReAct harness drives multi-agent reasoning — thoughts, tool calls, and observations in an auditable loop.

Fan Out Research

A nested Research Agent dynamically spawns scoped scouting sub-agents (agent-as-tool), all backed by shared ChromaDB memory.

Trace Everything

Self-hosted Langfuse instruments the run end-to-end — every thought, tool call, and LLM span lands in the trace, nothing happens off the record.

Judge the Response

An LLM-as-a-Judge (Grok) scores each response on hallucination and quality — live evaluation and performance metrics, not vibes.

Critique, Gate, Re-Research

Fact, Gap, Logic, Efficiency, and Sign-off critics emit a severity-tagged critique ledger (blocker / major / minor) that gates synthesis and triggers targeted re-research — maximising answer quality per token.

FILE 03 — CORE SYSTEMS

BUILT TO WATCH ITSELF WORK

🔭

Langfuse Observability

Self-Hosted · Full-Span Tracing

Every thought, tool call, and LLM span traced end-to-end — the whole reasoning chain is inspectable after the fact.

LangfuseTracing

⚖️

LLM-as-a-Judge

Grok Scoring Every Response

Hallucination and quality scores on each answer, surfacing live evaluation and performance metrics across the harness.

GrokEval Metrics

🧾

Critique Ledger

Five Critics · Severity Tags

Fact, Gap, Logic, Efficiency, Sign-off — every response earns a blocker/major/minor ledger that gates synthesis and fires targeted re-research.

Critic PipelineQuality Gates

🛰️

Research Fan-Out

Agent-as-Tool Sub-Agents

A nested Research Agent spawns scoped scouting sub-agents on demand, sharing ChromaDB memory so no scout starts blind.

Agent-as-ToolChromaDB

🖥️

Live Thinking Panel

Electron + React · SSE Streaming

A desktop UI with a live Thinking & Critique panel streaming agent reasoning and critic verdicts over SSE as they happen.

ElectronReactSSE

🎯

Quality per Token

Gated Synthesis · Targeted Re-Research

The ledger decides what ships: blockers stop synthesis, gaps trigger re-research aimed exactly where the answer is weak.

ReActLangGraph

FILE 04 — THE BLUEPRINT

REASONING UNDER FULL SURVEILLANCE

Electron + React UI

Thinking & Critique Panel · SSE

▼ ▼ ▼

LangGraph ReAct Orchestrator

Multi-Agent Reasoning Loop

▼ ▼ ▼

Research Agent

Scoped Scout Sub-Agents

Critic Pipeline

Fact · Gap · Logic · Efficiency · Sign-off

LLM Judge

Grok · Hallucination + Quality

▼ ▼ ▼

ChromaDB

Shared Agent Memory

Langfuse

Self-Hosted · Every Span Traced

Nothing ships until the ledger clears — blockers gate synthesis, gaps trigger re-research

Critics on Every Answer

100%

Spans Traced in Langfuse

Severity Tiers Gating Synthesis

END OF FILE

MISSION LOGGED. RETURN TO BASE.

← Return to Home Browse All Case Files GitHub ↗