THE ALL-SEEING HARNESS, MACH2.
A multi-agent ReAct harness instrumented end-to-end — self-hosted Langfuse traces every thought, tool call, and LLM span, an LLM-as-a-Judge (Grok) scores each response for hallucination and quality, and a five-critic pipeline gates every answer before it reaches the user. Like Vision: it sees everything, and it judges what it sees.
NO ANSWER SHIPS UNJUDGED
Most agent demos optimise for the happy path. Mach2 — the successor to Mach1 — optimises for the question that actually matters in production: how good is this answer, and can you prove it?
Every run is traced end-to-end in self-hosted Langfuse — each thought, tool call, and LLM span on record. An LLM-as-a-Judge (Grok) scores every response on hallucination and quality, surfacing live evaluation and performance metrics. And before anything reaches the user, a multi-layer critic pipeline gets its say — maximising answer quality per token.
"If you can't see the reasoning, you can't trust the answer."
- FILE SNAPSHOT
- ReAct core — LangGraph multi-agent harness
- Tracing — self-hosted Langfuse, every span
- Judge — Grok scoring hallucination + quality
- Critics — Fact · Gap · Logic · Efficiency · Sign-off
- Memory — shared ChromaDB across agents
- Desktop — Electron + React, live SSE panel
EVERY THOUGHT TRACED, EVERY ANSWER GATED
Reason & Act
A LangGraph ReAct harness drives multi-agent reasoning — thoughts, tool calls, and observations in an auditable loop.
Fan Out Research
A nested Research Agent dynamically spawns scoped scouting sub-agents (agent-as-tool), all backed by shared ChromaDB memory.
Trace Everything
Self-hosted Langfuse instruments the run end-to-end — every thought, tool call, and LLM span lands in the trace, nothing happens off the record.
Judge the Response
An LLM-as-a-Judge (Grok) scores each response on hallucination and quality — live evaluation and performance metrics, not vibes.
Critique, Gate, Re-Research
Fact, Gap, Logic, Efficiency, and Sign-off critics emit a severity-tagged critique ledger (blocker / major / minor) that gates synthesis and triggers targeted re-research — maximising answer quality per token.
BUILT TO WATCH ITSELF WORK
Langfuse Observability
Every thought, tool call, and LLM span traced end-to-end — the whole reasoning chain is inspectable after the fact.
LLM-as-a-Judge
Hallucination and quality scores on each answer, surfacing live evaluation and performance metrics across the harness.
Critique Ledger
Fact, Gap, Logic, Efficiency, Sign-off — every response earns a blocker/major/minor ledger that gates synthesis and fires targeted re-research.
Research Fan-Out
A nested Research Agent spawns scoped scouting sub-agents on demand, sharing ChromaDB memory so no scout starts blind.
Live Thinking Panel
A desktop UI with a live Thinking & Critique panel streaming agent reasoning and critic verdicts over SSE as they happen.
Quality per Token
The ledger decides what ships: blockers stop synthesis, gaps trigger re-research aimed exactly where the answer is weak.
REASONING UNDER FULL SURVEILLANCE
Electron + React UI
Thinking & Critique Panel · SSE
LangGraph ReAct Orchestrator
Multi-Agent Reasoning Loop
Research Agent
Scoped Scout Sub-Agents
Critic Pipeline
Fact · Gap · Logic · Efficiency · Sign-off
LLM Judge
Grok · Hallucination + Quality
ChromaDB
Shared Agent Memory
Langfuse
Self-Hosted · Every Span Traced