LLMbench
Tier: comparative tool. Object: generated prose across models.
LLMbench sends a prompt to two models simultaneously, displays their responses side-by-side, and enables researchers to build layered interpretive readings across both outputs. It is the hermeneutic surface instrument of the Vector Lab: the layer at which models are usually encountered. Where Manifold Atlas compares models at the geometric level and Vectorscope inspects internals, LLMbench handles comparative close reading of generated text.
Why LLMbench
AI-generated text is rarely read closely. It is skimmed, summarised, or dismissed. LLMbench asks a different question: what happens when two models generate in response to the same prompt, and what does the comparison reveal about the geometries below? The tool reuses annotation infrastructure from the Critical Code Studies Workbench, reoriented from source code analysis to LLM output comparison.
Modes
Modes are organised across three tiers: Compare, Analyse, and Investigate.
Compare tier
- Compare. Dual-panel generation from two models on the same prompt, with annotation and cross-panel linking.
Analyse tier
- Stochastic. Repeated generation from the same model to see the distribution of responses.
- Temperature. Generation across temperature settings to see how stochasticity plays at different parameter values.
- Divergence. Cross-model divergence metrics including cosine similarity, Jaccard, word overlap (Dice), and uniqueness.
- Probs. Probability visualisation for models that return logprobs: heatmap, pixel map, and 3D probability net.
Investigate tier
- Grammar Probe. Pattern-specific probes of generation behaviour, organised in five phases that are all now shipped: Phase A (prevalence heatmap, prompt × model × temperature), Phase B (continuation logprobs at each scaffold’s fork point, with a Spearman correlation between logprob and cosine(X, Y-phrase) for antithesis patterns), Phase C (forced continuation), Phase D (perturbation), Phase E (temperature sweep). Ships with a library of grammatical constructions (Not X but Y, Hyland hedging triplets, tricolon and parallelism, modal stacking), twenty default prompts across six registers, and ten thematic suites along Purpose and Domain axes. A global Stop button cancels any running phase mid-stream.
- Sampling Probe (added v2.15.0). A new Investigate-tier mode that turns sampling itself into an inspectable surface. Per-token transcript, override audit log, completion-style prompting with editable prompt, Panel B no-flash, transcript-matches strip, sample prompts, and PDF export.
Across the Compare and Analyse modes, ten guided exercises are available as scholarly presets drawing on Hyland, Lakoff and Johnson, Hayden White and others. Each exercise supplies a preset prompt, methodological context, and guided questions.
Annotation and cross-panel linking
Annotations layer on both panels independently. A new cross-panel link feature connects annotations across the two outputs with typed relations (contrast, parallel, divergence, convergence, echo, absence, note) and free-text notes. Annotations and links persist in a saved comparison format.
Providers
Seven providers supported: OpenAI, Anthropic, Google, Hugging Face, OpenRouter, OpenAI-compatible endpoints, and local Ollama. Logprob support varies by provider; the interface is explicit about what each provider can return.
Theoretical background
LLMbench operationalises the commitment that AI-generated text deserves hermeneutic reading, not just dismissal or statistical critique. It sits above the geometric instruments in the Vector Lab and is often the presentation layer for findings that begin lower down. A claim tested at the geometric level in Manifold Atlas can be illustrated, read closely, and compared across models in LLMbench.
Stack
Next.js, TypeScript, Three.js. Multi-provider adapters with explicit capability flags. Persistent local storage for prompt history, annotations, and saved comparisons.
Status
Mature. Now at v2.2, a milestone bump marking the local-Ollama path: LLMbench runs end-to-end against a local open-weight model from a deployed origin, with full token-level instrumentation. Ollama logprobs are wired through all four logprob endpoints (Compare overlay, Sampling Probe, Grammar Probe Phase B, Probs), so the same per-token reading that worked against hosted models now works against Gemma, Llama, Qwen, Phi, and DeepSeek running locally. The cumulative v2.15.34–v2.15.47 work brought browser-direct generation, structured error rendering, a copyable OLLAMA_ORIGINS command in the unreachable-error message, a proper README Ollama setup section, and a Design Rationale paragraph on why the closer the model sits to the researcher the more honest the reading becomes. All five Grammar Probe phases (A through E) and the Sampling Probe Investigate-tier mode are shipped. In active use for teaching and research.
Siblings
Manifold Atlas compares the same models at the geometric level; LLMbench is the presentation layer for that work. Vectorscope opens the internals behind the prose. Manifoldscope characterises a single manifold. Theoryscope applies comparable methods to corpora of theoretical texts.