LLMbench

Tier: comparative tool. Object: generated prose across models.

github.com/vector-lab-tools/LLMbench · All tools

LLMbench sends a prompt to two models simultaneously, displays their responses side-by-side, and enables researchers to build layered interpretive readings across both outputs. It is the hermeneutic surface instrument of the Vector Lab: the layer at which models are usually encountered. Where Manifold Atlas compares models at the geometric level and Vectorscope inspects internals, LLMbench handles comparative close reading of generated text.

Why LLMbench

AI-generated text is rarely read closely. It is skimmed, summarised, or dismissed. LLMbench asks a different question: what happens when two models generate in response to the same prompt, and what does the comparison reveal about the geometries below? The tool reuses annotation infrastructure from the Critical Code Studies Workbench, reoriented from source code analysis to LLM output comparison.

Modes

Modes are organised across three tiers: Compare, Analyse, and Investigate.

Compare tier

Analyse tier

Investigate tier

Across the Compare and Analyse modes, ten guided exercises are available as scholarly presets drawing on Hyland, Lakoff and Johnson, Hayden White and others. Each exercise supplies a preset prompt, methodological context, and guided questions.

Annotation and cross-panel linking

Annotations layer on both panels independently. A new cross-panel link feature connects annotations across the two outputs with typed relations (contrast, parallel, divergence, convergence, echo, absence, note) and free-text notes. Annotations and links persist in a saved comparison format.

Providers

Seven providers supported: OpenAI, Anthropic, Google, Hugging Face, OpenRouter, OpenAI-compatible endpoints, and local Ollama. Logprob support varies by provider; the interface is explicit about what each provider can return.

Theoretical background

LLMbench operationalises the commitment that AI-generated text deserves hermeneutic reading, not just dismissal or statistical critique. It sits above the geometric instruments in the Vector Lab and is often the presentation layer for findings that begin lower down. A claim tested at the geometric level in Manifold Atlas can be illustrated, read closely, and compared across models in LLMbench.

Stack

Next.js, TypeScript, Three.js. Multi-provider adapters with explicit capability flags. Persistent local storage for prompt history, annotations, and saved comparisons.

Status

Mature. Now at v2.2, a milestone bump marking the local-Ollama path: LLMbench runs end-to-end against a local open-weight model from a deployed origin, with full token-level instrumentation. Ollama logprobs are wired through all four logprob endpoints (Compare overlay, Sampling Probe, Grammar Probe Phase B, Probs), so the same per-token reading that worked against hosted models now works against Gemma, Llama, Qwen, Phi, and DeepSeek running locally. The cumulative v2.15.34–v2.15.47 work brought browser-direct generation, structured error rendering, a copyable OLLAMA_ORIGINS command in the unreachable-error message, a proper README Ollama setup section, and a Design Rationale paragraph on why the closer the model sits to the researcher the more honest the reading becomes. All five Grammar Probe phases (A through E) and the Sampling Probe Investigate-tier mode are shipped. In active use for teaching and research.

Siblings

Manifold Atlas compares the same models at the geometric level; LLMbench is the presentation layer for that work. Vectorscope opens the internals behind the prose. Manifoldscope characterises a single manifold. Theoryscope applies comparable methods to corpora of theoretical texts.