LLMbench

Tier: comparative tool. Object: generated prose across models.

github.com/vector-lab-tools/LLMbench ยท All tools

LLMbench sends a prompt to two models simultaneously, displays their responses side-by-side, and enables researchers to build layered interpretive readings across both outputs. It is the hermeneutic surface instrument of the Vector Lab: the layer at which models are usually encountered. Where Manifold Atlas compares models at the geometric level and Vectorscope inspects internals, LLMbench handles comparative close reading of generated text.

Why LLMbench

AI-generated text is rarely read closely. It is skimmed, summarised, or dismissed. LLMbench asks a different question: what happens when two models generate in response to the same prompt, and what does the comparison reveal about the geometries below? The tool reuses annotation infrastructure from the Critical Code Studies Workbench, reoriented from source code analysis to LLM output comparison.

Modes

Six modes cover the range of comparative work:

Annotation and cross-panel linking

Annotations layer on both panels independently. A new cross-panel link feature connects annotations across the two outputs with typed relations (contrast, parallel, divergence, convergence, echo, absence, note) and free-text notes. Annotations and links persist in a saved comparison format.

Providers

Seven providers supported: OpenAI, Anthropic, Google, Hugging Face, OpenRouter, OpenAI-compatible endpoints, and local Ollama. Logprob support varies by provider; the interface is explicit about what each provider can return.

Theoretical background

LLMbench operationalises the commitment that AI-generated text deserves hermeneutic reading, not just dismissal or statistical critique. It sits above the geometric instruments in the Vector Lab and is often the presentation layer for findings that begin lower down. A claim tested at the geometric level in Manifold Atlas can be illustrated, read closely, and compared across models in LLMbench.

Stack

Next.js, TypeScript, Three.js. Multi-provider adapters with explicit capability flags. Persistent local storage for prompt history, annotations, and saved comparisons.

Status

Mature. Currently at v2.8.0 with prompt history in all Analyse modes, cross-panel annotation links, cosine similarity in divergence, and ten guided exercises. In active use for teaching and research.

Siblings

Manifold Atlas compares the same models at the geometric level; LLMbench is the presentation layer for that work. Vectorscope opens the internals behind the prose. Manifoldscope characterises a single manifold. Theoryscope applies comparable methods to corpora of theoretical texts.