Skip to content

Use Case

Auto Research

Can models participate in the loops that drive scientific and engineering progress?

Definition

What is Auto Research?

Auto Research is the discipline of building AI agents that participate in the loops driving scientific and engineering progress — hypothesis generation, data selection, training pipeline design, experimentation, evaluation, and iteration.

Bake AI delivers Auto Research benchmarks as executable tasks: sandboxed Docker environments with fixed pipelines, fixed seeds, hardware budgets, a graded metric, a reference baseline, and a reference solution from a working ML researcher. Agents are scored on whether their submitted code actually moves the metric — not whether it compiles.

Task Anatomy

What ships in every task

instruction.md
Problem statement, setup, goal, and rules.
task.toml
Graded metric, baseline score, reference score, hardware budget.
environment/
Dockerfile, fixed training and conversion pipeline, datasets.
solution/
Reference solution from a working ML researcher.
tests/
Verifier scripts that run the pipeline and compare metrics.
metric: ifeval_prompt_strict
e.g. data_select_ifeval — baseline 0.38, reference 0.42.

Reference Benchmarks

Named benchmarks Auto Research builds on

AutoLab

Bake AI's public benchmark for autonomous research agents. Executable ML tasks with verified pipelines.

IFEval

Instruction-following benchmark (~541 prompts). Used as the graded metric in data_select_ifeval.

Qwen2.5-3B-Instruct

Base model fine-tuned via LoRA on a single H100 within the agent's 8-hour budget.

The Problem

Where the research pipeline breaks down

Generating plausible code is easy. Beating a real baseline under a fixed budget — on a real dataset, with a real grader — is where agents fall apart.

Below Baseline

Code that runs cleanly but fails to beat the random baseline on the actual metric

Doesn't Generalize

Solutions that overfit one data pool or model and collapse when the seed, source mix, or checkpoint changes

Budget Blown

Training runs that silently blow the GPU, memory, or wall-clock budget and return no usable result

How It Works

From plausible code to a metric that actually moves

BakeLens audits each attempt against baseline and reference scores. Proof delivers verified task environments and expert solution traces.

BakeLens audits the auto-research pipeline

1

Trace every stage: task understanding, plan, code, training run, evaluation, retry

2

Score each attempt against the task's baseline and reference solution, not just whether the code ran

3

Surface where agents waste compute, pick the wrong heuristic, or hit silent budget limits

Diagnosed by BakeLens

Proof delivers verified research environments

1

Reproducible Docker task environments with fixed training pipelines, datasets, and graders

2

Baseline and reference solutions written by working ML researchers, with metric scores attached

3

Step-by-step expert traces showing how a researcher reasons from task spec to a metric-moving solution

Powered by Proof

What You Get

Deliverables

Verified Task Environments

Sandboxed Docker tasks with fixed pipelines, fixed seeds, hardware budgets, and a graded metric — built like AutoLab's data_select_ifeval

Expert Solution Traces

Reference solutions from ML researchers who beat the baseline, with the reasoning, code diffs, and ablations behind each decision

Auto Research Eval Suite

Held-out tasks that measure agents on real metric improvements — not whether the code compiles, but whether it actually wins

Built for AI Operating Beyond Benchmarks

Diagnosis, evaluation, expert data, and environments for production deployment.