What is Auto Research?

Auto Research is the discipline of building AI agents that participate in the loops driving scientific and engineering progress: hypothesis generation, data selection, training pipeline design, experimentation, evaluation, and iteration. Bake AI delivers Auto Research benchmarks as executable Docker tasks with fixed pipelines, hardware budgets, a graded metric, a baseline, and a reference solution from a working ML researcher.

What does an Auto Research task look like?

Each task ships with instruction.md (problem statement), task.toml (metric, baseline score, reference score, hardware budget), environment/ (Dockerfile and fixed pipeline), solution/ (reference solution), and tests/ (graders). Example: AutoLab's data_select_ifeval — select up to 5,000 samples from a 50k pool to maximize IFEval prompt_strict accuracy after LoRA fine-tuning of Qwen2.5-3B-Instruct on a single H100 within an 8-hour agent budget.

Use Case

Auto Research

Can models participate in the loops that drive scientific and engineering progress?

Book a Demo See AutoLab Benchmark

The Problem

Where the research pipeline breaks down

Generating plausible code is easy. Beating a real baseline under a fixed budget — on a real dataset, with a real grader — is where agents fall apart.

Below Baseline

Code that runs cleanly but fails to beat the random baseline on the actual metric

Doesn't Generalize

Solutions that overfit one data pool or model and collapse when the seed, source mix, or checkpoint changes

Budget Blown

Training runs that silently blow the GPU, memory, or wall-clock budget and return no usable result

How It Works

From plausible code to a metric that actually moves

BakeLens audits each attempt against baseline and reference scores. Proof delivers verified task environments and expert solution traces.

BakeLens audits the auto-research pipeline

Trace every stage: task understanding, plan, code, training run, evaluation, retry

Score each attempt against the task's baseline and reference solution, not just whether the code ran

Surface where agents waste compute, pick the wrong heuristic, or hit silent budget limits

Diagnosed by BakeLens

Proof delivers verified research environments

Reproducible Docker task environments with fixed training pipelines, datasets, and graders

Baseline and reference solutions written by working ML researchers, with metric scores attached

Step-by-step expert traces showing how a researcher reasons from task spec to a metric-moving solution

What You Get

Deliverables

Verified Task Environments

Sandboxed Docker tasks with fixed pipelines, fixed seeds, hardware budgets, and a graded metric — built like AutoLab's data_select_ifeval

Expert Solution Traces

Reference solutions from ML researchers who beat the baseline, with the reasoning, code diffs, and ablations behind each decision

Auto Research Eval Suite

Held-out tasks that measure agents on real metric improvements — not whether the code compiles, but whether it actually wins

Explore More

Built for AI Operating Beyond Benchmarks

Diagnosis, evaluation, expert data, and environments for production deployment.

Talk to an Expert Request Sample