Use Case
Coding Models
Repo-level coding ≠ solving LeetCode.
Common failures
- Code that passes unit tests but breaks integration due to wrong abstraction or assumptions
- Debugging that patches symptoms without understanding the call graph
- Generated tests that cover happy paths and miss the failures that matter in production
BakeLens traces the coding pipeline
Follow the full chain: file read → edit → test → debug → commit, not just the final diff
Classify failures: wrong file, wrong function, wrong logic, wrong test, wrong context
Measure regression: does fixing one file break another? How often, and where?
Proof delivers repo-level expert data
Senior engineers annotate real repo-level tasks with reasoning, not just correct outputs
Debugging traces with root cause analysis that explain why the fix works
Integration test data covering cross-file dependencies and edge cases
Deliverables
Coding pipeline diagnosis
Where in the edit-test-debug loop your agent fails, and how often
Expert coding datasets
Repo-level tasks annotated by senior engineers with step-by-step rationale
Integration eval suite
Tests that catch cross-file and cross-module failures, not just function-level correctness