Simple, ergonomic, and high performance evals. We use it at OpenAI as part of our infrastructure to run Preparedness evaluations.
- Minimal indirection. You should be able to implement and understand an eval in 100 lines.
- Separation of concerns. Keep data loading away from completions/parsing/different ways of running an eval.
- Fast iteration and testability. nanoevals should import in less than a second and be testable without a live LLM backend.
- High performance. Nanoeval should max out the compute resources available to it.
Eval
- A chz class. Enumerates a set of tasks, and (typically) uses a "Solver" to solve them and then records the results. Can be configured in code or on the CLI using a chz entrypoint.EvalSpec
- An eval to run and runtime characteristics of how to run it (i.e. concurrency, recording, other administrivia)Task
- A single scoreable unit of work.Solver
- A strategy (usually involving sampling a model) to go from a task to a result that can be scored. For example, there may be different ways to prompt a model to answer a multiple-choice question (i.e. looking at logits, few-shot prompting, etc)
# Using https://github.com/astral-sh/uv (recommended) uv add "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval"# Using pip pip install "git+https://github.com/openai/SWELancer-Benchmark#egg=nanoeval&subdirectory=project/nanoeval"
nanoeval is pre-release software and may have breaking changes, so it's recommended that you pin your installation to a specific commit. The uv command above will do this for you.
See gpqa_api.py for an implementation of GPQA using the OpenAI API in <70 lines of code.
At the highest level: nanoeval is just a library. You can import it and call nanoeval.run()
on an EvalSpec. nanoeval then loads all the tasks and runs eval.evaluate()
in parallel using asyncio.
More granularly: nanoeval operates like a tiny distributed system. Eval state is tracked in a per-process sqlite database in /dev/shm
(or /tmp
on macOS). When you call .run()
, it queues up the eval and all of its tasks in sqlite. It then starts one or more executors that continually poll the db for new tasks, run them, and put the results back in the db.
The executors can operate in two modes:
- In-process: The executor is just an async task running in the same process as the main eval script. The default.
- Multiprocessing: Starts a pool of executor processes that all poll the db. Use this via
spec.runner.experimental_use_multiprocessing=True
.
nanoeval has been tested up to ~5,000 concurrent rollouts. It is likely that it can go higher.
For highest performance, use multiprocessing with as many processes as your system memory + core count allows. See RunnerArgs
for documentation.
nanoeval has a tiny built-in monitor to track ongoing evals. It's a streamlit that visualizes the state of the internal run state database. This can be helpful to diagnose hangs on specific tasks. To use it:
# either set spec.runner.use_monitor=True OR run this command: python3 -m nanoeval.bin.mon
Because nanoeval uses a persistent database to track the state of individual tasks in a run, you can restart an in-progress eval if it crashes. (In-progress rollouts will be restarted from scratch, but completed rollouts will be saved.) To do this:
# Restarts the eval in a new process python3 -m nanoeval.bin.resume run_set_id=...
You can list all run sets (databases) using the following command:
ls -lh "$(python3 -c "from nanoeval.fs_paths import database_dir; print(database_dir())")"
The run set ID for each database is simply the filename, without the .db*
suffix.
An eval is just a chz
class that defines get_name()
, get_tasks()
, evaluate()
and get_summary()
. Start with gpqa_simple.py
; copy it and modify it to suit your needs. If necessary, drop down to the base nanoeval.Eval
class instead of using MCQEval
.
The following sections describe common use case needs and how to achieve them.
You may import code from any nanoeval.*
package that does not start with an underscore. Functions and classes that start with an underscore are considered private.
Many things you might want to use in an eval (e.g., external resources, tools) require one-time creation and cleanup. To accomodate this, nanoeval.run
will __aenter__
an eval on start and __aexit__
on exit. We also have a simple wrapper for these functions called HasAsyncContextManager
that lets you do something like this:
fromtyping_extensionsimportoverridefromtypingimportAsyncGeneratorimportchzfromnanoevalimportEvalfromnanoeval.solvers.mcqimportMCQTask, Answer, MCQSolver@chz.chzclassMCQEval(Eval[MCQTask, Answer]): solver: MCQSolvern_consensus: int=1@overrideasyncdef_context(self) ->AsyncGenerator[None, None]: asyncwithself.solver: yield
Nanoeval uses multiprocessing
to execute rollouts in parallel. Sometimes, if you ctrl-c the main job, the multiprocessing executors don’t have time to exit. A quick fix:
pkill -f multiprocessing.spawn
py-spy
is an excellent tool to figure out where processes are stuck if progress isn’t happening. You can check the monitor to find the PIDs of all the executors and py-spy them one by one. The executors also run aiomonitor
, so you can connect to them via python3 -m aiomonitor.cli ...
to inspect async tasks.
nanoeval relies heavily on Python asyncio for concurrency within each executor process; thus, if you block the main thread, this will harm performance and lead to main thread stalls. A common footgun is making a synchronous LLM or HTTP call, which can stall the main thread for dozens of seconds.
Tracking down blocking calls can be annoying, so nanoeval comes with some built-in features to diagnose these.
- Blocking synchronous calls will trigger a stacktrace dump to a temporary directory. You can see them by running
open "$(python3 -c "from nanoeval.fs_paths import stacktrace_root_dir; print(stacktrace_root_dir())")"
. - Blocking synchronous calls will also trigger a console warning.