Building haven bench in the open, and the flaky CI ghost it flushed out

I built a small thing this week and it caught a bigger thing. That’s the whole post, really. But the shape of how it happened is worth writing down, because it’s the kind of story that usually gets quietly squashed into a one-line commit message and never told. Building in the open means showing the part where the harbor light flickers, not just the part where the boat docks clean.

The small thing is haven bench, a command that tells you how fast a model actually runs on your hardware. The bigger thing was a flaky CI failure that wore my new feature as a disguise.

The small thing: a number you can trust

If you run local models, you live and die by tokens per second. It’s the single number everyone in r/LocalLLaMA trades like baseball cards, and yet most people read it off a vibe: “feels fast on my 3090.” I wanted InferHaven to just tell you, honestly, on your own box.

haven@haven · ~

$ haven bench qwen2.5-coder:3b --runs 3

  InferHaven bench — qwen2.5-coder:3b-instruct-q4_K_M
  run 1          106.0 tok/s
  run 2          106.3 tok/s
  run 3          106.0 tok/s
  generation     106.1 tok/s  <- decode rate (avg of 3 runs)
  prompt eval   3507.8 tok/s  (40 prompt tokens)
  load            0.17 s     (run 1: weights -> VRAM)
  total           1.39 s     (run 1)
  method       num_predict=128, seed=0, temp=0, runs=3

The headline is generation, the decode rate. Under the hood Ollama hands back its timings in nanoseconds, and the math is just eval_count / (eval_duration / 1e9). The reason that’s the honest number and not, say, total is subtle and important: eval_duration excludes model load and prompt ingestion. So it’s the pure speed of the model writing tokens, and it holds steady whether the model was cold or already warm in VRAM.

One more bit of honesty baked into the output: prompt eval is gloriously noisy on a short prompt (3507 tok/s above, but it bounces between runs), because you’re dividing a tiny token count by a tinier duration. So bench reports it, but quietly. The number it puts in green, the one it wants you to believe, is generation. A benchmark that oversells itself is just a vibe with extra steps.

I wrote it the slow way on purpose: the core tokens/sec math by hand, with a unit test seeded from a real run off my RTX 3060, so I could actually explain every line of it instead of cargo-culting a one-liner. Seven assertions, all green. Shellcheck clean. Ran it live against three models. Pushed the PR.

And then CI turned red.

The red light

Two smoke jobs run on every PR: a slim codespaces stack and the full full-stack one. Full-stack went green. Codespaces failed, and not in my test. It failed bringing the container up, before my code ever ran:

ci · smoke (codespaces) · Bring devcontainer up

Running the postCreateCommand from devcontainer.json...
Error response from daemon: container 46a4… is not running
postCreateCommand from devcontainer.json failed with exit code 1.
##[error]Process completed with exit code 1.

This is the moment that decides whether you’re a good engineer or a fast one. The tempting move is: it’s my PR, it’s the only thing that changed Poke at the test, re-run it, add a sleep, wrap something in a || true and move on. But if you don’t get to the root of the issue it will most likely just surface again.

No fix without a root cause first. A red test you “fixed” by re-running is just a bug you’ve agreed to meet again later, usually in front of a stranger.
— The rule I make myself follow

So I did the boring thing instead and actually looked.

Following the evidence, not the vibe

The accusation was “your PR broke the build.” The evidence disagreed, layer by layer.

What it looked like

My PR broke CI

Only my branch changed
Red appeared right after I pushed
It's the new feature, obviously
Just re-run it / patch the test

What the evidence said

My PR was a bystander

My diff touched zero boot-path files
Full-stack booted the SAME image fine
The failure was in container startup, before my code ran
main had failed this exact way before, intermittently

Three facts did the work. First, my diff touched a CLI command, a library function, a test, and some docs. Nothing in the container’s startup path. Second, the full-stack job built the very same workspace image and came up clean; if my scripts could kill a boot, both jobs would die, not one. Third, the failure happened during up, before the step that runs my new code ever executed.

That’s not a guilty feature. That’s a flaky boot that happened to be standing next to me when the cops showed up.

The good news is I’d wired the CI to dump the dying container’s logs on failure, and the container’s last words were the whole case:

ci · smoke (codespaces) · Dump logs on failure

workspace-1 | chown: cannot access '/home/haven/.zcompdump-46a4482c870c-5.9.lock':
              No such file or directory

There it is. 46a4482c870c is the container’s own ID. .zcompdump-…-5.9.lock is a zsh completion lock file, created and deleted in milliseconds while the shell builds its completion cache. And the thing that tripped over it was my entrypoint’s first-boot ownership sweep: a recursive chown -R over the home directory.

The bug: a chown that raced the tide

Here’s the entire bug, and it’s a beauty because there’s almost nothing to it:

docker/workspace/scripts/entrypoint.sh

set -e
# ...
chown -R "${HAVEN_USER}:${HAVEN_USER}" "${HOME_DIR}"

chown -R walks the tree, builds a list of things to change, then changes them. If a transient file (say, a zsh completion lock) exists when the walk lists it but is gone by the time chown reaches it, chown exits non-zero. And set -e says: any command that fails, abort the script. So the entrypoint dies. So the container exits. So forty seconds later, when the devcontainer tries to run its setup step, the daemon shrugs and says that container is not running.

It’s a time-of-check versus time-of-use race, and like all races it only loses sometimes, which is exactly why main was usually green and only occasionally, mysteriously, wasn’t. My PR didn’t cause it. My PR just rolled the dice enough times to hit it.

The fix is the same guard the other recursive chowns in that file already had. I’d just missed these two:

the one-line-class fix

- chown -R "${HAVEN_USER}:${HAVEN_USER}" "${HOME_DIR}"
+ chown -R "${HAVEN_USER}:${HAVEN_USER}" "${HOME_DIR}" 2>/dev/null || true

|| true tells set -e to let this one slide, and a vanished lock file stops being a death sentence for the whole container. I proved the mechanism in isolation first (a deliberately failing chown -R under set -e halts the script; the guarded version sails right past it), then shipped it as its own small PR, separate from the benchmark, so the history reads honestly: here’s a feature, and here’s an unrelated bug the feature flushed out.

What I actually want you to take from this

The benchmark is nice. Go run haven bench on your own card and post the number. Honest decode rates are good for everyone, and the more of them in the wild the less anyone has to guess.

But the part I think is worth more than the feature is the shape of the hunt:

A green unit test is not a green build. My math was perfect. The bug was three layers away from my math, in startup code I didn’t touch.
Make your failures talk. That “dump logs on failure” step cost me five minutes to write months ago and handed me the entire diagnosis in one line. Future-you is a stranger; leave them evidence.
Resist the re-run. The single most expensive habit in software is treating a flaky test as noise. Flaky almost always means real bug, intermittent trigger. This one had been quietly failing for weeks.
Fix the root, label it honestly. The race got its own PR with its own explanation. Nobody reading the history six months from now has to wonder why a chown grew a || true.

That’s the whole reason I’m building this in the open. Not because the wins make good screenshots, but because the misses are where the actual craft lives, and most of the industry hides them.

Both PRs are merged. The light’s back to steady.

haven@laptop · ~

$ git clone https://github.com/InferHaven/inferhaven-core
$ cd inferhaven-core
$ cp .env.example .env
$ docker compose up -d
$ ssh haven@localhost
$ haven bench    # tell me what your card does

Float your boat up to the dock, clone the repo, run the benchmark, and if you want the managed version when it’s ready, the waitlist on the homepage is the way in. The lighthouse is on. And now it doesn’t flicker.

— Ethan L.