Engineering Rigor in the AI Age: Building a Benchmark You Can Trust

The Rails Infrastructure team has been working on making Bundler faster and our work has paid off. A cold bundle install is 3x faster on a Gemfile with 452 gems compared to Bundler 2.7. But “faster” only means something if everyone agrees on what’s being measured. If two people are running benchmarks with different definitions of “cold install” or different cache states, the results aren’t comparable. We needed a shared tool that would give us confidence, both internally and externally, that we’re tackling the right problems and actually making Bundler faster. And along the way we learned when Claude can be helpful and when to not outsource our own expertise and thinking.

What affects bundle install time

Before building anything, we had to understand what variables go into bundle install performance. There are more than you’d expect:

Number of gems: A Gemfile with 35 gems takes less time to install than a Gemfile with 500 gems.
Depth of dependencies: A flat Gemfile with no transitive dependencies resolves faster than a deep dependency tree.
Native extensions: Gems like bigdecimal take orders of magnitude longer to install than pure Ruby gems (bigdecimal alone takes 3 seconds to install). The ratio of native extension gems to pure Ruby gems in your Gemfile changes the install profile significantly. It’s rare to have a Gemfile without at least a few dependencies on native extensions.
How Ruby is compiled: Optimization flags, compiler version, and platform all affect gem compilation time.
Network time: Downloading from rubygems.org introduces latency and rate limiting that can skew results.
Number of cores: Bundler parallelizes installs across worker threads, so core count matters.
Endpoint security software: On company issued, managed machines, security software that scans file writes adds measurable time to every gem install. Running the same benchmark on a personal non-managed device vs a managed device produced very different numbers with no code change. If your benchmarks aren’t reproducible across machines, this is worth checking.

We needed a way to remove as many of these variables as possible so when we made changes, we could trust our benchmarks were correct. The goal was to remove guesswork, ensure everyone is testing from the same starting point, and provide a straightforward way to run benchmarks when making changes.

What we built

It took a few weeks to get reliable results and as part of building the benchmark we also implemented a full toolkit that includes scripts for installing, benchmarking and profiling Ruby package managers.

Getting a reliable benchmark took a lot of iteration. The first version of the benchmark was basic, and every time we ran it we’d find something that wasn’t quite right. We worked with Claude to make the original benchmark and tweak it as we found issues with the runs. In some cases we had to do the tedious work of debugging the benchmark ourselves.

Back in 2018 when I was working on improving Rails integration test performance, I kept an entire repo with all my benchmarks and profile scripts so I could track changes over time, but also so I could share it with the community. When your benchmark is open source, you’re not working in a vacuum and everyone else can check your assumptions. That lesson stuck with me, and it’s why this toolkit follows a similar pattern.

The toolkit includes everything you need to benchmark and profile Bundler.

Setup scripts: Scripts to install for both macOS and Linux that lets you choose which package managers you want to benchmark.
Benchmarking tool: The benchmark tool uses hyperfine for statistical timing with standard deviation, min/max, and outlier detection.
- It supports running against multiple branches and package managers, switching the Ruby version, changing the number of iterations, and provides multiple Gemfile scenarios.
- It automatically runs both warm and cold scenarios and outputs how much faster or slower each is than the baseline.
- It includes a fake gemserver. Thanks to Claude I was able to quickly build a fake gemserver that served real gems based on Aaron’s slow-gemserver and use that to eliminate deviations caused by network round trips to Rubygemsorg and/or rate limiting.
Profiler tool for Bundler: The profiler tool can currently only profile Bundler but also includes everything you need using either Samply or Vernier
- It supports switching the Ruby version, running with cold or warm cache mode, choosing the Gemfile scenario, and changing the output path of the profile.
- It also can optionally use the fake gemserver to avoid profiling network time.

How we defined what to measure

As part of building this benchmark we also needed to define what we wanted to measure in order to have the same understanding of the scenarios we are trying to improve.

Cold is defined as a first-ever install. Before each iteration, hyperfine’s --prepare hook nukes all caches: download cache, compact index cache, installed gems, bundle home, and removes the lockfile. The install has to resolve dependencies, download every gem, and install from scratch. This is the case where nothing is compiled or installed.

Warm is defined as reinstalling gems that have been downloaded previously. The benchmark setup first runs one full cold install to populate the download cache. Then for each timed iteration, the --prepare hook removes only installed gems and the .bundle directory, keeping the download cache and lockfile intact. The install runs with BUNDLE_FROZEN=1 so it skips resolution and only extracts and installs.

Getting “warm” right was harder than it sounds. Bundler, gel, rv, and scint package managers all have different cache structures. Early on Bundler’s warm results were barely faster than cold, and we spent time debugging before realizing it was a cache isolation issue in the benchmark itself, not a Bundler problem. The benchmark was wrong, not the code. Interestingly, this was a case where AI wasn’t that helpful. Claude kept missing this specific environment variable, so we had to debug the hard way (this is also why the script has a BENCH_DEBUG mode). But it paid off because we gained a better understanding of which environment variables affect the caches.

In the future we may want to define other cache scenarios to measure. There are scenarios between our pre-defined cold and warm scenarios like bundle update or having the gems installed but no lockfile so resolution needs to occur again. The beauty of this toolkit being open source is that if there’s a scenario you want to test, we can easily add that to the benchmark script.

Using the benchmark script

In order to support multiple tools we implemented a --run argument that is specified as a LABEL:TOOL[:PATH] triple. The label isolates caches, the tool selects the package manager, and the path optionally points to a local checkout or git worktree. Multiple runs can be compared in a single invocation, with the first treated as baseline.

ruby run_benchmark.rb \
  --run master:bundler:~/rubygems \
  --run patched:bundler:~/rubygems-patched \
  --scenario rails \
  --iterations 5 \
  --source http://localhost:9292

Caches are fully isolated per label under .caches/<label>/ using environment variables so comparing two Bundler versions in the same invocation won’t contaminate results. The comparison output shows relative speed so you can quickly see if your change is how many times faster or slower it is than the baseline.

Each scenario is just a directory containing a Gemfile. The rails scenario represents a typical Rails application with 35 gems. The large scenario is a stress test with 452 gems. You can add your own by creating a directory with a Gemfile and passing --scenario yourdir to the script.

$ ruby run_benchmark.rb --run bundler27:bundler:~/bundler27 --run master:bundler:~/rubygems/ --scenario large --iterations 3 --source http://localhost:9292 --ruby /usr/local/bin/ruby
Benchmark matrix
Ruby: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
Source: http://localhost:9292
Iterations: 3
Runs:
  bundler27: bundler (/home/ubuntu/bundler27)
  master: bundler (/home/ubuntu/rubygems)

=== Scenario: large (452 gems, bundler27) ===
  Running cold benchmark (3 runs)...
Benchmark 1: bundler27 (cold)
  Time (mean ± σ):     51.368 s ±  0.058 s    [User: 106.055 s, System: 22.181 s]
  Range (min … max):   51.332 s … 51.435 s    3 runs

  Running warm benchmark (3 runs)...
Benchmark 1: bundler27 (warm)
  Time (mean ± σ):      8.895 s ±  0.150 s    [User: 7.743 s, System: 4.658 s]
  Range (min … max):    8.742 s …  9.042 s    3 runs

  Cold median: 51.34s  Warm median: 8.9s

=== Scenario: large (452 gems, master) ===
  Running cold benchmark (3 runs)...
Benchmark 1: master (cold)
  Time (mean ± σ):     16.012 s ±  0.023 s    [User: 107.344 s, System: 21.568 s]
  Range (min … max):   15.987 s … 16.033 s    3 runs

  Running warm benchmark (3 runs)...
Benchmark 1: master (warm)
  Time (mean ± σ):      7.202 s ±  0.027 s    [User: 4.908 s, System: 3.061 s]
  Range (min … max):    7.173 s …  7.228 s    3 runs

  Cold median: 16.02s  Warm median: 7.2s

Results written to /home/ubuntu/bundler-bench/results/bundler27_20260318_161305.json
Results written to /home/ubuntu/bundler-bench/results/master_20260318_161305.json

=== Comparison Summary ===

Scenario: large (452 gems)
                             Cold     +/-                        Warm     +/-
  ------------------------------------------------------------------------------
  bundler27                51.34s   0.06s  baseline             8.90s   0.15s  baseline
  master                   16.02s   0.02s  3.21x faster         7.20s   0.03s  1.24x faster

Note these numbers will vary across macOS and Linux, as well as machines with endpoint security software. While we aimed to reduce many variables, you still may not see the same numbers, however times faster should be between 2-3.5x for cold and 1-1.5x for warm for bundle install. This script was run on an AWS Sandbox and therefore has no other traffic or endpoint security altering the numbers. It is also using the fake gemserver, so network round trips aren’t involved.

Using the profiling script

Benchmarks tell you whether something got faster. Profiles tell you why it’s slow. The profiling tool runs a single bundle install under Vernier or samply to produce flamegraphs.

ruby profile_bundler.rb \
  --run master:bundler:~/rubygems \
  --scenario rails \
  --mode warm \
  --profiler vernier

It supports both cold and warm modes so you can profile the specific phase you’re investigating. Profiles are written to profiles/ with filenames that include the label, scenario, mode, platform, and timestamp so you can compare across runs and machines.

Here’s an example of the Vernier output for the master branch on the AWS linux sandbox for the cold cache mode.

Vernier flamegraph for cold bundle install on Linux

See all the yellow bars? Those are native extensions compiling and blocking the threads from doing other work.

Setup scripts

Reproducing someone else’s benchmark results is only possible if you’re starting from the same place. The repository includes setup scripts for both macos and linux which will install Ruby, hyperfine, profiling tools, and clone the repos you need:

./setup-benchmark-mac.sh --tools bundler,bundler27,gel,scint

The --tools flag lets you pick which package managers or versions to install. It defaults to bundler,bundler27 so you can compare the current master against the last stable release without extra setup. We wanted anyone on the team (or in the community) to be able to spin up a fresh machine and get comparable results without hunting down the right Ruby version, compiler flags, or repo branches.

What we learned

A shared benchmark is a source of truth. When someone says “my change is faster” and someone else disagrees, you need a neutral tool that everyone agreed on beforehand. Without that, performance discussions turn into competing anecdotes. The toolkit gives us a way to settle those disagreements with data instead of intuition.

Reproducibility matters just as much. If you can’t reproduce similar results on someone else’s machine, you can’t verify the claims. “It’s faster on my machine” isn’t useful if it’s not faster for everyone, or worse faster on Linux but slower on macOS. When results differ across machines, we can start asking why instead of arguing about whether.

Back in 2018 I gave a talk called How to Performance which was on the surface about how I sped up integration tests in Rails, but really it was a talk about how to write benchmarks you could trust so you know when you actually made something faster. Many of the lessons I learned back then came up again during this project.

Profiles and guesswork are only one part of the equation. A profile can show you a hot spot, and you can write a fix that looks faster, but without a proper benchmark you don’t actually know. You don’t know if the gain on macOS is a regression on Linux. You don’t know if “cold” got faster but “warm” got slower. You don’t know if the improvement holds across different Gemfile sizes. The benchmark is what turns a hypothesis into evidence.

This matters even more now than it did in 2018. Engineering rigor is more important in the AI world than it was before. It’s easy to generate output that looks correct or looks faster. It’s easy to make a benchmark that looks reasonable but cheats on the warm caches. AI is good at producing plausible code and plausible explanations. Humans are good at critical thinking and using our gut to know when something doesn’t look right. We have taste, discernment, and scrutiny, AI has data.

That’s not a dig on AI. I used Claude extensively throughout this project. It was great at writing the setup scripts, which are uninteresting and error prone, and it wrote the original benchmark tool. But it also got things wrong. The warm cache bug I mentioned earlier? Claude missed setting BUNDLE_USER_HOME in the environment, which meant Bundler was writing to the system bundle home instead of the isolated one. Warm caches on the master branch looked broken because they were being shared across runs. I spent time debugging Bundler before I realized the benchmark itself was wrong. Claude didn’t catch it because it doesn’t have the deep institutional knowledge of how Bundler’s cache layers interact. I caught it because I knew what the numbers should look like and they didn’t add up.

That’s not a reason to stop using AI. It’s a reminder to not outsource our thinking and to always test our assumptions. Applying engineering rigor is how we can be sure the work we’re doing, whether it’s us or AI doing it, is valid and achieves our goals.

The toolkit is available at bundler-perf-toolkit. If you’re working on Bundler performance or just curious about how your Gemfile affects install times, give it a try. We welcome PRs with new scenarios, corrections to cache handling if you spot something we got wrong, and support for other tools to test against.