A new Register Allocator for ZJIT

We recently landed a new register allocator in ZJIT and I thought I’d write a post about it!

What is a register allocator?

Whenever a compiler generates machine code it needs to decide where to put values. Those values usually take the shape of a variable in your function, though the compiler can also compute intermediate values as well. When we need to perform a calculation on some value, the CPU needs to know how to find the value.

The CPU can typically only compute output based on inputs that are in registers, though some architectures (like x86) allow computations on values stored in memory. That said, reading and writing to registers is much faster than memory, so it behooves the compiler to keep values in registers as long as possible.

Any particular function in your program could have tons of variables, but the number of available registers is finite, and architecture dependent. This is where a register allocator comes in. The register allocator looks at all of the variables, then figures out which registers they should go in, and if there aren’t enough registers figures out how to “spill” those variables to memory.

How does it work?

There are several well-known approaches to register allocation, each with different trade-offs between compile time and code quality. For ZJIT, we chose to implement a linear scan register allocator based on a reduced version of Christian Wimmer’s paper titled “Linear Scan Register Allocation on SSA Form”. The paper isn’t very long so I highly recommend giving it a read when you have time! Alternatively, Max Bernstein wrote a blog post breaking down the paper as well.

ZJIT uses Static single-assignment form, or SSA form in its back end. “SSA form” is a representation of code that only allows a variable to be assigned once.

For example, this Ruby code:

a = 123
a += 1
a

Would be represented in our backend Low-level Intermediate Representation (or “LIR”) kind of like this:

v1 = Const(123)
v2 = Const(1)
v3 = Add v1, v2
CRet v3

In the above pseudocode (it’s not exactly the same as the IR we use in the backend, but quite close), v1, v2, and v3 are all different variables. No variables are allowed to be re-assigned like in the corresponding Ruby code. The generated intermediate representation has exactly the same semantics as the Ruby code, but adds the restriction that a variable can only be written to once. When ZJIT generates an intermediate representation of the Ruby code, it translates all variables to these numeric SSA variables and also uses SSA variables for temporary values. In the example translation above we can think of v1 as being equivalent to a, v2 as a temporary variable with the value 1, and v3 as a new version of a that has been added with v2.

Variables can be written to once, but you can read from the variable as many times as you want. We call the place where the variable was written the “definition” and the place where the variable is read a “use”.

Lifetimes

The first thing a register allocator needs to understand is when each value is alive and for how long. We call this duration a “lifetime” or “live range”. A value’s lifetime (or live range) starts at the point where it’s defined (the definition) and ends at the place it is last used (its “last use”). If two values have overlapping lifetimes they can’t share the same register. If there are more overlapping lifetimes than registers, then we know we need to spill a value to memory.

Since these ranges refer to the “lifetime” of the variable, it’s common to say that the variable “came to life” at its definition, and then “died” at its last use.

Let’s look at an example. Consider this Ruby method that is already in SSA form:

def add_twice(a, b, c)
  d = a + b
  e = d + c
  e
end

Here are the lifetimes for each value:

#  Instruction     |  a  |  b  |  c  |  d  |  e
-------------------+-----+-----+-----+-----+----
1  d = a + b       |  x  |  x  |  .  |  .  |
2  e = d + c       |     |     |  x  |  x  |  .
3  return e        |     |     |     |     |  x

A . character means that the variable is alive at that instruction. An x character means the variable dies, but is still used at that instruction.

We can also express these lifetimes as ranges. The live range for each value is [definition, last use]:

a: [0, 1]
b: [0, 1]
c: [0, 2]
d: [1, 2]
e: [2, 3]

Where instruction 0 represents the function entry (parameter definitions).

The parameters a, b, and c are all alive at instruction 1 because they were defined before the method body.
a and b are last used at instruction 1, so they die after that.
c stays alive through instruction 2 because that’s where it’s last used.
d is defined at instruction 1 and last used at instruction 2, so it’s alive for both.
e is defined at instruction 2 and last used at instruction 3.

At instruction 1, a and b die and d comes to life. Since we know that a and b are never used again, we’re free to reuse one of their registers for d. So we only need three registers at instruction 1: one each for a/d, b, and c. Similarly at instruction 2, c and d die as e is born, so we can reuse a register again.

Computing lifetimes involves a backward dataflow analysis over the control flow graph. We walk the instructions in reverse order, tracking which values are currently live. When we see the definition or use of a value, we know it is live at that instruction.

ZJIT has a debugging option to dump live range graphs like this:

$ ruby --zjit-call-threshold=2 --zjit-dump-lir=live_intervals ../test.rb

Here is an example of the output graph from one basic block as an excerpt from many blocks in a larger function:

          v0  v1  v2  v3  v4  v5  v6  v7  v8  v9  v10 v11
          --- --- --- --- --- --- --- --- --- --- --- ---
i0     :   .   .   .   .   .   .   .   .   .   .   .   .  FrameSetup
i2     :   v   .   .   .   .   .   .   .   .   .   .   .  v0 = Load [x21 - 0x28]
i4     :   █   v   .   .   .   .   .   .   .   .   .   .  v1 = Load [x21 - 0x20]
i6     :   █   █   .   .   .   .   .   .   .   .   .   .  Jmp bb3_l2([x19 + 0x18], v0, v1)

As mentioned earlier, in ZJIT, SSA variables names are just numbers. In the block above we have 12 variables numbered 0 through 11. These variables are listed in the first row along the top of the graph, numbered v0 through v11.

The first column lists the instruction numbers. In this case we have 4 instructions (i0 through i6), and for reasons outside the scope of this post, they’ve been numbered with even numbers.

The last column of the graph lists the actual LIR instruction. I’m not going to dive in to exactly what each of these instructions means, but you can see that some of them define variables (like i2 and i4 define v0 and v1), and some of them use variables (like i6 uses v0 and v1).

In the center of the chart, the v character indicates when a variable came to life, the solid block is when a variable is “alive”, and a ^ character indicates last use (though there are no “last uses” in this example).

Interference graphs

Once we know the lifetimes of all values, we need to figure out how many ranges overlap and where. One way to do this is to build an “interference graph”. An interference graph is a straightforward graph data structure where each node in the graph represents a live range, and each edge in the graph represents an “overlap”, or “interference”, with another live range.

Once you have an interference graph, you can treat allocation as a graph coloring problem, where each color represents a physical register available on that CPU. If we have k physical registers, we need a valid k-coloring of the interference graph. This is NP-complete in the general case, but heuristics can help simplify the problem.

While graph coloring produces excellent results, building and manipulating the interference graph can be expensive in terms of both time and memory, especially for large functions. JIT compilers should be fast, so we opted for a different algorithm: Linear Scan.

Linear Scan

As I mentioned earlier, we’re using a linear scan register allocator based on Christian Wimmer’s paper.

The algorithm is fairly straightforward. Once you’ve computed live ranges, iterate over those live ranges in order. When you get to an instruction where a live range starts, “pull” a free register from a pool, and assign the register to the live range. When you get to an instruction where a live range ends, put the register back in the pool.

If, at some point, you run out of registers in the pool, spill either the new live range, or an existing one.

Local vs global allocation

The techniques we’ve discussed so far like lifetimes, interference graphs, and linear scan can all be applied at different scopes. In the register allocator world there are two main types you’ll read about, and the difference is to do with how they process the program’s control flow graph.

First is the “local” register allocator. Every function is broken down in to multiple basic blocks, and program control flows through these blocks. A local register allocator will only ever “see” one single block at a time.

The second allocator is a “global” register allocator. A global register allocator will process the function’s entire graph at once. Confusingly, this allocator is not “global” in the sense of “global variables” in a program, but “global” in the sense that it analyzes an entire function at once rather than one basic block.

ZJIT’s previous register allocator was a local allocator inherited from YJIT. It only tracked lifetimes within a single basic block. This means that at every block boundary, all live values had to be moved to well-known locations (like the stack or fixed registers) so the next block could pick them up.

For a very lazy block-at-a-time compiler like YJIT, this is a reasonable trade-off. YJIT compiles one basic block at a time, so it never has the full picture of the function anyway. Using YJIT’s allocator helped us to bootstrap ZJIT quickly.

But ZJIT compiles entire methods, so we have all the information we need to do better.

A global allocator can keep a variable in the same register across block boundaries. If a value is defined in one block and used in a later block, the allocator can let its live range span both blocks and assign it a single register for the whole duration. This avoids unnecessary stores and loads at block boundaries, which can make a real difference in tight loops.

A global allocator also unlocks features that are difficult or impossible with a local one. For example, splitting and adding new basic blocks in various optimization passes was very tricky due to keeping track of what variables went where. Now basic blocks can be easily manipulated without worrying about which blocks use what variable. It’s also a prerequisite for method inlining, where the inlined callee’s code becomes part of the caller’s control flow graph and its values need to participate in the same allocation.

Where we are now

The new register allocator has landed and is working well. We’re now building on top of it! Method inlining is actively in progress and relies heavily on the global allocation we now have.

There are still improvements to make to the allocator itself. One big one is lifetime holes. Right now, a value’s live range is a single contiguous interval from its definition to its last use. But in practice, a value can be “dead” in the middle of its range.

For example, if a value is used in one branch of an if but not the other, the register allocator could accidentally keep the value “live” for one branch. This can be a problem because the value will end up using valuable resources (physical registers) that should be used for values that are actually alive at that time. The below Ruby program kind of visualizes this issue:

def example(cond, a, b)
  x = a + b

  if cond
    y = a * 2
    use(y)
  else
    use(x)
  end
end

Depending on how the code is linearized, the value x could be considered “alive” inside the true side of the if statement even though we can clearly see it’s not used. In other words, the register assigned to the value x will be reserved inside the true side even though it’s not actually used.

Representing these holes would let us reuse registers more aggressively in those gaps, reducing spills.

We’re excited about the foundation this gives us and looking forward to building on it!