Intro

Since the post at the end of last year, ZJIT has grown and changed in some exciting ways. This is the story of how a new, self-contained optimization pass causes ZJIT performance to surpass YJIT on an interesting microbenchmark. It has been 10 months since ZJIT was merged into Ruby, and we’re now beginning to see the design differences between YJIT and ZJIT manifest themselves in performance divergences. In this post, we will explore the details of one new optimization in ZJIT called load-store optimization. This implementation is part of ZJIT’s optimizer in HIR. Recall that the structure of ZJIT looks roughly like the following.

flowchart LR
        A(["Ruby"])
        A --> B(["YARV"])
        B --> C(["HIR"])
        C --> D(["LIR"])
        D --> E(["Assembly"])

This post will focus on optimization passes in HIR, or “High-level” Intermediate Representation. At the HIR level, we have two capabilities that are distinct from other compilation stages. Our optimizations in HIR typically utilize the benefits of our SSA representation in addition to the HIR instruction effect system.

These are the current analysis passes in ZJIT without load-store optimization, as well as the order in which the passes are executed.

run_pass!(type_specialize);
run_pass!(inline);
run_pass!(optimize_getivar);
run_pass!(optimize_c_calls);
run_pass!(fold_constants);
run_pass!(clean_cfg);
run_pass!(remove_redundant_patch_points);
run_pass!(eliminate_dead_code);

Here’s where load-store optimization gets added.

  run_pass!(type_specialize);
  run_pass!(inline);
  run_pass!(optimize_getivar);
  run_pass!(optimize_c_calls);
+ run_pass!(optimize_load_store);
  run_pass!(fold_constants);
  run_pass!(clean_cfg);
  run_pass!(remove_redundant_patch_points);
  run_pass!(eliminate_dead_code);

Overview

Ruby is an object-oriented programming language, so CRuby needs to have some notion of object loads, modifications, and stores. In fact, this is a topic already covered by another Rails at Scale blog post. The shape system provides performance improvements in CRuby (both interpreter and JIT), but there is still plenty of opportunity to improve JIT performance. Sometimes optimizing interpreter opcodes one at a time leaves repeated loads or stores that can be cleaned up with a program analysis optimization pass. Before getting into the weeds about this pass, let’s talk performance.

Results

The setivar benchmark for ZJIT changes dramatically on 2026-03-06. This is when load-store optimization landed in ZJIT. At the time of this writing, ZJIT takes an average of 2ms per iteration on this benchmark, while YJIT takes an average of 5ms.

The moment load-store optimization was added and ZJIT (yellow) overtook YJIT (green)
The moment load-store optimization was added and ZJIT (yellow) overtook YJIT (green)

This is the second time that ZJIT has clearly surpassed YJIT. The first example is here.

At a high level, this means that ZJIT is over twice as fast as YJIT for repeated instance variable assignment, and more than 25 times faster than the interpreter!

A Troubling Development

However, there’s an important question we have to address - why should an optimization pass for object loads and stores have anything to do with instance variable assignment? It turns out that ZJIT’s High Intermediate Representation (HIR) uses LoadField and StoreField instructions both for both object instance variables, and for object shapes. We’re going to have to dig deeper into CRuby shapes and ZJIT HIR internals in order to make sense of this.

Background

So far, we’ve learned that HIR has LoadField and StoreField instructions. We’ve claimed that they are multi-purpose and that the performance wins come from optimizing object shapes, but that they can also apply to object instance variables. Because the algorithm works just as well for both situations, the rest of this post will focus on object instance variables. This allows us to demonstrate concepts in pure Ruby to make things more approachable.

Example

Let’s start with a simple example we can all agree on. Clearly this code snippet has a double store, and we can safely remove one of the @a = value calls.

class C
  def initialize
    value = 1
    @a = value
    @a = value
  end
end

Here’s the same code snippet with an example of the call we remove. Here, we have elided a redundant StoreField instruction.

  class C
    def initialize
      value = 1
      @a = value
-     @a = value
    end
  end

When should we remove LoadField and StoreField instructions? The HIR code snippets will come later. For now, we only need to know the mapping between Ruby and HIR for instance variable loads and stores.

Ruby HIR
@var = value StoreField var, @obj@offset, value
@var LoadField var, @obj@offset

Note: In a class’s initialize method, instance variable operations are likely to cause LoadField and StoreField instructions due to shape transitions. Outside of an initialize method, the loads and stores are more likely to be related to the instance variables themselves. We decided that more complicated Ruby code snippets would clarify the kind of LoadField or StoreField but overly clutter the code snippets in this post.

Cases

Let’s consider every edge case for our algorithm through short Ruby snippets to illustrate scenarios where we can and cannot elide LoadField or StoreField HIR instructions.

Note: The following examples could replace the value variable with the constant 1, but in ZJIT this could cause other optimizations such as constant folding to interfere with our load-store demonstrations. We will use these more complex code snippets in case the reader wants to follow along with a compiler explorer.

Redundant Store
class C
  def initialize
    value = 1
    @a = value
    # This store is redundant and should be elided in HIR
    @a = value
  end
end
Redundant Load
class C
  def initialize
    value = 1
    @a = value
    # We already know that this load is `value` and should be replaced
    @a
  end
end
Redundant Store with Aliasing
class C
  attr_accessor :a

  def initialize(value)
    @a = value
  end
end

class D
  attr_accessor :a

  def initialize(value)
    @a = value
  end
end

def multi_object_test
  x = C.new(1)
  y = D.new(1)
  new_x_val = 2
  new_y_val = 3
  x.a = new_x_val
  y.a = new_y_val
  # We would like to elide this (but currently do not)
  x.a = new_x_val
end

With variables pointing to distinct objects, we could elide the second store to object x. This is not currently implemented, but is a possible improvement with a technique called type-based alias analysis.

Required Store with Aliasing
class C
  attr_accessor :a

  def initialize(value)
    @a = value
  end
end

def multi_object_test
  x = C.new(1)
  y = x
  new_x_val = 2
  new_y_val = 3
  x.a = new_x_val
  y.a = new_y_val
  # We should not elide the second `x.a` assignment because the `y.a` assignment modifies `x`
  # The `x.a` store after this comment is no longer redundant
  x.a = new_x_val
end

With multiple multiple variables aliasing to the same object, we cannot elide the second store to x. While technically we could elide y.a = new_y_val and the initial y = x assignment, these improvements are out of scope for this post. The key point here is that aliasing needs to be considered. If we assume that y and x reference different objects and elide the second x.a = new_x_val call, we alter program behavior.

Required Store with Effects
def scary_method(obj)
  obj.a = "We have modified the object. The second store is no longer redundant"
end

class C
  attr_accessor :a

  def initialize(value)
    @a = value
  end
end

def effectful_operations_between_stores_test
  x = C.new(1)
  x.a = 5
  scary_method(x)
  # We want to elide this but `scary_method` can modify `x`
  x.a = 5
end

In this case, the second store looks redundant, but it might not be. An arbitrary Ruby method (or C call, or some HIR instructions) could modify the x object and breaks the assumptions we can make about the state of the x object. In such cases, we cannot perform load-store optimization.

The Algorithm

Key Idea

With these cases, we have covered everything needed to implement our load-store optimization algorithm. The algorithm is a lightweight abstract interpretation over objects. This approach allows us to minimize the computation required to perform our optimization pass while ensuring soundness. In layperson’s terms, this means that every load we replace and every store we eliminate will not change program behavior, but that we will potentially miss some loads or stores that could be eliminated.

Tricky Details

Basic Blocks

Our load-store optimization pass scans through basic blocks, searches for redundant loads and stores, and updates the HIR instructions accordingly. Unnecessary StoreField operations are elided, and unnecessary LoadField operations are replaced with the instruction already holding the value. While one key benefit of ZJIT is that it can optimize entire methods, load-store optimization is (for now) block-local only.

LoadField and StoreField Distinctions

So far, we’ve talked about elision and instruction removal. We can get away with deleting StoreField instructions because no other instructions point to StoreField instructions. Conversely, LoadField instructions do have dependencies and are referenced by other instructions. These references need to be fixed up. Each reference to LoadField gets replaced with the cached value that was the target of a load.

The WriteBarrier Instruction

ZJIT has WriteBarrier instructions to support garbage collection. These also can modify objects and act similarly to stores. We need to handle this case in our algorithm.

Pointer Intricacies

The pseudo code we are about to introduce uses the term “offset” to denote the number of bytes from the object’s base address in memory. We use this to detect redundant loads and stores, as well as clear the cache from effectful instructions and write barriers. However, it is not immediately obvious that simply checking offsets would be enough. How can we be sure that the memory regions we are tracking remain untouched by some other instruction? Fortunately, HIR instructions always point to the base of an object and use offsets that are in bounds of the object. If we have two offsets that are not equal, they cannot reference the same region of memory. If the offsets are equal, then object aliasing must be considered.

Algorithm Sketch

Here’s the pseudo-code for a given basic block.

initialize an empty cache as a hashmap

For each HIR instruction in the basic block
    if instruction is `LoadField`
        check if the object, offset, and value triple is in the cache
        if so, delete instruction and replace references to it with the loaded value
        else, cache the loaded value with the object, offset pair as a key
        
    if instruction is `StoreField`
        check if the object, offset, and value triple is in the cache
        if so, delete the instruction
        else, remove each cache entry with the same offset (the flags field) to avoid aliasing issues
        
    if instruction is `WriteBarrier`
        # This instruction is needed for the garbage collector and is complex
        # It works similarly to `StoreField` in practice
        # This instruction is never removed but the cache cleaning is still needed
        remove each cache entry with the same offset to avoid aliasing issues
        
    if instruction can modify objects
        flush the cache
    else
        continue
          
return the pruned HIR instructions

Source Code

The source at the time of this writing can be found here.

HIR Improvements

After the optimization, here are examples of how the HIR changes.

This the new HIR for our first redundant load example.

  fn initialize@../scripts/double_load.rb:3:
  bb1():
    EntryPoint interpreter
    v1:BasicObject = LoadSelf
    v2:NilClass = Const Value(nil)
    Jump bb3(v1, v2)
  bb2():
    EntryPoint JIT(0)
    v5:BasicObject = LoadArg :self@0
    v6:NilClass = Const Value(nil)
    Jump bb3(v5, v6)
  bb3(v8:BasicObject, v9:NilClass):
    v13:Fixnum[1] = Const Value(1)
    PatchPoint SingleRactorMode
    v30:HeapBasicObject = GuardType v8, HeapBasicObject
    v31:CShape = LoadField v30, :_shape_id@0x4
    v32:CShape[0x80000] = GuardBitEquals v31, CShape(0x80000)
    StoreField v30, :@a@0x10, v13
    WriteBarrier v30, v13
    v35:CShape[0x80008] = Const CShape(0x80008)
    StoreField v30, :_shape_id@0x4, v35
-   v20:HeapBasicObject = RefineType v8, HeapBasicObject
    PatchPoint SingleRactorMode
-   v38:CShape = LoadField v20, :_shape_id@0x4
-   v39:CShape[0x80008] = GuardBitEquals v38, CShape(0x80008)
-   v40:BasicObject = LoadField v20, :@a@0x10
    CheckInterrupts
-   Return v40
+   Return v13

This the new HIR for our first redundant store example.

bb1():
  EntryPoint interpreter
  v1:BasicObject = LoadSelf
  v2:NilClass = Const Value(nil)
  Jump bb3(v1, v2)
bb2():
  EntryPoint JIT(0)
  v5:BasicObject = LoadArg :self@0
  v6:NilClass = Const Value(nil)
  Jump bb3(v5, v6)
bb3(v8:BasicObject, v9:NilClass):
  v13:Fixnum[1] = Const Value(1)
  PatchPoint SingleRactorMode
  v35:HeapBasicObject = GuardType v8, HeapBasicObject
  v36:CShape = LoadField v35, :_shape_id@0x4
  v37:CShape[0x80000] = GuardBitEquals v36, CShape(0x80000)
  StoreField v35, :@a@0x10, v13
  WriteBarrier v35, v13
  v40:CShape[0x80008] = Const CShape(0x80008)
  StoreField v35, :_shape_id@0x4, v40
  v20:HeapBasicObject = RefineType v8, HeapBasicObject
  PatchPoint NoEPEscape(initialize)
  PatchPoint SingleRactorMode
- v43:CShape = LoadField v20, :_shape_id@0x4
- v44:CShape[0x80008] = GuardBitEquals v43, CShape(0x80008)
- StoreField v20, :@a@0x10, v13
  WriteBarrier v20, v13
  CheckInterrupts
  Return v13

And that’s load-store optimization!

Design Discussion

You may notice that our optimization is pruning the graph of loads and stores on an object. We are solving a very similar problem to the SSA form baked into the HIR. While it would be great to have “more SSA” at the object level, this comes at a cost. Computing SSA at this level could necessitate structural changes to HIR and make things less ergonomic or more confusing in regions of the codebase outside of load-store optimization. In fact, this question of “more SSA” is a complex design decision and contentious topic with a rich history in compilers such as V8 or Jikes RVM. So far, we’ve decided to use a lightweight SSA representation in ZJIT that causes us to work a bit harder for certain optimization passes, yielding subtle design simplifications across the rest of HIR.

Future Work

There’s still a lot of exciting work to be done and there are improvements to be made before we hit diminishing returns. Dead store elimination utilizes many of the same ideas and could help improve object initialization performance. We could implement type based alias analysis, though this requires care, as type confusion bugs are quite dangerous in JIT compilers. See section 4.1 in the phrack article for further details.

Conclusion

Thanks for reading the first post about ZJIT’s optimizer. We have lots more to come, so stay tuned.