How ZJIT removes redundant object loads and stores
Intro
Since the post at the end of last year, ZJIT has grown and changed in some exciting ways. This is the story of how a new, self-contained optimization pass causes ZJIT performance to surpass YJIT on an interesting microbenchmark. It has been 10 months since ZJIT was merged into Ruby, and we’re now beginning to see the design differences between YJIT and ZJIT manifest themselves in performance divergences. In this post, we will explore the details of one new optimization in ZJIT called load-store optimization. This implementation is part of ZJIT’s optimizer in HIR. Recall that the structure of ZJIT looks roughly like the following.
flowchart LR
A(["Ruby"])
A --> B(["YARV"])
B --> C(["HIR"])
C --> D(["LIR"])
D --> E(["Assembly"])
This post will focus on optimization passes in HIR, or “High-level” Intermediate Representation. At the HIR level, we have two capabilities that are distinct from other compilation stages. Our optimizations in HIR typically utilize the benefits of our SSA representation in addition to the HIR instruction effect system.
These are the current analysis passes in ZJIT without load-store optimization, as well as the order in which the passes are executed.
run_pass!(type_specialize);
run_pass!(inline);
run_pass!(optimize_getivar);
run_pass!(optimize_c_calls);
run_pass!(fold_constants);
run_pass!(clean_cfg);
run_pass!(remove_redundant_patch_points);
run_pass!(eliminate_dead_code);
Here’s where load-store optimization gets added.
run_pass!(type_specialize);
run_pass!(inline);
run_pass!(optimize_getivar);
run_pass!(optimize_c_calls);
+ run_pass!(optimize_load_store);
run_pass!(fold_constants);
run_pass!(clean_cfg);
run_pass!(remove_redundant_patch_points);
run_pass!(eliminate_dead_code);
Overview
Ruby is an object-oriented programming language, so CRuby needs to have some notion of object loads, modifications, and stores. In fact, this is a topic already covered by another Rails at Scale blog post. The shape system provides performance improvements in CRuby (both interpreter and JIT), but there is still plenty of opportunity to improve JIT performance. Sometimes optimizing interpreter opcodes one at a time leaves repeated loads or stores that can be cleaned up with a program analysis optimization pass. Before getting into the weeds about this pass, let’s talk performance.
Results
The setivar benchmark for ZJIT changes dramatically on
2026-03-06. This is when load-store optimization landed in ZJIT. At the time of
this writing, ZJIT takes an average of 2ms per iteration on this benchmark,
while YJIT takes an average of 5ms.

This is the second time that ZJIT has clearly surpassed YJIT. The first example is here.
At a high level, this means that ZJIT is over twice as fast as YJIT for repeated instance variable assignment, and more than 25 times faster than the interpreter!
A Troubling Development
However, there’s an important question we have to address - why should an
optimization pass for object loads and stores have anything to do with instance
variable assignment? It turns out that ZJIT’s High Intermediate Representation
(HIR) uses LoadField and StoreField instructions both for both object
instance variables, and for object shapes. We’re going to have to dig deeper
into CRuby shapes and ZJIT HIR internals in order to make sense of this.
Background
So far, we’ve learned that HIR has LoadField and StoreField instructions.
We’ve claimed that they are multi-purpose and that the performance wins come
from optimizing object shapes, but that they can also apply to object instance
variables. Because the algorithm works just as well for both situations, the
rest of this post will focus on object instance variables. This allows us to
demonstrate concepts in pure Ruby to make things more approachable.
Example
Let’s start with a simple example we can all agree on. Clearly this code
snippet has a double store, and we can safely remove one of the @a = value
calls.
class C
def initialize
value = 1
@a = value
@a = value
end
end
Here’s the same code snippet with an example of the call we remove. Here, we
have elided a redundant StoreField instruction.
class C
def initialize
value = 1
@a = value
- @a = value
end
end
When should we remove LoadField and StoreField instructions? The HIR code
snippets will come later. For now, we only need to know the mapping between Ruby
and HIR for instance variable loads and stores.
| Ruby | HIR |
|---|---|
@var = value |
StoreField var, @obj@offset, value |
@var |
LoadField var, @obj@offset |
Note: In a class’s
initializemethod, instance variable operations are likely to causeLoadFieldandStoreFieldinstructions due to shape transitions. Outside of an initialize method, the loads and stores are more likely to be related to the instance variables themselves. We decided that more complicated Ruby code snippets would clarify the kind ofLoadFieldorStoreFieldbut overly clutter the code snippets in this post.
Cases
Let’s consider every edge case for our algorithm through short Ruby snippets
to illustrate scenarios where we can and cannot elide LoadField or
StoreField HIR instructions.
Note: The following examples could replace the
valuevariable with the constant1, but in ZJIT this could cause other optimizations such as constant folding to interfere with our load-store demonstrations. We will use these more complex code snippets in case the reader wants to follow along with a compiler explorer.
Redundant Store
class C
def initialize
value = 1
@a = value
# This store is redundant and should be elided in HIR
@a = value
end
end
Redundant Load
class C
def initialize
value = 1
@a = value
# We already know that this load is `value` and should be replaced
@a
end
end
Redundant Store with Aliasing
class C
attr_accessor :a
def initialize(value)
@a = value
end
end
class D
attr_accessor :a
def initialize(value)
@a = value
end
end
def multi_object_test
x = C.new(1)
y = D.new(1)
new_x_val = 2
new_y_val = 3
x.a = new_x_val
y.a = new_y_val
# We would like to elide this (but currently do not)
x.a = new_x_val
end
With variables pointing to distinct objects, we could elide the second store to
object x. This is not currently implemented, but is a possible improvement
with a technique called type-based alias analysis.
Required Store with Aliasing
class C
attr_accessor :a
def initialize(value)
@a = value
end
end
def multi_object_test
x = C.new(1)
y = x
new_x_val = 2
new_y_val = 3
x.a = new_x_val
y.a = new_y_val
# We should not elide the second `x.a` assignment because the `y.a` assignment modifies `x`
# The `x.a` store after this comment is no longer redundant
x.a = new_x_val
end
With multiple multiple variables aliasing to the same object, we cannot elide
the second store to x. While technically we could elide y.a = new_y_val and
the initial y = x assignment, these improvements are out of scope for this
post. The key point here is that aliasing needs to be considered. If we assume
that y and x reference different objects and elide the second
x.a = new_x_val call, we alter program behavior.
Required Store with Effects
def scary_method(obj)
obj.a = "We have modified the object. The second store is no longer redundant"
end
class C
attr_accessor :a
def initialize(value)
@a = value
end
end
def effectful_operations_between_stores_test
x = C.new(1)
x.a = 5
scary_method(x)
# We want to elide this but `scary_method` can modify `x`
x.a = 5
end
In this case, the second store looks redundant, but it might not be. An
arbitrary Ruby method (or C call, or some HIR instructions) could modify the x
object and breaks the assumptions we can make about the state of the x object.
In such cases, we cannot perform load-store optimization.
The Algorithm
Key Idea
With these cases, we have covered everything needed to implement our load-store optimization algorithm. The algorithm is a lightweight abstract interpretation over objects. This approach allows us to minimize the computation required to perform our optimization pass while ensuring soundness. In layperson’s terms, this means that every load we replace and every store we eliminate will not change program behavior, but that we will potentially miss some loads or stores that could be eliminated.
Tricky Details
Basic Blocks
Our load-store optimization pass scans through basic blocks, searches for
redundant loads and stores, and updates the HIR instructions accordingly.
Unnecessary StoreField operations are elided, and unnecessary LoadField
operations are replaced with the instruction already holding the value. While
one key benefit of ZJIT is that it can optimize entire methods, load-store
optimization is (for now) block-local only.
LoadField and StoreField Distinctions
So far, we’ve talked about elision and instruction removal. We can get away with
deleting StoreField instructions because no other instructions point to
StoreField instructions. Conversely, LoadField instructions do have
dependencies and are referenced by other instructions. These references need to
be fixed up. Each reference to LoadField gets replaced with the cached value
that was the target of a load.
The WriteBarrier Instruction
ZJIT has WriteBarrier instructions to support garbage collection. These also
can modify objects and act similarly to stores. We need to handle this case in
our algorithm.
Pointer Intricacies
The pseudo code we are about to introduce uses the term “offset” to denote the number of bytes from the object’s base address in memory. We use this to detect redundant loads and stores, as well as clear the cache from effectful instructions and write barriers. However, it is not immediately obvious that simply checking offsets would be enough. How can we be sure that the memory regions we are tracking remain untouched by some other instruction? Fortunately, HIR instructions always point to the base of an object and use offsets that are in bounds of the object. If we have two offsets that are not equal, they cannot reference the same region of memory. If the offsets are equal, then object aliasing must be considered.
Algorithm Sketch
Here’s the pseudo-code for a given basic block.
initialize an empty cache as a hashmap
For each HIR instruction in the basic block
if instruction is `LoadField`
check if the object, offset, and value triple is in the cache
if so, delete instruction and replace references to it with the loaded value
else, cache the loaded value with the object, offset pair as a key
if instruction is `StoreField`
check if the object, offset, and value triple is in the cache
if so, delete the instruction
else, remove each cache entry with the same offset (the flags field) to avoid aliasing issues
if instruction is `WriteBarrier`
# This instruction is needed for the garbage collector and is complex
# It works similarly to `StoreField` in practice
# This instruction is never removed but the cache cleaning is still needed
remove each cache entry with the same offset to avoid aliasing issues
if instruction can modify objects
flush the cache
else
continue
return the pruned HIR instructions
Source Code
The source at the time of this writing can be found here.
HIR Improvements
After the optimization, here are examples of how the HIR changes.
This the new HIR for our first redundant load example.
fn initialize@../scripts/double_load.rb:3:
bb1():
EntryPoint interpreter
v1:BasicObject = LoadSelf
v2:NilClass = Const Value(nil)
Jump bb3(v1, v2)
bb2():
EntryPoint JIT(0)
v5:BasicObject = LoadArg :self@0
v6:NilClass = Const Value(nil)
Jump bb3(v5, v6)
bb3(v8:BasicObject, v9:NilClass):
v13:Fixnum[1] = Const Value(1)
PatchPoint SingleRactorMode
v30:HeapBasicObject = GuardType v8, HeapBasicObject
v31:CShape = LoadField v30, :_shape_id@0x4
v32:CShape[0x80000] = GuardBitEquals v31, CShape(0x80000)
StoreField v30, :@a@0x10, v13
WriteBarrier v30, v13
v35:CShape[0x80008] = Const CShape(0x80008)
StoreField v30, :_shape_id@0x4, v35
- v20:HeapBasicObject = RefineType v8, HeapBasicObject
PatchPoint SingleRactorMode
- v38:CShape = LoadField v20, :_shape_id@0x4
- v39:CShape[0x80008] = GuardBitEquals v38, CShape(0x80008)
- v40:BasicObject = LoadField v20, :@a@0x10
CheckInterrupts
- Return v40
+ Return v13
This the new HIR for our first redundant store example.
bb1():
EntryPoint interpreter
v1:BasicObject = LoadSelf
v2:NilClass = Const Value(nil)
Jump bb3(v1, v2)
bb2():
EntryPoint JIT(0)
v5:BasicObject = LoadArg :self@0
v6:NilClass = Const Value(nil)
Jump bb3(v5, v6)
bb3(v8:BasicObject, v9:NilClass):
v13:Fixnum[1] = Const Value(1)
PatchPoint SingleRactorMode
v35:HeapBasicObject = GuardType v8, HeapBasicObject
v36:CShape = LoadField v35, :_shape_id@0x4
v37:CShape[0x80000] = GuardBitEquals v36, CShape(0x80000)
StoreField v35, :@a@0x10, v13
WriteBarrier v35, v13
v40:CShape[0x80008] = Const CShape(0x80008)
StoreField v35, :_shape_id@0x4, v40
v20:HeapBasicObject = RefineType v8, HeapBasicObject
PatchPoint NoEPEscape(initialize)
PatchPoint SingleRactorMode
- v43:CShape = LoadField v20, :_shape_id@0x4
- v44:CShape[0x80008] = GuardBitEquals v43, CShape(0x80008)
- StoreField v20, :@a@0x10, v13
WriteBarrier v20, v13
CheckInterrupts
Return v13
And that’s load-store optimization!
Design Discussion
You may notice that our optimization is pruning the graph of loads and stores on an object. We are solving a very similar problem to the SSA form baked into the HIR. While it would be great to have “more SSA” at the object level, this comes at a cost. Computing SSA at this level could necessitate structural changes to HIR and make things less ergonomic or more confusing in regions of the codebase outside of load-store optimization. In fact, this question of “more SSA” is a complex design decision and contentious topic with a rich history in compilers such as V8 or Jikes RVM. So far, we’ve decided to use a lightweight SSA representation in ZJIT that causes us to work a bit harder for certain optimization passes, yielding subtle design simplifications across the rest of HIR.
Future Work
There’s still a lot of exciting work to be done and there are improvements to be made before we hit diminishing returns. Dead store elimination utilizes many of the same ideas and could help improve object initialization performance. We could implement type based alias analysis, though this requires care, as type confusion bugs are quite dangerous in JIT compilers. See section 4.1 in the phrack article for further details.
Conclusion
Thanks for reading the first post about ZJIT’s optimizer. We have lots more to come, so stay tuned.