Performance impact of the memoization idiom on modern Ruby
One major internal change in Ruby 3.2 was the introduction of object shapes.
In this post, we’ll try to cover why they were introduced, how they work, and what their limitations are.
How object instance variables are stored
Because of how dynamic Ruby is, an operation as simple as accessing an instance variable is a lot of work.
In most cases, Ruby objects store their instance variables in an array of references.
For instance, let’s create a simple object with 2 instance variables:
class User
def initialize(name, admin)
@name = name
@admin = admin
end
def admin?
@admin
end
end
User.new("John", true)
In MRI, this object will occupy a 40 bytes object slot that is laid out like this:
+--------+----------+---------+---------+---------+
| flags | class | ivar0 | ivar1 | ivar2 |
+--------+----------+---------+---------+---------+
| 0x01 | User | "John" | true | UNDEF |
+--------+----------+---------+---------+---------+
Each cell is 8 bytes large, The first cell is used for the object’s flags
, it’s used for many things, but it’s all outside of the scope of this post.
The second cell stores a pointer to the object’s class, and then the last 3 cells are used to store the object’s instance variable.
In this case, since the object only has 2 instance variables, the third slot is not used.
Now when Ruby executes code that accesses an instance variable, it needs to be able to translate the variable name into an index.
In Ruby 3.1 and older, each class would keep a Hash that maps instance variable names to indexes, in pseudo-Ruby it worked like this:
class Class
def initialize
@instance_variables_index = {}
end
def ivar_index(ivar_name)
@instance_variables_index[ivar_name] ||= @instance_variables_index.size
end
end
class Object
def initialize
@ivars = []
end
def instance_variable_get(ivar_name)
index = self.class.ivar_index(ivar_name)
value = @ivars[index]
return value == UNDEF ? nil : value
end
def instance_variable_set(ivar_name, value)
index = self.class.ivar_index(ivar_name)
@ivars[index] = value
end
end
user.instance_variable_get(:@admin) # => true
Inline Caches
As you can imagine, having to do a Hash lookup for every instance variable access is pretty slow.
To avoid doing this every time, the Ruby Virtual Machine uses what’s called inline caches. They are small memory regions inside the VM bytecode, that can store the result of this kind of computations to hopefully avoid doing it every time.
In pseudo-Ruby, it looks like this:
def ruby_vm_get_instance_variable(object, ivar_name, cache)
if cache.last_class != object.class
cache.last_class = object.class
cache.index = object.class.ivar_index(ivar_name)
end
value = object.ivars[cache.index]
return value == UNDEF ? nil : value
end
With such inline caches, if a method only ever deals with a single class, it only has to look up the instance variable index once. On any future execution, all it has to do is compare the cached class with the current object class, and if it matches it can directly re-use the cached index. This is much much faster than having to lookup the index.
Inline Caches and Polymorphism
The problem with such an approach though, is that it doesn’t perform well in the presence of simple polymorphism:
require "benchmark/ips"
class Parent
def name
@name
end
def initialize(name)
@name = name
end
end
class Subclass < Parent
end
stable_1 = Parent.new("Odile Deray")
stable_2 = Parent.new("Simon Jérémi")
unstable_1 = Parent.new("Peter")
unstable_2 = Subclass.new("Steven")
puts RUBY_DESCRIPTION
Benchmark.ips do |x|
x.report("stable class") do
stable_1.name
stable_2.name
end
x.report("unstable class") do
unstable_1.name
unstable_2.name
end
x.compare!(order: :baseline)
end
In the above benchmark, we compare the performance of a simple instance variable access on two instances of the same class against the same simple access but with two distinct classes.
ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]
stable class 14.222M (± 1.3%) i/s - 71.960M in 5.060612s
unstable class 11.625M (± 3.2%) i/s - 58.238M in 5.015375s
Comparison:
stable class: 14221847.8 i/s
unstable class: 11625350.9 i/s - 1.22x slower
As you can see, even though Subclass
is in every shape and form identical to Parent
because the class is used as a cache key
when it differs we have to go through the slow lookup again.
Introducing Object Shapes
To solve this problem (and a few others), Ruby 3.2
implemented object shapes, which is a technique pioneered by Smalltalk in the 80’s.
To keep this article relatively brief, I will only offer a basic explanation of what object shapes are and how they work. If you’d like to dig deeper into the topic, I heavily recommend either reading Chris Seaton’s post about the general concept, or watching one of Jemma Issroff’s talks on how she implemented shapes in CRuby.
So without going into too much detail, shapes are a tree-like structure where each node represents an instance variable, for example with the following code:
class Person
def initialize(name)
@name = name
end
end
class Employee < Person
end
class Customer < Person
def initialize(name)
super
@balance = 0
end
end
person = Person.new("Peter")
employee = Employee.new("Steven")
customer = Customer.new(George)
Here the shape tree will look like this:
+----------+----------+-------+-----------+
| shape_id | ivar | index | parent_id |
+----------+----------+-------+-----------+
+ 12 | @name | 0 | 0 |
+----------+----------+-------+-----------+
+ 42 | @balance | 1 | 12 |
+----------+----------+-------+-----------+
And in the above example, the user
and employee
instance will have the same shape (12
), even though they are of different class.
Any two objects that have the same list of instance variables defined in the same order will share a common shape.
Thanks to this, the instance variable access code was updated to use the shape as a cache key:
def ruby_vm_get_instance_variable(object, ivar_name, cache)
if cache.last_shape_id != object.shape_id
cache.last_shape_id = object.shape_id
cache.index = RubyVM.ivar_index(shape_id, ivar_name)
end
if cache.index == IVAR_NOT_SET
return nil
else
return object.ivars[cache.index]
end
end
Running the same benchmark on Ruby 3.2
confirms that now the two cases have identical performance:
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]
stable class 13.994M (± 1.2%) i/s - 70.504M in 5.038920s
unstable class 14.122M (± 1.5%) i/s - 71.367M in 5.054938s
Comparison:
stable class: 13993825.7 i/s
unstable class: 14121540.7 i/s - same-ish: difference falls within error
Thanks to shapes, the inline cache is no longer flip-flopping in the presence of simple polymorphism.
I won’t go into too much detail but note that this isn’t the only advantage of shapes.
The Memoization Idiom
However, shapes aren’t faster in all cases.
A very common idiom you have certainly encountered if you ever worked with Ruby, is the memoization pattern:
def dynamic_property
@dynamic_property ||= something_slow_to_compute
end
Unfortunately, this idiom doesn’t play well with shapes, because now instances of the same class will frequently have different shapes, causing the inline caches hit rate to degrade.
A class with all its instance variables defined in initialize
will only ever have one shape.
But when extra instance variables can be defined in semi-random order, the number of shapes can quickly explode. A class with 1 memoized variable will have 2 shapes, a class with 2 memoized variables will have 4 shapes, a class with 5 memoized variables will have up to 121 shapes, and so on.
Yes, this is a factorial growth. The number of shapes a class may have is 1 + !lazily_defined_variables
.
But how bad is it? Let’s measure:
require "benchmark/ips"
class UnstableShape
attr_reader :name
def initialize(name)
@name = name
end
def first_name
@first_name ||= @name.split.first
end
def upcased_name
@upcased_name ||= @name.upcase
end
end
# Generate two objects with the same shape
stable_1 = UnstableShape.new("Odile Deray")
stable_2 = UnstableShape.new("Simon Jérémi")
stable_1.first_name
stable_1.upcased_name
stable_2.first_name
stable_2.upcased_name
# Generate two objects with distinct shapes
unstable_1 = UnstableShape.new("Peter")
unstable_2 = UnstableShape.new("Steven")
unstable_1.first_name
unstable_1.upcased_name
unstable_2.upcased_name
unstable_2.first_name
puts RUBY_DESCRIPTION
Benchmark.ips do |x|
x.report("stable shape") do
stable_1.name
stable_2.name
end
x.report("unstable shape") do
unstable_1.name
unstable_2.name
end
x.compare!(order: :baseline)
end
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]
stable shape 15.341M (± 0.9%) i/s - 76.755M in 5.003687s
unstable shape 13.259M (± 1.1%) i/s - 66.385M in 5.007561s
Comparison:
stable shape: 15340799.2 i/s
unstable shape: 13258541.5 i/s - 1.16x slower
16% slower isn’t too bad, but keep in mind that the overhead directly depends on how deep the shape is, if we perform the same benchmark, but add an arbitrarily large number of variables to the shape, we can make the benchmark arbitrarily slower:
class UnstableShape
def initialize(name)
@name = name
@ivar01 = true
@ivar02 = true
@ivar03 = true
@ivar04 = true
@ivar05 = true
@ivar06 = true
@ivar07 = true
@ivar08 = true
@ivar09 = true
@ivar10 = true
@ivar11 = true
@ivar12 = true
@ivar13 = true
@ivar14 = true
@ivar15 = true
@ivar16 = true
@ivar17 = true
@ivar18 = true
@ivar19 = true
@ivar20 = true
@ivar21 = true
@ivar22 = true
@ivar23 = true
@ivar24 = true
@ivar25 = true
end
end
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]
stable shape 14.253M (± 0.9%) i/s - 71.296M in 5.002615s
unstable shape 5.821M (± 1.8%) i/s - 29.116M in 5.003786s
Comparison:
stable shape: 14253062.2 i/s
unstable shape: 5820686.1 i/s - 2.45x slower
The reason is that on a cache miss, Ruby has to walk the object’s shape tree until it finds the node it needs, so it’s an O(n)
operation where n
is the object’s shape depth
which is more or less its number of instance variables.
If you are curious, the function responsible for that is rb_shape_get_iv_index
,
in pseudo-Ruby, it looks like:
def RubyVM.ivar_index(shape_id, ivar)
shape = SHAPES[shape_id]
while shape
if shape.ivar == ivar
return shape.index
else
shape = SHAPES[shape.parent_id]
end
end
IVAR_NOT_SET
end
You may think that 25
instance variables is a bit over the top, but I chose that number because that is how many instance variables an ActiveRecord::Base
instance has.
So this shows that using the memoization pattern in a single Rails model can significantly degrade the performance of most Active Record operations.
YJIT’s Polymorphic Caches
This unstable shape problem is in large part because the virtual machine’s inline caches are monomorphic, meaning they can only record a single shape.
However one of YJIT’s optimizations is that it implements polymorphic inline caches. In Ruby 3.2, YJIT inline caches can record up to 20 different shapes. But that’s a memory usage versus speed tradeoff, so based on production data, this was reduced to 5 in Ruby 3.3.
Still, if we enable YJIT, we can see that we again have identical performance for this simple memoization benchmark:
ruby 3.2.2 (2023-03-30 revision e51014f9c0) +YJIT [arm64-darwin22]
stable shape 40.524M (± 1.4%) i/s - 203.140M in 5.013915s
unstable shape 40.557M (± 1.0%) i/s - 203.056M in 5.007196s
Comparison:
stable shape: 40523693.8 i/s
unstable shape: 40557046.2 i/s - same-ish: difference falls within error
But as explained before, the number of shapes can quickly explode if memoization is used extensively.
SHAPE_TOO_COMPLEX
This problem was spotted quite early during object shapes implementation, so a workaround was implemented
to deal with pathological cases. When the virtual machine detects that a class is generating a lot of different shapes,
it automatically marks that class as “too complex”, and all future instances will have a special unique shape called SHAPE_TOO_COMPLEX
.
Objects with this special shape no longer store their instance variables in an internal array, but instead use a hash map. This uses more memory and makes for slower but more consistent access performance.
This can be demonstrated by modifying our benchmark a bit more:
class StableShape
attr_reader :name
def initialize(name)
@name = name
end
end
class TooComplexShape
attr_reader :name
def initialize(name)
@name = name
end
end
# Cause the class to be marked as too complex
10.times do |i|
TooComplexShape.allocate.instance_variable_set("@iv_#{i}", i)
end
stable = StableShape.new("Test")
too_complex = TooComplexShape.new("George Abitbol")
puts ObjectSpace.dump(too_complex)
puts RUBY_DESCRIPTION
Benchmark.ips do |x|
x.report("stable shape") do
stable.name
end
x.report("too-complex shape") do
too_complex.name
end
x.compare!(order: :baseline)
end
{"address":"0x104d991d0", "type":"OBJECT", ..., "too_complex_shape":true, ...}
ruby 3.2.2 (2023-03-30 revision e51014f9c0) +YJIT [arm64-darwin22]
stable shape 45.207M (± 1.0%) i/s - 226.553M in 5.011953s
too-complex shape 29.952M (± 1.1%) i/s - 150.621M in 5.029290s
Comparison:
stable shape: 45207185.7 i/s
too-complex shape: 29952498.7 i/s - 1.51x slower
So SHAPE_TOO_COMPLEX
is slower than the happy path, but faster than the worst-case scenario where we have to constantly lookup
the variable index in the shape tree.
Common Ancestors
While working on this blog post, it occurred to me that we could make inline caches handle the memoization pattern better, without changes to any Ruby code.
On a cache miss, we start searching for the variable index from scratch. But if the inline cache is flip-flopping because of memoization, the shape that was stored in the cache and the current shape will have a common ancestor, and checking for it would allow us to stop walking the shape tree much sooner.
In the UnstableShape
example, we had an object with 26
fixed instance variables, and 2
memoized instance variables.
The variable we’re looking for is the very first one, so we have to walk over at least 26
shapes which is slow.
But if we provide the stale shape_id
, we’d only need to check at most 3
shapes, which is much faster.
C code can be a bit cryptic, so here’s the pseudo-Ruby version of it:
def RubyVM.ivar_index(shape_id, ivar, cached_shape_id, cached_index)
shape = SHAPES[shape_id]
while shape
if cached_shape_id == shape.parent_id
# The new shape is a descendant of the one we had in the cache
# so we know the cached index is still valid.
return cached_index
elsif shape.ivar == ivar
return shape.index
else
shape = SHAPES[shape.parent_id]
end
end
IVAR_NOT_SET
end
It’s still an O(n)
operation, but now in most cases n
is the number of lazily defined variables rather than the total number of variables, so n
is significantly reduced.
When I shared this idea with some fellow Ruby committers, they pointed out that John Hawthorn had pretty much the same idea 6 months prior, but even better implemented.
As I’m writing this, this patch isn’t yet merged, but I’m hoping some version of it will be for Ruby 3.3.0 in December.
Indexing the Ancestors
The above patch would pretty much solve the performance concern of moderate users of the memoization pattern.
However, one pattern it doesn’t help with at all is code that is shared between radically different shapes:
module Synchronized
def initialize
@mutex = Mutex.new
end
def synchronized(...)
@mutex.synchronize(...)
end
end
class Car
include Synchronized
def initialize(brand, ...)
super
@brand = brand
...
end
end
class Animal
include Synchronized
def initialize(breed, ...)
super
@breed = breed
...
end
end
In the above example, the @mutex
access in the synchronized
method will witness radically different shapes, and won’t even see a common ancestor, as such the access will remain O(n)
.
This issue was foreseen before the release of Ruby 3.2, but no solution could be implemented in time.
A possible solution is to keep an index of the parent shapes in the child shapes so that on a miss we can more quickly lookup the attribute index. The downside of course is that such indexes would use more memory.
Aaron Patterson worked on implementing this idea, and chose to use a Red-black tree index because it allows to easily share the parent shape’s index, keeping the memory overhead down. To further minimize the extra memory usage his implementation only generates the index every 10 shapes, so on a miss Ruby would have to at most walk 10 shapes and then do an index lookup.
This patch isn’t yet merged either, but will hopefully make it into Ruby 3.3 too.
The Importance of Shape-Friendly Code
As demonstrated in the above benchmarks, over-use of the memoization pattern can cause a degradation of performance in your Ruby programs, so it can be effective to eagerly initialize instance variables in the constructor to avoid this, at least in hot spots.
But if the pattern is used moderately, it’s probably not a big deal and will be handled better by either YJIT or future versions of Ruby.
The one thing that is best avoided though, is SHAPE_TOO_COMPLEX
as it is not only slower but consumes more memory.
To help with that, starting in Ruby 3.3 a warning will be emitted when a class is marked as too complex,
and it can be enabled by setting Warning[:performance] = true
or passing -W:performance
to the command line:
$ ruby -W:performance /tmp/too-complex.rb
/tmp/too-complex.rb:19: warning: Maximum shapes variations (8) reached by TooComplexShape, instance variables accesses will be slower.
But overall, for moderate use of the pattern, outside of hotspots, it’s not a major performance concern.
My rule of thumb is that one or two memoized variables in a class are fine, but more than that likely deserve a quick refactor to eagerly initialize the variables to nil
in the constructor:
class StableShape
def initialize
@expensive_value = nil
end
def expensive_value
@expensive_value ||= compute_expensive_value
end
end
And in the case where the memoized value may be falsy, you can use a token module:
class StableShape
NOT_SET = Module.new
private_constant :NOT_SET
def initialize
@expensive_value = NOT_SET
end
def expensive_value
return @expensive_value unless @expensive_value == NOT_SET
@expensive_value = compute_expensive_value
end
end