Performance impact of the memoization idiom on modern Ruby

One major internal change in Ruby 3.2 was the introduction of object shapes.

In this post, we’ll try to cover why they were introduced, how they work, and what their limitations are.

How object instance variables are stored

Because of how dynamic Ruby is, an operation as simple as accessing an instance variable is a lot of work.

In most cases, Ruby objects store their instance variables in an array of references.

For instance, let’s create a simple object with 2 instance variables:

class User
  def initialize(name, admin)
    @name = name
    @admin = admin
  end

  def admin?
    @admin
  end
end

User.new("John", true)

In MRI, this object will occupy a 40 bytes object slot that is laid out like this:

+--------+----------+---------+---------+---------+
|  flags |  class   |  ivar0  |  ivar1  |  ivar2  |
+--------+----------+---------+---------+---------+
|  0x01  |  User    |  "John" |  true   |  UNDEF  |
+--------+----------+---------+---------+---------+

Each cell is 8 bytes large, The first cell is used for the object’s flags, it’s used for many things, but it’s all outside of the scope of this post.

The second cell stores a pointer to the object’s class, and then the last 3 cells are used to store the object’s instance variable.

In this case, since the object only has 2 instance variables, the third slot is not used.

Now when Ruby executes code that accesses an instance variable, it needs to be able to translate the variable name into an index.

In Ruby 3.1 and older, each class would keep a Hash that maps instance variable names to indexes, in pseudo-Ruby it worked like this:

class Class
  def initialize
    @instance_variables_index = {}
  end

  def ivar_index(ivar_name)
    @instance_variables_index[ivar_name] ||= @instance_variables_index.size
  end
end

class Object
  def initialize
    @ivars = []
  end

  def instance_variable_get(ivar_name)
    index = self.class.ivar_index(ivar_name)
    value = @ivars[index]
    return value == UNDEF ? nil : value
  end

  def instance_variable_set(ivar_name, value)
    index = self.class.ivar_index(ivar_name)
    @ivars[index] = value
  end
end

user.instance_variable_get(:@admin) # => true

Inline Caches

As you can imagine, having to do a Hash lookup for every instance variable access is pretty slow.

To avoid doing this every time, the Ruby Virtual Machine uses what’s called inline caches. They are small memory regions inside the VM bytecode, that can store the result of this kind of computations to hopefully avoid doing it every time.

In pseudo-Ruby, it looks like this:

def ruby_vm_get_instance_variable(object, ivar_name, cache)
  if cache.last_class != object.class
    cache.last_class = object.class
    cache.index = object.class.ivar_index(ivar_name)
  end

  value = object.ivars[cache.index]
  return value == UNDEF ? nil : value
end

With such inline caches, if a method only ever deals with a single class, it only has to look up the instance variable index once. On any future execution, all it has to do is compare the cached class with the current object class, and if it matches it can directly re-use the cached index. This is much much faster than having to lookup the index.

Inline Caches and Polymorphism

The problem with such an approach though, is that it doesn’t perform well in the presence of simple polymorphism:

require "benchmark/ips"

class Parent
  def name
    @name
  end

  def initialize(name)
    @name = name
  end
end

class Subclass < Parent
end

stable_1 = Parent.new("Odile Deray")
stable_2 = Parent.new("Simon Jérémi")

unstable_1 = Parent.new("Peter")
unstable_2 = Subclass.new("Steven")

puts RUBY_DESCRIPTION
Benchmark.ips do |x|
  x.report("stable class") do
    stable_1.name
    stable_2.name
  end

  x.report("unstable class") do
    unstable_1.name
    unstable_2.name
  end

  x.compare!(order: :baseline)
end

In the above benchmark, we compare the performance of a simple instance variable access on two instances of the same class against the same simple access but with two distinct classes.

ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]

        stable class     14.222M (± 1.3%) i/s -     71.960M in   5.060612s
      unstable class     11.625M (± 3.2%) i/s -     58.238M in   5.015375s

Comparison:
        stable class: 14221847.8 i/s
      unstable class: 11625350.9 i/s - 1.22x  slower

As you can see, even though Subclass is in every shape and form identical to Parent because the class is used as a cache key when it differs we have to go through the slow lookup again.

Introducing Object Shapes

To solve this problem (and a few others), Ruby 3.2 implemented object shapes, which is a technique pioneered by Smalltalk in the 80’s.

To keep this article relatively brief, I will only offer a basic explanation of what object shapes are and how they work. If you’d like to dig deeper into the topic, I heavily recommend either reading Chris Seaton’s post about the general concept, or watching one of Jemma Issroff’s talks on how she implemented shapes in CRuby.

So without going into too much detail, shapes are a tree-like structure where each node represents an instance variable, for example with the following code:

class Person
  def initialize(name)
    @name = name
  end
end

class Employee < Person
end

class Customer < Person
  def initialize(name)
    super
    @balance = 0
  end
end

person = Person.new("Peter")
employee = Employee.new("Steven")
customer = Customer.new(George)

Here the shape tree will look like this:

+----------+----------+-------+-----------+
| shape_id |   ivar   | index | parent_id |
+----------+----------+-------+-----------+
+ 12       | @name    | 0     | 0         |
+----------+----------+-------+-----------+
+ 42       | @balance | 1     | 12        |
+----------+----------+-------+-----------+

And in the above example, the user and employee instance will have the same shape (12), even though they are of different class.

Any two objects that have the same list of instance variables defined in the same order will share a common shape.

Thanks to this, the instance variable access code was updated to use the shape as a cache key:

def ruby_vm_get_instance_variable(object, ivar_name, cache)
  if cache.last_shape_id != object.shape_id
    cache.last_shape_id = object.shape_id
    cache.index = RubyVM.ivar_index(shape_id, ivar_name)
  end

  if cache.index == IVAR_NOT_SET
    return nil
  else
    return object.ivars[cache.index]
  end
end

Running the same benchmark on Ruby 3.2 confirms that now the two cases have identical performance:

ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]

        stable class     13.994M (± 1.2%) i/s -     70.504M in   5.038920s
      unstable class     14.122M (± 1.5%) i/s -     71.367M in   5.054938s

Comparison:
        stable class: 13993825.7 i/s
      unstable class: 14121540.7 i/s - same-ish: difference falls within error

Thanks to shapes, the inline cache is no longer flip-flopping in the presence of simple polymorphism.

I won’t go into too much detail but note that this isn’t the only advantage of shapes.

The Memoization Idiom

However, shapes aren’t faster in all cases.

A very common idiom you have certainly encountered if you ever worked with Ruby, is the memoization pattern:

def dynamic_property
  @dynamic_property ||= something_slow_to_compute
end

Unfortunately, this idiom doesn’t play well with shapes, because now instances of the same class will frequently have different shapes, causing the inline caches hit rate to degrade.

A class with all its instance variables defined in initialize will only ever have one shape.

But when extra instance variables can be defined in semi-random order, the number of shapes can quickly explode. A class with 1 memoized variable will have 2 shapes, a class with 2 memoized variables will have 4 shapes, a class with 5 memoized variables will have up to 121 shapes, and so on.

Yes, this is a factorial growth. The number of shapes a class may have is 1 + !lazily_defined_variables.

But how bad is it? Let’s measure:

require "benchmark/ips"

class UnstableShape
  attr_reader :name

  def initialize(name)
    @name = name
  end

  def first_name
    @first_name ||= @name.split.first
  end

  def upcased_name
    @upcased_name ||= @name.upcase
  end
end

# Generate two objects with the same shape
stable_1 = UnstableShape.new("Odile Deray")
stable_2 = UnstableShape.new("Simon Jérémi")
stable_1.first_name
stable_1.upcased_name
stable_2.first_name
stable_2.upcased_name

# Generate two objects with distinct shapes
unstable_1 = UnstableShape.new("Peter")
unstable_2 = UnstableShape.new("Steven")
unstable_1.first_name
unstable_1.upcased_name
unstable_2.upcased_name
unstable_2.first_name

puts RUBY_DESCRIPTION
Benchmark.ips do |x|
  x.report("stable shape") do
    stable_1.name
    stable_2.name
  end

  x.report("unstable shape") do
    unstable_1.name
    unstable_2.name
  end

  x.compare!(order: :baseline)
end

ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]

        stable shape     15.341M (± 0.9%) i/s -     76.755M in   5.003687s
      unstable shape     13.259M (± 1.1%) i/s -     66.385M in   5.007561s

Comparison:
        stable shape: 15340799.2 i/s
      unstable shape: 13258541.5 i/s - 1.16x  slower

16% slower isn’t too bad, but keep in mind that the overhead directly depends on how deep the shape is, if we perform the same benchmark, but add an arbitrarily large number of variables to the shape, we can make the benchmark arbitrarily slower:

class UnstableShape
  def initialize(name)
    @name = name

    @ivar01 = true
    @ivar02 = true
    @ivar03 = true
    @ivar04 = true
    @ivar05 = true
    @ivar06 = true
    @ivar07 = true
    @ivar08 = true
    @ivar09 = true
    @ivar10 = true
    @ivar11 = true
    @ivar12 = true
    @ivar13 = true
    @ivar14 = true
    @ivar15 = true
    @ivar16 = true
    @ivar17 = true
    @ivar18 = true
    @ivar19 = true
    @ivar20 = true
    @ivar21 = true
    @ivar22 = true
    @ivar23 = true
    @ivar24 = true
    @ivar25 = true
  end
end

ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22]

        stable shape     14.253M (± 0.9%) i/s -     71.296M in   5.002615s
      unstable shape      5.821M (± 1.8%) i/s -     29.116M in   5.003786s

Comparison:
        stable shape: 14253062.2 i/s
      unstable shape:  5820686.1 i/s - 2.45x  slower

The reason is that on a cache miss, Ruby has to walk the object’s shape tree until it finds the node it needs, so it’s an O(n) operation where n is the object’s shape depth which is more or less its number of instance variables.

If you are curious, the function responsible for that is rb_shape_get_iv_index, in pseudo-Ruby, it looks like:

def RubyVM.ivar_index(shape_id, ivar)
  shape = SHAPES[shape_id]
  while shape
    if shape.ivar == ivar
      return shape.index
    else
      shape = SHAPES[shape.parent_id]
    end
  end
  IVAR_NOT_SET
end

You may think that 25 instance variables is a bit over the top, but I chose that number because that is how many instance variables an ActiveRecord::Base instance has. So this shows that using the memoization pattern in a single Rails model can significantly degrade the performance of most Active Record operations.

YJIT’s Polymorphic Caches

This unstable shape problem is in large part because the virtual machine’s inline caches are monomorphic, meaning they can only record a single shape.

However one of YJIT’s optimizations is that it implements polymorphic inline caches. In Ruby 3.2, YJIT inline caches can record up to 20 different shapes. But that’s a memory usage versus speed tradeoff, so based on production data, this was reduced to 5 in Ruby 3.3.

Still, if we enable YJIT, we can see that we again have identical performance for this simple memoization benchmark:

ruby 3.2.2 (2023-03-30 revision e51014f9c0) +YJIT [arm64-darwin22]

        stable shape     40.524M (± 1.4%) i/s -    203.140M in   5.013915s
      unstable shape     40.557M (± 1.0%) i/s -    203.056M in   5.007196s

Comparison:
        stable shape: 40523693.8 i/s
      unstable shape: 40557046.2 i/s - same-ish: difference falls within error

But as explained before, the number of shapes can quickly explode if memoization is used extensively.

`SHAPE_TOO_COMPLEX`

This problem was spotted quite early during object shapes implementation, so a workaround was implemented to deal with pathological cases. When the virtual machine detects that a class is generating a lot of different shapes, it automatically marks that class as “too complex”, and all future instances will have a special unique shape called SHAPE_TOO_COMPLEX.

Objects with this special shape no longer store their instance variables in an internal array, but instead use a hash map. This uses more memory and makes for slower but more consistent access performance.

This can be demonstrated by modifying our benchmark a bit more:

class StableShape
  attr_reader :name

  def initialize(name)
    @name = name
  end
end

class TooComplexShape
  attr_reader :name

  def initialize(name)
    @name = name
  end
end

# Cause the class to be marked as too complex
10.times do |i|
  TooComplexShape.allocate.instance_variable_set("@iv_#{i}", i)
end

stable = StableShape.new("Test")
too_complex = TooComplexShape.new("George Abitbol")
puts ObjectSpace.dump(too_complex)

puts RUBY_DESCRIPTION
Benchmark.ips do |x|
  x.report("stable shape") do
    stable.name
  end

  x.report("too-complex shape") do
    too_complex.name
  end

  x.compare!(order: :baseline)
end

{"address":"0x104d991d0", "type":"OBJECT", ..., "too_complex_shape":true, ...}

ruby 3.2.2 (2023-03-30 revision e51014f9c0) +YJIT [arm64-darwin22]

        stable shape     45.207M (± 1.0%) i/s -    226.553M in   5.011953s
   too-complex shape     29.952M (± 1.1%) i/s -    150.621M in   5.029290s

Comparison:
        stable shape: 45207185.7 i/s
   too-complex shape: 29952498.7 i/s - 1.51x  slower

So SHAPE_TOO_COMPLEX is slower than the happy path, but faster than the worst-case scenario where we have to constantly lookup the variable index in the shape tree.

Common Ancestors

While working on this blog post, it occurred to me that we could make inline caches handle the memoization pattern better, without changes to any Ruby code.

On a cache miss, we start searching for the variable index from scratch. But if the inline cache is flip-flopping because of memoization, the shape that was stored in the cache and the current shape will have a common ancestor, and checking for it would allow us to stop walking the shape tree much sooner.

In the UnstableShape example, we had an object with 26 fixed instance variables, and 2 memoized instance variables. The variable we’re looking for is the very first one, so we have to walk over at least 26 shapes which is slow.

But if we provide the stale shape_id, we’d only need to check at most 3 shapes, which is much faster.

C code can be a bit cryptic, so here’s the pseudo-Ruby version of it:

def RubyVM.ivar_index(shape_id, ivar, cached_shape_id, cached_index)
  shape = SHAPES[shape_id]
  while shape
    if cached_shape_id == shape.parent_id
      # The new shape is a descendant of the one we had in the cache
      # so we know the cached index is still valid.
      return cached_index
    elsif shape.ivar == ivar
      return shape.index
    else
      shape = SHAPES[shape.parent_id]
    end
  end
  IVAR_NOT_SET
end

It’s still an O(n) operation, but now in most cases n is the number of lazily defined variables rather than the total number of variables, so n is significantly reduced.

When I shared this idea with some fellow Ruby committers, they pointed out that John Hawthorn had pretty much the same idea 6 months prior, but even better implemented.

As I’m writing this, this patch isn’t yet merged, but I’m hoping some version of it will be for Ruby 3.3.0 in December.

Indexing the Ancestors

The above patch would pretty much solve the performance concern of moderate users of the memoization pattern.

However, one pattern it doesn’t help with at all is code that is shared between radically different shapes:

module Synchronized
  def initialize
    @mutex = Mutex.new
  end

  def synchronized(...)
    @mutex.synchronize(...)
  end
end

class Car
  include Synchronized

  def initialize(brand, ...)
    super
    @brand = brand
    ...
  end
end

class Animal
  include Synchronized

  def initialize(breed, ...)
    super
    @breed = breed
    ...
  end
end

In the above example, the @mutex access in the synchronized method will witness radically different shapes, and won’t even see a common ancestor, as such the access will remain O(n).

This issue was foreseen before the release of Ruby 3.2, but no solution could be implemented in time.

A possible solution is to keep an index of the parent shapes in the child shapes so that on a miss we can more quickly lookup the attribute index. The downside of course is that such indexes would use more memory.

Aaron Patterson worked on implementing this idea, and chose to use a Red-black tree index because it allows to easily share the parent shape’s index, keeping the memory overhead down. To further minimize the extra memory usage his implementation only generates the index every 10 shapes, so on a miss Ruby would have to at most walk 10 shapes and then do an index lookup.

This patch isn’t yet merged either, but will hopefully make it into Ruby 3.3 too.

The Importance of Shape-Friendly Code

As demonstrated in the above benchmarks, over-use of the memoization pattern can cause a degradation of performance in your Ruby programs, so it can be effective to eagerly initialize instance variables in the constructor to avoid this, at least in hot spots.

But if the pattern is used moderately, it’s probably not a big deal and will be handled better by either YJIT or future versions of Ruby.

The one thing that is best avoided though, is SHAPE_TOO_COMPLEX as it is not only slower but consumes more memory. To help with that, starting in Ruby 3.3 a warning will be emitted when a class is marked as too complex, and it can be enabled by setting Warning[:performance] = true or passing -W:performance to the command line:

$ ruby -W:performance /tmp/too-complex.rb 
/tmp/too-complex.rb:19: warning: Maximum shapes variations (8) reached by TooComplexShape, instance variables accesses will be slower.

But overall, for moderate use of the pattern, outside of hotspots, it’s not a major performance concern.

My rule of thumb is that one or two memoized variables in a class are fine, but more than that likely deserve a quick refactor to eagerly initialize the variables to nil in the constructor:

class StableShape
  def initialize
    @expensive_value = nil
  end

  def expensive_value
    @expensive_value ||= compute_expensive_value
  end
end

And in the case where the memoized value may be falsy, you can use a token module:

class StableShape
  NOT_SET = Module.new
  private_constant :NOT_SET

  def initialize
    @expensive_value = NOT_SET
  end

  def expensive_value
    return @expensive_value unless @expensive_value == NOT_SET
    @expensive_value = compute_expensive_value
  end
end