A short debugging story that taught us about the attributes of IO operations.

An error

Our team is working on making TruffleRuby support one of our larger internal Ruby application. Running tests in CI we’ve noticed a flaky test failure reporting:

undefined method 'write_without_cli_ui' for #<IO:<STDOUT>> (NoMethodError)

Debugging

While this looked an obvious mistake, our initial findings were confusing. Our application in test was using the cli-ui gem, where one adjustment the gem is making is aliasing the STDOUT.write method, so it can do writes with extra features.

To exclude basic mistakes we’ve confirmed:

  • the aliased method #write_without_cli_ui is never called outside of the gem (or this context), but purely from the re-defined STDOUT.write method
  • this gem code is the only place where this specific metaprogramming has been applied
  • this code (method aliasing and redefinition) is applied once
  • there is always only one STDOUT object used (same as $stdout)

We’ve also confirmed that when the error happens, the missing method had already been defined. We looked into a few theories how a method can be undefined (other metaprogramming, serialization/deserialization), but all failed.

At this point we had exhausted all trivial culprits, and started to look into the deeper layers, by examining:

  • running process / thread changes
  • object / class / singleton class

Here we’ve found our first surprising revelation: STDOUT’s singleton class is different at the point of error, while everything else remained the same (same object in memory).

We’ve looked into the CRuby implemantation of IO, and stumbled upon IO#reopen:

RBASIC_SET_CLASS(io, rb_obj_class(nfile));

(or the equivalent code in TruffleRuby):

Primitive.vm_set_class self, Primitive.class(io)

This is a method that replaces the underlying object’s class without providing a new object. The documentation also has a slight hint to it:

Reassociates the stream with another stream, which may be of a different class. This method may be used to redirect an existing stream to a new destination.

It didn’t take long to find where this was called by using ActiveSupport::Testing::Stream#capture.

We still had trouble explaining the error. If a singleton class has been replaced, and the specific #write_without_cli_ui method is only called via the redefined #write method, which also only lived on the singleton class, how are we able to call these methods after #reopen removed the singleton class from STDOUT?

That is a little trickier and can be explained by how TruffleRuby manages singleton class lookups. It had a basic expectation that an object’s singleton class cannot be replaced. For performance reasons, TruffleRuby makes extensive use of local caches for things like method resolution. Here, TruffleRuby assumed the singleton class couldn’t change and as such, the runtime could cache the method resolution. With the cache pointing at the wrong singleton class we were receiving the wrong method reference.

The fix

Reproducing and writing a regression test case for it was also non-obvious. TruffleRuby’s inline caching works on a per call-site basis. By making the method lookup (and through that a singleton class lookup) in a method context, we make the lookup result local to that context. With this in mind the following test was able to reproduce the error:

# In TruffleRuby:

def indirection(io)
  # In-method call to have a local context for a separate inline cache.
  io.foo
end

io = File.open("/tmp/bug", "a+")
io.define_singleton_method(:foo) { "old" }
# Trigger the lookup to establish the inline cached version.
indirection(io)

io.reopen($stderr)
io.define_singleton_method(:foo) { "new" }
indirection(io)
# Before fix: result -> "old" (incorrect)
#  After fix: result -> "new" (correct)

The fix at this point was rather trivial: in this PR we’ve updated the guards for the specialized meta class getter to exclude instances of IO type objects, which from this point always receive a fresh singleton class lookup (in another specialized getter).

We’ve opened a discussion on the Ruby project to see how can we reduce the surprise effect of this behavior. While we’ve updated TruffleRuby to match CRuby’s behavior, there’s still the open language design issue of IO#reopen from wiping the object’s class/singleton class, which could disable added functionality to IO objects.