Implementing Embedded TypedData Objects
Internally, CRuby’s objects are strongly typed, with various types such as Array, Hash, Regexp, and Object. There is also a type called TypedData which is a data type used internally and by native gems to store a native pointer to an arbitrary piece of data. Some types in Ruby that are TypedData objects include Time, Mutex, and Enumerator. Native extensions like Nokogiri, pg, mysql2, and liquid-c also use TypedData objects extensively.
Jean Boussier and I implemented TypedData objects on Variable Width Allocation in Ruby 3.3, which improves performance and memory usage. In this blog post, we will explore what TypedData objects are, how the memory layout changes with embedded TypedData objects, and our progress with implementing types on TypedData objects.
What are TypedData objects?
In Ruby, TypedData objects appear no different from any other Ruby object: it is an instance of a class, has instance methods that can be called, and can hold instance variables. However, under the hood, it’s quite different from other types of Ruby objects. TypedData objects are designed to store a pointer to an arbitrary piece of data. Compared to using instance variables, this is faster as it does not perform instance variable lookups and allows the developer to store data that are not Ruby objects. For a more detailed guide on TypedData objects, including how to use them, read my other blog post. A TypedData object looks as follows:
The fields in a TypedData object are:
headers
: data for all Ruby objects that contains metadata about the object for the garbage collector.type
: stores a pointer to the configuration for the TypedData object, including:- The name of the TypedData object.
- The mark function which is used for marking the Ruby objects this TypedData object refers to.
- The free function which is used for freeing the resources of the TypedData object when it is reclaimed by the garbage collector.
- Flags for features that the TypedData object supports.
typed_flag
: not useful for us. It’s just there for legacy reasons.data
: pointer to an arbitrary region of memory.
Embedding TypedData objects
Jean Boussier and I implemented a new type of TypedData objects, called embedded TypedData objects. Rather than allocating the data of the TypedData object externally, it allocates the data right after the object itself. The benefits of this feature include:
- This reduces the number of allocations from two (one Ruby object and one system allocation) to just a single Ruby object, significantly improving performance. Allocating memory from the system and releasing it back can have significant performance overhead.
- This also improves runtime performance as we reduce memory accesses when reading the data of the TypedData object because we no longer need to follow a pointer.
- This can also reduce memory usage because we no longer need to store an 8 byte pointer to the memory region and avoids memory used for bookkeeping for the system allocated memory.
- Some implementation of
malloc
can suffer from external memory fragmentation, which increases memory usage. Ruby’s garbage collector is designed to mitigate this issue.
Each TypedData object can opt into this feature by using the RUBY_TYPED_EMBEDDABLE
flag. It isn’t applied to all of them, as it requires minor changes in the implementation. Additionally, there is a requirement that the data is not shared between multiple TypedData objects since the address of the data may change when the object is moved during the compaction phase in garbage collection.
Impacts of embedded TypedData objects
We have implemented this feature in over 30 most commonly used TypedData objects, including Time, Enumerator, Method, and TracePoint.
By removing the need to allocate memory from the system using malloc, embedded TypedData objects can be significantly faster to allocate. For example, we saw a 80% speedup in Time.now
, a 68% speedup in Object#to_enum
, and nearly a 50% speedup in Object#method
.
Conclusion
We looked at what TypedData objects are, how we implemented embedded TypedData objects, and the performance improvements that embedded TypedData objects bring. We look forward to opening up this API for third-party native extensions, allowing more of the community to benefit from the performance improvements this feature offers.