In Ruby 3.3.0, a new standard library was added to CRuby called Prism. Prism is a parser for the Ruby language, exposed as both a C library (optionally usable by CRuby) and a Ruby library (usable as a Ruby gem). The Prism project represents many person-years worth of effort, and is the result of a collaboration between Shopify, CRuby core contributors, other Ruby implementation authors, and Ruby tooling developers.
This post provides an overview of the Prism project — why it exists, where it stands today, and what the future holds. It also gives some insight into the broader ecosystem of Ruby parsers, intermediate representations, and tools. This includes some well-known projects that you are likely to have heard of (e.g., Ripper) and newer projects that you may not yet be familiar with (e.g., LRama).
If you’re low on time and just want the conclusion, here it is: if you need to parse Ruby code for whatever reason, use the Prism library. Regardless of any future decisions made by the CRuby core team, this library is guaranteed to live on in perpetuity as the definitive Ruby parser API. It is well-documented, error-tolerant, portable to every major Ruby implementation, and has a clear path to future improvements.
The story of the Ruby language frontend is long, fragmented, and complex. It includes many projects, each with their own (sometimes conflicting) goals. To understand the current state of affairs, it’s necessary to look back at the history of the Ruby frontend, to see how the language has evolved over time. There is a lot to digest here, so we will try to make this as brief as possible.
Ruby was created in early 1993. At the time, it was very common for language designers to generate their parsers using a tool called Yacc, which stood for “Yet Another Compiler Compiler”. Yacc takes a grammar file (a file suffixed with .y
that describes the syntax of a language) and generates a parser for that language (in this case a .c
file). Matz took this same approach; the first parser of the Ruby language shipped with Ruby 0.01 was generated by Yacc.
The oldest available changelog entry that has an explicit version attached is for CRuby 0.06. At this point the parser was generated by Yacc and written in C.
Fundamental to the current state of the Ruby frontend is that at this point Ruby was a tree-walk interpreter. This means that after generating a syntax tree, the Ruby runtime would walk the tree to execute the code. This is in contrast to the current CRuby runtime YARV, which is a bytecode interpreter. All syntax errors, warnings, and other diagnostics were generated by the parser itself, and the parser was tightly coupled to the runtime. The syntax tree that was generated was explicitly designed for speedy execution, not for analysis or transformation.
This is a very important point, and worth spending some extra time considering. You can still see its impact in the structure of the CRuby syntax tree today. You can use ruby --dump=parsetree -e [SOURCE]
to see the parsed tree.
Consider how the following examples are represented and how they differ from the actual source code.
for left, right in elements do end
def foo = return :bar
puts "World!"; BEGIN { puts "Hello!" }
/foo #{bar}/o
Notice that all of the deviations in the syntax tree from the source code make sense for the use case of an interpreter and/or compiler — they make things more efficient. But also notice that they make things significantly more difficult for any other use case.
Ruby 0.95 was released at the end of 1995, and in this release there was a new entry in the ToDo
file at the root of the repository: hand written parser (recursive descent)
. We’ve written before about hand-written recursive descent parsers and how they differ from generated LALR parsers in the Prism announcement post, so I won’t rehash it here. Suffice to say, we believe a hand-written recursive descent parser is the best choice for a language like Ruby, and it appears from this early version of CRuby that (at least at the time) Matz agreed.
In 2000, Dave Thomas created the nodeDump project, which walked the Ruby AST and generated documentation. To our knowledge, this is the first attempt to access and manipulate the Ruby syntax tree outside of the CRuby runtime itself. While this project was made obsolete by the Ruby 1.9 switch and therefore is no longer maintained, it is worth mentioning that from the earliest public days of Ruby a desire for a Ruby parser API existed.
In 2001, Jan Arne Petersen began work on a reimplementation of Ruby for the Java Virtual Machine that was a direct port of the Ruby 1.6 code named JRuby. This project still exists and is in use in production systems to this day. The parser took a copy of the parse.y
grammar file used as the input to Yacc and rewrote its actions in Java. Since then, any changes to the Ruby grammar have been manually copied over to JRuby through this same process.
This is another important point: every change to the CRuby grammar file induced a change in the JRuby grammar file. This is not a unique story; every unique Ruby parser developed since these early days (14 by our count) has had to do the same thing. This is a significant investment of time and effort.
Somewhat incredibly, JRuby has managed to stay on top of the grammar changes and in its latest release supports nearly all of the syntax of CRuby 3.3.0 (the latest released version at the time of writing this post). We say nearly here because minute differences have existed between the two parsers since the beginning. While JRuby’s parser has been by far the most comprehensive alternative Ruby parser over the years, getting to 100% parity with all of the various eccentricities is extremely difficult.
In 2001, around the time of Ruby 1.7, Aoki Minero released the first version of the Ripper library. This was an event-driven parser that allowed users to build their own syntax trees. It worked by copying the Ruby grammar file and modifying the actions to dispatch events that called out to user-defined methods. Originally this project existed on its own, before the maintenance of it proved to be difficult and it was eventually merged into CRuby three years later.
This parser still exists today as a standard library. Although it has had the explicit Ripper is still early-alpha version.
warning at the top of its documentation for the past 20 years, it has still served as the choice of parser for many projects (including yard, prettier, rubyfmt, etc.). These projects chose Ripper for myriad reasons, but the two that stand out are: it was the only standard library parser available and it was the only parser guaranteed to parse exactly as the current version of CRuby (a somewhat unique constraint necessary for tools like IRB).
In 2003, CRuby released version 1.8.0, the last minor CRuby version to include the tree-walk interpreter. In this version, a new entry in the ToDo
file appeared in CRuby: Parser API
. Likely this was a reaction to the demand that was already present in the ecosystem: developers wanted to build tools on top of the Ruby syntax tree and couldn’t.
In contrast, there was prior art in other language ecosystems. Perl had the PPI module starting in 2001, which allowed developers to access the syntax tree without the Perl runtime. CPython shipped with the dis module starting in 1990, which gave developers access to the bytecode, and later in 2005 shipped the ast module. It is table-stakes within any language ecosystem to be able to access the syntax tree and/or the bytecode to create high-quality tooling, and Ruby was behind in this regard.
In 2004, Ripper was merged and along with it CRuby switched from using Yacc to using its spiritual successor: GNU Bison. Bison was compatible with Yacc and boasted a number of improvements including reentrancy. Ripper was also built on top of Bison, meaning it was necessary to switch to Bison to support merging Ripper.
It’s important to note that this meant that in order to compile CRuby from source developers had to have Bison installed on their system. This was not a significant change (previously developers had to have Yacc installed), but it’s worth mentioning because this pain-point existed until only very recently.
Also in 2004, Ryan Davis released a library called ParseTree, which gave developers access to the CRuby syntax tree using a C extension that converted it into Ruby primitives (arrays, strings, symbols, integers, etc.). This relied heavily on the structure of the CRuby 1.8 syntax tree, effectively mirroring it into a Ruby structure. While this project did not survive the 1.8 to 1.9 migration, it was the spiritual predecessor to the ruby_parser gem.
In 2007, Ryan Davis closed down his work on the ParseTree library and instead replaced it with the ruby_parser gem, which worked on both the CRuby 1.8 and 1.9 branches. (At the time, 1.9 was a long-lived branch that housed the YARV bytecode interpreter). This project was different from ParseTree; it took the approach of copying the grammar file, rewriting the actions in Ruby, and then feeding it into the racc parser generator.
This project still exists today, and is the basis of many tools in the Ruby ecosystem (including but not limited to flog, dawnscanner, and fasterer). This is the first true fragmentation within the CRuby parser ecosystem; developers could now choose between using Ripper or ruby_parser.
At long last in 2007 CRuby released version 1.9, which among other things included the YARV bytecode interpreter. This was the first version that included Ripper as a standard library. Merging in Ripper meant that CRuby now maintained two parsers: the Bison-generated parser that was compiled into YARV bytecode, and the Ripper parser that was exposed as a standard library.
In order to marry these two requirements, Ripper was fashioned as a pre-processing step on the existing grammar file. Within the actions of the grammar file a special domain-specific language was used in C language comments to describe the actions that Ripper would take. A tool was created that would extract these comments and generate a clean grammar file that itself could then be passed into Bison. If this sounds complicated, that’s because it is. Ripper’s setup means that any changes to the grammar file might inadvertently change the semantics of Ripper, a caveat that exists to this day.
In 2012, four days after Ruby 1.9 was recognized by the ISO as an international standard, work began on a new Ruby implementation called mruby. It was meant as a “lightweight” version of Ruby suitable for smaller devices with limited memory.
The parser started out as a copy of an earlier version of the CRuby parser, but quickly developed its own syntax tree more in the style of a Lisp S-expression. While effort has been made to update its syntax tree to include newer CRuby features, it is proven both difficult and at times undesirable.
Importantly, mruby was designed to be embedded and portable. This meant it could serve as a submodule of other libraries. As such, its parser has been used to power other projects (notably Artichoke Ruby) because it is accessible from other languages.
One year later in 2013, the parser gem was created. Again, it took the grammar file from CRuby. It then rewrote the actions in Ruby. Using a lexer generated by ragel and a parser generated by racc, it provided a Ruby API for accessing the Ruby syntax tree.
This project proved quite popular, thanks in part to tireless efforts to exactly match CRuby parsing semantics for every version (a separate grammar file is checked in for each supported version). It was convenient enough that tools like rubocop ended up switching over from Ripper to use it. This is the most widely-used Ruby parser in the ecosystem today outside of the CRuby parser itself. It is the basis for most of the static analysis tooling in use today.
While this project was quite successful, the caveat is that it further fragmented the community. Now developers could choose between using Ripper, ruby_parser, and parser. Static analysis tools that were being developed at the time ended up not being able to reuse code from one another, and instead the community reimplemented the same logic in multiple projects. Worse still, the CRuby codebase only used its own parser, which meant any improvements to tooling generated by the ecosystem outside of CRuby were lost to the reference implementation.
To keep up with syntax changes, tooling has been developed to open issues on the repository any time a change to CRuby’s parse.y
file is committed. This is a common story that we’ve already seen in this post going back to JRuby in 2001.
Also in 2013, Oracle Labs started a new Ruby implementation based on partial evaluation of self-optimizing AST interpreters. To do so, it used two Java-based technologies: the Truffle AST interpreter framework and the Graal JIT compiler. Given its basis in Java, TruffleRuby initially joined the JRuby project as an alternative runtime backend called JRuby+Truffle. However, as the projects drifted apart in design, TruffleRuby broke off as a standalone project again and forked the JRuby parser to adapt to its own core library.
TruffleRuby boasts the highest peak performance of any Ruby implementation to date. It leverages the power of multiple JIT compilers and the GraalVM ecosystem to achieve this. While it isn’t widely-used in production systems, the concepts that it has introduced have been quite influential in the Ruby community. Perhaps most notably, contributors to TruffleRuby have drastically advanced the Ruby Spec Suite, a comprehensive test suite for the Ruby language.
In 2017, a project called typedruby was created, which introduced gradual static typing for Ruby. In order to parse Ruby code it rewrote the lexer from ruby_parser, copied the grammar file from CRuby 2.4.0, and rewrote the actions in C++.
This project is not still under active development, but it is worth mentioning here because the parser it created was eventually vendored into another C++ project: Sorbet. Sorbet is a gradual static type checker for Ruby that was developed by Stripe. It is still in active use in production systems of the largest Ruby codebases around today.
In 2018, commit 0f3dcbdf introduced an AST
module in CRuby to help with writing tests for the parser. This ended up being renamed to RubyVM::AbstractSyntaxTree
and was released as an experimental feature in Ruby 2.6.0. Many caveats were attached to this feature, including warnings of future changes and the fact that it was not guaranteed to be stable. This was the first time the CRuby parser was exposed as a public API from within CRuby itself and not from a community project.
For the most part, the warnings appear to have worked. Not many projects have been developed on top of this feature. The most notable exception is the error_highlight gem, a core library that provides better error messages by including snippets of the source code that generated the errors.
In 2019, CRuby introduced the concept of pattern matching. This was the largest influx of new syntax into the Ruby language since Ruby 1.9 12 years earlier. The parse.y
file added a whopping 714 lines of code to support this new feature.
Inherently, this meant that every parser in the ecosystem had to be updated to support this new syntax. Inadvertently, this spelled the end of claiming 100% compatibility for most of the parsers in the ecosystem for many years. Most of them started out with a subset of the whole feature, supporting only the most common use-cases. Here you can see a timeline of these efforts:
2019-05-08
- parser2020-09-22
- Sorbet2021-04-07
- JRuby2021-08-30
- ruby_parser2023-04-08
- TruffleRubyIn 2019, Tim Morgan created the Natalie project, an ahead-of-time compiled C++ Ruby implementation. The parser was hand-written, using the syntax tree structure developed by Ryan Davis in the ruby_parser project. Over time, the parser was extracted into its own project called natalie_parser.
In 2020, an alternative mruby implementation was created called PicoRuby. The project was meant to be a minimal, small-footprint implementation of mruby that would function well on microcontroller boards like a Raspberry Pi Pico. The parser was copied from the mruby project and then modified to better suite the needs and requirements of the project.
At this point in the timeline in late 2022, the first commit was made on the Prism project, the topic of this blog post. We’ll come back to this topic momentarily.
In 2023, Kaneko Yuichiro created commit a1b01e77, which added LRama as the new parser generator in CRuby. LRama is a reimplementation of the Bison parser generator written in Ruby. It took the same parse.y
file that had been used for the past 20 years and generated the same parse.c
file that Bison would have. This solved problems that had existed since the creation of the CRuby parser; developers compiling CRuby from source would no longer have to have Yacc/Bison installed on their system. Additionally, small differences in Bison versions would no longer accidentally break the parser.
LRama was a significant shift in the maintenance of the CRuby parser. Because the entire parser generation pipeline was now controlled by CRuby, it became possible to modify the grammar file in ways that were not previously possible with Bison. You can see this in commits that refactor Ripper and in commits that introduce the use of the ?
operator to the grammar file like terms? and ‘\n’?.
It’s important to note that this change is not without downsides. If you’ve been reading carefully, you’ll remember that almost every other parser in the ecosystem relied on mirroring changes in parse.y
into their own grammar files. At the time of this commit that included the parsers in JRuby, TruffleRuby, ruby_parser, parser, and Sorbet, as well as all of the downstream tools that depend on these projects.
In mid-2023, Kaneko Yuichiro created CRuby commit b481b673, introducing the concept of a “universal” parser. You can read more about it in the issue. The idea was to incrementally extract the existing CRuby parser into a standalone library.
This was accomplished by providing a callback interface where consumers would implement all of the features needed by the parser. Over time, the number of callbacks necessary to implement as a consumer has decreased, with the stated goal of eventually having the fewest number of callbacks possible.
Work continues on this project today. At the time of writing, this interface can be optionally used by CRuby, but it has not been adopted elsewhere in the ecosystem.
I’ve purposefully omitted a large number of projects from this list, in a (somewhat failing) effort to keep this history as brief as possible. A number of other parsers and syntax trees were developed throughout this time and contributed to the overall development of the Ruby frontend ecosystem. Here are a few of them:
Project | Parser |
---|---|
Cardinal | A hand-written parser in Parrot |
IronRuby | A copy of the CRuby parser, rewritten in .NET |
MacRuby | A copy of the CRuby parser, rewritten in Objective-C |
Rubinius | Melbourne, a copy of CRuby parser, rewritten in Ruby |
Ruby Intermediate Language | A hand-written parser in OCaml |
Syntax Tree | A syntax tree built on top of Ripper |
Topaz | A copy of the CRuby parser, rewritten in RPython |
At the time of Prism’s conception in early 2022, we faced a fragmented ecosystem with a large number of disparate requirements. Taking into account the history of the Ruby frontend and the current state of the ecosystem, it was clear that if a single parser were going to be written, it would have to solve everything at the same time to avoid risking becoming yet another option.
We wanted to solve these problems for a lot of reasons, but the biggest was the sheer maintenance cost. In early 2022, Shopify was investing in CRuby, TruffleRuby, parser (via downstream projects like rubocop and packwerk), and Sorbet. We had developers who were actively working on syntax updates for TruffleRuby. We also had developers who had recently invested months of effort bringing pattern matching to Sorbet. Simply put, exploring the possibility of a single parser that all of these projects could use made quite a lot of sense.
Taking stock of the community, we realized that we could never claim to be a universal parser unless we worked closely with the maintainers of all of the various parsers in the ecosystem to determine their needs and meet them. We would also need a clear migration path for all of the existing projects to move over to a new parser, which was no small feat. In the end, we came up with the following list:
Concern | Description |
---|---|
Compatibility | No one would want to adopt a new parser that didn’t parse the same code the exact same way as their existing parser. |
Maintainability | Every project in the ecosystem wanted a parser that was easy to maintain and update. |
Performance | No projects would want to adopt a new parser that was slower than their existing parser. |
Error‑tolerance | To be useable as the basis for IDE tooling, the new parser must be error-tolerant (a requirement of all of the implementations, as well as Sorbet and Ruby LSP). |
Portability | At this time there were parsers being actively maintained in C, C++, Ruby, Rust, and Java. All except the last could function through FFI, but requiring a Java process to call back and forth into native functions was not a good solution. Instead, we would need to develop a serialization format that could be used by the Java projects that could retrieve the syntax tree with a single FFI call. |
Reentrancy | The parser would need to be reentrant in order to be used in a multi-threaded environment. |
Identifiable | The parser would need to generate syntax trees whose nodes could be consistently identified across parses. This was a requirement of error_highlight, which reparses the source code to find the exact location of errors. |
Small footprint | The footprint would need to be small in order to be suitable for implementations like mruby and PicoRuby. |
Migration path | We would need to provide a clear migration path for all of the existing projects to move over to a new parser. |
With this large, ambitious list of requirements, we went to work. After a year’s worth of work, we had:
At this point we opened our first pull request against CRuby to merge in our efforts in CRuby cc7f765f in late June of 2023. Because CRuby has been a bytecode interpreter since 1.9, our next step was then to generate the same bytecode instructions as the existing CRuby compiler, so we set out to do that.
In the meantime, we also began work on our migration story to make it easier for existing projects to migrate over to Prism. We began developing “translation” layers that would translate Prism’s syntax tree into the syntax tree of the other existing parsers. This included Ripper, ruby_parser, and parser.
The history and the work we’ve done on Prism brings us to today. After two years worth of work, here is where the ecosystem outside of CRuby currently stands:
Project | Status |
---|---|
Natalie | Toward the end of last year, Natalie completed its migration to Prism. This was the first adopter, and Tim helped find a number of bugs in Prism that we were able to fix. |
JRuby | Earlier this year, JRuby completed its migration to Prism. They were one of the earliest adopters, and after weathering all of the various breaking changes that Prism introduced while it was being developed, they can now boast a peak of a 3x speedup in parsing time. |
TruffleRuby | Also earlier this year, TruffleRuby completed its migration to Prism. In their announcement they mention the parser that Prism is about twice as fast as their previous parser. |
PicoRuby | As of writing, the author of PicoRuby is working on experimenting with both Prism and the “universal” parser. They have not yet completed their migration, but plan to present their findings at RubyKaigi next month. |
ruby_parser | Earlier this year we completed our translation layer to the ruby_parser syntax tree. This can be used as the basis of tools relying on this parser to migrate over to Prism, and is showing a nice speedup in parsing time. |
parser | Earlier this year we also completed our translation layer to the parser syntax tree. It was successful enough that it was immediately adopted into rubocop and released as an optional configuration in the latest version. The author of rubocop also mentioned he would consider switching the parser over to Prism directly in the future. Other tools also began using this translation layer, including packwerk. |
Ripper | Most recently, we completed our translation layer to the Ripper event stream. This was the most complex translation layer to write, but it now allows tools to migrate over to Prism that may have been relying on this “experimental” tool. |
Sorbet | To reap the benefits of our work, we are considering attempting to migrate the Sorbet project over to using Prism. This work is currently being evaluated. |
A number of other open-source libraries and implementations have also adopted or are experimenting with Prism, a few of which include:
Numerous other closed-source tools have also been developed since the inception of Prism, both inside and outside of Shopify. These includes tools developed in Ruby, C++, Rust, and JavaScript, all via the various bindings that Prism provides.
In total, at this point if PicoRuby (and potentially mruby) choose to adopt Prism going forward, this will mean every Ruby implementation and parser in the world will be using Prism, except CRuby.
Unsurprisingly, the reference implementation of Ruby is the most difficult to migrate. There are a number of factors that make switching to Prism difficult, layed out below.
Prism designed its own syntax tree from the ground-up with the concerns of the various stakeholders in mind. This necessarily means we also need to write our own compiler to match the instruction sequences generated by CRuby. All told at the time of writing this accounts for about 9000 lines of code. With a change as large as replacing the parser and compiler, it’s difficult to ensure that every edge case is covered. Trepidation about the sheer size of the change is warranted and understandable.
While we continue to make progress on this every day, there are still a small number of edge cases indicating slight deviations with Prism. The vast majority of these have to do with the compiler, as opposed to the parser, within modules like TracePoint
and Coverage
. We are actively working on reducing these differences, but it is a slow process.
As of writing, we are failing 105/21813 tests and 42/32601 specs. Confidently, we can say we will have them resolved by the next CRuby release (if not significantly sooner). In the meantime we will be testing Shopify’s core monolith with --parser=prism
in our CI environment to ensure that we catch any regressions that might occur, in addition to passing all of the tests and specs.
As previously mentioned, the other effort within CRuby to improve the Ruby frontend ecosystem is the “universal” parser. While Prism has been adopted by nearly every other parser and implementation in the ecosystem, efforts have continued on the “universal” parser to ease the maintenance burden of the CRuby developers. While Prism has been merged into CRuby as an library, it has not yet been adopted as the default parser because both the Prism project and the “universal” parser project have been asked to compete.
Unfortunately this means developing CRuby itself is going to be difficult until such a time as a decision is made. Any changes to the grammar or compiler will have to be done twice: once for the existing pipeline and once for Prism. Fortunately for those not developing CRuby, to ease concerns about the future Matz has agreed that going forward the official parser API for Ruby will be the Prism API. This means regardless of which parser (Prism or the “universal” parser) is adopted as the official solution for CRuby, the developer-facing API will be the same. This is a significant win for the Ruby ecosystem, as it means developers can develop against Prism today and do not have to worry about whichever internal solution CRuby ends up choosing.
We hope this post has brought some clarity as to the relationship between Prism, LRama, “universal” parser, and the broader Ruby frontend ecosystem. We are excited about the progress the whole community has made in rallying around a single syntax tree, and are very excited about the possibilities this enables. By sharing a single source of truth for parsing, the whole community can start to benefit from things like better error messages, shared indices for code navigation, and more. Overall, we are excited to see what the future holds for the Ruby frontend ecosystem, and are excited to be a part of it.