Interprocedural Sparse Conditional Type Propagation

It’s 11 o’clock. Do you know where your variables are pointing?

def shout(obj)
  obj.to_s + "!"
end

It’s hard to tell just looking at the code what type obj is. We assume it has a to_s method, but many classes define methods named to_s. Which to_s method are we calling? What is the return type of shout? If to_s doesn’t return a String, it’s really hard to say.

Adding type annotations would help… a little. With types, it looks like we have full knowledge about what each thing is but we actually don’t. Ruby, like many other object-oriented languages, has this thing called inheritance which means that type signatures like Integer and String mean an instance of that class… or an instance of a subclass of that class.

Additionally, gradual type checkers such as Sorbet (for example) have features such as T.unsafe and T.untyped which make it possible to lie to the type checker. These annotations unfortunately render the type system unsound without run-time type checks, which makes it a poor basis for something we would like to use in program optimization. (For more information, see this blog post for how it affects Python in a similar way.)

In order to build an effective compiler for a dynamic language such as Ruby, the compiler needs precise type information. This means that as compiler designers, we have to take things into our own hands and track the types ourselves.

In this post, we show an interprocedural type analysis over a very small Ruby subset. Such analysis could be used for program optimization by a sufficiently advanced compiler. This is not something Shopify is working on but we are sharing this post and attached analysis code because we think you will find it interesting.

Note that this analysis is not what people usually refer to as a type inference engine or type checker; unlike Hindley-Milner (see also previous writing) or similar constraint-based type systems, this type analysis tracks dataflow across functions.

This analysis might be able to identify all of the callers to shout, determine that the to_s of all the arguments return String, and therefore conclude that the return type is String. All from an un-annotated program.

The examples after the intro in this post will use more parentheses than the typical Ruby program because we also wrote a mini-Ruby parser and it does not support method calls without parentheses.

Static analysis

Let’s start from the top. We’ll go over some examples and then continue on into code and some benchmarks.

Do you know what type this program returns?

That’s right, it’s Integer[1]. Not only is it an Integer, but we have additional information about its exact value available at analysis time. That will come in handy later.

What about this variable? What type is a?

a = 1

Not a trick question, at least not yet. It’s still Integer[1]. But what if we assign to it twice?

a = 1
a = "hello"

Ah. Tricky. Things get a little complicated. If we split our program into segments based on logical execution “time”, we can say that a starts off as Integer[1] and then becomes String["hello"]. This is not super pleasant because it means that when analyzing the code, you have to carry around some notion of “time” state in your analysis. It would be much nicer if instead something rewrote the input code to look more like this:

a0 = 1
a1 = "hello"

Then we could easily tell the two variables apart at any time because they have different names. This is where static single assignment (SSA) form comes in. Automatically converting your input program to SSA introduces some complexity but gives you the guarantee that every variable has a single unchanging type. This is why we analyze SSA instead of some other form of intermediate representation (IR). Assume for the rest of this post that we are working with SSA.

Let’s continue with our analysis.

What types do the variables have in the below program?

a = 1
b = 2
c = a + b

We know a and b because they are constants, so can we constant-fold a+b into 3? Kind of. Sort of. In Ruby, without global knowledge that someone has not and will not patch the Integer class or do a variety of other nasty things, no.

But let’s pretend for the duration of this post that we live in a world where it’s not possible to redefine the meaning of existing classes (remember, we’re looking at a Ruby-like language with different semantics but similar syntax) or add new classes at run-time (this is called a closed-world assumption). In that case, it is absolutely possible to fold those constants. So c has type Integer[3].

Let’s complicate things.

if condition
  a = 3
else
  a = 4
end

We said that each variable would only be assigned once, but SSA can represent such a program using Φ (phi) nodes. Phi nodes are special pseudo-instructions that track dataflow when it could come from multiple places. In this case, SSA would place one after the if to merge two differently named variables into a third one.

if condition
  a0 = 3
else
  a1 = 4
end
a2 = phi(a0, a1)

This also happens when using the returned value of an if expression:

a = if condition
      3
    else
      4
    end

The phi function exists to merge multiple input values.

For our analysis, the phi node does not do anything other than compute the type union of its inputs. We do this because we are treating a type as a set of all possible values that it could represent. For example Integer[3] is the set {3}. And Integer is the infinite and difficult to fit into memory set {..., -2, -1, 0, 1, 2, ...}.

This makes the type of a (a2) some type like Integer[3 or 4], but as we saw, that set could grow potentially without bound. In order to use a reasonable amount of memory and make sure our analysis runs in a reasonable amount of time, we have to limit our set size. This is where the notion of a finite-height lattice comes in. Wait, don’t click away! It’s going to be okay!

We’re using a lattice as a set with a little more structure to it. Instead of having a union operation that just expands and expands and expands, we give each level of set a limited amount of entries before it overflows to the next, less-specific level. It’s kind of like a finite state machine. This is a diagram of a subset of our type lattice:

A diagram of multiple labeled sets: Empty,
known Integer constants, Integer, known String constants, String, a set of
classes, and the topmost set of all objects, Any. The diagram is vertical, with
arrows going from bottom to top with decreasing specificity. — An example lattice similar to the one we use in our demo static analysis. At the bottom are the more specific types and at the top are the less specific types. Arrows indicate that results can only become less precise as more merging happens.

All lattice elements in our program analysis start at Empty (unreachable) and incrementally add elements, following the state transition arrows as they grow. If we see one constant integer, we can go into the Integer[N] state. If we see another, different integer, we have to lose some information and transition to the Integer state. This state symbolically represents all instances of the Integer class. Losing information like this is a tradeoff between precision and analysis time.

To bring this back to our example, this means that a (which merges Integer[3] and Integer[4]) would have the type Integer in our lattice.

Let’s complicate things further. Let’s say we know somehow at analysis time that the condition is truthy, perhaps because it’s an inline constant:

a = if true
      3
    else
      4
    end

Many analyses looking at this program would see two inputs to a with different values and therefore continue to conclude that the type of a is still Integer—even though as humans looking at it, we know that the else branch never happens. It’s possible to do a simplification in another analysis that deletes the else branch, but instead we’re going to use some excellent work by Zadeck and co called Sparse Conditional Constant Propagation (SCCP).

Sparse conditional constant propagation

Unlike many abstract interpretation based analyses, SCCP uses type information to inform its worklist-based exploration of the control-flow graph (CFG). If it knows from other information that the condition of a branch instruction is a constant, it does not explore both branches of the conditional. Instead, it only pushes the relevant branch onto the worklist.

Because we’re working inside an SSA (CFG), we separate control-flow into basic blocks as our unit of granularity. These basic blocks are chunks of instructions where the only control-flow allowed is the last instruction.

fn sctp(prog: &Program) -> AnalysisResult {
    // ...
    while block_worklist.len() > 0 || insn_worklist.len() > 0 {
        // Read an instruction from the instruction worklist
        while let Some(insn_id) = insn_worklist.pop_front() {
            let Insn { op, block_id, .. } = &prog.insns[insn_id.0];
            if let Op::IfTrue { val, then_block, else_block } = op {
                // If we know statically we won't execute a branch, don't
                // analyze it
                match type_of(val) {
                    // Empty represents code that is not (yet) reachable;
                    // it has no value at run-time.
                    Type::Empty => {},
                    Type::Const(Value::Bool(false)) => block_worklist.push_back(*else_block),
                    Type::Const(Value::Bool(true)) => block_worklist.push_back(*then_block),
                    _ => {
                        block_worklist.push_back(*then_block);
                        block_worklist.push_back(*else_block);
                    }
                }
                continue;
            };
        }
        // ...
    }
    // ...
}

This leaves us with a phi node that only sees one input operand, Integer[3], which gives us more precision to work with in later parts of the program. The original SCCP paper stops here (papers have page limits, after all) but we took it a little further. Instead of just reasoning about constants, we use our full type lattice. And we do it interprocedurally.

Let’s look at a small example of why interprocedural analysis matters before we move on to trickier snippets. Here we have a function decisions with one visible call site and that call site passes in true:

def decisions(condition)
  if condition
    3
  else
    4
  end
end

decisions(true)

If we were just looking at decisions in isolation, we would still think the return type is Integer. However, if we let information from all the call sites flow into the function, we can see that all (one) of the call sites pass true to the function… and therefore we should only look at one branch of the if.

Now, a reader familiar with SCCP might be wondering how this works interprocedurally. SCCP by definition requires knowing in advance what instructions use what other instructions: if you learn new facts about the output instruction A, you have to propagate this new information to all of the uses. In a single function’s control-flow graph, this isn’t so bad; we have full visibility into definitions and uses. It gets harder when we expand to multiple functions. In this example, we have to mark the condition parameter as a use of all of the (currently constant) actual arguments being passed in.

But how do we know the callers?

Interprocedural SCCP

Let’s start at the entrypoint for an application. That’s normally a main function somewhere that allocates some objects and calls a couple of other functions. These functions might in turn call other functions, and so on and so forth, until the application terminates.

These calls and returns form a graph, but we don’t know it statically—we don’t know it at the start of the analysis. Instead, we have to incrementally build it as we discover call edges.

In the following code snippet, we would begin analysis at the entrypoint, which in this snippet is the main function. In it, we see a direct call to the foo function. We mark that foo is called by main—and not just by main, but by the specific call site inside main. Then we enqueue the start of the bar function—its entry basic block—onto the block worklist.

def bar(a, b)
  a + b
end

def foo()
  bar(1, 2) + bar(3, 4)
end

def main()
  foo()
end

At some point, the analysis will pop the entry basic block of foo off the worklist and analyze foo. For each of the direct calls to bar, it will create a call edge. In addition, because we are passing arguments, it will wire up 1 and 3 to the a parameter and 2 and 4 to the b parameter. It will enqueue bar’s entry block.

At this point, we’re merging Integer[1] and Integer[3] at the a parameter (and similarly at b). This is kind of like an interprocedural phi node and we have to do the same union operation on our type lattice.

This means that we won’t be able to fold a+b for either call to bar, unfortunately, but we will still get a return type of Integer, because we know that Integer+Integer=Integer.

Now, if there were a third call to bar that passed it Strings, every call site would lose. We would end up with ClassUnion[String, Integer] at each parameter and, worse, Any as the function result. We wouldn’t even get ClassUnion[String, Integer] because we don’t keep each call site separate, so from the perspective of the analysis, we could be looking at String+Integer, which doesn’t have a known type (in fact, it probably raises an exception or something).

But what if we kept each call site separate?

Sensitivity

This kind of thing is generally called something-sensitivity, where the something depends on what your strategy is to partition your analysis. One example of sensitivity is call-site sensitivity.

In particular, we might want to extend our current analysis with 1-call-site-sensitivity. The number, the k variable that we can dial for more precision and slower analysis, is the number of “call frames” we want to keep track of in the analysis. This stuff is good for very commonly used library functions such as to_s and each, where each caller might be quite different.

In the above very not-representative example, 1-call-site-sensitivity would allow us to completely constant fold the entire program into Integer[10] (as 1 + 2 + 3 + 4 = 10). Wow! But it would slow down the analysis because it requires duplicating analysis work. To side-by-side the rough steps:

Without call-site sensitivity / 0-call-site-sensitive (what we currently have):

See call to bar with arguments 1 and 2
Mark bars parameters as being Integer[1] and Integer[2]
See bar add node with constant left and right operands
Mark bar add result as Integer[3]
Mark bar return as Integer[3]
Mark result of bar(1, 2) as Integer[3]
See call to bar with arguments 3 and 4
Mark bars parameters as being Integer and Integer (we have to union)
Mark bar add result as Integer (the arguments are not constant)
Mark bar return as Integer
See foo’s own add with operands Integer and Integer
Mark foo’s add as returning Integer

With 1-call-site-sensitive:

See call to bar from function foo with arguments 1 and 2
Make a new call context from foo
Mark foo0->bar parameters as being Integer[1] and Integer[2]
See foo0->bar add node with constant left and right operands
Mark foo0->bar add result as Integer[3]
Mark foo0->bar return as Integer[3]
Mark result of bar(1, 2) as Integer[3]
See call to bar with arguments 3 and 4
Make a new call context from foo
Mark foo1->bar parameters as being Integer[3] and Integer[4]
Mark foo1->bar add result as Integer[7]
Mark foo1->bar return as Integer[7]
See foo’s own add with constant operands Integer[3] and Integer[7]
Mark foo add as returning Integer[10]

See how we had to analyze bar once per call-site instead of merging call inputs and returns and moving up the lattice? That slows the analysis down.

There is also context sensitivity, which is about partitioning calls based on some computed property of a given call site instead of where it is in the program. Maybe it’s the tuple of argument types, or the tuple of argument types with any constant values removed, or something else entirely. Ideally it should be fast to generate and compare between different other call sites.

There are other kinds of sensitivity like object sensitivity, field sensitivity, and so on—but since this is a bit of a detour in the main article and we did not implement any of them, we instead leave them as breadcrumbs for you to follow and read about.

Let’s go back to the main interprocedural SCCP and add some more trickiness into the mix: objects.

Objects and method lookup

Ruby doesn’t just deal with integers and strings. Those are special cases of a larger object system where objects have instance variables, methods, etc and are instances of user-defined classes.

class Point
  attr_accessor :x
  attr_accessor :y

  def initialize(x, y)
    @x = x
    @y = y
  end
end

p = Point.new(3, 4)
puts(p.x, p.y)

This means that we have to start tracking all classes in our static analysis or we will have a hard time being precise when answering questions such as “what type is the variable p?”

Knowing the type of p is nice—maybe we can fold some is_a? branches in SCCP—but the analysis becomes even more useful if we can keep track of the types of instance variables on objects. That would let us answer the question “what type is p.x?”

Per this paper (PDF), there are at least two ways to think about how we might store that kind of information. One, which the paper calls field-based, unifies the storage of field types based on their name. So in this case, all potential writes to any field x might fall into the same bucket and get unioned together.

Another, which the paper calls field-sensitive, unifies the storage of field types based on the receiver (object holding the field) class. In this case, we would differentiate all possible types of p at a given program point when writing to and reading from p.x.

We chose to do the latter approach in our static analysis: we made it field sensitive.

fn sctp(prog: &Program) -> AnalysisResult {
    // ...
    while block_worklist.len() > 0 || insn_worklist.len() > 0 {
        // Read an instruction from the instruction worklist
        while let Some(insn_id) = insn_worklist.pop_front() {
            let Insn { op, block_id, .. } = &prog.insns[insn_id.0];
            // ...
            match op {
                Op::GetIvar { self_val, name } => {
                    let result = match type_of(self_val) {
                        Type::Object(classes) => {
                            // ...
                            classes.iter().fold(Type::Empty, |acc, class_id| union(&acc, &ivar_types[class_id][name]))
                        }
                        ty => panic!("getivar on non-Object type {ty:?}"),
                    };
                    result
                }
            }
        }
    }
}

This means that we have to do two things: 1) keep track of field types for each instance variable (ivar) of each class and then 2) at a given ivar read, union all of the field types from all of the potential classes of the receiver.

Unfortunately, it also creates a complicated uses relationship: any GetIvar instruction is a use of all possible SetIvar instructions that could affect it. This means that if we see a SetIvar that writes to T.X for some class T and field name X, we have to go and re-analyze all of the GetIvars that could read from that class (and propagate this information recursively to the other uses, as usual).

All of this union-ing and reflowing and graph exploration sounds slow. Even with pretty efficient data structures, there’s a lot of iteration going on. How slow is it really? To answer that, we build some “torture tests” to artificially create some worst-case benchmarks.

Testing how it scales: generating torture tests

One of the big challenges when it comes to anything related to compiler design is that it’s difficult to find large, representative benchmarks. There are multiple reasons for this. Large program tend to come with many dependencies which makes them hard to distribute, install and maintain. Some software is closed source or copyrighted. In our case, we’re working with a mini language that we created to experiment, so there are simply no real-world programs written in that language, so what can we do?

The first question to ask is: what are we trying to measure? One of our main concerns in implementing and testing this analysis was to know how well it performed in terms of execution time. We would like to be confident that the analysis can cope with large challenging programs. We know from experience that YJIT compiles over 9000 methods when running Shopify’s production code. If YJIT compiles 9000 “hot” methods, then one could guess that the full program might contain 10 times more code or more, so let’s say 100,000 methods. As such, we figured that although we don’t have human-crafted programs of that scale for our mini-language, we could generate some synthetic programs that have a similar scale. We figure that if our analysis can cope with a “torture test” that is designed to be large and inherently challenging, that gives us a good degree of confidence that it could cope with “real” programs.

To generate synthetic test programs, we want to generate a call graph of functions that call each other. Although this isn’t strictly necessary for our type analysis to work, we’d like to generate a program that isn’t infinitely recursive and always terminates. That’s not difficult to achieve because we can write a piece of code that directly generates a Directed Acyclic Graph (DAG). See the random_dag function in the loupe repository described at the end of this post. This function generates a directed graph that has a single “root” node with a number of interconnected child nodes such that there are no cycles between the nodes.

For our first torture test (see gen_torture_test), we generated a graph of 200,000 functions that call each other. Some functions have leaf nodes, meaning they don’t call anybody, and these functions directly return a constant integer or nil. The functions that have callees will sum the return value of their children. If a child returns nil, it will add zero to the sum. This means that non-leaf functions contain dynamic branches that depend on type information.

As a second torture test (see gen_torture_test_2), we wanted to evaluate how well our analysis could cope with polymorphic and megamorphic call sites. A polymorphic call site is a function call that has to handle more than one class. A megamorphic call site is one that has to handle a large number of classes, such as 5-10 or more. We started by generating a large number of synthetic classes, we went with 5000 classes because that seemed like a realistic figure for the number of classes that might be contained by a large real-world program. Each class has 10 instance variables and 10 methods with the same name for the sake of convenience (that makes it easier to generate code).

In order to generate polymorphic and megamorphic call sites, we generate an instance of each class, and then we sample a random number of class instances from that set. We use a Pareto distribution to sample the number of classes because we believe this is similar to how real programs are generally structured. That is, most call sites are monomorphic, but a small number of call sites are highly megamorphic. We generate 200 random DAGs with 750 nodes each, and call the root node of each DAG with a random number of class instances. Each DAG then passes the object it receives from the root node through all of its children. This creates a large number of polymorphic and megamorphic call sites. Our synthetic program contains call sites that receive as many as 144 different classes.

The structure of each DAG in the second torture test is similar to the first one, with the difference that each function calls a randomly selected method of the object it receives as a parameter, and then calls it child functions in the DAG. Conveniently, since methods always have the same name for each class, it’s easy to select a random method that we know by construction is defined on all of our classes. This creates more polymorphic call sites, which is what we wanted to stress-test. The methods of each class are all leaf methods that can either return nil, a random integer, or a randomly selected instance variable.

How does it scale really?

Using the torture test generators, we generated two programs: one with classes and one without.

The program with classes has 175,000 reachable functions of 205,000 total generated, 3 million instructions, and megamorphic (up to 144 classes) method lookups. We complete the analysis in 2.5 seconds on a single core.

The program without classes has 200,000 functions and we analyze it in 1.3 seconds on a single core.

Now, these numbers don’t mean much in absolute terms—people have different hardware, different codebases, etc—but in relative terms they mean that this kind of analysis is more tractable than not. It doesn’t take hours to run. And our analysis is not even particularly well-optimized. We were actually surprised at how fast the analysis runs. Our initial hope was that the analysis could run on a program with 20,000 methods in less than 60 seconds, but we can analyze programs about 10 times that size much faster than expected. This makes it seem likely that the analysis could work on large human-crafted software.

Adding object sensitivity or increasing the k for call-site sensitivity would probably slow things down quite a bit. However, because we know the analysis is so fast, it seems possible to imagine that we could selectively split/specialize call sites of built-in methods to add sensitivity in specific places without increasing the running time by much. For example, in a language with methods on an Array class, such as Ruby, we could do splitting on all Array method calls to increase precision for those highly polymorphic functions.

Wrapping up

Thanks for reading about our little big static type analysis prototype. We published the code on GitHub as a static companion artifact to go with this article and nothing more; it is an experiment that we built, but not a prelude to a bigger project nor is it a tool we expect others to contribute to.

If you would like to read more about the big wide world of program analysis, we recommend searching for terms such as control-flow analysis (CFA) and points-to analysis. Here is an excellent lecture (PDF) by Ondřej Lhoták that gives a tour of the area.