Interprocedural Sparse Conditional Type Propagation
It’s 11 o’clock. Do you know where your variables are pointing?
def shout(obj)
obj.to_s + "!"
end
It’s hard to tell just looking at the code what type obj
is. We assume it
has a to_s
method, but many classes define methods named to_s
. Which to_s
method are we calling?
What is the return type of shout
? If to_s
doesn’t return a String
, it’s
really hard to say.
Adding type annotations would help… a little. With types, it looks like we have
full knowledge about what each thing is but we actually don’t. Ruby, like many
other object-oriented languages, has this thing called inheritance which means that type
signatures like Integer
and String
mean an instance of that class… or an
instance of a subclass of that class.
Additionally, gradual type checkers such as Sorbet (for example) have features
such as T.unsafe
and T.untyped
which make it possible to lie
to the type checker. These annotations unfortunately render the type system
unsound without run-time type checks, which makes it a poor basis for something we
would like to use in program optimization. (For more information, see this
blog post for how it affects
Python in a similar way.)
In order to build an effective compiler for a dynamic language such as Ruby, the compiler needs precise type information. This means that as compiler designers, we have to take things into our own hands and track the types ourselves.
In this post, we show an interprocedural type analysis over a very small Ruby subset. Such analysis could be used for program optimization by a sufficiently advanced compiler. This is not something Shopify is working on but we are sharing this post and attached analysis code because we think you will find it interesting.
Note that this analysis is not what people usually refer to as a type inference engine or type checker; unlike Hindley-Milner (see also previous writing) or similar constraint-based type systems, this type analysis tracks dataflow across functions.
This analysis might be able to identify all of the callers to shout
,
determine that the to_s
of all the arguments return String
, and therefore
conclude that the return type is String
. All from an un-annotated program.
The examples after the intro in this post will use more parentheses than the typical Ruby program because we also wrote a mini-Ruby parser and it does not support method calls without parentheses.
Static analysis
Let’s start from the top. We’ll go over some examples and then continue on into code and some benchmarks.
Do you know what type this program returns?
1
That’s right, it’s Integer[1]
. Not only is it an Integer
, but we have
additional information about its exact value available at analysis time. That will
come in handy later.
What about this variable? What type is a
?
a = 1
Not a trick question, at least not yet. It’s still Integer[1]
. But what if we
assign to it twice?
a = 1
a = "hello"
Ah. Tricky. Things get a little complicated. If we split our program into
segments based on logical execution “time”, we can say that a
starts off as
Integer[1]
and then becomes String["hello"]
. This is not super pleasant
because it means that when analyzing the code, you have to carry around some
notion of “time” state in your analysis. It would be much nicer if instead
something rewrote the input code to look more like this:
a0 = 1
a1 = "hello"
Then we could easily tell the two variables apart at any time because they have different names. This is where static single assignment (SSA) form comes in. Automatically converting your input program to SSA introduces some complexity but gives you the guarantee that every variable has a single unchanging type. This is why we analyze SSA instead of some other form of intermediate representation (IR). Assume for the rest of this post that we are working with SSA.
Let’s continue with our analysis.
What types do the variables have in the below program?
a = 1
b = 2
c = a + b
We know a
and b
because they are constants, so can we constant-fold a+b
into 3
? Kind of. Sort of. In Ruby, without global knowledge that someone has
not and will not patch the Integer
class or do a variety of other nasty
things, no.
But let’s pretend for the duration of this post that we live in a world where
it’s not possible to redefine the meaning of existing classes (remember, we’re
looking at a Ruby-like language with different semantics but similar syntax) or
add new classes at run-time (this is called a closed-world assumption). In that
case, it is absolutely possible to fold those constants. So c
has type
Integer[3]
.
Let’s complicate things.
if condition
a = 3
else
a = 4
end
We said that each variable would only be assigned once, but SSA can represent
such a program using Φ (phi) nodes. Phi nodes are special pseudo-instructions that
track dataflow when it could come from multiple places. In this case, SSA would
place one after the if
to merge two differently named variables into a third
one.
if condition
a0 = 3
else
a1 = 4
end
a2 = phi(a0, a1)
This also happens when using the returned value of an if
expression:
a = if condition
3
else
4
end
The phi
function exists to merge multiple input values.
For our analysis, the phi node does not do anything other than compute the type
union of its inputs. We do this because we are treating a type as a set of all
possible values that it could represent. For example Integer[3]
is the set
{3}
. And Integer
is the infinite and difficult to fit into memory set
{..., -2, -1, 0, 1, 2, ...}
.
This makes the type of a
(a2
) some type like Integer[3 or 4]
, but as we
saw, that set could grow potentially without bound. In order to use a
reasonable amount of memory and make sure our analysis runs in a reasonable
amount of time, we have to limit our set size. This is where the notion of a
finite-height lattice comes in. Wait, don’t click away! It’s going to be okay!
We’re using a lattice as a set with a little more structure to it. Instead of
having a union
operation that just expands and expands and expands, we give
each level of set a limited amount of entries before it overflows to the next,
less-specific level. It’s kind of like a finite state machine. This is a
diagram of a subset of our type lattice:
All lattice elements in our program analysis start at Empty
(unreachable) and incrementally
add elements, following the state transition arrows as they grow. If we see one
constant integer, we can go into the Integer[N]
state. If we see another,
different integer, we have to lose some information and transition to the
Integer
state. This state symbolically represents all instances of the Integer
class. Losing information like this is a tradeoff between precision and analysis time.
To bring this back to our example, this means that a
(which merges
Integer[3]
and Integer[4]
) would have the type Integer
in our lattice.
Let’s complicate things further. Let’s say we know somehow at analysis time that the condition is truthy, perhaps because it’s an inline constant:
a = if true
3
else
4
end
Many analyses looking at this program would see two inputs to a
with
different values and therefore continue to conclude that the type of a
is
still Integer
—even though as humans looking at it, we know that the else
branch never happens. It’s possible to do a simplification in another analysis
that deletes the else
branch, but instead we’re going to use some excellent
work by Zadeck and co called Sparse Conditional Constant Propagation
(SCCP).
Sparse conditional constant propagation
Unlike many abstract interpretation based analyses, SCCP uses type information to inform its worklist-based exploration of the control-flow graph (CFG). If it knows from other information that the condition of a branch instruction is a constant, it does not explore both branches of the conditional. Instead, it only pushes the relevant branch onto the worklist.
Because we’re working inside an SSA (CFG), we separate control-flow into basic blocks as our unit of granularity. These basic blocks are chunks of instructions where the only control-flow allowed is the last instruction.
fn sctp(prog: &Program) -> AnalysisResult {
// ...
while block_worklist.len() > 0 || insn_worklist.len() > 0 {
// Read an instruction from the instruction worklist
while let Some(insn_id) = insn_worklist.pop_front() {
let Insn { op, block_id, .. } = &prog.insns[insn_id.0];
if let Op::IfTrue { val, then_block, else_block } = op {
// If we know statically we won't execute a branch, don't
// analyze it
match type_of(val) {
// Empty represents code that is not (yet) reachable;
// it has no value at run-time.
Type::Empty => {},
Type::Const(Value::Bool(false)) => block_worklist.push_back(*else_block),
Type::Const(Value::Bool(true)) => block_worklist.push_back(*then_block),
_ => {
block_worklist.push_back(*then_block);
block_worklist.push_back(*else_block);
}
}
continue;
};
}
// ...
}
// ...
}
This leaves us with a phi node that only sees one input operand, Integer[3]
,
which gives us more precision to work with in later parts of the program. The
original SCCP paper stops here (papers have page limits, after all) but we took
it a little further. Instead of just reasoning about constants, we use our full
type lattice. And we do it interprocedurally.
Let’s look at a small example of why interprocedural analysis matters before we
move on to trickier snippets. Here we have a function decisions
with one
visible call site and that call site passes in true
:
def decisions(condition)
if condition
3
else
4
end
end
decisions(true)
If we were just looking at decisions
in isolation, we would still think the
return type is Integer
. However, if we let information from all the call
sites flow into the function, we can see that all (one) of the call sites pass
true
to the function… and therefore we should only look at one branch of
the if
.
Now, a reader familiar with SCCP might be wondering how this works
interprocedurally. SCCP by definition requires knowing in advance what
instructions use what other instructions: if you learn new facts about
the output instruction A
, you have to propagate this new information to all
of the uses. In a single function’s control-flow graph, this isn’t so bad; we
have full visibility into definitions and uses. It gets harder when we expand
to multiple functions. In this example, we have to mark the condition
parameter as a use of all of the (currently constant) actual arguments being
passed in.
But how do we know the callers?
Interprocedural SCCP
Let’s start at the entrypoint for an application. That’s normally a main
function somewhere that allocates some objects and calls a couple of other
functions. These functions might in turn call other functions, and so on and so
forth, until the application terminates.
These calls and returns form a graph, but we don’t know it statically—we don’t know it at the start of the analysis. Instead, we have to incrementally build it as we discover call edges.
In the following code snippet, we would begin analysis at the entrypoint, which
in this snippet is the main
function. In it, we see a direct call to the
foo
function. We mark that foo
is called by main
—and not just by
main
, but by the specific call site inside main
. Then we enqueue the start
of the bar
function—its entry basic block—onto the block worklist.
def bar(a, b)
a + b
end
def foo()
bar(1, 2) + bar(3, 4)
end
def main()
foo()
end
At some point, the analysis will pop the entry basic block of foo
off the
worklist and analyze foo
. For each of the direct calls to bar
, it will
create a call edge. In addition, because we are passing arguments, it will wire
up 1
and 3
to the a
parameter and 2
and 4
to the b
parameter. It
will enqueue bar
’s entry block.
At this point, we’re merging Integer[1]
and Integer[3]
at the a
parameter
(and similarly at b
). This is kind of like an interprocedural phi node and we
have to do the same union operation on our type lattice.
This means that we won’t be able to fold a+b
for either call to bar
,
unfortunately, but we will still get a return type of Integer
, because we
know that Integer+Integer=Integer
.
Now, if there were a third call to bar
that passed it String
s, every call
site would lose. We would end up with ClassUnion[String, Integer]
at each
parameter and, worse, Any
as the function result. We wouldn’t even get
ClassUnion[String, Integer]
because we don’t keep each call site separate, so
from the perspective of the analysis, we could be looking at String+Integer
,
which doesn’t have a known type (in fact, it probably raises an exception or
something).
But what if we kept each call site separate?
Sensitivity
This kind of thing is generally called something-sensitivity, where the something depends on what your strategy is to partition your analysis. One example of sensitivity is call-site sensitivity.
In particular, we might want to extend our current analysis with
1-call-site-sensitivity. The number, the k variable that we can dial for
more precision and slower analysis, is the number of “call frames” we want to
keep track of in the analysis. This stuff is good for very commonly used
library functions such as to_s
and each
, where each caller might be quite
different.
In the above very not-representative example, 1-call-site-sensitivity would
allow us to completely constant fold the entire program into Integer[10]
(as
1 + 2 + 3 + 4 = 10
). Wow! But it would slow down the analysis because it
requires duplicating analysis work. To side-by-side the rough steps:
Without call-site sensitivity / 0-call-site-sensitive (what we currently have):
- See call to
bar
with arguments 1 and 2 - Mark
bar
s parameters as beingInteger[1]
andInteger[2]
- See
bar
add node with constant left and right operands - Mark
bar
add result asInteger[3]
- Mark
bar
return asInteger[3]
- Mark result of
bar(1, 2)
asInteger[3]
- See call to
bar
with arguments 3 and 4 - Mark
bar
s parameters as beingInteger
andInteger
(we have to union) - Mark
bar
add result asInteger
(the arguments are not constant) - Mark
bar
return asInteger
- See
foo
’s own add with operandsInteger
andInteger
- Mark
foo
’s add as returningInteger
With 1-call-site-sensitive:
- See call to
bar
from functionfoo
with arguments 1 and 2 - Make a new call context from
foo
- Mark
foo0->bar
parameters as beingInteger[1]
andInteger[2]
- See
foo0->bar
add node with constant left and right operands - Mark
foo0->bar
add result asInteger[3]
- Mark
foo0->bar
return asInteger[3]
- Mark result of
bar(1, 2)
asInteger[3]
- See call to
bar
with arguments 3 and 4 - Make a new call context from
foo
- Mark
foo1->bar
parameters as beingInteger[3]
andInteger[4]
- Mark
foo1->bar
add result asInteger[7]
- Mark
foo1->bar
return asInteger[7]
- See
foo
’s own add with constant operandsInteger[3]
andInteger[7]
- Mark
foo
add as returningInteger[10]
See how we had to analyze bar
once per call-site instead of merging call
inputs and returns and moving up the lattice? That slows the analysis down.
There is also context sensitivity, which is about partitioning calls based on some computed property of a given call site instead of where it is in the program. Maybe it’s the tuple of argument types, or the tuple of argument types with any constant values removed, or something else entirely. Ideally it should be fast to generate and compare between different other call sites.
There are other kinds of sensitivity like object sensitivity, field sensitivity, and so on—but since this is a bit of a detour in the main article and we did not implement any of them, we instead leave them as breadcrumbs for you to follow and read about.
Let’s go back to the main interprocedural SCCP and add some more trickiness into the mix: objects.
Objects and method lookup
Ruby doesn’t just deal with integers and strings. Those are special cases of a larger object system where objects have instance variables, methods, etc and are instances of user-defined classes.
class Point
attr_accessor :x
attr_accessor :y
def initialize(x, y)
@x = x
@y = y
end
end
p = Point.new(3, 4)
puts(p.x, p.y)
This means that we have to start tracking all classes in our static analysis or
we will have a hard time being precise when answering questions such as “what
type is the variable p
?”
Knowing the type of p
is nice—maybe we can fold some is_a?
branches in
SCCP—but the analysis becomes even more useful if we can keep track of the
types of instance variables on objects. That would let us answer the question
“what type is p.x
?”
Per this paper (PDF), there
are at least two ways to think about how we might store that kind of
information. One, which the paper calls field-based, unifies the storage of
field types based on their name. So in this case, all potential writes to any
field x
might fall into the same bucket and get union
ed together.
Another, which the paper calls field-sensitive, unifies the storage of field
types based on the receiver (object holding the field) class. In this case, we
would differentiate all possible types of p
at a given program point when
writing to and reading from p.x
.
We chose to do the latter approach in our static analysis: we made it field sensitive.
fn sctp(prog: &Program) -> AnalysisResult {
// ...
while block_worklist.len() > 0 || insn_worklist.len() > 0 {
// Read an instruction from the instruction worklist
while let Some(insn_id) = insn_worklist.pop_front() {
let Insn { op, block_id, .. } = &prog.insns[insn_id.0];
// ...
match op {
Op::GetIvar { self_val, name } => {
let result = match type_of(self_val) {
Type::Object(classes) => {
// ...
classes.iter().fold(Type::Empty, |acc, class_id| union(&acc, &ivar_types[class_id][name]))
}
ty => panic!("getivar on non-Object type {ty:?}"),
};
result
}
}
}
}
}
This means that we have to do two things: 1) keep track of field types for each instance variable (ivar) of each class and then 2) at a given ivar read, union all of the field types from all of the potential classes of the receiver.
Unfortunately, it also creates a complicated uses relationship: any GetIvar
instruction is a use of all possible SetIvar
instructions that could affect
it. This means that if we see a SetIvar
that writes to T.X
for some class
T
and field name X
, we have to go and re-analyze all of the GetIvar
s that
could read from that class (and propagate this information recursively to the
other uses, as usual).
All of this union-ing and reflowing and graph exploration sounds slow. Even with pretty efficient data structures, there’s a lot of iteration going on. How slow is it really? To answer that, we build some “torture tests” to artificially create some worst-case benchmarks.
Testing how it scales: generating torture tests
One of the big challenges when it comes to anything related to compiler design is that it’s difficult to find large, representative benchmarks. There are multiple reasons for this. Large program tend to come with many dependencies which makes them hard to distribute, install and maintain. Some software is closed source or copyrighted. In our case, we’re working with a mini language that we created to experiment, so there are simply no real-world programs written in that language, so what can we do?
The first question to ask is: what are we trying to measure? One of our main concerns in implementing and testing this analysis was to know how well it performed in terms of execution time. We would like to be confident that the analysis can cope with large challenging programs. We know from experience that YJIT compiles over 9000 methods when running Shopify’s production code. If YJIT compiles 9000 “hot” methods, then one could guess that the full program might contain 10 times more code or more, so let’s say 100,000 methods. As such, we figured that although we don’t have human-crafted programs of that scale for our mini-language, we could generate some synthetic programs that have a similar scale. We figure that if our analysis can cope with a “torture test” that is designed to be large and inherently challenging, that gives us a good degree of confidence that it could cope with “real” programs.
To generate synthetic test programs, we want to generate a call graph of
functions that call each other. Although this isn’t strictly necessary for
our type analysis to work, we’d like to generate a program that isn’t infinitely
recursive and always terminates. That’s not difficult to achieve
because we can write a piece of code that directly generates a Directed Acyclic
Graph (DAG). See the random_dag
function in the loupe repository described at the end of this post. This function
generates a directed graph that has a single “root” node with a number of
interconnected child nodes such that there are no cycles between the nodes.
For our first torture test (see gen_torture_test
), we generated a graph of
200,000 functions that call each other. Some functions have leaf nodes, meaning
they don’t call anybody, and these functions directly return a constant
integer or nil
. The functions that have callees will sum the return value of
their children. If a child returns nil, it will add zero to the sum. This means
that non-leaf functions contain dynamic branches that depend on type information.
As a second torture test (see gen_torture_test_2
), we wanted to evaluate how
well our analysis could cope with polymorphic and megamorphic call sites. A
polymorphic call site is a function call that has to handle more than one class.
A megamorphic call site is one that has to handle a large number of classes, such
as 5-10 or more. We started by generating a large number of synthetic classes, we
went with 5000 classes because that seemed like a realistic figure for the number
of classes that might be contained by a large real-world program. Each class has 10
instance variables and 10 methods with the same name for the sake of convenience
(that makes it easier to generate code).
In order to generate polymorphic and megamorphic call sites, we generate an instance of each class, and then we sample a random number of class instances from that set. We use a Pareto distribution to sample the number of classes because we believe this is similar to how real programs are generally structured. That is, most call sites are monomorphic, but a small number of call sites are highly megamorphic. We generate 200 random DAGs with 750 nodes each, and call the root node of each DAG with a random number of class instances. Each DAG then passes the object it receives from the root node through all of its children. This creates a large number of polymorphic and megamorphic call sites. Our synthetic program contains call sites that receive as many as 144 different classes.
The structure of each DAG in the second torture test is similar to the first one, with
the difference that each function calls a randomly selected method of the object it
receives as a parameter, and then calls it child functions in the DAG. Conveniently,
since methods always have the same name for each class, it’s easy to select a random
method that we know by construction is defined on all of our classes. This creates more
polymorphic call sites, which is what we wanted to stress-test. The methods of each
class are all leaf methods that can either return nil
, a random integer, or a randomly
selected instance variable.
How does it scale really?
Using the torture test generators, we generated two programs: one with classes and one without.
The program with classes has 175,000 reachable functions of 205,000 total generated, 3 million instructions, and megamorphic (up to 144 classes) method lookups. We complete the analysis in 2.5 seconds on a single core.
The program without classes has 200,000 functions and we analyze it in 1.3 seconds on a single core.
Now, these numbers don’t mean much in absolute terms—people have different hardware, different codebases, etc—but in relative terms they mean that this kind of analysis is more tractable than not. It doesn’t take hours to run. And our analysis is not even particularly well-optimized. We were actually surprised at how fast the analysis runs. Our initial hope was that the analysis could run on a program with 20,000 methods in less than 60 seconds, but we can analyze programs about 10 times that size much faster than expected. This makes it seem likely that the analysis could work on large human-crafted software.
Adding object sensitivity or increasing the k for
call-site sensitivity would
probably slow things down quite a bit. However, because we know the analysis is
so fast, it seems possible to imagine that we could selectively split/specialize
call sites of built-in methods to add sensitivity in specific places without
increasing the running time by much. For example, in a language with methods on
an Array
class, such as Ruby, we could do splitting on all Array
method calls
to increase precision for those highly polymorphic functions.
Wrapping up
Thanks for reading about our little big static type analysis prototype. We published the code on GitHub as a static companion artifact to go with this article and nothing more; it is an experiment that we built, but not a prelude to a bigger project nor is it a tool we expect others to contribute to.
If you would like to read more about the big wide world of program analysis, we recommend searching for terms such as control-flow analysis (CFA) and points-to analysis. Here is an excellent lecture (PDF) by Ondřej Lhoták that gives a tour of the area.