How Mips2cs works

This page goes through some technical details of how Mips2cs works. The aim is to give a sort of overview of the system, and to provide some "context" for those who may wish to go diving into the source code.

It is assumed that the reader has some familiarity with MIPS assembly language and with Haskell and C# programming.

Overview

The basic "pipeline" of Mips2cs is summarized in the following diagram:

Diagram showing the Mips2cs pipeline

The sequence of events during a typical run is as follows:

  1. Open the input file (as a binary file) and read the contents into a ByteString.
  2. Parse out the ELF headers and the code and data sections (we use an external library for this).
  3. Disassemble the code. (We define an "Insn" datatype to represent the disassembled MIPS instructions.)
  4. Convert the assembly code into an "Intermediate" representation; this serves as a higher-level representation than the raw machine code, and is described in detail below.
  5. Optimize the Intermediate code, by performing a number of transformation passes on it.
  6. Convert the Intermediate code to C#, and write the output file.

The following sections give more details.

The "Intermediate" representation

The Intermediate representation is comprised of three main elements: Exprs, CondExprs and Statements. A full definition of these can be found in the source code (in module Mips2cs.Intermediate); here we give a brief description of each.

Expr

Represents an arbitrary integer expression. Examples:

CondExpr

Represents a condition (boolean expression). Examples:

Statement

Represents an operation performed by the MIPS machine. Examples include:

Variables

The Intermediate language contains the concept of variables. Variables are created by the Let statement and can be used in expressions via the VarExpr construct (see above).

Variables must follow two rules:

Variables are used extensively during the optimization passes of Mips2cs (this will be discussed further in the section on optimization, below).

Translating the MIPS machine code into Intermediate code

The idea is that the entire MIPS machine code program can be converted into an equivalent program in the Intermediate language. To do this, each MIPS instruction is converted into zero or more Statements. This is usually straightforward. For example:

Branch instructions

Branches require special handling because of the MIPS delay slot mechanism. Mips2cs creates two Statements for a branch instruction: one for the delay slot, and one for the branch itself. In addition, any registers referenced in the branch condition are saved (using Let statements) to ensure that the correct value gets used.

For example, the following disassembly:

  400084:       1060000c        beqz    v1,4000b8
  400088:       24630001        addiu   v1,v1,1

would be translated to the following three Statements:

  Let "delay1" (LoadRegExpr "v1")
  StoreReg "v1" (BinExpr Add (LoadRegExpr "v1") (LitExpr 1))
  Jump (BinCond Equal (VarExpr "delay1") (LitExpr 0)) 0x4000b8 0x40008c

Note that the Let is essential in this example; without it, the wrong value of v1 would be used in the branch condition.

Basic block analysis

After generating the Statements for each machine code instruction, we organize them into basic blocks. (A basic block is a chunk of code that has exactly one entry point and one exit point.)

In Mips2cs basic blocks are represented by

  1. an address (the start address of the block in the original machine code)
  2. a sequence of Statements.

The following invariants are maintained:

  1. The final Statement of a block is either a Jump or an IndirectJump.
  2. All other Statements in the block are non-jump statements.

The results of the basic block analysis are used extensively during the optimization and code generation passes.

Optimization

Once the Intermediate code has been produced, Mips2cs runs a series of optimization passes on it, in order to make it shorter and simpler. (This results in a shorter and faster C# program at the end of the process.)

The optimizations are done by the module Mips2cs.Simplifier, which runs several optimization passes, as follows.

Expression simplification

We run a pass to simplify Exprs at compile time. This uses simple pattern matching to calculate compile time constants (e.g. "1 + 2" is changed into "3") and apply arithmetic identities (e.g. "x + 0" is changed into "x", for any expression x).

Synthesizing many small expressions into larger ones

A naive translation of machine code into C# would tend to produce sequences like the following:

  v0 = s0 + s1;
  v0 = v0 + s2;
  v0 = v0 + 10;

(where each line corresponds to a single machine code instruction). However, it would be more natural to write this in a single line, as follows:

  v0 = s0 + s1 + s2 + 10;

Not only does this result in a shorter C# program, it also results in a smaller final compiled executable (at least when using the Microsoft C# compiler – which appears to generate a more efficient bytecode sequence for the second form of the code, as compared to the first).

Therefore, we introduce two optimization passes which are designed to turn the first form of the code into the second. These are as follows:

Converting StoreReg statements into Let statements

If a register is written multiple times, all writes bar the last can be changed into local variables (i.e. Let statements). For example, the following:

  StoreReg "v0" (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1"))
  StoreReg "v0" (BinExpr Add (LoadRegExpr "v0") (LoadRegExpr "s2"))
  StoreReg "v0" (BinExpr Add (LoadRegExpr "v0") (LitExpr 10))

can be changed into:

  Let "dummy_var_1" (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1"))
  Let "dummy_var_2" (BinExpr Add (VarExpr "dummy_var_1") (LoadRegExpr "s2"))
  StoreReg "v0" (BinExpr Add (VarExpr "dummy_var_2") (LitExpr 10))

Substitution

Because "Lets" are, by definition, local variables, it is legitimate to substitute the right-hand side of the Let directly into the expression where the variable is used, and then delete the Let. So, continuing our example, we may first substitute for dummy_var_1 to obtain:

  Let "dummy_var_2" (BinExpr Add (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1")) (LoadRegExpr "s2"))
  StoreReg "v0" (BinExpr Add (VarExpr "dummy_var_2") (LitExpr 10))

and then substitute for dummy_var_2 to obtain:

  StoreReg "v0" (BinExpr Add (BinExpr Add (BinExpr Add (LoadRegExpr "s0") (LoadRegExpr "s1")) (LoadRegExpr "s2")) (LitExpr 10))

We have now arrived at the desired form of the code, i.e. one single assignment instead of three separate assignments.

Note that there are some cases where we do not want to substitute out a Let statement. These are as follows:

Complex expressions

If we have something like

  Let "dummy_var_1" (some huge complex expression)
  StoreReg "v0" (VarExpr "dummy_var_1")
  StoreReg "v1" (VarExpr "dummy_var_1")

then we don't want to substitute for dummy_var_1, because we will then be evaluating the huge complex expression more than once. Therefore, before doing a substitution, Mips2cs checks that it is not going to result in repeated calculation being done.

Data hazards

Sometimes it is not safe to substitute a variable, e.g. in

  Let "foo" (LoadRegExpr "v1")
  StoreReg "v1" (LitExpr 1)
  StoreReg "v2" (VarExpr "foo")

we cannot substitute (LoadRegExpr "v1") in place of (VarExpr "foo"), because the value of v1 has changed in the interim. We refer to this situation as a data hazard, and in such cases, substitution is not performed.

Liveness analysis

We say that a register is "dead" if its value is no longer needed after the current basic block completes. In other words, future basic blocks will not read from that register (or at least, they will always write to it before they read it).

Mips2cs includes a liveness analysis pass which can detect dead registers in many cases. If a dead register is found, the StoreReg statement (where the dead value gets written) is changed into a Let. The substitution pass is then run again; this often allows the Let to be substituted into other expressions, or even removed entirely.

A common example is the MIPS multiply instruction which writes a 64-bit result to the hi and lo registers. In many cases, the hi value is never used. In these cases, the liveness analysis (and following substitution pass) is usually able to remove the assignment to hi entirely.

There are also examples where liveness analysis does not remove a value entirely, but it does lead to further optimizations being made. For an example see the Worked example.

Code generation

Once we have our optimized Intermediate code we can use it to generate the final C# code. By this point, all of the hard optimization work has been done, so we can translate the Intermediate code into C# in a relatively straightforward and mechanical way.

An example of the C# code generated by Mips2cs is given in cs_example.html.

Points to note are the following:

Worked example

If you want to see how the above works in a real example then you may want to look at the Worked example page. This illustrates the complete pipeline (including translation to Intermediate code, optimization, and C# code generation) for a simple factorial function.