The compiling process

What are compilers ?

Software written in variety of languages:

  • Traditional imperative languages such as COBOL, FORTRAN, Pascal or C.
  • Object oriented languages, such as C++, Smalltalk or Java.
  • Functional languages such as LISP, Prolog or Erlang
  • ...

Compilers are basically sophisticated text processing systems, which task is thought of as consisting of 2 stages:

  • Analysis stage, in which the input text is analyzed
  • Synthesis stage, during which the machine-oriented representation is generated.

Compiler = translate information from one representation to another
input = program source code

Usually,
Translators = transform at the same level of abstraction
Compilers = transform from high-level source code to low-level (object) code

Examples

Typical: gcc, javac

Non-typical compiler:

  • latex: document compiler: transforms to DVI printing commands. Input = document (not program)
  • C-to-Silicon compiler: generates hardware circuits for C programs, output is lower-level than typcial compilers

Translators:

  • f2c: fortran to C (both high-level)
  • latex2html (both document)
  • dvi2ps: DVI-to-PostScript (both low-level)

Need compilers?

Difficult to write, debug, maintain and understand programs written in assembly language

Tremendous increase in productivity when compilers showed up (50 years ago)

There are still few cases where it is better to use assembly-level code:

  • access low-level resources like device drivers
  • These code fragments are very small and the compiler handles the rest of the code in the application

Efficiency

Goal: Generate machine code which describes the same computation as the source code

Unique translation ? => No!
Best translation ? => No! (Best = smallest or fastest ?)
Compiler optimizations = find better translations

  1. efficient compilation
  2. minimal compiler size
  3. minimal size of object code
  4. production of efficient object code
  5. ease of portability
  6. ease of maintenance
  7. Optimal usability including good error diagnostics and error recovery

Not easy or even impossible to satisfy all: we have a choice and must find the "good" ones.
that's why we teach compiler concepts!

  • (5) and (6) don't go along with (2)
  • (4) doesn't go along with (1)
  • Teaching environment: (1) and (7) are important
  • Embedded systems: (3) and (4) are important

Compilers have flags to control the mode in which the compiler should operate, with respect to possible optimizations, etc...

Correctness

The generated code must execute precisely the same computation as in the source code

  • Hard to debug programs with a broken compiler
  • Implications for development costs and security
  • Compiler courses: study techniques to ensure correct translation

A language compiler may also be refered to as an implementation of the language.
The object code can be in the form of the machine code or assembly code, or possibly some intermediate code (to be further transformed)
Alternatively, the intermediate code may be directly executed by means of an interpreter.

Analysis : rather general (good for automation)
Synthesis : machine dependant

Automation helped with tools such as Lex and YACC.

Compiler itself is implemented in a specific language:

Java ----- Bytecode
             |
           C

Erlang ----- BEAM Bytecode
                |
           Erlang

T diagrams can be combined: One compiler can be obtained from another one after being compiled by a third one.

Used for porting a compiler from one machine to another.


General Structure

High-level source code -> Compiler -> Low-level machine code

Source code: Optimized for human readibility
Matches human notions of grammar, uses constructs with names, such as variables and procedures/function calls.

int expr(int d)
{
	ind d;
	d = 4*n*n*(n+1)*(n+1);
	return d;
}

Assembly and Machine Code: Optimized for hardware
Consists of machine instructions, uses registers and unnamed memory locations
Much harder to understand by human

Structure

Highly modular in design

Logically: compilation process is divided into stages, which in turn are divided into phases.
Physically: compiler is divided into passes.

  • 0. Pre-processing
  • 1. Analysis stage
    • Lexcial analysis
    • Syntax analysis
    • Semantic analysis
  • 2. Synthesis stage

Analysis

Lexical analysis => Syntax analysis => Semantic analysis

Synthesis

Machine independent code generation => Optimization of machine independent code =>
Storage allocation =>
Machine code generation => Optimization of machine code

Lexical analysis

Relatively simple: Symbols (or tokens) are formed.

for, do, while,
name, salary, counter,
++, ==

Reads characters from input stream and replace them by language symbols.

; number int return do == ++

should be a "correct" pass
The lexical analyzer has no context to work with: it doesn't know what symbols have been processed and which ones will be.

Simple => easy to automate
Can be time-consuming (since we have more characters than symbols)

Syntax analysis

Phase where the overall structure of the program is identified.
Involves understanding of the order of the symbols.
The syntax analyser or parser needs to know the context in which it is operating.
(in terms of symbols that have been already processed)
Very often give a tree-like output: the syntax tree.

Example: (a+b)*(c+d)

       *
      /  \
    /      \
  +        +
 /  \      /  \
a   b  c    d

What to do if error? Depends on the error recovery level.
Must at least indicates that "at this point" the input is invalid.
Often outputs more, for help, but might not help!

Syntax analysis drives the lexical phase and builds the structure upon which semantic analysis is performed => key phase

Semantic analysis

Separate from the syntax phase.
Some features, like the scope of a variable, cannot be checked at the syntax phase.
Very often build as commands or action codes called by the parser, to set up and access appropriate tables (of information)

UIL

Universal Intermediate Language

Why producing machine independant code?

  • Helps portability (of the compiler)
  • helps to separate language dependencies and machine dependencies in the compiler

Many compilers produce some type of intermediate code.
UIL: Desirable goal but elusif (ingraspable, uncatchable)

Intermediate language for established languages:

  • P-code for Pascal
  • Diana for Ada
  • Bytecode for Java
  • BEAM bytecode for Erlang

The UIL problem:

    m languages
     \ \ \ \   / / / /
           UIL
     / / / /   \ \ \ \ 
     n machines

Problem to design and define: at what level should we have the UIL ?

Storage Allocation

Every constant and variable appearing in the program must have storage space allocated for its value (=> address)

  • Static storage: allocate and never release: lifetime = lifetime of the program
  • Dynamic storage: allocate and release when done: lifetime = lifetime of the block or procedure
  • Global storage: lifetime is unknown at compile time. Located and deallocated at run-time. Implies run-time overhead.

Synthesis stage

Not well suited to automation. No widely available tools.
Early idea: compiler-compiler.
input: specification of a language and a machine
output: implementation of the language on the machine (the compiler)

Has been done for analysis stage but not so much for the synthesis stage.

One pass = each time the source code is read.
Various phases can be executed in parallel (no complex communication between the passes)
Sometimes not possible to do one pass and must sweep several time!


Typical compiler course:

  • Intro
  • Analysis stage
    • Lexcial analysis
    • Syntax analysis
    • Semantic analysis
  • Synthesis stage
    • Intermediate code
    • Control flow
    • Dataflow analysis
    • Optimization
    • Register Allocation
    • Advanced topics

Our focus: Analysis stage