Software written in variety of languages:
Compilers are basically sophisticated text processing systems, which task is thought of as consisting of 2 stages:
Compiler = translate information from one representation to another
input = program source code
Usually,
Translators = transform at the same level of abstraction
Compilers = transform from high-level source code to low-level (object) code
Typical: gcc, javac
Non-typical compiler:
Translators:
Difficult to write, debug, maintain and understand programs written in assembly language
Tremendous increase in productivity when compilers showed up (50 years ago)
There are still few cases where it is better to use assembly-level code:
Goal: Generate machine code which describes the same computation as the source code
Unique translation ? => No!
Best translation ? => No! (Best = smallest or fastest ?)
Compiler optimizations = find better translations
Not easy or even impossible to satisfy all: we have a choice and must find the "good" ones.
that's why we teach compiler concepts!
Compilers have flags to control the mode in which the compiler should operate, with respect to possible optimizations, etc...
The generated code must execute precisely the same computation as in the source code
A language compiler may also be refered to as an implementation of the language.
The object code can be in the form of the machine code or assembly code, or possibly some intermediate code (to be further transformed)
Alternatively, the intermediate code may be directly executed by means of an interpreter.
Analysis : rather general (good for automation)
Synthesis : machine dependant
Automation helped with tools such as Lex and YACC.
Compiler itself is implemented in a specific language:
Java ----- Bytecode
|
C
Erlang ----- BEAM Bytecode
|
Erlang
T diagrams can be combined: One compiler can be obtained from another one after being compiled by a third one.
Used for porting a compiler from one machine to another.
High-level source code -> Compiler -> Low-level machine code
Source code: Optimized for human readibility
Matches human notions of grammar, uses constructs with names, such as variables and procedures/function calls.
int expr(int d)
{
ind d;
d = 4*n*n*(n+1)*(n+1);
return d;
}
Assembly and Machine Code: Optimized for hardware
Consists of machine instructions, uses registers and unnamed memory locations
Much harder to understand by human
Highly modular in design
Logically: compilation process is divided into stages, which in turn are divided into phases.
Physically: compiler is divided into passes.
Analysis
Lexical analysis => Syntax analysis => Semantic analysis
Synthesis
Machine independent code generation => Optimization of machine independent code =>
Storage allocation =>
Machine code generation => Optimization of machine code
Relatively simple: Symbols (or tokens) are formed.
for, do, while,
name, salary, counter,
++, ==
Reads characters from input stream and replace them by language symbols.
; number int return do == ++
should be a "correct" pass
The lexical analyzer has no context to work with: it doesn't know what symbols have been processed and which ones will be.
Simple => easy to automate
Can be time-consuming (since we have more characters than symbols)
Phase where the overall structure of the program is identified.
Involves understanding of the order of the symbols.
The syntax analyser or parser needs to know the context in which it is operating.
(in terms of symbols that have been already processed)
Very often give a tree-like output: the syntax tree.
Example: (a+b)*(c+d)
*
/ \
/ \
+ +
/ \ / \
a b c d
Syntax analysis drives the lexical phase and builds the structure upon which semantic analysis is performed => key phase
Separate from the syntax phase.
Some features, like the scope of a variable, cannot be checked at the syntax phase.
Very often build as commands or action codes called by the parser, to set up and access appropriate tables (of information)
Universal Intermediate Language
Why producing machine independant code?
Many compilers produce some type of intermediate code.
UIL: Desirable goal but elusif (ingraspable, uncatchable)
Intermediate language for established languages:
The UIL problem:
m languages
\ \ \ \ / / / /
UIL
/ / / / \ \ \ \
n machines
Every constant and variable appearing in the program must have storage space allocated for its value (=> address)
Not well suited to automation. No widely available tools.
Early idea: compiler-compiler.
input: specification of a language and a machine
output: implementation of the language on the machine (the compiler)
Has been done for analysis stage but not so much for the synthesis stage.
One pass = each time the source code is read.
Various phases can be executed in parallel (no complex communication between the passes)
Sometimes not possible to do one pass and must sweep several time!
Typical compiler course:
Our focus: Analysis stage