Understanding Compilers from scratch
zero to something guide…
Technologies are evolving so fast. New programming languages rising up while legacy programming languages are improving and going deeper. Last year I came across with new language — Ballerina, a open source programming language — which grabbed my attention in the first sight. After testing few sample code I wanted to learn how it is built, so as the first step I tried learning its compiler design.
So in this article we are going to learn some key concepts and build a small compiler program, not a complete solution but a simple application which will translate assembly like small new language to machine code which can be fed into a processor designed on Altera FPGA.
What is compiler?
A program that translate a program from source language into a target language. In Ballerina, it is from Ballerina source code to a bytecode which fed to machine later. Source code is in high level programming language like C/Java then Compiler will translate them to assembly code latter MIPS assembler will convert bytecode into machine code.
Compiler Parts
Compilers consist three main parts.
- Front-end
- Middle ware
- Back-end
Front-end works with the higher level programming language and does main analysis to identify whether the source code is in a good heath to be translated. So it carries three analysis.
- Lexical Analysis
- Syntactical Analysis
- Semantic Analysis
Middle ware concern about preparing the intermediate representation of the code and its optimization
In the back-end, the real translation happens. Well it has two main components.
- Code generation
- Optimization
so lets talk about how really the compiler does all these work.
How compiler works
Above illustration shows the whole work done in a compiler.Lets break into small pieces and talk.
Scanner
This component scans the source code and generate tokens. Its main responsibility is to read character by character and load it to token buffer and generate tokens according to the lexical rules. Lexical rules are the criterion about how the language is declared. For a example variable names should not start with a underscore (like _var is illegal). Scanner will work on Maximum Length algorithm so it will tokenize a maximum length word. Then this tokens will be sen to Parser. Lexical rules are defined using Regular expressions which we are not going to talk about here.
Parser
Parser will do two main analysis ; Syntactical analysis and Sentiment analysis. After parser gets a token from the tokenizer then it looks for the symbol tables to do the other analysis. Syntactical analysis check whether the grammar of the sentence is correct. A design of the grammar should be done properly to remove syntactical ambiguity. For a example,
Here,according java grammar every statement should end with semicolon(;) but it is missing so the parser will raise syntax error.
Lets understand semantic analysis. sometimes grammar can be correct but it can be incorrect in terms of the context. There is a catch here — we don’t do semantic analysis because it is very costly and need extra work, so first syntax is checked then only semantic analysis done. This way we can reduce the workload.
Here grammar can be correct but variable c is not defined so it is not available in this context. So this raises a semantic error.
If everything is okay then the healthy code is translated into intermediate representation (not the bytecode) only to support translation. This will a simpler version of the sorce code. Example,
d = a + b + c
//this will be translated to
temp = a + b
d = temp + c
After this phase, assembly code will be generated and some optimizations will be there.
For a example lets do a simple code translation.
This will be a small assembler
This small language is designed for a image down sampling project. Lets look the source code for adding two numbers.
Assembler coded in python.
Usage of the programming language.
Resources
Hope you learned something.
Happy coding.