LitePAC Compiler ================ Compiler Overview ----------------- The vmlang program is the compiler. The program compiles one source file at a time. As input it takes a C-like language source file and coverts it to an assembler listing of the target architecture. The process of compilation is divided into some stages: - Lexical analysis. - Syntax analysis. - Type checking analysis (aka semantic analysis). - Intermidiate representation. - Peephole optimization. - Generation of the architecture specific code. Stages are listed in the execution order. During stages execution a source program is represented in some different forms. Each stage describes in more detailes below. [NOTE] To explain how the compiler works we have to descibe some theoretical stuff. At the beginning it will be a little bit confused why we need it. But be patient and believe that it makes sense. Lexical analysis ---------------- Introduction ~~~~~~~~~~~~ Tokens are the base elements in the lexical analysis. The lexical analyser sees the source file as the stream of tokens. Lexeme is another name for token. The lexical analyser is called from the syntaxer analyser. The lexical analyser returns one token at a time. Token is a simple data structure, each token has its own type and the attribute value which refers to this token. For instance, if the lexical analyser found an integer number +1024+, the syntaxer analyser will recieve the integer number token (+TOKEN_INTEGER+), the attribute value will contain +1024+. For some tokens the attribute value means nothing and it is empty. For instance, arithmetic opetations (addition, subtraction, etc) don't need the attribute and the lexical analyser only returns the approappriate token type (+TOKEN_PLUS+, +TOKEN_MINUS+, etc). LitePAC Lexer (aka mlex) ~~~~~~~~~~~~~~~~~~~~~~~~ Indeed in the compiler the lexical analyser is implemented as one big function. It returns the token type and set the attribute value in the global variable _mlex_. The full list of all tokens can be found in the _ file. The global variable which contains all information about a token is presented below. ----------------------------------------------- struct mlex { token_t token; union { int integer; float real; char ch; char *id; char *str; }; }; ----------------------------------------------- Lets look at another example. Assume we have an expression +a = b + c * 10+. The lexical analyser sees this expression as: +(id, a) (=) (id, b) (+++) (id, c) (*) (num, 10)+ The atrribute value is placed after the comma. For the LitePAC compiler the stream of tokens looks the following way: +TOKEN_ID+ _->_ +TOKEN_EQUAL+ _->_ +TOKEN_ID+ _->_ +TOKEN_PLUS+ _->_ +TOKEN_ID+ _->_ +TOKEN_ASTERISK+ _->_ +TOKEN_INTEGER+ The global variable _mlex_ contains the approapriate value for each token. In case of +TOKEN_ID+ the anonymous union _mlex_ keeps the pointer to the identifier name. In case of an integer number it keeps the value of the number. Regular expressions ~~~~~~~~~~~~~~~~~~~ Now we describe how the parser works in the formal way with help of regular expressions. [NOTE] Definitely if you need to write the lexer you will never describe its tokens with the help of regular expressions, everything will be in your head. We use the regular expressions in this chapter only as the helpful notation. Regular expressions help to describe what kind of tokens the LitePAC lexer has to recognize. Also this formal description is the technical specification for a programmer. +'+'+ _->_ one or more elements +'?'+ _->_ zero or one elements +'*'+ _->_ zero or one elements White spaces (lexer skips them and doesn't return tokens for them): +ws+ _->_ (blank | tab | newline)+ Digits: +digit+ _->_ +[0-9]+ Numbers: +digits+ _->_ +digit++ Integer numbers and float numbers (including exponent representation): +number -> digits(.digits)?(E[+++|-]?digits)?+ English letters in lower and higher cases and the underscore symbol: +letter_+ _->_ +[A-Za-z_]+ Identifier can begin with the underscore symbol or the letter: +id+ _->_ +letter_[letter_|digit]*+ Different operations: +assign_op+ _->_ +=+ +arith_op+ _->_ +++ _|_ +-+ _|_ +*+ _|_ +/+ _|_ +%+ +rel_op+ _->_ +<+ _|_ +>+ _|_ +<=+ _|_ +>=+ _|_ +==+ _|_ +!=+ +not_op+ _->_ +!+ +logic_op+ _->_ +||+ _|_ +&&+ Basic symbols: +basic_sym+ _->_ +{+ _|_ +(+ _|_ +[+ _|_ +}+ _|_ +)+ _|_ +]+ _|_ +.+ _|_ +,+ _|_ +;+ _|_ +&+ Keyword tokens (see the explanation below): _if ->_ +if+ _else ->_ +else+ _for ->_ +for+ _etc_ Each language has a set of keywords. Later we will speak about handling the keywords in more details. At this point we need only to know that the keyword is the common identifier +id+ and when the lexical analyser finds the +id+. It makes a lookup in the keyword table and if the identifier is the keyword the lexical analyser returns the token of this keyword otherwise it returns +TOKEN_ID+ with the name of the identifier in the attribute value. So each keyword has its own token. *for* -- +TOKEN_FOR+ *if* -- +TOKEN_IF+ *else* -- +TOKEN_ELSE+ *while* -- +TOKEN_WHILE+ ... Keyword table ------------- In the last chapter we alredy talked about the keywords. Now you know that for each keyword lexer returns an own token. When the lexer finds the identifier it needs to detect wheter it is a simple identifier or it is a keyword. The keywords table can be found in the _keyword.c_ file. The keywords table uses the hash table as a backend. And before starting the parser the table is initialized with keywords. A keyword is represented with a simple structure, where _name_ is the keyword string, _token_t_ is the token type which is defined in __: ----------------------- struct keyword { char *name; token_t type; }; ----------------------- The following piece of code shows what is going on in the lexer: ------------------------------------ kword = keyword_table_lookup(cbuf); if (kword != NULL) { mlex.token = kword->type; return kword->type; } mlex.id = xstrdup(cbuf); mlex.token = TOKEN_ID; return TOKEN_ID; ------------------------------------ The code was simplified to show the core of handling. _cbuf_ contains the examining string. The lookup function in fact is the simple wrapper around the hash table lookup function. If we found the match in the keyword table the lexer returns the type from the _type_ field. If this identifier is not the keyword we return +TOKEN_ID+ and set the value attribute with the found string. Syntax analysis --------------- Introduction ~~~~~~~~~~~~ Syntaxer is also called as a parser. The syntaxer is the heart of the compiler, it knows about all syntaxer constructions, which can be used in the language. In general a syntax describes the language and it knows which syntaxer constructions are permitted. As mentioned earler the syntax analyser recieves tokens from the lexical analyser. When the parser finishes its work, it builds the unambiguous tree. The tree is the same source program but represented in the tree form. Traversing tree can restore the original source program. The syntax parser in LitePAC is implemented as the recursive descent predictive parser. In spite of the long name this kind of parser is easy implemented than others. For building tree the compiler has simple API for all necessary nodes which need to present the original program. [NOTE] Before describe the LitePAC compiler syntax (aka gramma) we give some theory basis how it works and why it works this way. The main reason why we do it because the right written grammar is mapped _directly_ to the source code of the syntax parser. Backus-Naur notation ~~~~~~~~~~~~~~~~~~~~ The syntax analysis is based on the parser rules. Rules should be described before writting a parser. There is the whole math theory for the formal decription of languages. We only need to know what the Backus-Naur notation is and how to use it. This notation consist of some main components: - set of terminals (aka tokens); - set of nonterminals (aka syntax variables); - set of productions; - one of nonterminals has to be the start symbol. A terminal is a base language symbol. A nonteminal consist of terminals, terminals in nonterminals are organised in the way to describe a syntax construction. A production consists of one nonterminal -- the head or the left side of the production, and/or set of nonterminals -- body or the right side of the production. Sides are divede with an arrow '->', sometimes we use '|' (or) symbol to describe the right side alternative variants. Again it sounds quite difficult, lets look at the example. Assume we need to describe the grammar of this expression: +2 + 4 - 8 - 5 + ... + 9+ Rules for this grammar look the following way: +list+ _->_ +list+ +'+'+ +digit | list '-' digit | digit+ +digit+ _->_ +0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9+ Each line is a production, the left side is the head of the production, the right side is the body of the production. Terminals are '+ - 0 1 ... 9'. Signs '+' and '-' don't stand for addition or subtruction, they are only basic tokens. Nonterminals are the digit and the list, the list is also the start symbol. The tree is the result of the parsing. The tree has some important properties: - The root is the start symbol; - Each leaf is the terminal; - The inŠµtermediate nodes are nonterminals; The tree for expression +2 + 4 - 8 - 5+: __________list_____ / | \ ___________list____ | digit / | \ | | _____list___ | digit | | / | \ | | | | digit | digit | | | | | | | | | | | 2 + 4 - 8 - 5 The main goal of the parser -- beggining from the start symbol to locate the right syntax constructions. Associativity of operators ~~~~~~~~~~~~~~~~~~~~~~~~~~ The next important point is worse to mention is the associativity of operators. The associativity of operators is the order of their execution if they (operators) have the same priority. The priority will be described in the next chapter. Now lets assume that we have only '+' and '-' operators and it is well known that they have the same proirity. In the most languages all arithmetic operations are left-associative. For exmaple we have the expression +4 + 7 + 3+. Where we have 2 operators and 3 operands. The operand 7 has two operators + from left and right but only the left operator refers to 7. In fact the right operand of 7 is the left operand of 3. It is very obvious for the human beings. But it also should be obvious for the parser. Now knowing this we can say that +4 + 7 + 3+ is the same as +(4 + 7) + 3+. But it is not the same as +4 + (7 + 3)+ and this is only because we defined earlier that arithmetic operations are left-associative. There are also right-associative operators and for example '=' is the right-associative operator. Above we already described the rules for '+' and '-' operators. These rules are for '=' operator: +right+ _->_ +letter '=' right | letter+ +letter+ _->_ +a | b | c ... | z+ This grammar can describe the expression like this: +c = d = e = z+ Below you see the difference in trees of left-associative and right-associative operators. ___right____________ / | \ letter | ___right___________ | | / | \ | | letter | ___right____ | | | | / | \ | | | | letter | letter | | | | | | | c = d = e = z The tree of left-associative operators grow to left-down, the tree of right-associative operators grow to right-down. Again it is very obvious but for parser everything are just tokens. We use well known rules for people but we could invent something unusual and the parser would parse it as we would say. Precedence of operators ~~~~~~~~~~~~~~~~~~~~~~~ In the last chapter we assumed that our grammar is very simple and consisits of '+' and '-' operators and they have the same priority (aka precedence). But real languages have lots of operators. Before describe the LitePAC compiler grammar lets look at another example, it is a little bit more difficult. Assume we have this expression: +6 - (2 + 5 * 3) * 9 / (3 - 1) + ... / 2+ We have 4 operators ++ - * /+ and lets say that +*+ and +/+ have a higher priority than +'+'+ and +-+. Also we use brackets to give the expession in brackets a higher priority. Using the known notation describe the grammar: +summ_expr+ _->_ +summ_expr+ +++ +mult_expr | summ_expr - mult_expr | mult_expr+ +mult_expr+ _->_ +mult_expr * term | mult_expr / term | term+ +term+ __->_ +digit | (summ_expr) | E+ +digit+ _->_ +0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9+ summ_expr is the start symbol. And we begin a lookup from this symbol and go deeper to the rule chains before we find the matching. Look that we add the expression in the bracket in the _term_ production and more over now the rules describe deep nesting expressions in brackets like this: +1 + (2 * (4 + 6))+ Also we add a special terminal +E+ it means empty expression or just nothing. This grammar fully describes our example, the sample expression is quite big and complex. Also this grammar describes very simple expressions like -- +9+ (just one digit, no operators at all) or even nothing (the empty expression). Now you see how powerfull this technique is and using it we can describe a quite difficult language grammar. Soon we describe the grammar of the LitePAC compiler but now we need to step back. Recursive descent predictive parser ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using this kind of parser we can handle the grammar of the LitePAC compiler in one pass from left to right. The input file is scanned terminal by terminal (token by token). The syntax analyser recieves tokens from the lexical analyser. The parser always checks a current token with its rules and if it gets a matching it follows this rule further until not the end of the rule or the error. It is called predictive because the parser follows the rule and predicts that the next coming token will be the same token as the choosed rule says. If it is not the token from the rule it means we found a construction in the source file that doesn't fit our grammar and definetly it is the user error. [NOTE] Making a small resume it is worth to say that the LitePAC syntaxer parses the source file in one pass from left to right. It looks at one token ahead and creates the tree nodes as it handles tokens. Lets support the explanation with the example. Assume we parse the for cycle and we have this grammar rules to describe a +for+ cycle: +stmt+ _->_ +for_expr | if_expr | ... | expr+ +for_expr+ _->_ +for (optexpt; optexpr ; optexpr) stmt+ +optexpr+ _->_ +expr | E+ +expr+ _->_ +it describes assignment, arith and condition expressions etc+ The parsing expression looks the following way: ------------------------------------ for ( ; i < 10 ; i = i + 1) k = k + 1 ------------------------------------ As described in the Lexical analysis chapter if the lexical analyser detects that the identifier is the keyword it returns the token of the keyword. Again the syntax analyser starts from the start symbol +stmt+. The parser recieved the +for+ token (+TOKEN_FOR+) from the lexical analyser. According to the grammar the parser follows the +for_expr+ rule. Now the parser predicts that the next token should be +(+ (+TOKEN_LPARENTH+). If it got a matching it continues following the rule. Now it tryes to parse +optexpr+ rule, it is the empty expression and +optexpr+ rule permits +E+. Then the parser predicts that the next token should be +;+ (+TOKEN_SEMICOLON+). And so on and so on. Finnaly the syntaxer will parse the whole cycle expression, will create the for node and will put it in the tree. [NOTE] We come to the point when we can clarify why we need the Backus-Naur notation. If you open the source code of the parser and find the +stmt+ and +for_expr+ handlers you'll see that all functions which parse the +for+ expression have exactly the same names as they are described in the grammar and also they are called in the same order. The _match()_ function checks the predicted token with the current one. --------------------------------- match('for'); match('('); optexpr(); match(';'); optexpr(); match(';'); optexpr(); match(')'); stmt(); --------------------------------- (it is not a full true because a sample grammar was simplified but this approach works in the compiler). Left recursion ~~~~~~~~~~~~~~ There is a small pitfall when we write the recursive descent parser and it is the left recursion. Everything described before is still true, we only need to change some productions. As described earler we can have a production like this: +mult_expr+ _->_ +mult_expr * term | term+ The problem is that in the the body of the production in the most left position we have exactly the same nonterminal as the head is. As described in the last chapter the elements of the production are mapped to the source code as function names. For our example it means that the first function which the parser should call from _mult_expr()_ is _mult_expr()_ function. So the parser gets stuck in the infinite loop. To fix this problem we only need to rewrite rules where we can find this pattern. To rewrite rules we use the next approach. Assume that A is the nonterminal with two productions: +A -> Aa | B+ where +a+ and +B+ are the sequence of terminals or/and nonterminals and they don't start with +A+. We see that +A+ has a left recursion. To remove the recursion we rewrite it in the following way: +A -> BR+ +R -> aR | E+ The nonterminal +R+ has a right recursion. Lets look at our example. We see that: +A+ is +mult_expr+ +a+ is +* term+ +B+ is +term+ Using this rule we rewrite the expression in the following way: +mult_expr+ _->_ +++ +term rest_mult+ +rest_mult+ _->_ +* term rest_mult | E+ where +A+ is +mult_expr+, +a+ is +* term+, +B+ is +term+, +R+ is +rest_expr+ Lets finish the conversion in the example when we has 4 operators ++ - * /+. Let say +*+ and +/+ have higher proirity. The original and converted grammars look the following way: Original grammar: +summ_expr -> summ_expr + mult_expr | summ_expr - mult_expr | mult_expr+ +mult_expr+ _->_ mult_expr * term | mult_expr / term | term+ +term+ _->_ +digit | (summ_expr) | E+ +digit+ _->_ +0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9+ Converted grammar: +summ_expr+ _->_ +mult_expr rest_summ+ +rest_summ+ _->_ +++ +mult_expr rest_summ | - mult_expr rest_summ | E+ +mult_expr+ _->_ +term rest_mult+ +rest_mult+ _->_ +* term rest_mult | / term rest_mult | E+ +term+ _->_ +digit | (summ_expr)+ +digit+ _->_ +0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9+ The current form now is not so obvious as it was before but we can't write the recursive descent parser if won't perfom this conversion. When we will describe the LitePAC compiler grammar it will be presented in the converted form without left recusions. LitePAC grammar rules ~~~~~~~~~~~~~~~~~~~~~ This is almost the whole grammar of the LitePAC compiler. _programme_ -> stmts _stmts -> stmt stmts_rest_ _stmts_rest -> stmt stmts_rest | E_ _stmt ->scope_expr | if_expr | for_expr | while_expr | break_expr | continue_expr | return_expr | declaration_expr | other_expr | unknown_expr_ _declaration_expr -> process_function | process_array | process_variable_ _or_expr -> and_expr rest_or_ _rest_or -> '||' rest_or_ _and_expr -> rel_expr rest_and_ _rest_and -> '&&' rest_and_ _expr -> id '=' expr | or_expr_ _summ_expr -> mult_expr rest_summ_ _rest_summ -> '+' mult_expr rest_summ | '-' mult_expr rest_summ | E_ _mult_expr -> term_expr rest_mult_ _rest_mult -> '*' term_expr rest_mult | '/' term_expr rest_mult | '%' term_expr rest_mult | E_ _term_expr -> term | '(' expr ')'_ _term -> integer | real | id | id_or_indexed_ _id_or_indexed -> id | indexed_id_ _indexed_id -> id '[' expr ']' rest_indexed_id_ _rest_indexed_id -> '[' expr ']' rest_indexed_id | E_ Abstract syntax tree ~~~~~~~~~~~~~~~~~~~~~ As mentioned earlier when the parser processes the input file it builds a tree. In the LitePAC implemetation the main parser function is _programme()_ which returns the root of the tree, where tree leafs are terminals and tree interior nodes are nonterminals. [NOTE] The tree must be unambiguous, it means that each expression can have only one tree representation (we can't represent one expression in two different ways with two different trees). The tree is unambiguous because of the rightly developed grammar. Each significant language construction is the node. The whole list of node types, structures and constructions can be found in the __ file. In this chapter we look at the real example how we build the while node. According to the LitePAC grammar we call _while_expr()_ from _stmt()_. The simplified _while_expr()_ function looks the following way: ------------------------------------------------------------------------------- static struct ast_node *while_expr() { struct ast_node_while *while_node; struct ast_node *expr_node; struct ast_node *stmt_node; if (!match(TOKEN_LPARENTH)) err_msg("( expected after `while'"); expr_node = or_expr(); if (expr_node == NULL) err_msg("`while' ( expr ) can't be empty"); if (!match(TOKEN_RPARENTH)) err_msg(") expected after `while (expr'"); stmt_node = stmt(opaque); if (stmt_node == NULL) err_msg("`while' body statement can't be empty"); while_node = ast_node_while(expr_node, stmt_node); return AST_NODE(while_node); } ------------------------------------------------------------------------------- To make the listing shorter and simplier some checks were removed. The function checkes the left bracket, the condition expression (it can't be empty), the right bracket and finnaly parses the cycle body (it also can't be empty). You can see that _or_expr()_ and _stmt()_ are functions and they return subtree of parsed expressions. Then we build the while node and return it to the calling function (_stmt()_). In spite of the name the _or_expr()_ function returns any kind of operation nodes, using grammar rules the parser goes deep through rules. Parsing the most of nodes looks absolutly the same way. Symbol table ------------ Introduction ~~~~~~~~~~~~ Symbol tables keep all information about declared symbols. Each program has at least one symbol table -- a global symbol table. Lets look at the example. ------------------------------------------ int global_var; void function(int arg1) { int local1, local2, local3; ... { int local1, local2; ... local1 = local2 + local3; } local1 = global + local2; } ------------------------------------------ We have a simple example where _global_var_ is the global variable. It lives in the global symbol table. This table is created by the syntax analyser before the parsing of the source file. [NOTE] There is a simple rule -- create a new symbol table for each scope. A scope is a new namespace of variables. It means that names from old tables are overlaped with names from a new table. Our function is a new namespace, _arg1_, _local1_, _local2_ and _local3_ live in the new symbol table. We have access to the global variable. Inside the function we create a new namespace where we declared two variables _local1_ and _local2_. Assume that global table is number *1*, the function scope is number *2*, the new table is number *3*. Now the names from previous tables are overlaped with new symbols. In our example when we add two variables the first variable _local2_ is from the table *3*, the second variable _local3_ is from the table *2*. When we leave the scope we forget about this table and change the current table to the previous one -- number *2*. The second addition operation refers to all variables from the table *2*. LitePAC implementation ~~~~~~~~~~~~~~~~~~~~~~ Now lets look how it works in the LitePAC compiler. The symbol table is created by a parser. All tables are kept in a single linked list. Where the last one is the current symbol table. At the very beggining the current table points to the global table. +----------+ +----------+ +----------+ +----------+ | table N | -> | table N-1| ... | table 2 | -> | table 1 | +----------+ +----------+ +----------+ +----------+ Current table Global table The following code is the core of a scope handling. --------------------------------------------------------- static struct ast_node *scope_expr() { struct ast_node_scope *scope_node; struct ast_node *stmt_node; struct symbol_table *table; /* Push a new symbol scope. */ symbol_table_push(); /* Parse the scope. */ stmt_node = stmts(); /* Keep the table of this scope in the node. */ table = symbol_table_get_current_table(); scope_node = ast_node_scope(table, stmt_node); /* Pop a symbol scope */ symbol_table_pop(); return AST_NODE(scope_node); } --------------------------------------------------------- The code is simplified. Here we removed code related to function arguments handling. When the parser finds +{+ the parser calls _scope_expr()_ to process the scope. First of all we push a new symbol table on the stack. Now the current table is new and it is empty. Then we parse the scope, if the parser will find some declarations it adds them to the current symbol table. After that we get the current symbol table and create the tree node, we attach the symbol table to the node. Finaly we delete the symbol table from the stack. Now the current table is the previous one (the most bottom table is the global table). Symbol types ~~~~~~~~~~~~ The LitePAC language is very simple and it has a limited set of symbol types. - Variable or function argument symbol. - Array symbol. - Function or function declaration symbol. There is the enum with symbol types: ------------------------------ typedef enum { SYMBOL_TYPE_UNKNOWN, SYMBOL_TYPE_VARIABLE, SYMBOL_TYPE_ARGUMENT, SYMBOL_TYPE_ARRAY, SYMBOL_TYPE_FUNC, SYMBOL_TYPE_FUNC_DECL } sym_type_t; ------------------------------ There is a base structure. -------------------------------------- struct symbol { sym_type_t type; unsigned int ref_count; data_type_t decl_type; storage_class_t klass; char *name; const char *file_name; unsigned int line_number; symbol_release_t release; unsigned int nr_references; struct frame_pos fr; }; -------------------------------------- All symbols have a bunch of common fields. - A declaration type. - A symbol name. - A storage class (symbol atributes: *module*, *extern*). - A file name and the number of the line where it was declared. - An auxilary structure for the code generation stage. Each symbol inherits all these fields. Then each type adds their own fields. All structures can be found in the __ file. Symbol lookup ~~~~~~~~~~~~~~ A symbol table uses a hash table as the backend. So indeed when we search a symbol in the symbol table we call a hash table lookup function. In our implementation we have two lookup functions. --------------------------------------------------------- struct symbol *symbol_table_lookup_top(const char *name); struct symbol *symbol_table_lookup_all(const char *name); --------------------------------------------------------- The first makes a lookup only in the current table, the second make a look up in all tables in the stack. The best explanation when and how they are used can be found in the _syntax.c_ file. Intermidiate representation --------------------------- Introduction ~~~~~~~~~~~~ The code generation can be prerfomed without converting the tree to the intermidiate representataion. And it can be done directly from the tree. But modern compilers convert the tree to another kind of representation which close to the machine code but it is not the machine code yet. Indeed the syntax tree is already the intermidiate source file representation. But in this chapter we talk about the Three-address code translaion. It is a form of representing intermediate code. This new representation is easy than the tree. And all optimization techniques are applied to this new form. There is also another reason why this representation is useful. If the compiler has to generate the code for several architectures its easier to do it from intermidiate representation. And the lexer, the parser and the tree (the compliler front end) are separated from the code generation (the compiler back end). For example in LitePAC the statement +z = x + y * r+ after conversion looks the following way: ----------------------- t1 = y t2 = r t3 = t1 * t2 t4 = x t5 = t3 + t4 z = t5 ----------------------- If you just think how the real computer works, you will see that each line is a real instruction. And we can rewrite the last listing with some hypothetic architecture instructions the following way: ----------------------- LOAD y LOAD r MUL LOAD x ADD STORE z ----------------------- *LOAD* and *STORE* are trivial instructions and may work with registers or the memory to save the result of instructions. Now you see how the code generation is simle perfomed from the three-address code. We will speak about the LitePAC code generation stage later. LitePAC quadruples ~~~~~~~~~~~~~~~~~~ The syntaxer works with tokens and using rules creates nodes. The new subsystem works with nodes and creates quadruples. As we have the useful API for nodes, now we have similar API for quadrules. Quadruple is simple data structure, each node is mapped to one or several quadruples. The tree is not flat data structure and to convert the tree to qudruples we need to traverse it. As input for this subsystem we have a tree and as output we get the linear array of quadruples. [NOTE] Why quadruples? In books about writting compilers you can find that indeed there is a data structute with this name. And the main idea that we need only 4 field (4-tuple) in the data structure. And using basic qudruples we can represent any kind of nodes. Fields for each quadruple use for different purpose and in each case we have to interpret fields according to the qudruple type. In our implemetation we only use this definition. The subsystem has very limited set of quadruples: - Assignment instructions of the form +x = y op z+, where +op+ in our implementation is only arithmethic operations. - Copy instruction of the form +x = y+. - An uncodition jump +goto Ln+ where +n+ is the label number. - Condition jumps +if x op y goto Ln+, where +op+ is relational operations and +n+ is the label number. - Function calls. +param x+ for a function argument, if the function returns value +x = call name, n+; if doesn't +call name, n+, where +n+ is the number of arguments. A sample call of _func(x1, x2, x3)_ looks the following way (arguments are represented in the reverse order -- right to left): ------------- param x3 param x2 param x1 call func, n ------------- - Indexed copy instructions of the form +x = y[i]+ and +x[i] = y+. In our implementation they are two different types of quadruples. - Numbered labels. +Ln:+, where the +n+ is the label number. - Also in our implementation we use function and scope quadruples, a scope quadruple contains a symbol table for this scope. - Memory area load and store operations. It is enough to represent all constructions of the language. Later on the stage of the code generation each quadruple is mapped to the block of the architecture dependent instructions. The quadruple API can be found in the __ file, the tree conversion in the _tree2inter.c_ file. Node to quadruple conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Lets look at the example. Assume we have the following statement: ------------------ i = 0; while (i < 10) { b = a[i]; i = i + 1; } ------------------ When the parser finished its work, it built the following tree: NEEDS A PICTURE Now we convert the node to the array of quadruples, the result output looks the following way: --------------------------- t0 = 0 i = t0 L1: t1 = i if t1 < 10 goto L2 goto L3 L2: t2 = i t3 = a[t2] b = t3 t4 = i t5 = 1 t6 = t4 + t5 i = t6 goto L1 L3: --------------------------- Each line is the qudruple. You can see that the code is not very optimal. Peep hole optimizations can fix it. Peephole optimization --------------------- At present several basic optimizations are implemented. To this moment we alredy have an array of quadiruples. The previous stage converted the tree to the array of quadruples, but some sets of quadruples are not very effective. The main goal to find unsutisfied patterns in the array and replace them with another. For exmaple using this approach we delete unsufficient conditional and uncomditional jumps. In the example above the +if+ and the next after it +goto+ quadruples look bad. The simple optimization replaces the following code: ---------------------------- if t1 < 10 goto L2 goto L3 L2: ... L3: ---------------------------- with the next code: ---------------------------- if t1 >= 10 goto L3 ... L3: ---------------------------- We don't need a label +L2+ at all, we only need to change the condition in the +if+ quadruple to opposite one, delete the label +L2+ and jump to the label +L3+. All patterns and explanations you can find in comments in the source file _peephole.c_. Code generation --------------- Introduction ~~~~~~~~~~~~ In this chapter we only describe the design of the code generation subsystem. For the code generation we only have the liner array of quadruples. Right now the LitePAC has two architecture implementations. - "Internal" architecture implementation. - LitePAC virtual machine implementation. The first impementation is used for debuging and hacking purpose only. It is not real machine code generation. Its output looks like the last sample. It gives the full picture how languages constructions are represented in quadruples. This implementation is very simple and it can be found in the _machine_internal.c_ file. The _vmlang_ program has an option *-i* to choose this code generation. The second implementation can be found in the special paper about the architecture of the LitePAC virtual machine. Interface ~~~~~~~~~ The code generation subsystem is implemented in the generic way. You can see more details in the __ file. A programmer who implements the specific architecture code generation needs only to write and to set callbacks for each quadruple. The structure of callbacks looks the following way: --------------------------------------------------------------------- struct q_vfuncs { void (*q_label)(struct quad *, struct quads_ctx *); void (*q_goto)(struct quad *, struct quads_ctx *); void (*q_if)(struct quad *, struct quads_ctx *); void (*q_call)(struct quad *, struct quads_ctx *); void (*q_param)(struct quad *, struct quads_ctx *); void (*q_op)(struct quad *, struct quads_ctx *); void (*q_return)(struct quad *, struct quads_ctx *); void (*q_assign)(struct quad *, struct quads_ctx *); void (*q_store_access)(struct quad *, struct quads_ctx *); void (*q_load_access)(struct quad *, struct quads_ctx *); #ifdef _MEM_AREA void (*q_store_mem_area)(struct quad *, struct quads_ctx *); void (*q_load_mem_area)(struct quad *, struct quads_ctx *); #endif void (*q_func)(struct quad *, struct quads_ctx *); void (*q_scope)(struct quad *, struct quads_ctx *); void (*q_scope_end)(struct quad *, struct quads_ctx *); void (*q_func_end)(struct quad *, struct quads_ctx *); }; --------------------------------------------------------------------- In the listing you see all types of quads we have. When a programmer wrote all callbacks, he needs to call the following function to set up callbacks: ------------------------------------------------- void machine_set_vfuncs(struct q_vfuncs *funcs); ------------------------------------------------- The last step is running the code generation. There is a function: ------------------------------------------------------------------------------------------- void machine_gen_code(struct quad **quads, unsigned int quads_size, code_print_out_t func); ------------------------------------------------------------------------------------------- Where the first argument is the array of quadruples pointers, the second argument is its size, and the last is the print out function pointer. The print out function has the following prototype: -------------------------------------- void print_out(const char *fmt, ...); -------------------------------------- If the pointer is NULL, by default code generation subsystem uses _vprintf()_. When a programmer writes his own implementation for one architecture. He includes __ and he gets access to the external _arch_print_out()_ function. This function uses to print out instructions. Indeed _arch_print_out()_ function is just the function pointer which points to the user defined print out function or to the default print out function. More explanations how it works can be found in sources. The internal implementation also was written for this purpose. It is very easy for understanding.