Runtime virtual machine ======================= Overview -------- The runtime environment is the binary interpeter. The interpeter is the finite state automaton. But we call it as a virtual machine. The LitePAC virtual machine is the stack machine. Nevertheless the machine has some registers. Almost all instructions wait operands and store the result on the stack. The stack grows to the _less_ address. The machine has three registers: - *sp* the stack pointer, it points to the top of the stack. - *pc* the program counter. - *fp* the stack frame ponter, it uses to address arguments and local function variables. Registers can't be changed manualy, only instructions change registers values. We use +%+ sign to distinguish register names and identifiers. Instructions groups: - Data instructions. - Memory instructions. - Reference insturctions. - Arithmetic instructions. - Control flow instructions. - Function call and return instructions. Opcodes ------- Data instructions ~~~~~~~~~~~~~~~~~ *PUSH* -- put a four-byte constant on the top of the stack. The constant is in the text section. Put the constant +0xdeadbeaf+ on the stack. ------------------- PUSH 0xdeadbeef ------------------- high address high address +------------+ +------------+ | | | | +------------+ <- %sp +------------+ | | | 0xdeadbeaf | +------------+ +------------+ <- %sp low address low address *STRPUSH* -- put a string constant on the top of the stack. The instruction works with the read only section. The offset of the string in the read only section is in the text section. All strings in the read only section are null-terminated. Put the constant string on the stack. The string reads from the read only section by the +0x12+ byte offset. The string in the stack occupies the fixed size. Even if its length is 5 bytes, the stack grows (indeed shrinks) by the +STRING_LEN+ bytes. ------------------ STRPUSH 0x00000012 ------------------ high address high address +------------+ +------------+ | | | | +------------+ <- %sp +------------+ | | | string is | +------------+ + here + | | +------------+ <- %sp low address low address Memory instructions ~~~~~~~~~~~~~~~~~~~ All memory instructions always have at least two operands: - The area type. - The four-bytes value, it is interpreted according to the area type (it can be offset in the area or it can be the value that will be decoded for inputs and outputs). Both operands are placed in the text section. There are some memory areas: - Stack. - Data section. - Inputs. - Outputs. Lets look at two examples: Put the value of the variable +a+ to the top of the stack. The +a+ variable is from the data section. ----------- LOAD a ----------- In the stage of assembling the _vmasm_ program writes to the object file: . The data section type. . The calculated offset of the variable +a+ in the data section. Put the value from the stack (its position is relative about *fp*) on the top of the stack. ----------- LOAD 4(%fp) ----------- In the stage of assembling the _vmasm_ program writes to the object file: . The stack area type. . The offset value +4+. We also have the inputs and outputs memory types. But we will talk about them later. They are handled in the special way. And the second operand is not the offset but the coded value. [NOTE] This approach works for all memory instructions. In these examples we use a *LOAD* instruction as the easiest instruction. Other instructions have additional operands and they are placed on the stack when the runtime is executed. *LOAD* -- put a four-bytes value from the specified memory area on the top of the stack. It works with two opreands: - The area type. - The four-bytes value (offset or coded value). Put the value of the variable +a+ on the top of the stack. --------- LOAD a --------- high address high address +------------+ +------------+ | | | | +------------+ <- %sp +------------+ | | | a | +------------+ +------------+ <- %sp low address low address *STORE* -- take a four-bytes value from the top of the stack and put it to specified memory area. It works with three opreands: - The area type. - The four-bytes value (offset or coded value). - The assigned value, it is on the top of the stack. Assign the number +127+ to +a+. --------- STORE a --------- high address high address +------------+ +------------+ | | | | +------------+ +------------+ <- %sp | 127 | | | +------------+ <- %sp +------------+ low address low address *ALOAD* -- put the four-bytes value from the array in the specified memory area to the top of the stack. It works with three operands: - The area type. - The four-bytes value (offset or coded value). - The array index number (at the top of the stack). Put the value from +a[10]+ on the top of the stack. Lets say that +a[10]+ is +314+. -------- ALOAD a -------- high address high address +------------+ +------------+ | | | | +------------+ +------------+ | 10 | | 314 | +------------+ <- %sp +------------+ <- %sp low address low address *ASTORE* -- take the four-bytes value from the top of the stack and put it to the array in the specified memory area. It works with four operands: - The area type. - The four-bytes value (offset or coded value). - The array index number (at the top of the stack). - The assigned value (it is below the array index number). Assign the number +163+ to +a[10]+ --------- ASTORE a --------- high address high address +------------+ +------------+ | | | | +------------ +------------+ <- %sp | 163 | | | +------------+ +------------+ | 10 | +------------+ <- %sp low address low address *STRLOAD* -- put the string from the specified memory area to the top of the stack. It works with two operands: - The area type. - The four-bytes value (offset only). Put the string +str+ on the top of the stack. ------------ STRLOAD str ------------ high address high address +------------+ +------------+ | | | | +------------i <- %sp +------------+ | | | string is | +------------+ + here + | | +------------+ <- %sp low address low address *STRSTORE* -- take the string from the top of the stack and put it to the specified memory area. It works with three operands: - The area type. - The four-bytes value (offset only). - The string itself. ------------- STRSTORE str ------------- high address high address +------------+ +------------+ | | | | +------------+ +------------+ <- %sp | string is | | | + here + +------------+ | | +------------+ <- %sp low address low address *ASTRLOAD* -- put the string from the array in the specified memory area to the top of the stack. It works with three operands: - The area type. - The four-bytes value (offset only). - The array index number (at the top of the stack). Put the string from +str[10]+ on the top of the stack. ------------ ASTRLOAD str ------------ high address high address +------------+ +------------+ | | | | +------------+ +------------+ | 10 | | string is | +------------+ <- %sp + here + | | +------------+ <- %sp low address low address *ASTRSTORE* -- take the string from the top of the stack and put it to the array in the specified memory area. It works with four operands: - The area type. - The four-bytes value (offset only). - The array index number (at the top of the stack). - The string itself (it is below the array index number). Assign a string from the stack to +str[10]+. ------------- ASTRSTORE str ------------- high address high address +------------+ +------------+ | | | | +------------+ +------------+ <- %sp | string is | | | + here + +------------+ | | +------------+ | 10 | +------------+ <- %sp low address low address Reference insturctions ~~~~~~~~~~~~~~~~~~~~~~ The LitePAC language references can be used only as the function arguments. It is enough for this language but it can be easily extended to support full feature references. Using references we don't work with the copy of variables but we work with the original data. It means that when we change the value of the reference variable, this change will be visiable for the calling code. Indeed, the reference variable don't keep the variable value. It keeps just a "pointer" (offset in the section) to this variable. Since we have different sections (stack and data section), a user can change a variable value either from the stack or from the data section. Thus it is not enough for the reference variable to keep just the offset in the section, it also needs to know which section it is. In that way the reference variable always occupies _eight_ bytes inspite of the data type (whether it is integer, float numbers or a string). [NOTE] Beggining from this place the reference is 8-bytes variable. It consitst of the type of section and the absolute offset in the specified section. To make the reference description clear we divide all reference instructions in two more groups: - Instructions that create references and put them in the stack. - Instructions that follow the reference, they can read and write data. Reference create instructions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *REFPUT* -- put the reference to the variable on the top of the stack. Lets say we have a function prototype: ----------------------------- void func(int byref var); ----------------------------- Somewhere we call this function: ----------------------------- void init() { int a; ... func(a); ... } ----------------------------- Since the prototype is declared with the reference, the final code looks the following way: ----------------- REFPUT a CALL func ----------------- This is how the stack looks like just before the *CALL* instruction, after call it looks in another way, read a chapter about a *CALL* instruction for more details. high address high address +------------+ +------------+ | | | | +------------+ +------------+ | ref. base | +------------+ | abs. offs | +------------+ <- %sp low address low address [NOTE] a *reference base* is the type of the section (stack or data section), an absolute offset is an offset in the specified section. *REFCOPY* -- copy the reference. Lets look at the example. Assume that we have two function prototypes: -------------------------- void func1(int byref var); void func2(int byref var); -------------------------- First of all we call the first function, then inside the first function we call the second one. --------------------------- void func1(int byref var) { ... func2(var); ... } void init() { int a ... func1(a); ... } --------------------------- The _func2()_ argument is the reference so we don't make reference on reference, we just copy the original reference. This instruction copies 8 bytes and put them again on the top of the stack. -------------- REFCOPY 8(%fp) CALL func2 ... REFPUT a CALL func1 ... -------------- Reference modify instructions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All load and store reference instructons work in the similar way. They have two operands, both of them are in the text section: - The area type (it is only stack now). - The four-bytes value (offset) When an instruction reads these two opernads it can reach a real reference. And after that follow the reference to get data. Load reference instructions read the value following the reference. Store instructions write the value following the reference. The stack look the following way. There is a reference somewhere in the stack. On the this pictute the reference is +++ +8+ bytes about the *fp* register (the first function argument). It means that the instruction first operand is the stack area, the instruction second operand is +8+. Again both of them were read from the text section. high address +------------+ | | +------------+ | ref. base | +------------+ +12(%fp) | abs. offs | +------------+ +8(%fp) | .... | + + | .... | +------------+ <- %sp low address [NOTE] It can be a little bit confusing because the instruction has two operand: the first is the memory area, the second is the offset. And a reference is also consists of two value -- the section type (a reference base) and the absolute offset (they were prepared by either *REFCOPY* or *REFPUT* instructions). We need the first pair to detect where the reference is. The second pair is used to follow the reference. In the next examples we use the +8(fp)+ operand it means +++ +8+ bytes about the *fp* register that is the first function argument. So we assume that the next peaces of code are related to the following function prototype: -------------------------- void test(int byref ref); -------------------------- *REFLOAD* -- read the reference and put the original four-bytes value on the top of the stack. --------------- REFLOAD 8(%fp) --------------- high address high address +------------+ +------------+ | | | | +------------+ <- %sp +------------+ | result | +------------+ <- %sp low address low address *REFSTORE* -- read the reference and write the four-bytes value from the top of the stack to the place following the reference. ---------------- REFSTORE 8(%fp) ---------------- high address high address +------------+ +------------+ | | | | +------------+ +------------+ <- %sp | 327 | +------------+ <- %sp low address low address *REFSTRLOAD* -- read the reference and put the original string on the top of the stack. ------------------ REFSTRLOAD 8(%fp) ------------------ high address high address +------------+ +------------+ | | | | +------------+ <- %sp +------------+ | string is | | here | +------------+ <- %sp low address low address *REFSTRSTORE* -- read the reference and write the string from the top of the stack to the place following the reference. ------------------- REFSTRSTORE 8(%fp) ------------------- high address high address +------------+ +------------+ | | | | +------------+ +------------+ <- %sp | string is | | here | +------------+ <- %sp low address low address Arithmetic instructions ~~~~~~~~~~~~~~~~~~~~~~~ We have two types of arithmetic instructions: - Integer arithmetic instructions. - Float arithmetic instructions. These instructions always wait two operands on the top of the stack. Instructions return result on the stack. ---------------- PUSH 0x12121212 PUSH 0x34343434 ADD ---------------- high address high address +------------+ +------------+ | | | | +------------+ +------------+ | 0x12121212 | | result | +------------+ +------------+ <- %sp | 0x34343434 | +------------+ <- %sp low address low address The ADD instruction is used only as the sample. All instructions act the same way. *ADD SUB MUL DIV* -- integer arithmetic instructions. *FADD FSUB FMUL FDIV* -- float arithmetic instructions. Control flow instructions ~~~~~~~~~~~~~~~~~~~~~~~~~ This group of instruction consitst of: - The unconditional jump instruction. - Compare instructions. - Conditinal jump instructions. *JMP* -- the uncoditional jump instruction. The uncoditional jump instruction has only one operand. It is the relative value about current *pc*. The operand will be added to *pc*. This operand is placed in the text section. *CMP FCMP STRCMP* -- compare instructions. Compare instructions work with two operands. And they act as arithmetic instructions, return value on the top of the stack. But these instructions return +0+, +1+ or +-1+ only. The return value is used as the one of the operand for conditional jump instructions. For *CMP* and *FCMP* the return value is +0+ when operands are equal, is +1+ when the first operand is bigger, is +-1+ when the second operand is bigger. For *STRCMP* acts the same way as _strcmp()_ function. *JG JGE JL JLE JEQ JNEQ* -- unconditional jump instructions. Uncoditional jump instructions work with two operands: - The result of the last compare instruction +1+, +0+ or +-1+ (it is on the top of the stack). - The relative value about current *pc*. It is placed in the text section. The *pc* will be changed in the next cases: *JG* changes *pc* if the first operand is +1+. *JGE* changes *pc* if the first operand is +0+ or +1+. *JL* changes *pc* if the first operand is -1. *JLE* changes *pc* if the first operand is 0 or -1. *JEQ* changes *pc* if the first operand is 0. *JNEQ* - changes *pc* if the first operand is -1. Otherwise the program continues execution with the next instruction. Function call and return instructions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this chapter we only list instuctions. The full usage of the chapter instructions read in the next chapters. *CALL* -- call the function, it puts the return address on the top of the stack and change the *pc* register with the function address. There is only one operand in the text secton -- the function address. *ENT* -- a prologue instruction, make some preparaton for function execution (more in the next chapter). *LEAVE* -- a epilogue instruction, it is opposite to the prologue. *GROW* -- grow the stack size, there is only one operand (text section) -- the size in bytes. *SHRINK* -- shrink the stack size, there is only one operand (text section) -- the size in bytes. Function call ------------- In the current and in the next chapter we use the following example: A function prototype: -------------------------------------------- string test(int arg1, float arg2, int arg3); -------------------------------------------- A function call -------------------------------------------- { string str; ... str = test(a, 2.1, 15); ... } -------------------------------------------- Function arguments are pushed on the stack in the reverse order (right to left). The sequence of assembler instructions to put arguments on the stack. ------------ PUSH 15 PUSH 2.1 LOAD a ------------ high address +-------------+ | | +-------------+ | 15 | +-------------+ | 2.1 | +-------------+ | a | +-------------+ <- %sp | | +-------------+ low address Call a function when all arguments were pushed. ------------ CALL test ------------ high address +-------------+ | | +-------------+ | 15 | +-------------+ | 2.1 | +-------------+ | a | +-------------+ | ret address | +-------------+ <- %sp | | +-------------+ low address A call instruction puts the returin address on the top of the stack and change the *pc* register with a function address. Call stack ---------- Each function starts its execution with a _prologue_. The prologue saves the previous value of the *fp* register and fixes a new value of *fp*. When we will fix a new *fp* we can address all variables and arguments in the function and don't care about *sp*. We known only that *sp* points to the last operation result. --------- ENT --------- high address +-------------+ | | +-------------+ | arg3 | +-------------+ | arg2 | +-------------+ | arg1 | +-------------+ | ret address | +-------------+ | %fp | <- previous frame pointer +-------------+ <- %sp (new fixed %fp, %fp == %sp) | | +-------------+ low address At this point we shrink the stack size to prepare a place for function local variables. We shrink and grow the stack size each time when come to the new scope or leave it respectively. Since the stack grows to the less addresses we need to shrink it and vice versa. In our example we need place for three 4-bytes variables. ---------- SHRINK 12 ---------- high address +-------------+ | | +-------------+ | arg3 | +-------------+ +16(%fp) | arg2 | +-------------+ +12(%fp) | arg1 | +-------------+ +8(%fp) | ret address | +-------------+ +4(%fp) | %fp | +-------------+ <- %fp | local var1 | +-------------+ -4(%fp) | local var2 | +-------------+ -8(%fp) | local var3 | +-------------+ -12(%fp) <- %sp | | +-------------+ low address Parts described above are common for all functions. After that a function is executed by a virtual machine. Most instructions use the stack as a holder for their operands. The stack in the virtual machine is balanced it means that it can't be overflowed. Most instrutions pick operands, poke the result back on the top of the stack. If a function returns a value we need to move it to the right place at the end of the function execution just before a prologue. A calling code, except all function arguments, also prepares a place for a return value. So the last opertaion result should be return to the return value place. high address +-------------+ | | +-------------+ | return value| | place | | | +-------------+ +20(%fp) | arg3 | +-------------+ +16(%fp) | arg2 | +-------------+ +12(%fp) | arg1 | +-------------+ +8(%fp) | ret address | +-------------+ +4(%fp) | %fp | +-------------+ <- %fp | local var1 | +-------------+ -4(%fp) | local var2 | +-------------+ -8(%fp) | local var3 | +-------------+ -12(%fp) <- %sp (in case of the void return value) | last operat.| | result | | | +-------------+ <- %sp | | +-------------+ low address If a function don't return a value (void data type) we don't generate this code. Otherwise we need to place the return value below all arguments with one instruction dependes on the return data type. In case of four-bytes return value: ---------------- LOAD 20(%fp) ---------------- In case of string return value: ---------------- STRLOAD 20(%fp) ---------------- On the top of the stack we have the result of the last operations so it is a return value for us. The last instruction in the function is *LEAVE*, this is an _epilogue_. Every function finishes with the epilogue. It makes oposite actions to a prologue. Also it takes a return value address from the top of the stack and changes *pc* to return to the calling code. ------ LEAVE ------ high address +-------------+ | return value| | | | | +-------------+ | arg3 | +-------------+ | arg2 | +-------------+ | arg1 | +-------------+ <- %sp | | +-------------+ low address A calling code grows a stack to remove all arguments. ------- GROW 12 ------- There is only a return value on the top of the stack now. high address +-------------+ | return value| | | | | +-------------+ <- %sp low address Library call ------------ The runtime environment has a gate to the "real" wolrd. We can call external functions of the operating system. They can be written in C. We have library header files and library object files. The first provide us with function prototypes, the second provide us with the interface to the real call. Library object files is common object files with the symbol table that is full with external symbols. So we can link our files against these object files. Library object files are very simple wrappers around real calls. *LIBCALL* -- the real call to the external function. It has only one operand in the text section. The operand is the function number. Real calls are registerd in the library call table when the machine starts up. They are shared libraries (.so binary files). Each library call has it is own number. The library call has the following structure: ------------------------------------ struct libcall { const char *name; unsigned int num; syscall_handler_t handler; }; ------------------------------------ The _name_ field stores the name of the call. The _num_ field sotres the number of the call. The _handler_ field is the function handler, the handler itself has one argument and it is the machine context structure (*struct vm*). The call has to be registered with the next function, it fills out the calls table record. --------------------------------------------- ret_t libcall_register(struct libcall *lc); --------------------------------------------- When the machine finds the *LIBCALL* instructions. It extracts the call number from the text section and calls the following function. --------------------------------------------------------------- ret_t libcall_call_by_num(struct vm *machine, u_int32_t num); --------------------------------------------------------------- This function finds the real handler in the table and calls it. Now we leave the machine. The called code can do anything with the machine. At least it should follow the function prototype to keep the machine in the working state when the called code returns back to the machine.