Virtual machine file format =========================== Overview -------- The LitePAC project can consist of some source files. Each file is compiled separatly from other files. When the _vmlang_ compiles the source file, we get the assembler listing. Then the file with assembler instructions should be assebmled. _vmasm_ program converts the assembler source file to the object file. The object file is not executable file yet. To build the executable file we use _vmlink_ program, it takes one or several object files and produce one executable file. The linker does not just gather all files in one. Object files can contain unresolved symbols. For example we call a function but it lives in another file or it is the library call . So _vmlink_ tries to resolve symbols and patches a function address. We look at this process in more details later. Both programs use well-known options to assemble and linking files. So assembling and linking are two different stages. There are some reasons: - Export only neccesary symbols in object and executable files. - Ability to use building project systems (for exmaple _make_). - Only changed files need to be recompiled and assembled. - A program consists of logic blocks (object files). _vmobjdump_ is a very useful tool for hacking the file format. File format ----------- The file format is close to the ELF file. Some ideas were taken from the ELF specification. In the same time the file format is much easier than ELF at least because we don't need to relocate symbol dynamicly when we begin the file execution. Object and executable files are platform independent, they can be build on one platform but be executed or rebuild on another. The byte order for the file is _big-endian_ (BE). The file looks the following way: +-------------+ <-- 0 offset | file header | +-------------+ | section 1 | +-------------+ | section 2 | +-------------+ | | +-------------+ | section N | +-------------+ <-- o_shoff | sect.head 1 | +-------------+ | sect.head 2 | +-------------+ | | +-------------+ | sect.head N | +-------------+ The file consists of the file header, sections and section headers. There are 7 types of sections: - Text section (_.text_). - Inintialization section (_.init_). - Data section (_.data_). - Section with read only data, it keeps all constant strings that are used in the source file (_.ro_). - Symbol table section (_.symtab_). - Relocation table section (_.reloctab_). - String section, it keeps all strings (identifiers) that are related to this object or executable file (_.string_). For the data section we only have a section header and don't have a section. We only need to know the final size of the secton for the runtime. It will be fill out with zeros. File header ----------- Lets look at the file header. ------------------------------------------------------------------------------- struct obj_header { unsigned char o_magic[MAGIC_LEN]; /* ".VM" + '\0' terminator */ file_type_t o_type; /* file type object or executable */ u_int16_t o_maj_vers; /* major version number */ u_int16_t o_min_vers; /* minor version number */ u_int32_t o_src_name; /* source file name - file.vm */ u_int32_t o_shoff; /* section headers table offset in * file */ u_int32_t o_shnum; /* number of entries in section * headers table */ u_int32_t o_shentsize; /* the size in bytes of each entry */ u_int32_t o_entry; /* start execution address */ }; ------------------------------------------------------------------------------- The file header has a magic signature, major and minor version numbers. These fields are used for sanity checks. The _o_type_ field stores the file type, whether it is the object or executable file. The _o_src_name_ field stores the offset in the string section where the real file name is located. The _o_shoff_ field stores the offset in the file where the first section header is located. So this field is calculated as the size of the file header and sizes of all sections (look at the picture). The _o_shnum_ field stores the number of section headers. The _o_shentsize_ field stores the size of *section_header* structure. The last field _o_entry_ is valid only for executable files. It keeps the start address or the entry point of the program. Probably the header should also keep the file check sum. Then we can detect whether file is corrupt or it is health. Section header -------------- Lets look at the section header. ------------------------------------------------------------------------------ struct section_header { u_int32_t sh_name; /* section name */ sh_type_t sh_type; /* section type */ u_int32_t sh_off; /* first byte offset in file */ u_int32_t sh_size; /* section size */ u_int32_t sh_entsize; /* entry size, for example symtable * entry or relocation table entry */ }; ------------------------------------------------------------------------------ The _sh_name_ field stores the offset in the string section where the real section name is. The _sh_type_ field stores the section type, all types were described above. The _sh_off_ field stores the offset of the first section byte from the file begining. The _sh_size_ field stores the size of the section. The _sh_entsize_ field is very important only in 2 cases. It stores the size of one table unit, the *sym_tb_entry* structure size or the *rel_tb_entry* structure size. Knowning this size we can calculate the number of units in the table (section_size/entry_size). In other cases this field is zero. Symbol table ------------ A symbol table contains symbols that where declared only in the global scope of the original source file. This symbol table is not related to the compiler symbol table. There are two different tables. Symbol table entry has the following structure: ------------------------------------------------------------------------------ struct sym_tb_entry { u_int32_t st_name; /* symbol name */ u_int32_t st_offset; /* offset in section */ u_int32_t st_size; /* size of variable or function */ sym_type_t st_visib; /* module, global */ sh_type_t st_shtype; /* section type */ }; ------------------------------------------------------------------------------ The _st_name_ field stores the offset in the string section where the real symbol name is. The _st_offset_ field stores the offset in the section _st_shtype_. The _st_size_ field stores the size of the variable or the function. The _sh_type_t_ field stores the section type. The symbol belongs to, lives in this section. Also it has special value in case when the variable or the function was declared with *extern* keyword, in this case it has a special value +SHT_UNDEF+ and the _st_offset_ is zero. It means that now we don't know exactly where the symbol is. The linker resolves this problem. The _sym_type_t_ field stores the symbol type, now it can be global or module (a module symbol was declared in the source file with the *module* keyword). The symbol table is very important. It is always presented in the object and executable files. The symbol table is also used in the process of relocation symbols. Relocataion table ----------------- A relocation table section is only presented in the object files and not in the executable file. The table is created by _vmasm_ and it is used by _vmlink_ to resolve unknown symbols. A relocation table uses a symbol table. We can have a relocation record for three cases: - A variable was declared with the *extern* keyword. - We call the external function and has only the function prototype, it means that the called function can live in another file or it can be the library call. - Special case for the read only section, it needs only for the *STRPUSH* instruction. We need to backpatch its operand. The operand is the offset in the read only section. _vmlink_ recalculates a new offset and patches the operand with a new value. Relocation table entry has the following structure: ------------------------------------------------------------------------------ struct rel_tb_entry { u_int32_t r_info; /* index in a symbol table or the type * of the section (r_flag) */ u_int32_t r_flag; /* how to use r_info field */ u_int32_t r_offset; /* offset in section */ sh_type_t r_shtype; /* .init or .text */ }; ------------------------------------------------------------------------------ The _r_info_ field is interpreted according to the _r_flag_ field. It can be the order number of the symbol in the symbol table (indeed we need only a symbol name) or it can be the section type (now it is the read only section type only). The _r_flag_ field may have two values +RF_SYMTB+ or +RF_SECT+, they are described earlier. The _r_offset_ field stores the offset in the _r_shtype_ section. _vmlink_ patches the section _r_shtype_ by this offset. The _r_shtype_ field stores the type of the section where the relocation is needed. It can be ithe the text section or the init section. More explanation can be found in the _vmlink.c_ file. It has enougth comments. vmasm ----- As we said earlier the _vmasm_ program converts the assembler listing to the object file. The _vmasm_ program is very simple, it reads the assembler file line by line and processes instructions. Each instruction has own handler. The program has a table with all known instructions and their handlers. ----------------------------------- struct code_entry { const char *name; instr_handler_t handler; } instr_handlers[] = { ... { "add", arith }, ... { "jge", jump }, ... }; ----------------------------------- First of all the program reads the line, finds the identifier, then looks in the table of instruction. If it finds a match it calls a handler. Then handler continues parsing the line, processing operands (if this instruction has operands). The vmasm uses buffer API to create, to fill out, to write to the file a main header, sections and section headers. The most difficult part of the _vmasm_ program is creating symbol table entries and relocation table entries. The overview of the both tables are given earlier in this paper. vmlink ------