Virtual machine file format
===========================

Overview
--------
The LitePAC project can consist of some source files. Each file is compiled
separatly from other files. When the _vmlang_ compiles the source file, we get
the assembler listing. Then the file with assembler instructions should be
assebmled. _vmasm_ program converts the assembler source file to the object
file. The object file is not executable file yet. To build the executable file
we use _vmlink_ program, it takes one or several object files and produce one
executable file. The linker does not just gather all files in one. Object files
can contain unresolved symbols. For example we call a function but it lives in
another file or it is the library call . So _vmlink_ tries to resolve symbols
and patches a function address. We look at this process in more details later.
Both programs use well-known options to assemble and linking files.


So assembling and linking are two different stages. There are some reasons:

- Export only neccesary symbols in object and executable files.
- Ability to use building project systems (for exmaple _make_).
- Only changed files need to be recompiled and assembled.
- A program consists of logic blocks (object files).

_vmobjdump_ is a very useful tool for hacking the file format.


File format
-----------
The file format is close to the ELF file. Some ideas were taken from the
ELF specification. In the same time the file format is much easier than ELF at
least because we don't need to relocate symbol dynamicly when we begin the file
execution.

Object and executable files are platform independent, they can be build on
one platform but be executed or rebuild on another. The byte order for the
file is _big-endian_ (BE).

The file looks the following way:


	+-------------+ <-- 0 offset
	| file header |
	+-------------+
	|  section 1  |
	+-------------+
        |  section 2  |
        +-------------+
        |	      |
        +-------------+
        |  section N  |
        +-------------+ <-- o_shoff
	| sect.head 1 |
	+-------------+
	| sect.head 2 |
	+-------------+
	|	      |
	+-------------+
	| sect.head N |
	+-------------+

The file consists of the file header, sections and section headers.

There are 7 types of sections:

- Text section (_.text_).
- Inintialization section (_.init_).
- Data section (_.data_).
- Section with read only data, it keeps all constant strings that are used in
the source file (_.ro_).
- Symbol table section (_.symtab_).
- Relocation table section (_.reloctab_).
- String section, it keeps all strings (identifiers) that are related to this
object or executable file (_.string_).

For the data section we only have a section header and don't have a section.
We only need to know the final size of the secton for the runtime. It will be
fill out with zeros.


File header
-----------
Lets look at the file header.

-------------------------------------------------------------------------------
struct obj_header {
        unsigned char   o_magic[MAGIC_LEN]; /* ".VM" + '\0' terminator */
        file_type_t     o_type;             /* file type object or executable */
        u_int16_t       o_maj_vers;	    /* major version number */
        u_int16_t       o_min_vers;	    /* minor version number */
        u_int32_t       o_src_name;         /* source file name - file.vm */
        u_int32_t       o_shoff;            /* section headers table offset in
					     * file */
        u_int32_t       o_shnum;            /* number of entries in section
  					     *	headers table */
        u_int32_t       o_shentsize;        /* the size in bytes of each entry */
        u_int32_t       o_entry;            /* start execution address */
};
-------------------------------------------------------------------------------

The file header has a magic signature, major and minor version numbers. These
fields are used for sanity checks.

The _o_type_ field stores the file type, whether it is the object or executable
file.

The _o_src_name_ field stores the offset in the string section where the real
file name is located.

The _o_shoff_ field stores the offset in the file where the first section header
is located. So this field is calculated as the size of the file header and
sizes of all sections (look at the picture). The _o_shnum_ field stores the
number of section headers. The _o_shentsize_ field stores the size of
*section_header* structure.

The last field _o_entry_ is valid only for executable files. It keeps the start
address or the entry point of the program.

Probably the header should also keep the file check sum. Then we can detect
whether file is corrupt or it is health.


Section header
--------------
Lets look at the section header.

------------------------------------------------------------------------------
struct section_header {
        u_int32_t       sh_name;        /* section name */
        sh_type_t       sh_type;        /* section type */
        u_int32_t       sh_off;         /* first byte offset in file */
        u_int32_t       sh_size;        /* section size */
        u_int32_t       sh_entsize;     /* entry size,  for example symtable
					 * entry or relocation table entry */
};
------------------------------------------------------------------------------

The _sh_name_ field stores the offset in the string section where the real
section name is.

The _sh_type_ field stores the section type, all types were described above.

The _sh_off_ field stores the offset of the first section byte from the file
begining.

The _sh_size_ field stores the size of the section.

The _sh_entsize_ field is very important only in 2 cases. It stores the size
of one table unit, the *sym_tb_entry* structure size or the *rel_tb_entry*
structure size. Knowning this size we can calculate the number of units in the
table (section_size/entry_size). In other cases this field is zero.


Symbol table
------------
A symbol table contains symbols that where declared only in the global scope
of the original source file. This symbol table is not related to the compiler
symbol table. There are two different tables.

Symbol table entry has the following structure:

------------------------------------------------------------------------------
struct sym_tb_entry {
        u_int32_t       st_name;        /* symbol name */
        u_int32_t       st_offset;      /* offset in section */
        u_int32_t       st_size;        /* size of variable or function */
        sym_type_t      st_visib;       /* module, global */
        sh_type_t       st_shtype;      /* section type */
};
------------------------------------------------------------------------------

The _st_name_ field stores the offset in the string section where the real
symbol name is.

The _st_offset_ field stores the offset in the section _st_shtype_.

The _st_size_ field stores the size of the variable or the function.

The _sh_type_t_ field stores the section type. The symbol belongs to, lives in
this section. Also it has special value in case when the variable or the function
was declared with *extern* keyword, in this case it has a special value
+SHT_UNDEF+ and the _st_offset_ is zero. It means that now we don't know
exactly where the symbol is. The linker resolves this problem.

The _sym_type_t_ field stores the symbol type, now it can be global or module
(a module symbol was declared in the source file with the *module* keyword).

The symbol table is very important. It is always presented in the object and
executable files. The symbol table is also used in the process of relocation
symbols.

 
Relocataion table
-----------------
A relocation table section is only presented in the object files and not in
the executable file. The table is created by _vmasm_ and it is used by
_vmlink_ to resolve unknown symbols. A relocation table uses a symbol table.

We can have a relocation record for three cases:

- A variable was declared with the *extern* keyword.
- We call the external function and has only the function prototype, it means
that the called function can live in another file or it can be the library call.
- Special case for the read only section, it needs only for the *STRPUSH* instruction.
We need to backpatch its operand. The operand is the offset in the read only section.
_vmlink_ recalculates a new offset and patches the operand with a new value.

Relocation table entry has the following structure:

------------------------------------------------------------------------------
struct rel_tb_entry {
        u_int32_t       r_info;         /* index in a symbol table or the type
                                         * of the section (r_flag) */
        u_int32_t       r_flag;         /* how to use r_info field */
        u_int32_t       r_offset;       /* offset in section */
        sh_type_t       r_shtype;       /* .init or .text */
};
------------------------------------------------------------------------------

The _r_info_ field is interpreted according to the _r_flag_ field. It can be
the order number of the symbol in the symbol table (indeed we need only a symbol
name) or it can be the section type (now it is the read only section type only).

The _r_flag_ field may have two values +RF_SYMTB+ or +RF_SECT+, they are
described earlier.

The _r_offset_ field stores the offset in the _r_shtype_ section. _vmlink_
patches the section _r_shtype_ by this offset.

The _r_shtype_ field stores the type of the section where the relocation is
needed. It can be ithe the text section or the init section.

More explanation can be found in the _vmlink.c_ file. It has enougth comments.


vmasm
-----
As we said earlier the _vmasm_ program converts the assembler listing to the
object file. The _vmasm_ program is very simple, it reads the assembler file
line by line and processes instructions.

Each instruction has own handler. The program has a table with all known
instructions and their handlers.

-----------------------------------
struct code_entry {
	const char	*name;
	instr_handler_t	handler;
} instr_handlers[] = {
	...
	{ "add", arith },
	...
	{ "jge", jump },
	...
};
-----------------------------------

First of all the program reads the line, finds the identifier, then looks in
the table of instruction. If it finds a match it calls a handler. Then handler
continues parsing the line, processing operands (if this instruction has
operands).

The vmasm uses buffer API to create, to fill out, to write to the file a
main header, sections and section headers. The most difficult part of the
_vmasm_ program is creating symbol table entries and relocation table entries.
The overview of the both tables are given earlier in this paper.

vmlink
------