Small Virtual Machine executing its own byte-code

Question

I want present to you my little Virtual Machine. It nothing really sophisticated. It is an stack only virtual machine with the exception for some 'global' variables which i would count as registers but besides that all calculation done by my VM is done on stack.

What i planning now is to show the most interesting parts of the project How i implement this VM.

Reason why I do not post the complete code directly is you can find it on GitHub where you can clone it and test out locally. I recommend if you really interested download the project, execute and take a look the code and THEN comeback and read the post.

It makes it probably easier to understand. It's not that i do have a lot of files but to keep the code readable I think that this is the best choice. Let me know if you think otherwise!

Also Unfortunately i am not good writer, not just only because English is my second language... I try my best to correct as much orthographic and grammatical mistakes as possible.

Lets go through my code.

ByteCode.h

#include <memory> #include <functional> namespace vm { class VM; enum ByteCodes : unsigned long long { NOP=0, IADD, ISUB, IMUL, ILT, //Integer Less than IEQ, //Integer EQUAL IPUSH, //Integer FADD, FSUB, FMUL, FLT, //Float Less than FEQ, //Float EQUAL FPUSH, //Floating Point PRINT, HALT, POP, GSTORE, //Stores Value Globally GLOAD, STORE, //Stores Value Locally LOAD, BRANCH, // unconditional branch BRTRUE, // conditional branches BRFALSE, CALL, //Function call RET, //Returns to the Point after the Call MAXCODE }; struct Instruction { std::string mName; int mNumericName; int mOperandCount; std::function<void()> mInstruction; Instruction() = default; Instruction(const std::string& OpCode, int Value, int operands); Instruction(const std::string& OpCode, int Value); Instruction(const std::string& OpCode, int Value, int operands, std::function<void()> lambda); Instruction(const std::string& OpCode, int Value, std::function<void()> lambda); virtual ~Instruction() { }; static void Init( VM*); }; extern std::unique_ptr<Instruction[]> InstructionCode; }

This header i believe is pretty straight forward to understand. You find three important parts in it.

Enumeration of the Bytes which could trigger actions
Very small Abstraction of an Instruction containing some Meta Information
A table of Instruction where the indices correspond the values of the enumeration

What is more interesting is now how such an instruction is implemented. I will not show the whole implementation file but instead a small section of it:

ByteCode.cpp

namespace vm { std::unique_ptr<Instruction[]> InstructionCode = std::make_unique<Instruction[]>(MAXCODE);; Instruction::Instruction(const std::string& OpCode, int Value, int operands) : mName(OpCode), mNumericName(Value), mOperandCount(operands) { InstructionCode[Value] = (*this); } Instruction::Instruction(const std::string& OpCode, int Value) : mName(OpCode), mNumericName(Value), mOperandCount(0) { InstructionCode[Value] = (*this); } Instruction::Instruction(const std::string& OpCode, int Value, int operands, std::function<void()> lambda) : mName(OpCode), mNumericName(Value), mOperandCount(operands), mInstruction(lambda) { InstructionCode[Value] = (*this); } Instruction::Instruction(const std::string& OpCode, int Value, std::function<void()> lambda) : mName(OpCode), mNumericName(Value), mOperandCount(0), mInstruction(lambda) { InstructionCode[Value] = (*this); } void Instruction::Init( VM* Machine) { Instruction("NOP", 0, []() {}); Instruction("IADD", IADD, [Machine]( ) { auto rhs = Machine->stack[Machine->stackPointer--]; //Right-Hand-Side operand auto lhs = Machine->stack[Machine->stackPointer--]; //Left-Hand-Side operand Machine->stack.pop_back(); Machine->stack.pop_back(); Machine->stack.push_back(VM::Type{ VM::Type::INT,{ (lhs.mValue.mValue + rhs.mValue.mValue) } }); Machine->stackPointer++; }); }

What you see is the that each instruction if constructed is automatically be stored in the global array you have seen in the header file. This shall insure that every instruction ever instantiated is accessible via this very array just mentioned. This is important later on.

An Instruction saves 4 kinds of information

Name as String
Name as Number Code
amount of Operands
Implementation of the Instruction

The really interesting part start now with the static function Init. Init is called from the VM at beginning of it's life cycle. It does constructed all instruction just by calling the constructor of the class Instruction.

The Init function expects just the pointer of the current VM instance( a class shown later) which is given to an lambda who implements the behaviour of a certain instruction.

The lambda which contains the behaviour of the instruction captures Initss function argument, which is the pointer to the current instance of the VM. Through this i enable the lambda to access the VM's representation of the stack where all operation happen.

You see two Instructions implemented here. NOP and IADD. NOP simply does nothing so the lambda is empty. IADD on the other hand adds to integer variables which are stored on the stack before this instruction is called. if nothing is there, well that would be a problem then.

This scheme just repeats for every instruction defined in the enumeration in the header file.

VM.h

The header contains the nested struct Type. Type represent any values possible in the VM. To make life easier anything which is not an Pointer is stored within an double regardless if less bytes are need. Efficiency was not the purpose here. In essence Type is a tagged union. Because Every Instruction or Opcode is represented by an Byte, ergo is an Value, it can be expressed as an Type in the "Code Section" of the VM.

namespace vm { class VM { public: struct Type { enum types { INT, POINTER, FLOAT, CHAR } mObjecttype; union Value { double mValue; size_t mPointer; Value() = default; Value(double i) : mValue(i) {}; Value(size_t i) : mPointer(i) {}; }mValue; Type() = default; Type(types t, Value val) : mObjecttype(t), mValue(val) {} Type(const Type& t) : mObjecttype(t.mObjecttype), mValue(t.mValue) {} }; public: typedef std::vector<Type> Memory; typedef std::unique_ptr<char[]> HeapMemory; private: friend Instruction; //<-- Ugly but it works, to my surprise //Memory of the Machine //HeapMemory heap; Memory stack; Memory globals; Memory code; //!<-- Copy of the Code to execute Statistics stats; //!<-- Does Statistics about the execution of the Code //Rudimentary special purpose register size_t instructionPointer = -1; size_t stackPointer = -1; size_t framePointer = -1; //Debug bool trace = true; protected: void pushOntoStack(Type T); Type popOffStack(); size_t popPointer(); size_t HeapSize() const { return Kilobytes(512); } std::string disassemble() const; std::string stackString() const; public: VM(const std::vector<Type>& code, int main); virtual ~VM(); void cpu(); void operator()() { cpu(); } }; }

VM.cpp

The Constructor of the VM is quite simple it expects an vector of Types which stores simply a copy of the code. The second argument is the "line" number or the index from the vector from which the VM shall start executing those sweet little bytes of Opcode/Instruction. You also find the call to the static function Init from the class Instructions earlier explained.

namespace vm { VM::VM(const std::vector<Type>& code, int main) : globals(10)//,heap(std::make_unique<char[]>(HeapSize())) { this->code = code; this->instructionPointer = main; Instruction::Init(this); } VM::~VM() { }

The member function cpu() is quite interesting and a bit difficult to read. It first checks if the stacktrace is enabled; if so it prints the first line as you can read.

Then we enter our interpreting loop; for each instruction contained in our code given to the constructor earlier. this loop runs as long as our instruction pointer is less than the code size, otherwise we would try to access code which does not exist.

The first step is to fetch now our first Opcode and increase our IP by one. We than check if the recived opcode is valid and if so proceed.

 void VM::cpu() { if (trace) std::cout << std::left << std::setw(44) << "Instruction & Results:" << std::right << "Stack Memory:\n"; while (instructionPointer < code.size()) { try { auto opcode = code[instructionPointer].mValue.mValue;// fetch OpCode instructionPointer++; if (opcode < 0) throw std::runtime_error("Opcode Mismatch No Negative instructions please "); else if (opcode > MAXCODE) throw std::runtime_error("Opcode not supported: " + std::to_string(opcode) + "! at Instruction Pointer(" + std::to_string(instructionPointer - 1) + ")."); else if(opcode == 0) throw std::runtime_error("Opcode 0 Does not exist yet!\n ");

If the trace is enabled we print the current state of the stack as well as the instruction which causes the current stack.

 if (trace) { std::cout << std::left << std::setw(40) << disassemble() << std::right << stackString() << "\n";

The function AddMeasurement from the class Statistic does what you expect. it stores some kind of Measurement(here time). It's not really well coded and if you interested please look up the code at my repository it is in the same project. The Argument given to AddMeasurement calls the lambda of our Opcode/Instruction and Measures how long it takes to execute and stores this information for analysis later. I did this because it seems interesting to me to know just how fast is my Virtual Machine.

 stats.AddMeasurement({ InstructionCode[static_cast<size_t>(opcode)].mName, (measure<std::chrono::nanoseconds>:: duration( InstructionCode[static_cast<size_t>(opcode)].mInstruction )).count() }); //Decode and Execute Opcode } else { stats.AddMeasurement({ InstructionCode[static_cast<size_t>(opcode)].mName, (measure<std::chrono::nanoseconds>:: duration( InstructionCode[static_cast<size_t>(opcode)].mInstruction )).count() });//Decode and Execute Opcode }

What i personally find awesome is I do not have an awfull crazy long switch statement which chooses what code must be executed. In an earlier version i had such an switch statement and it drove me crazy looking at it. So i came up the with global array. The Array you find the Bytecode.h file; where the bytecode corresponds to the index of the array for appropriate instruction need if accessed.

 } catch (std::out_of_range& e) { std::cerr << e.what(); } catch (std::exception& e) { std::cerr << e.what(); } catch (...) { std::cerr << "Totally lost... do not know what happend but i'm dead for now!!!\n"; } } }

The Dissamble function takes the current Opcode pointed by the IP and prints the meta information stored in it including the line number within the code as well as the Name and Value of the operands.

std::string VM::disassemble() const { std::stringstream buffer; auto opcode = static_cast<size_t>(code[instructionPointer-1].mValue.mValue); if (opcode != 0 && opcode < MAXCODE) //Print Code section { buffer << std::showbase << std::setbase(10) << std::setw(4 * 2) << std::left << instructionPointer-1 << std::dec << ": \t" << std::setw(4 * 2) << std::left << InstructionCode[static_cast<size_t>(opcode)].mName; if (InstructionCode[static_cast<size_t>(opcode)].mOperandCount > 0) { auto nargs = InstructionCode[static_cast<size_t>(opcode)].mOperandCount; std::vector<std::string> operands; for (auto i = instructionPointer ; i <= instructionPointer + nargs-1; i++) { switch (code[i].mObjecttype) { case Type::POINTER: operands.push_back(std::to_string(code[i].mValue.mPointer)); break; default: operands.push_back(std::to_string(code[i].mValue.mValue)); break; } } for (auto item = 0; item < operands.size(); item++) { if (item > 0) buffer << std::left << ", "; buffer << std::setw(2) << std::left << " " << operands[item]; } } } return buffer.str(); }

Simply prints the current state of the Stack

std::string VM::stackString() const { std::stringstream buffer; buffer << "\tstack["; if (stack.size() != 0) { for (size_t i = 0; i <= stackPointer; ++i) { switch (stack[i].mObjecttype) { case Type::INT: buffer << " " << (long long)stack[i].mValue.mValue; break; case Type::FLOAT: buffer << " " << stack[i].mValue.mValue; break; } } } buffer << " ]"; return buffer.str(); } void VM::pushOntoStack(Type T) { stack.push_back(T); stackPointer++; } auto VM::popOffStack() -> VM::Type { auto tmp = stack[stackPointer--]; stack.pop_back(); return tmp; } size_t VM::popPointer() { auto tmp = popOffStack(); if (tmp.mObjecttype != Type::POINTER) throw std::runtime_error("Tried to access a pointer but received:" + std::to_string(tmp.mObjecttype) + "!"); return tmp.mValue.mPointer; } }

The main programm

Here you see the snippet containing an array of Type. This shall be the code executed by the VM. I tried to make the definition of this array more readable by using c-style macros. The project is not yet at the point that i can read from a file and execute the code stored there. Any code must be hard coded right know.

int main() { //Recursive Type factorial[] = { //Instruction // ADDRESS ///Function factorial(var N) // ARGS=1, LOCALS=0 INSTR(LOAD, -3.0,INT), // [0] Copy Given Argument N Into Current StackFrame INSTR(IPUSH, 2.0,INT), // [2] Push Constant (2) Onto Stack SINSTR(ILT), // [4] IF ((N < 2) == TRUE) Push TRUE ELSE Push FALSE Onto Stack INSTR(BRFALSE, 10ull,POINTER), // [5] Jump to Addr 10 if on Stack lays FALSE INSTR(IPUSH, 1.0,INT), // [7] Push Constant (1) Onto Stack SINSTR(RET), // [9] Return Constant (1)[7] // // RETURN N * factorial( N-1 ) INSTR(LOAD, -3.0,INT), // [10] Copy Given Argument N Into Current StackFrame INSTR(LOAD, -3.0,INT), // [12] Copy Given Argument N Into Current StackFrame INSTR(IPUSH, 1.0,INT), // [14] Push Constant (1) Onto Stack SINSTR(ISUB), // [16] Substract Constant (1) of Value (N) DINSTR(CALL, 0ull, 1.0,POINTER,INT), // [17] Call Function: factorial(var N) SINSTR(IMUL), // [20] Multiplicate Result of (factorial(var N))[17] times Value (N)[10] SINSTR(RET), // [21] Return Result[20] /// Function Main() // ARGS=0, LOCALS=0 // <-- MAIN METHOD! INSTR(IPUSH, N,INT), // [22] Push Constant (5) Onto StacK DINSTR(CALL, 0ull, 1.0,POINTER,INT), // [25] Call Function: factorial(var N) SINSTR(PRINT), // [26] Print Stack Top SINSTR(HALT) // [27] Abort }; VM::Memory CodeInstructions(std::begin(factorial), std::end(factorial)); VM Machine(CodeInstructions, 22); try { ///auto exe = &VM::cpu; auto avg = (measure<std::chrono::nanoseconds>::duration(Machine) ); //Machine.cpu(); } catch (std::exception& e) { std::cerr << e.what() << "\n"; } std::cin.get(); return 0; }

The Result of the programm you have seen above:

 Instruction & Results: Stack Memory: 22 : IPUSH 5.000000 stack[ ] 24 : CALL 0, 1.000000 stack[ 5 ] 0 : LOAD -3.000000 stack[ 5 1 ] 2 : IPUSH 2.000000 stack[ 5 1 5 ] 4 : ILT stack[ 5 1 5 2 ] 5 : BRFALSE 10 stack[ 5 1 0 ] 10 : LOAD -3.000000 stack[ 5 1 ] 12 : LOAD -3.000000 stack[ 5 1 5 ] 14 : IPUSH 1.000000 stack[ 5 1 5 5 ] 16 : ISUB stack[ 5 1 5 5 1 ] 17 : CALL 0, 1.000000 stack[ 5 1 5 4 ] 0 : LOAD -3.000000 stack[ 5 1 5 4 1 ] 2 : IPUSH 2.000000 stack[ 5 1 5 4 1 4 ] 4 : ILT stack[ 5 1 5 4 1 4 2 ] 5 : BRFALSE 10 stack[ 5 1 5 4 1 0 ] 10 : LOAD -3.000000 stack[ 5 1 5 4 1 ] 12 : LOAD -3.000000 stack[ 5 1 5 4 1 4 ] 14 : IPUSH 1.000000 stack[ 5 1 5 4 1 4 4 ] 16 : ISUB stack[ 5 1 5 4 1 4 4 1 ] 17 : CALL 0, 1.000000 stack[ 5 1 5 4 1 4 3 ] 0 : LOAD -3.000000 stack[ 5 1 5 4 1 4 3 1 ] 2 : IPUSH 2.000000 stack[ 5 1 5 4 1 4 3 1 3 ] 4 : ILT stack[ 5 1 5 4 1 4 3 1 3 2 ] 5 : BRFALSE 10 stack[ 5 1 5 4 1 4 3 1 0 ] 10 : LOAD -3.000000 stack[ 5 1 5 4 1 4 3 1 ] 12 : LOAD -3.000000 stack[ 5 1 5 4 1 4 3 1 3 ] 14 : IPUSH 1.000000 stack[ 5 1 5 4 1 4 3 1 3 3 ] 16 : ISUB stack[ 5 1 5 4 1 4 3 1 3 3 1 ] 17 : CALL 0, 1.000000 stack[ 5 1 5 4 1 4 3 1 3 2 ] 0 : LOAD -3.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 ] 2 : IPUSH 2.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 2 ] 4 : ILT stack[ 5 1 5 4 1 4 3 1 3 2 1 2 2 ] 5 : BRFALSE 10 stack[ 5 1 5 4 1 4 3 1 3 2 1 0 ] 10 : LOAD -3.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 ] 12 : LOAD -3.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 2 ] 14 : IPUSH 1.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 2 2 ] 16 : ISUB stack[ 5 1 5 4 1 4 3 1 3 2 1 2 2 1 ] 17 : CALL 0, 1.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 2 1 ] 0 : LOAD -3.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 2 1 1 ] 2 : IPUSH 2.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 2 1 1 1 ] 4 : ILT stack[ 5 1 5 4 1 4 3 1 3 2 1 2 1 1 1 2 ] 5 : BRFALSE 10 stack[ 5 1 5 4 1 4 3 1 3 2 1 2 1 1 1 ] 7 : IPUSH 1.000000 stack[ 5 1 5 4 1 4 3 1 3 2 1 2 1 1 ] 9 : RET stack[ 5 1 5 4 1 4 3 1 3 2 1 2 1 1 1 ] 20 : IMUL stack[ 5 1 5 4 1 4 3 1 3 2 1 2 1 ] 21 : RET stack[ 5 1 5 4 1 4 3 1 3 2 1 2 ] 20 : IMUL stack[ 5 1 5 4 1 4 3 1 3 2 ] 21 : RET stack[ 5 1 5 4 1 4 3 1 6 ] 20 : IMUL stack[ 5 1 5 4 1 4 6 ] 21 : RET stack[ 5 1 5 4 1 24 ] 20 : IMUL stack[ 5 1 5 24 ] 21 : RET stack[ 5 1 120 ] 27 : Print stack[ 120 ] 120 28 : HALT stack[ ] Name ---> | Min | Max | Average | Sum | count(ct) | avrg/ct | ================================================================================================================================== ISUB ---> | 3.9e+03 ns | 4.3e+03 ns | 4.1e+03 ns | 1.7e+04 ns | 4 | 1036.13 ns/ct | IMUL ---> | 3.6e+03 ns | 4.3e+03 ns | 3.9e+03 ns | 1.6e+04 ns | 4 | 986.875 ns/ct | ILT ---> | 3.9e+03 ns | 4.7e+03 ns | 4.2e+03 ns | 2.1e+04 ns | 5 | 836.8 ns/ct | IPUSH ---> | 2e+03 ns | 6.7e+03 ns | 2.9e+03 ns | 3.2e+04 ns | 11 | 264.248 ns/ct | Print ---> | 1.5e+04 ns | 1.5e+04 ns | 1.5e+04 ns | 1.5e+04 ns | 1 | 15395 ns/ct | LOAD ---> | 2.4e+03 ns | 8.3e+03 ns | 3.6e+03 ns | 4.7e+04 ns | 13 | 280.278 ns/ct | BRFALSE ---> | 2e+03 ns | 2.8e+03 ns | 2.3e+03 ns | 1.1e+04 ns | 5 | 457.92 ns/ct | CALL ---> | 4.3e+03 ns | 1.5e+04 ns | 7.8e+03 ns | 3.9e+04 ns | 5 | 1563.08 ns/ct | ----------------------------------------------------------------------------------------------------------------------------------- Sum | 37498 ns | 61973 ns | 44326.34 ns | 198549 ns | 48 | 20820.33 | Average | 781.2083 ns/ct | 1291.104 ns/ct | 923.4655 ns/ct | 4136.438ns/ct | | |

The worst part of all I can see is no actual tests for instructions. — Sugar, CommentedJul 30, 2021 at 15:22
Why is there an explicit type in enum ByteCodes : unsigned long long, but with a type that is way too large for the few constants defined? A unsigned char would be plenty large. — Cris Luengo, CommentedDec 7, 2024 at 21:12
@CrisLuengo sorry i cannot remember anymore. You are certainly not wrong. — ExOfDe, CommentedDec 10, 2024 at 9:14

pacmaninbw · Accepted Answer · 2024-12-07 16:19:49Z

General Observations

The rules in the Code Review Community on Stack Exchange require that only code to be reviewed must be in the post, however, the only way to actually review the code in the question is to clone the GitHub repository.

The code is not currently portable. Some operating systems such as Windows don't differentiate between capitals and lower case in file names, while others such as Linux based systems do. This code doesn't compile on a Linux based system because #include "vm.h" and #include "VM.h" are refering to different files. This code was obviously developed on a Windows system using some version of Visual Studio.

When using Visual Studio compile with the /W3 switch to catch most of the portability issues. Examples of warnings that might be found by `/W3' are:

/VM/VM/Measurement.h:121:35: error: comparison of integer expressions of different signedness: ‘int’ and ‘long long unsigned int’ [-Werror=sign-compare] 121 | for (int i = 1; i < vm::MAXCODE - 1; i++)

In many cases it is better to use std::size_t as the loop control variable when the loop control variable can't go negative, that should solve the previous compiler error.

Missing Include Files

The file Measurement.h needs to include to compile, Visual Studio apparently doesn't need this.

Prefer `std::size_t` Rather than `size_t` in C++

The C++ standard provides std::size_t rather than size_t. Many compilers allow size_t as well as std::size_t to be backward compatible with the C programming language, but you might end up with a type mismatch.

Prefer Using Statements Over `typedef`

Since C++11 the using statement is preferred over typedef.

In VM.h rather than:

 typedef std::vector<Type> Memory; typedef std::array<char, Kilobytes(512)> HeapMemory;

It would be better to have:

 using Memory = std::vector<Type>; using HeapMemory = std::array<char, Kilobytes(512)>;

Magic Numbers

There are Magic Numbers in the main() function (22 and 66 which is commented out) in the creation of the VM variable. It isn't clear from the code what these numbers are. It might be better to create symbolic constants for them to make the code more readble and easier to maintain. These numbers may be used in many places and being able to change them by editing only one line makes maintainence easier.

Numeric constants in code are sometimes referred to as Magic Numbers, because there is no obvious meaning for them. There is a discussion of this on stackoverflow.

Stack Exchange Network

Small Virtual Machine executing its own byte-code

ByteCode.h

ByteCode.cpp

VM.h

VM.cpp

The main programm

1 Answer 1

General Observations

Missing Include Files

Prefer `std::size_t` Rather than `size_t` in C++

Prefer Using Statements Over `typedef`

Magic Numbers

Hot Network Questions

Small Virtual Machine executing its own byte-code

ByteCode.h

ByteCode.cpp

VM.h

VM.cpp

The main programm

1 Answer 1

General Observations

Missing Include Files

Prefer std::size_t Rather than size_t in C++

Prefer Using Statements Over typedef

Magic Numbers

Related

Hot Network Questions

Prefer `std::size_t` Rather than `size_t` in C++

Prefer Using Statements Over `typedef`