Parsing Z80 assembler in C++

Question

I'm just starting with C++ (I have worked with C for more than a year) and as a first program I wanted to write a cleaner version of my Z80 assembler I had written in C before, this time trying to do it the C++ way (in which I clearly failed) as I did not want to use C++ as just 'Plain C with some extra features', I copied some C code and modified it a lot to translate it to C++ (I surely made a lot of mistakes as I tried different approaches modifying/messing up lot of the code) but I really could not organize my code nor express my ideas in C++.

I tried with std::istringstream but ended up using a string and an index variable. Right now the objective of the code is to just separate each statement into tokens.

The rules for the assembler are the following:

Registers start with % and can have spaces between % and the first char (e.g. ADD % A, %B)
Numbers start with $ and follow the same rules as registers
String and chars are between double and single quotes respectively as in C/C++
Indirect operands are between parenthesis as indirect address numbers
Everything else is considered as a label

For include: push the new file into the file stack and return from read statements If we are pushing into the stack a file we already had pushed then we have a include cycle, abort

Algorithm:

1.- Read every line and adding each statement to a vector of statements

2.- For each label add it to map and set it as ready

3.- For each macro read all its statements and store it inside the macro struct

4.- If we read a undefined opcode check if it is a macro and if so, copy all of its statements to the statements vector

5.- Resolve all equ labels that do not have a value, if one references a addr. label then move it to the addr labels group and copy the index of the statement after its referenced label.

6.- After reading a complete file pass through all statements checking args and adding up sizes give each address label the size before it's statement plus offset if needed by user

7.- Pass again through statements this time assembling them, as by now we should have every label

So How should someone implement this? Or How could I improve this?

#include <unordered_map> #include <iostream> #include <string> #include <vector> #include <cctype> #include <sstream> enum class OperandType { // Registers rB, rC, rD, rE, rH, rL, rA, rI, rR, // Double Registers rrBC, rrDE, rrHL, rrSP, rrIX, rrIY, rrAF, rrAFp, // Indirect Registers iBC, iDE, iHL, iSP, iIX, iIY, // Numbers and others NN, STRING, LABEL, // Flags fNZ, fZ, fNC, fC, fPO, fPE, fP, fM }; enum class Opcode { LD, EX, PUSH, POP, ADD, ADC, SBC, SUB, AND, OR, XOR, CP, INC, DEC, BIT, SET, RES, IM, RLC, RL, RRC, RR, SLA, SRA, SRL, JP, JR, DJNZ, CALL, RET, RST, IN, OUT, LDI, LDIR, LDD, LDDR, CPI, CPIR, CPD, CPDR, NEG, RLD, RRD, INI, INIR, IND, INDR, OUTI, OTIR, OUTD, OTDR, RETI, RETN, EXX, DAA, CPL, CCF, SCF, NOP, HALT, DI, EI, RLCA, RLA, RRCA, RRA, DB, DS, EQU, INCLUDE, MACRO }; static inline bool uses_flags(Opcode opcode) { return (opcode == Opcode::JP || opcode == Opcode::JR || opcode == Opcode::CALL || opcode == Opcode::RET); } static inline bool is_valid_char(int c) { return (isalnum(c) || c == '_' || c == '-' || c == '%' || c == '$' || c == '\''); } const std::unordered_map<std::string, Opcode> opcodes = { {"LD", Opcode::LD}, {"EX", Opcode::EX}, {"PUSH", Opcode::PUSH}, {"POP", Opcode::POP}, {"ADD", Opcode::ADD}, {"ADC", Opcode::ADC}, {"SBC", Opcode::SBC}, {"SUB", Opcode::SUB}, {"AND", Opcode::AND}, {"OR", Opcode::OR}, {"XOR", Opcode::XOR}, {"CP", Opcode::CP}, {"INC", Opcode::INC}, {"DEC", Opcode::DEC}, {"BIT", Opcode::BIT}, {"SET", Opcode::SET}, {"RES", Opcode::RES}, {"IM", Opcode::IM}, {"RLC", Opcode::RLC}, {"RL", Opcode::RL}, {"RRC", Opcode::RRC}, {"RR", Opcode::RR}, {"SLA", Opcode::SLA}, {"SRA", Opcode::SRA}, {"SRL", Opcode::SRL}, {"JP", Opcode::JP}, {"JR", Opcode::JR}, {"DJNZ", Opcode::DJNZ}, {"CALL", Opcode::CALL}, {"RET", Opcode::RET}, {"RST", Opcode::RST}, {"IN", Opcode::IN}, {"OUT", Opcode::OUT}, {"LDI", Opcode::LDI}, {"LDIR", Opcode::LDIR}, {"LDD", Opcode::LDD}, {"LDDR", Opcode::LDDR}, {"CPI", Opcode::CPI}, {"CPIR", Opcode::CPIR}, {"CPD", Opcode::CPD}, {"CPDR", Opcode::CPDR}, {"NEG", Opcode::NEG}, {"RLD", Opcode::RLD}, {"RRD", Opcode::RRD}, {"INI", Opcode::INI}, {"INIR", Opcode::INIR}, {"IND", Opcode::IND}, {"INDR", Opcode::INDR}, {"OUTI", Opcode::OUTI}, {"OTIR", Opcode::OTIR}, {"OUTD", Opcode::OUTD}, {"OTDR", Opcode::OTDR}, {"RETI", Opcode::RETI}, {"RETN", Opcode::RETN}, {"EXX", Opcode::EXX}, {"DAA", Opcode::DAA}, {"CPL", Opcode::CPL}, {"CCF", Opcode::CCF}, {"SCF", Opcode::SCF}, {"NOP", Opcode::NOP}, {"HALT", Opcode::HALT}, {"DI", Opcode::DI}, {"EI", Opcode::EI}, {"RLCA", Opcode::RLCA}, {"RLA", Opcode::RLA}, {"RRCA", Opcode::RRCA}, {"RRA", Opcode::RRA}, {"DB", Opcode::DB}, {"DS", Opcode::DS}, {"EQU", Opcode::EQU}, {"INCLUDE",Opcode::INCLUDE}, {"MACRO", Opcode::MACRO} }; const std::unordered_map<std::string, OperandType> registers = { {"A", OperandType::rA}, {"B", OperandType::rB}, {"C", OperandType::rC}, {"D", OperandType::rD}, {"E", OperandType::rE}, {"H", OperandType::rH}, {"L", OperandType::rL}, {"I", OperandType::rI}, {"R", OperandType::rR}, {"BC", OperandType::rrBC}, {"DE", OperandType::rrDE}, {"HL", OperandType::rrHL}, {"SP", OperandType::rrSP}, {"AF", OperandType::rrAF}, {"IX", OperandType::rrIX}, {"IY", OperandType::rrIY}, {"AF'", OperandType::rrAFp} }; const std::unordered_map<std::string, OperandType> flags = { {"NZ", OperandType::fNZ}, {"Z", OperandType::fZ}, {"NC", OperandType::fNC}, {"C", OperandType::fC}, {"PO", OperandType::fPO}, {"PE", OperandType::fPE}, {"P", OperandType::fP}, {"M", OperandType::fM} }; std::vector<std::string> tokens; inline void eat_spaces(const std::string& line, int& it) { while( it < line.size() && isspace(line[it]) ) it ++; } void read_operands(const std::string& line, int& it) { std::string delim = ";"; // Read operands while(it < line.size()) { std::string token; switch( line[it] ) { case ';': return; case '\'': case '\"': case '(': { auto next = line.find_first_of(delim + line[it], it + 1); if(next == std::string::npos || line[next] == ';') std::cout << "Error Operands" << std::endl; token = line.substr(it, next - it + 1); it = (next + 1); } break; case '%': case '$': token += line[it ++]; eat_spaces(line, it); while(isalnum(line[it])) token += line[it ++]; break; default: while(is_valid_char(line[it])) token += line[it ++]; break; } tokens.push_back(token); eat_spaces(line, it); if(it == line.size() || line[it] == ';') break; else if(line[it] != ',') { std::cout << "Garbage1:" << line[it] << std::endl; return; } // Eat comma it ++; // Check if after comma actually comes another operand eat_spaces(line, it); if(it == line.size() || line[it] == ';') { std::cout << "Garbage2:" << line[it] << std::endl;; return; } } } void read_opcode(const std::string& line, int& it) { // Read posible label and opcode for(int i = 0; i < 2; i ++) { std::string token; eat_spaces(line, it); if(line[it] == ';') return; if( !isalpha(line[it]) ) std::cout << "Error Opcode\n"; while( is_valid_char(line[it]) ) token += line[it ++]; if(line[it] == ':') { it ++; } else { tokens.push_back(token); break; } } eat_spaces(line, it); } void read_statements(void) { std::vector<Operand> operands; std::string line = "RET \"go\", b ;comment"; Opcode opcode; int it = 0; read_opcode(line, it); int first_operand = tokens.size(); try { opcode = opcodes.at(tokens[0]); } catch(std::out_of_range) { std::cout << "Unknown Opcode: " << tokens[0] << std::endl; } read_operands(line, it); if((opcode == Opcode::RET && tokens.size() > 1) || (uses_flags(opcode) && tokens.size() > 2)) { try { operands[0].type = flags.at(tokens[1]); } catch(std::out_of_range) { std::cout << "Unknown Flag:" << tokens[1] << std::endl; } } for(int i = 0; i < tokens.size(); i ++) std::cout << tokens[i] << std::endl; return; } int main() { read_statements(); }

What is Operand? An enum? A namespace? Something else? — Edward, CommentedDec 27, 2017 at 11:54
Its a class, but that line is just a left over as its not yet used — user156642, CommentedDec 27, 2017 at 12:51
Hmm. I don't think it will be possible to do much with a review without at least a hint of what's in there and in OperandType. — Edward, CommentedDec 27, 2017 at 16:49
Well I added the enum's definitions but really there's nothing more, that is the complete source. At this point I just want the program to separate tokens following the rules described above. — user156642, CommentedDec 27, 2017 at 22:50
Do you have a link to Z80 assembly instructions. I sold my book on Z80 assembly when sold my ZX Spectrum many years ago. — Loki Astari, CommentedDec 29, 2017 at 19:57

Toby Speight · Accepted Answer · 2018-03-13 18:06:42Z

This is a fairly straightforward way to implement this. I think it's a pretty good first try. Your naming is pretty good and understandable, and I can read all of this code every easily. Here are something things I would change.

Globals

The tokens vector is declared globally. This is a bad idea. It means that any function can change it at any time, and it will be difficult to figure out why it changed.

It also limits you to only ever working on a single file at a time. My main compiler will compile 1 text file per logical CPU core at a time, greatly reducing the amount of time it takes to build my stuff. You may want to do the same in the future.

I recommend passing tokens into the functions that need it rather than making it global both to reduce confusion over who changed what in it, and to allow multiple files to be assembled at once in the future.

Error Handling

I see 3 strategies for handling errors throughout the code:

Print an error but continue on (for example in the second set of case statements in read_operands())
Print an error then bail (for example in the rest of read_operands())
Use exceptions (in read_statements())

Of those, the first one seems dangerous to me. You got some junk input, why would you continue on? Even if you can skip to the next line and continue your assembly, the output won't be right because the input was garbage.

The second one is better in that you stop reading the garbage operands, but you don't convey any information back to the caller.

Given that it probably makes the most sense to stop once you hit an error, my suggestion is to use exceptions and throw on any bad input.

Simplify

I see several things happening in read_statements() in a way that I think could be simplified. There are 2 parts to parsing a language file - lexicographical analysis and syntactic analysis. Tokenizing the input string is the lexicographical part of that. Validating that the tokens are valid tokens and that they have the correct number of operators, etc. is the syntactic part of it. I think you should break those up in your read_statements() function and separate them into 2 different functions. Something like this:

void read_statements(void) { parse_line(line, tokens); check_tokens(tokens); }

Then parse_line() would figure out if there are labels, operators, and operands and put them into tokens:

void parse_line(const std::string& line, std::vector<std::string>& tokens) { std::string::const_iterator it = line.cbegin(); if (line.find_first_of(":", 0) != npos) { it = parse_label(line, it, tokens); } it = parse_opcode(line, it, tokens); it = parse_operands(line, it, tokens); }

The check_tokens() function would do the actual checking to ensure that the tokens are valid:

void check_tokens(std::vector<std::string> tokens) { Opcode opcode; try { opcode = opcodes.at(tokens [ 0 ]); } catch (std::out_of_range) { std::cout << "Unknown Opcode: " << tokens [ 0 ] << std::endl; } if (opcode == Opcode::RET && && tokens.size() > 1) || (uses_flags(opcode) && tokens.size() > 2)) { try { operands[0].type = flags.at(tokens[1]); } catch(std::out_of_range) { std::cout << "Unknown Flag:" << tokens[1] << std::endl; } } }

C++

You asked about C++-specific critique so here are some thoughts on that. It's great that you used std::string instead of C strings! They have many advantages.

In the above example, I used iterators instead of just an index to keep track of where we are in the string. In fact, I used a const_iterator because we have no need to change the string. This is a more idiomatic way to handle things in C++.

Generally, you'll want errors to go to std::cerr instead of std::cout. And I would have all the functions throw an exception whenever they hit an error that requires stopping processing. At the very least, you can return errors from your functions rather than returning nothing. This allows you to skip syntactic processing if lexicographical processing failed, for example.

Shouldn’t it be syntax analysis instead of semantics analysis? — Incomputable, CommentedDec 28, 2017 at 9:52
Thanks you very much for your help, one last question is about parse_line(), the only minor problem I see is when a line does not have a label but the comment does have ':', then, would it be a good idea to remove the comments in the first place?, like at the time of reading from file to remove this problem?, and is there a problem with having global unordered_map? Well, maybe make them static so they stay just within the file. About using exceptions, I have heard bad stuff about them, but, is it ok if I use them to throw errors at parsing/checking tokens? — user156642, CommentedDec 28, 2017 at 14:32
@Incomputable I'll have to think about that. It's been a while since I needed to use it in the real world. But I thought that lexicographical analysis figured out which keywords you were using (like if and for in a C-based language), but semantic analysis figured out if the structure was correct (like did you write if (x = 0; x < max; x++) instead of for(x = 0; x < max; x++)), but maybe it's not that simple. — user1118321, CommentedDec 28, 2017 at 16:00
@user1118321, semantical anaylises deals with types, array bounds, disambiguation after syntax analysis. Syntax analysis mostly deals with tokens being in the correct order and correct amount. May be during your education they were merged, as some compiler do indeed merge syntax analysis and semantics analysis steps. — Incomputable, CommentedDec 28, 2017 at 16:02
@Thomas If it's possible to have a ':' in comments, then you'll need another strategy. Removing comments and normalizing whitespace (collapsing it all to a single space, for example) as a first step is frequently used. I recommend it. — user1118321, CommentedDec 28, 2017 at 16:03

Toby Speight · Accepted Answer · 2018-03-13 18:23:11Z

Design

If I were going to do it, then I would design it so it uses the operator>> to read opcodes from the input. This is more C++ like.

OpCode code; while(stream >> code) { // I have successfully read one op code from the input stream. // This could be a file/std::cin or even a string stream. }

To do this you would need to define the OpCode class and its input function.

class OpCode { public: // define the input operator. friend std::istream& operator>>(std::istream& str, OpCode& op) { op.readOperation(str); return str; } void readOperation(std::istream& str); // reads the next opcode from the // stream and initializes // this object. };

Code Review

Please avoid Global Variables.

std::vector<std::string> tokens;

Please don't use `inline` unless it is required.

inline void eat_spaces(const std::string& line, int& it)

It makes moving code harder and provides no practical benefit. Modern compilers ignore the keyword and use their own algorithms to determine when a function should be inlined (and they are better than humans at doing it).

Avoid Input/Output parameters.

inline void eat_spaces(const std::string& line, int& it) { while( it < line.size() && isspace(line[it]) ) it ++; }

Much easier to read if you write this as taking a start point and returning the new start point.

int eat_spaces(const std::string& line, int it) { while( it < line.size() && isspace(line[it]) ) { it ++; } return it; }

Now the usage becomes:

it = eat_spaces(line, it); // I can see that it is being modified.

Don't write your own space ignore function.

This is already done automatically by the stream's formatted input operators.

stream >> X; // This ignores any white space. // Then correctly reads in an object of type typeof(X) // If the read fails the stream is either marked bad // or an exception is thrown.

A function that has a return type must return a value

int read_operands(const std::string& line, int it) { // STUFF case ';': return; // You must return something. // I am surprised the compiler lets this happen.

This is undefined behavior.

Have to mention your brace style.

It's a minor thing; but:

 if(line[it] == ':') { it ++; } else { tokens.push_back(token); break; }

This is a less common style but I have seen it before. Not going to say it's good or bad but whatever style you pick you need to be consistent about its usage. You are over the small scale but there are a couple of places that you fail.

Examples:

 std::cout << "Unknown Flag:" << tokens[1] << std::endl; } } // This is indented one past the brace. // It should be two before the brace. for(int i = 0; i < tokens.size(); i ++) std::cout << tokens[i] << std::endl;

And

 break; } // This is equal to the brace should be 2 before. tokens.push_back(token); eat_spaces(line, it);

When reading a lot of code, consistency is important. Personally I don't think this style helps with consistency (but it can be done).

ReDesign

Assembly program

Each line of Z80 Assembly has 3 parts:

labelOptional InstructOptional CommentOptional

So we are going to end up with a map of labels to instructions positions and a vector of instructions as a result of parsing. This is assuming we are throwing away the comments.

class Program { std::map<std::string, int> labelPoints; std::vector<Instruction> instructions; public: friend std::istream& operator>>(std::istream& str, Program& program) { program.readProgram(str); return str; } void readProgram(std::istream& str); }; void Program::readProgram(std::istream& str) { std::string line; while(std::getline(str, line) { // Have a line // Find the end of label std::size_t label = line.find(':'); label = label == std::string::npos ? 0 : label + 1; // Find the start of a comment. std::size_t comment = line.find(';', label); comment = comment == std::string::npos ? line.size : comment; // If we have a label then add it to the map if (label != 0) { labelPoints[line.substr(0, label - 1)] = instructions.size(); } // Now we make a stream for the instruction so we can // parse that separately. std::stringstream instructionStream(line.substr(label, (comment - label)); Instruction instruction; if (instructionStream >> instruction) { instructions.push_back(instruction); } else { // If reading the instruction fails then mark the fail bit // on the input stream as we need to force the stream to // stop being read in the while loop. This will allow any // external user to detect if the input was read correctly. str.setstate(std::ios::failbit); } } }

So this then makes the main look like this:

int main(int argc, char* argv[]) { // Doing input checks left as an exercise. std::ifstream input(argv[1]); Program program; if (input >> program) { // Program was correctly read from the stream. } }

Assembly Instruction

An assembly instruction is in 3 parts.

OpCode
Source (optional)
Destination (optional)

Whether an instruction has a source and a destination is dependent on instruction. But we will concentrate on parsing the input here and leave validation as a separate exercise.

class Instruction { OpCode code; Param source; Param destination; public: friend std::istream& operator>>(std::istream& str, Instruction& ins) { ins.readInstruction(str); return str; } void readInstruction(std::istream& str); }; void Instruction::readInstruction(std::istream& str) { source.used = false; destination.used = false; if (str >> code >> source >> destination) { // The read worked. validateInstruction(); } } std::istream& operator>>(std::istream& str, OpCode& code) { std::string op; str >> op; auto find = stringToOpCode.find(op); if (find != stringToOpCode.end() { code = find.second; } else { str.setstate(std::ios::failbit); } return str; } std::istream& operator>>(std::istream& str, Param& param) { // So now we just need to decide if this is a: // * register A B C D E H L AF BC DE HL IX IY SP // * value Number // * address (Number) or (HL) or (I[XY]+Number) // * flag NC C NZ Z PO PE P M std::string paramValue; if (str >> paramValue) { if (paramValue[0] == '(') { parseAddress(paramValue, str); } else if (is_digit(paramValue[0]) { parseNumber(paramValue, str); } else { auto find = stringToRegister.find(paramValue); if (find != stringToRegister.end()) { parseRegister(find, str); } else { auto find = stringToFlag.find(paramValue); if (find != stringToFlag.end()) { parseFlag(find, str); } else { str.setstate(std::ios::failbit); } } } } else { str.setstate(std::ios::failbit); } return str; }

Thanks a lot. About those inconsistencies in styling, global data and that return statement where actually left over from experimentation that I forgot to change, but thanks for noticing them. I'll try to modify the current program taking into account your recommendations. Right now the program is almost done but I would like if you could give your opinions about the main algorithm and if there is a way for assembling each instruction other than brute force. — user156642, CommentedDec 30, 2017 at 18:58
By the way here goes a link to zilog user manual, there all instructions are detailed. zilog.com/index.php?option=com_doc&Itemid=99 — user156642, CommentedDec 30, 2017 at 19:19

Stack Exchange Network

Parsing Z80 assembler in C++

Algorithm:

2 Answers 2

Globals

Error Handling

Simplify

C++

Design

Code Review

Please avoid Global Variables.

Please don't use `inline` unless it is required.

Avoid Input/Output parameters.

Don't write your own space ignore function.

A function that has a return type must return a value

Have to mention your brace style.

ReDesign

Assembly program

Assembly Instruction

Hot Network Questions

Parsing Z80 assembler in C++

Algorithm:

2 Answers 2

Globals

Error Handling

Simplify

C++

Design

Code Review

Please avoid Global Variables.

Please don't use inline unless it is required.

Avoid Input/Output parameters.

Don't write your own space ignore function.

A function that has a return type must return a value

Have to mention your brace style.

ReDesign

Assembly program

Assembly Instruction

Related

Hot Network Questions

Please don't use `inline` unless it is required.