Dimacs parser written in C++

Question

I am a beginner in C++. I recently wrote a dimacs parser in C++ as a learning exercise. Can you suggest improvements in my code?

#include<cstdlib> #include<cstdio> #include<vector> #include<cstring> #define vector std::vector #define memcmp std::memcmp typedef struct { char *current; int line; int nvars; int nclauses; } Parser; int parse_literal(Parser *parser); void skip(Parser *parser); void parse_header(Parser *parser); vector<int> parse_clause(Parser *parser); vector<vector<int>> parse_dimacs(Parser *parser); void print_clause(vector<int> clause); int main(int argc, char *argv[]) { if(argc < 2) { printf("Usage: ./dimacs_parser <dimacs_file>\n"); exit(EXIT_FAILURE); } // read file and allocate large enough string FILE *file = fopen(argv[1], "rb"); fseek(file, 0L, SEEK_END); size_t filesize = ftell(file); rewind(file); char *dimacs = (char *)malloc(filesize); size_t bytes_read = fread(dimacs, sizeof(char), filesize, file); dimacs[bytes_read] = '\0'; // EOF // parse Parser *parser = (Parser *)malloc(sizeof(Parser)); parser->current = dimacs; vector<vector<int>> clauses = parse_dimacs(parser); for(auto &clause: clauses) { print_clause(clause); putchar('\n'); } // free the resources fclose(file); free(dimacs); free(parser); } void print_clause(vector<int> clause) { printf("["); for(int i = 0; i < clause.size() - 1; i++) printf("%d, ", clause[i]); printf("%d]", clause.back()); } int parse_literal(Parser *parser) { skip(parser); bool is_negative = false; int literal = 0; if(*parser->current == '-') { is_negative = true; parser->current++; } while(*parser->current != ' ' && *parser->current != '\n') { switch(*parser->current) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': literal = 10 * literal + (*parser->current - '0'); parser->current++; break; default: printf("Unexpected character %c at [line: %d]", *parser->current, parser->line); exit(EXIT_FAILURE); } } return is_negative ? -literal : literal; } void skip(Parser *parser) { switch(*parser->current) { case ' ': case '\t': case '\r': parser->current++; skip(parser); break; case '\n': parser->current++; parser->line++; skip(parser); break; case 'c': while(*parser->current != '\n') parser->current++; skip(parser); break; default: return; } } void parse_header(Parser *parser) { #define skip_whitespace while(*parser->current++ != ' '); skip(parser); if(*parser->current != 'p') { printf("expect header at [line: %d]", parser->line); exit(EXIT_FAILURE); } parser->current++; skip_whitespace; #undef skip_whitespace if(memcmp(parser->current, "cnf", 3) != 0) { printf("expect 'cnf' in header at [line: %d]", parser->line); exit(EXIT_FAILURE); } parser->current += 3; skip(parser); parser->nvars = parse_literal(parser); skip(parser); parser->nclauses = parse_literal(parser); } vector<int> parse_clause(Parser *parser) { skip(parser); vector<int> clause; while(*parser->current != '\n' && *parser->current != '\0') { int literal = parse_literal(parser); if(abs(literal) > parser->nvars) { printf("literal at [line: %d] exceeds total no.of vars", parser->line); exit(EXIT_FAILURE); } clause.push_back(literal); } if(clause.empty()) { printf("Unexpected EOF at [line: %d]", parser->line); exit(EXIT_FAILURE); } if(clause.back() != 0) { printf("expect '0' at the end of the clause at [line: %d]", parser->line); exit(EXIT_FAILURE); } clause.pop_back(); return clause; } vector<vector<int>> parse_dimacs(Parser *parser) { vector<vector<int>> clauses; parser->line = 1; parse_header(parser); clauses.reserve(parser->nclauses); skip(parser); if(*parser->current == '\0') { printf("expect clauses at [line: %d]", parser->line); exit(EXIT_FAILURE); } while(*parser->current != '\0' && clauses.size() < parser->nclauses) { vector<int> clause = parse_clause(parser); clauses.push_back(clause); } skip(parser); return clauses; }

What is a "dimacs parser", please? Can you share sample inputs? It seems odd to use #define vector std::vector rather than using std::vector;, along with malloc, manual memory management and pointers. You might want to provide context for why you've avoided most standard C++ idioms here in favor of a "C with vectors" flavor. — ggorlen, CommentedApr 4, 2024 at 21:17
This website contains explanation about dimacs format along with sample dimacs file jix.github.io/varisat/manual/0.2.0/formats/dimacs.html. I am completely new to C++, hence the unidiomatic code. — Sai Charan Marrivada, CommentedApr 4, 2024 at 21:36
I want to conform to simplified dimacs format. The specification is mentioned here web.archive.org/web/20190325181937/https://…. — Sai Charan Marrivada, CommentedApr 4, 2024 at 22:43
@SaiCharanMarrivada You've clarified a lot in the comments (thanks for that--many askers simply disappear). I suggest editing the post to canonize those clarifications, especially the sample inputs and context/specification for what "Dimacs" is and which flavor you're parsing. This way, the question is complete, answerable and standalone, without having to look at the comments. As for being a beginner in C++, no problem, but you might want to elaborate a bit further on that--something like "I'm a beginner in C++ and haven't used any C++ features besides vectors" or similar. Thanks. — ggorlen, CommentedApr 4, 2024 at 23:51
Welcome! Please don't edit the code in the question once it's posted. The reason is that readers will lose context and already posted answers may not contain relevant info anymore. If you'd like post a new version of the code, consider making a new post and linking this post to the new one. — Rish, CommentedApr 5, 2024 at 7:33

Toby Speight · Accepted Answer · 2024-04-12 07:40:25Z

As others have already mentioned, your implementation is more 'C with std::vectors` than C++.

Don't use C headers

<cstdio>, <cstdlib> and <cstring> are all C headers. They still have their place in C++, but in most cases, you can replace them with a better C++ equivalent, or use a more idiomatic mechanism to achieve the same goals.

<cstdio> for printf: This can be replaced by using <iostream> and std::cout. If you're C++23, you can use the more ergonomic std::print in <print>.
<cstdlib> for malloc: This can be replaced by using either unique_ptr or shared_ptr in <memory>; even better, use the types already provided by the C++ standard library to deal with managing memory like std::vector or std::string.
<cstring> for memcmp: In your specific case, can probably be replaced by something like std::string or std::string_view's operator==. There's also std::equal.

Don't `#define`

You shouldn't be using #define this way. Really, the convention is to just use the fully qualified name, i.e. std::vector. If you really want to use just vector, consider using using declarations:

using std::vector;

Now you can refer to the type as vector.

One thing you should be careful of is never to use using declarations in a header and/or in the global scope and keep the declaration as local as possible.

#include <vector> // using std::vector; <- Bad, pollutes the global namespace void my_func() { using std::vector; // Fine, you can use 'vector' within my_func vector<int> vec; } void my_foo() { vector<int> vec; // Oops, doesn't work! }

`typedef struct`

You don't need typedef struct. This is a holdover from C. In C++, you can simply declare it as struct Parser { ... };

Reading the file

There are a couple of issues with the way you're reading the file.

The C way

I'll call your implementation 'the C way'. One problem is the lack of error handling. fread may read less than the desired amount. You need to check whether any occurred or you have EOF by calling ferror and feof respectively, and handle those conditions accordingly.

Second, the dimacs[bytes_read] = '\0'; line is writing to memory which doesn't exist, which is undefined behavior. The remedy is this to allocate 1 more than required during malloc.

The C++ way

The idiomatic C++ way of reading a file is using ifstream.

auto file = std::ifstream(argv[1]); if (!file) { // File does not exist, handle accordingly }

You can then read the file into a std::string. There are multiple ways to do it, none of them particularly 'elegant'. See this SO answer for more details.

If you're using C++17 or later, you can also use std::filesystem::file_size to get the size of the file (which calls into the underlying OS' filesystem APIs) instead of relying on fseek or ftell.

Another benefit of using ifstream is that you don't need to worry about doing an fclose at the end, due to RAII. See this SO answer to learn more about RAII.

Allocate on the stack

You allocate the parser object on the heap, but you need to ask if that's really necessary. There are 3 main reasons to allocate on the heap:

You don't know the allocation size until runtime (this is not the case).
The object is large enough that it doesn't make sense to allocate on the stack (this is not the case, and also very unlikely).
The lifetime of the object needs to extend beyond the lifetime of the enclosing scope (again, not the case).

In this case, consider just allocating the object on the stack. Another benefit is that you don't need to do free(parser) at the end, because it will automatically be destroyed when the enclosing scope is destroyed.

Parser parser;

Another issue is that malloc does not initialize the memory it allocates. So you might end up in situations where your fields are initialized to garbage values. lines, nvar, nclauses can be either 0, 5, 42, 20000000, or any other value.

In C++, you can simply initialize the object members inline.

struct Parser { int lines = 0; int nvar = 0; int nclauses = 0; };

Now, whether you allocate on the stack or on the heap (using make_unique), you will end up with sensible default values.

Recursive `skip`

Due the recursive nature of the skip function, it's possible for someone to craft a file in such a way as to cause a stack overflow.

On my MacBook, the stack size is 8MiB. In this case, I just needed to construct an input text file with 300,000 whitespaces to trigger the stack overflow.

$ for i in {0..300000}; do echo ' ' >> test.txt; done $ ./main test.txt [1] 31679 segmentation fault ./main test.txt

Consider implementing it iteratively.

Other general comments

Using reference instead of pointers: Instead of passing Parser* to your functions, consider passing Parser& since it's more idiomatic. It is also more difficult to end up in situation where Parser& points to invalid memory.

The switch in parse_literals can be replaced by an if.

if (auto ch = *parser->current; ch >= '0' && ch <= '9') { // Do something } else { // Do something else }

memcmp(parser->current, "cnf", 3) can result in undefined behavior if parser->current + 3 points to unallocated memory. Similarly, parser->current += 3.

thank you for you comments. The skip function is tail-recursive and an optimizing compiler will replace that with a loop instead of function calls. — Sai Charan Marrivada, CommentedApr 5, 2024 at 7:56
Yes, but 'will replace that' is doing a lot of heavy lifting there. Compilers are under no obligation to optimize this, and one can use -fno-optimize-siblilng-calls to disable tail recursion anyway. Maybe someone with a better grasp of the standard can confirm, but this would fall the umbrella of undefined behavior. — Rish, CommentedApr 13, 2024 at 2:42
I have a bit of functional programming background, so I cannot sometimes resist recursion. — Sai Charan Marrivada, CommentedApr 13, 2024 at 20:45

Madagascar · Accepted Answer · 2024-04-05 17:59:11Z

You have already recieved a review for the C++ part, so I'd not repeat that.

fseek(file, 0L, SEEK_END); size_t filesize = ftell(file);

The fseek and ftell functions are both defined by the ISO C language standard. The following is from the public latest draft of the C2X standard:

7.23.9.2:

The fseek function sets the file position indicator for the stream pointed to by stream. If a read or write error occurs, the error indicator for the stream is set and fseek fails.
For a binary stream, the new position, measured in characters from the beginning of the file, is obtained by adding offset to the position specified by whence. The specified position is the beginning of the file if whence is SEEK_SET, the current value of the file position indicator if SEEK_CUR, or end-of-file if SEEK_END. A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

A footnote below "7.23.4.2 The rename function" also states:

Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state.

As a binary stream need not support SEEK_END, you can not rely on it.

There's no portable away to find the size of a file in C, except for reading the whole file. POSIX's stat/fstat is another option, which are implemented by Windows as well, and are pretty portable.

But you don't need to meddle with fseek or stat in C++.

FILE *file = fopen(argv[1], "rb");

fopen returns a nullptr on failure. Failing to check for it risks invoking undefined behavior by a subsequent null pointer dereference. Same goes for the calls to malloc(), which went unchecked in the original post.

char *dimacs = (char *)malloc(filesize);

That's a C-style cast. The C++ way is:

char *dimacs = reinterpret_cast<char *> (malloc(filesize));

Note that the cast is redundant and frowned upon in C, but necessary in C++, because C allows implicit conversion from void * to any other pointer type, whilst C++ does not. The cast was also necessary prior to C89, when there was no void *, and malloc() and family used to return a char * instead.

Preferably use new in C++. (Or don't use it at all. @Rish has suggested alternatives.)

Note that malloc(filesize) should be malloc(filesize + 1) for the null terminator.

size_t bytes_read = fread(dimacs, sizeof(char), filesize, file);

sizeof(char) is defined to be 1, so you can elide that. More importantly, you've failed to check the return value of fread, which can be a short count if end-of-file was reached or an error occurred.

// free the resources fclose(file); free(dimacs); free(parser);

That's an unnecessary comment. It doesn't add any extra information to the code and should be removed.

switch(*parser->current) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9':

This switch can be replaced with 1 call to isdigit from cctype.

if (isdigit(static_cast<unsigned char> (*parser->current))) { ... }

#define skip_whitespace while(*parser->current++ != ' ');

This should be a function, not a macro.

printf("expect header at [line: %d]", parser->line);

Error messages go to stderr, not stdout.

Instead of printf, consider:

fprintf(stderr, "expect header at [line: %d]", parser->line); std::cerr << "expect header at [line: " << parser->line << "]"; std::cerr << std::format("expect header at [line: {}]", 42); std::print(stderr, "expect header at [line: {}]", parser->line);

Stack Exchange Network

Dimacs parser written in C++

2 Answers 2

Don't use C headers

Don't `#define`

`typedef struct`

Reading the file

The C way

The C++ way

Allocate on the stack

Recursive `skip`

Other general comments

Hot Network Questions

Dimacs parser written in C++

2 Answers 2

Don't use C headers

Don't #define

typedef struct

Reading the file

The C way

The C++ way

Allocate on the stack

Recursive skip

Other general comments

Related

Hot Network Questions

Don't `#define`

`typedef struct`

Recursive `skip`