6
\$\begingroup\$

I have been interested in the project of writing a parser that reads ASCII files for a long time, mostly for educational reasons. My idea was not to use tools like Bison or Yacc but to write a hand-written parser. I finally gave it a try with the objective of reading a file representing the description of a 3D scene (which is based on the USDA format, for those of you familiar with it).

Some context: A scene is made up of primitives (like a 3D mesh, a light, etc.) that are defined by a series of attributes. These attributes can be scalars or tuples, and an attribute can have an array of those or just a single element. Attributes are defined by their name (identifier), as well as their value type, such as float, float2, float3, or point3f, for example, to represent a float scalar, a 2-tuple, or a 3-tuple float, etc. Of course, while I need to store the primitives, I also need to store the attributes that belong to a primitive, along with their name, value type, base type (int, float), and value(s).

As primitives can be nested, the prim_t structure has a next variable that stores a sibling in the primitive hierarchy. So siblings are stored as a linked list at the moment. Similarly, attributes belonging to a primitive are stored as a linked list.

Since this is the first time I’m doing this (and I wrote it in C rather than C++, which I know a bit better), I would really appreciate feedback from the community. Here are the points that I’m particularly interested in / less interested in:

Interested:

  • I am interested in feedback about the scanner/lexer side of it. I have written down the rules following a free-context grammar syntax to serve as guidelines for implementing the flow the program is supposed to follow. Feedback on how characters are fetched, how tokens are composed or fetched, etc., would be great.
  • I am also interested in how to bridge the value types read from the file (like point3f) to C types. I’ve chosen to declare a macro (see #define DATA_TYPES) where I make an equivalence between the value type tokens and their C equivalent types, then build various arrays based on that to establish the bridge. Feedback on this approach would be great! Is it a good method? Could it be improved (which I’m sure it can)?
  • The attributes use a kind of type erasure. I don’t know how to do it differently. Is this the best way or the only way to do it? Can I improve the code here? I have been using a union to store attributes with a single element (which can be a tuple or a scalar) and a pointer to a data structure for arrays. Is this good practice? Can it be done differently?

Less Interested:

  • I am not particularly interested in speed here. I am more focused on feedback regarding general code quality, with a focus on the two techniques I described earlier: the scanner and the best way to store the attributes.
  • I know I can use mmap to make it faster, so no need to focus on that.
  • I am also less interested in error reporting for now. I’m aware that just calling die is not the best way to handle things and that the current implementation isn't production-ready in terms of error handling.

PS: I’ve done my best to remove the warnings, but I couldn’t completely get rid of unsafe buffer access.

All feedback greatly appreciated and humbly taken.

Code:

/** * clang src/readgeo.c -Weverything -std=c11 -o build/readgeo.exe -O3 */ #pragma clang diagnostic ignored "-Wunsafe-buffer-usage" #define _CRT_SECURE_NO_WARNINGS #include <stdlib.h> #include <stdio.h> #include <assert.h> #include <ctype.h> #include <string.h> #include <stdbool.h> #include <stdint.h> #ifdef _WIN32 #include <Windows.h> #else #include <time.h> #include <sys/time.h> #endif #define TOKEN_MAX_LEN 256 typedef enum { BT_BOOL, BT_INT32, BT_FLOAT, BT_TOKEN, } BaseType; typedef enum { AT_NONE, AT_BOOL, AT_INT32, AT_FLOAT, AT_FLOAT2, AT_FLOAT3, AT_TOKEN, } AttributeType; typedef struct { size_t capacity; /* number of stored elements */ size_t size; /* max number of elements that can be stored */ BaseType type; /* base type: float, bool, uint32, ... */ uint32_t bytes; void* memory; } data_t; typedef struct attribute_t { char identifier[TOKEN_MAX_LEN]; AttributeType attr_type; BaseType base_type; struct attribute_t* next; union { bool b[4]; /* bool, bool2, bool2, bool4 */ int i[4]; /* int, int2, int3, int4 */ float f[4]; /* float, float2, float3, float4 */ }; data_t* data; } attribute_t; typedef struct prim_t { int schid; // schema id char identifier[TOKEN_MAX_LEN]; struct prim_t* parent; struct prim_t* next; /* child-sibling. Linked list */ attribute_t* attributes; } prim_t; static prim_t* root; static int ch; static FILE* file; static int line; static int col; static void inp(void) { ch = fgetc(file); if (ch == '\n') { line++; col = 0; } col = (ch == '\t') ? col + 4 : col + 1; } static void unget(void) { ungetc(ch, file); col--; } typedef enum { TOKEN_RESERVED, TOKEN_IDENTIFIER, TOKEN_NUMBER, TOKEN_LEFT_BRACE, TOKEN_RIGHT_BRACE, TOKEN_LEFT_BRACKET, TOKEN_RIGHT_BRACKET, TOKEN_LEFT_PARENTHESIS, TOKEN_RIGHT_PARENTHESIS, TOKEN_EQUAL, TOKEN_COMMMA, TOKEN_COLON, TOKEN_INT, TOKEN_FLOAT, TOKEN_UNKNOWN } TokenCode; typedef struct { char content[TOKEN_MAX_LEN]; TokenCode type; } token_t; #define MAX_TABS 16 static void free_alloc(prim_t* const prim, size_t l) { int nattr; char tabs[MAX_TABS]; attribute_t* attr; if (prim->next) free_alloc(prim->next, l+1L); assert(l < MAX_TABS); memset(tabs, '\t', l); tabs[(l == 0) ? 0 : l] = '\0'; printf("%s%s\n", tabs, prim->identifier); fflush(stdout); nattr = 0; attr = prim->attributes; while (attr != NULL) { nattr++; printf("%s%04d: %s\n", tabs, nattr, attr->identifier); fflush(stdout); if (attr->data != NULL) { if (attr->data->memory != NULL) free(attr->data->memory); memset(attr->data, 0x0, sizeof(data_t)); } attr = attr->next; } free(prim); } static _Noreturn void die(const char* err, int _line_) { fprintf(stderr, "%s (line: %d, col: %d, last char read: %c), at %d\n", err, line, col, ch, _line_); fclose(file); free_alloc(root, 0); abort(); } static token_t next_token(void) { token_t token; int tl; memset(token.content, 0x0, TOKEN_MAX_LEN); token.type = TOKEN_UNKNOWN; tl = 0; inp(); while (isspace(ch) || ch == '\n' || ch == '\t') inp(); while (ch == '#') { while (ch != '\n') { inp(); } return next_token(); } if (isalpha(ch)) { do { token.content[tl++] = (char)ch; assert(tl < TOKEN_MAX_LEN-1); inp(); } while (isalnum(ch)); if (ch != ' ') unget(); token.content[tl] = '\0'; token.type = TOKEN_RESERVED; } else if (ch == '"') { inp(); while (ch != '"' && ch != EOF) { if (!isalnum(ch) && ch != '_' && ch != ':' && ch != '!') die("error while reading identifier", __LINE__); token.content[tl++] = (char)ch; inp(); } token.content[tl] = '\0'; token.type = TOKEN_IDENTIFIER; inp(); if (ch != ' ') unget(); } else if (isdigit(ch) || ch == '-') { do { token.content[tl++] = (char)ch; inp(); } while (isdigit(ch) || ch == '.' || ch == 'e' || ch == '-'); if (ch != ' ') unget(); token.content[tl] = '\0'; token.type = TOKEN_NUMBER; } else if (ch == '{') token.type = TOKEN_LEFT_BRACE; else if (ch == '}') token.type = TOKEN_RIGHT_BRACE; else if (ch == '[') token.type = TOKEN_LEFT_BRACKET; else if (ch == ']') token.type = TOKEN_RIGHT_BRACKET; else if (ch == '(') token.type = TOKEN_LEFT_PARENTHESIS; else if (ch == ')') token.type = TOKEN_RIGHT_PARENTHESIS; else if (ch == '=') token.type = TOKEN_EQUAL; else if (ch == ',') token.type = TOKEN_COMMMA; else if (ch == ':') token.type = TOKEN_COLON; else {} return token; } static const char* spec_tokens[] = {NULL, "def", "over", "class", NULL}; static const char* schema_tokens[] = {NULL, "Mesh", "Camera", "Light", "Xform", "GeomSubset", NULL}; static const char* var_tokens[] = {NULL, "uniform", NULL}; /* variabiliy. Default to varying */ /* https://openusd.org/dev/api/_usd__page__datatypes.html */ /* Value Type Token | Base Type | Tuple Size */ #define DATA_TYPES \ X("bool", AT_BOOL, 1, bool) \ X("int", AT_INT32, 1, int32_t) \ X("float", AT_FLOAT, 1, float) \ X("float2", AT_FLOAT2, 2, float) \ X("float3", AT_FLOAT3, 3, float) \ X("double", AT_FLOAT, 1, float) \ X("double3", AT_FLOAT3, 3, float) \ X("point3f", AT_FLOAT3, 3, float) \ X("color3f", AT_FLOAT3, 3, float) \ X("normal3f", AT_FLOAT3, 3, float) \ X("texCoord2f", AT_FLOAT2, 2, float) \ X("token", AT_TOKEN, 1, char) static const char* value_type_tokens[] = { #define X(a, b, c, d) a, NULL, DATA_TYPES NULL #undef X }; static AttributeType attribute_types[] = { #define X(a, b, c, d) (b), 0, DATA_TYPES 0 #undef X }; static size_t tuple_sizes[] = { #define X(a, b, c, d) (c), AT_NONE, DATA_TYPES AT_NONE #undef X }; static int find(const char* token, const char* const tokens[]) { for (int i = 1; tokens[i] != NULL; ++i) { if (strcmp(token, tokens[i]) == 0) return i; } return 0; } /** * @todo padding here is probably a good idea? */ static void init_data(data_t** data, BaseType type) { *data = (void*)malloc(sizeof(data_t)); (*data)->capacity = 16; /* 16x int, or 16x (float, float, float) if vec3f for instance */ (*data)->size = 0; (*data)->type = type; (*data)->bytes = (type == BT_BOOL || type == BT_TOKEN) ? 1 : 4; (*data)->memory = (void*)malloc((*data)->bytes * (*data)->capacity); } static void push(data_t* const data, void* v) { void* temp; assert(data->capacity > 0); if (data->size + 1 > data->capacity) { temp = (void*)realloc(data->memory, data->capacity * 2 * data->bytes); if (temp == NULL) die("can't allocate memory", __LINE__); data->memory = temp; data->capacity *= 2; } memcpy((char*)data->memory + data->size * data->bytes, v, data->bytes); data->size++; } static void push_token(data_t* const data, const char* const token) { size_t size; void* temp; size_t req_capacity, new_capacity; assert(data->bytes == 1 && data->capacity > 0); size = strlen(token); req_capacity = data->size + size + 1; /* need space to store token + \0 */ if (req_capacity > data->capacity) { new_capacity = max(data->capacity * 2, req_capacity); temp = (void*)realloc(data->memory, new_capacity); if (temp == NULL) die("can't allocate memory", __LINE__); data->memory = temp; data->capacity = new_capacity; } memcpy((char*)data->memory + data->size, token, size); memset((char*)data->memory + data->size + size, '\0', 1); data->size += size + 1; } typedef void (*conv_and_push_func)(data_t*, const char* const); static void conv_push_bool(data_t* data, const char* const token) { bool b; assert(strlen(token) != 0 && (token[0] == '0' || token[0] == '1')); b = (token[0] == '0') ? false : true; push(data, &b); } static void conv_push_int(data_t* data, const char* const token) { int a; a = atoi(token); push(data, &a); } static void conv_push_float(data_t* data, const char* const token) { float f; f = strtof(token, NULL); push(data, &f); } static void read_array_scalar(data_t* data, BaseType type, conv_and_push_func conv_and_push) { token_t token; token = next_token(); if (token.type != TOKEN_LEFT_BRACKET) die("expected [ after = for arrays", __LINE__); while (1) { token = next_token(); if ((type < BT_TOKEN && token.type != TOKEN_NUMBER) || (type == BT_TOKEN && token.type != TOKEN_IDENTIFIER)) { die("expected a string or a number (bad formatting)", __LINE__); } conv_and_push(data, token.content); token = next_token(); if (token.type != TOKEN_COMMMA) break; } if (token.type != TOKEN_RIGHT_BRACKET) die("expected ]", __LINE__); } static void read_array_tuple(data_t* data, size_t tuple, conv_and_push_func conv_and_push) { token_t token; size_t i; token = next_token(); if (token.type != TOKEN_LEFT_BRACKET) die("expected [ after = for arrays", __LINE__); while (1) { token = next_token(); if (token.type != TOKEN_LEFT_PARENTHESIS) die("error while reading tuple", __LINE__); for (i = 0; i < tuple;) { token = next_token(); if (token.type != TOKEN_NUMBER) { die("expected a number (bad formatting)", __LINE__); } conv_and_push(data, token.content); token = next_token(); if (!(++i < tuple) ? token.type == TOKEN_COMMMA : token.type == TOKEN_RIGHT_PARENTHESIS) die("bad formating in tuple", __LINE__); } token = next_token(); if (token.type != TOKEN_COMMMA) break; } if (token.type != TOKEN_RIGHT_BRACKET) die("expected ]", __LINE__); } static data_t* read_attr_array(BaseType type, size_t tuple) { data_t* data; conv_and_push_func conv_and_push; init_data(&data, type); switch (type) { case BT_BOOL: conv_and_push = conv_push_bool; break; case BT_INT32: conv_and_push = conv_push_int; break; case BT_FLOAT: conv_and_push = conv_push_float; break; case BT_TOKEN: assert(tuple == 1); conv_and_push = push_token; break; } if (tuple == 1) read_array_scalar(data, type, conv_and_push); else read_array_tuple(data, tuple, conv_and_push); return data; } typedef void (*conv_and_store_func)(attribute_t* const, const char*, size_t); static void conv_and_store_bool(attribute_t* const attr, const char* token, size_t i) { assert(strlen(token) != 0 && (token[0] == '0' || token[0] == '1') && i < 4 && attr->base_type == BT_BOOL); attr->b[i] = (token[0] == '0') ? false : true; } static void conv_and_store_int(attribute_t* const attr, const char* token, size_t i) { assert(i < 4 && attr->base_type == BT_INT32); attr->i[i] = atoi(token); } static void conv_and_store_float(attribute_t* const attr, const char* token, size_t i) { assert(i < 4 && attr->base_type == BT_FLOAT); attr->f[i] = strtof(token, NULL); } static void read_attr_value(BaseType type, size_t tuple, attribute_t* const attr) { token_t token; conv_and_store_func conv_and_store; conv_and_store = NULL; switch(type) { case BT_BOOL: conv_and_store = conv_and_store_bool; break; case BT_INT32: conv_and_store = conv_and_store_int; break; case BT_FLOAT: conv_and_store = conv_and_store_float; break; case BT_TOKEN: break; } if (tuple == 1) { token = next_token(); if (type < BT_TOKEN) { if (token.type != TOKEN_NUMBER) die("expected a number (bad formatting)", __LINE__); assert(conv_and_store != NULL); conv_and_store(attr, token.content, 0); } else { if (token.type != TOKEN_IDENTIFIER) die("expected a token (bad formatting)", __LINE__); // NOT IMPLEMENTED strcpy(attr->token, token.content); } } else { assert(type != BT_TOKEN); /* tokens are not stored as tuples */ token = next_token(); if (token.type != TOKEN_LEFT_PARENTHESIS) die("error while reading tuple", __LINE__); for (size_t i = 0; i < tuple;) { token = next_token(); if (token.type != TOKEN_NUMBER) { die("expected a number (bad formatting)", __LINE__); } conv_and_store(attr, token.content, i); token = next_token(); if (!(++i < tuple) ? token.type == TOKEN_COMMMA : token.type == TOKEN_RIGHT_PARENTHESIS) die("bad formating in tuple", __LINE__); } } } static void read_prim(prim_t* const prim, size_t l); static void read_prim_body(prim_t* const prim, size_t l) { token_t token; prim_t* cur_child; attribute_t* cur_attr; prim_t** next_prim_ptr; /* char tabs[16]; assert(l <= 16L); memset(tabs, ' ', l * 2); tabs[(l == 0) ? 0 : l * 2] = '\0'; */ cur_attr = NULL; next_prim_ptr = NULL; cur_child = prim->next; assert(prim->attributes == NULL); while(1) { token = next_token(); assert(!feof(file)); /* nested prim definition */ if (find(token.content, spec_tokens)) { prim_t* child = (prim_t*)malloc(sizeof(prim_t)); memset(child, 0x0, sizeof(prim_t)); child->parent = prim; next_prim_ptr = (prim->next == NULL) ? &prim->next : &cur_child->next; *next_prim_ptr = child; cur_child = child; read_prim(child, l+1L); } /* end of block for current prim definition */ else if (token.type == TOKEN_RIGHT_BRACE) { break; } /* dictionary - optional */ else if (token.type == TOKEN_LEFT_PARENTHESIS) { do { inp(); } while (ch != ')'); } /* reading a prim's attribute */ else { int vti, vi; /* index into the value_type_tokens array, variability index */ bool is_array; char identifier[TOKEN_MAX_LEN]; int off; BaseType type; attribute_t* attr; /* attribute to be created */ attribute_t** next_attr_ptr; /* next pointer to assign new attribute to */ off = 0; is_array = false; next_attr_ptr = NULL; /* variability: optional */ if ((vi = find(token.content, var_tokens))) { token = next_token(); } /* type */ if (!(vti = find(token.content, value_type_tokens))) { die("expected type declaration", __LINE__); } token = next_token(); /* is array? */ if (token.type == TOKEN_LEFT_BRACKET) { token = next_token(); if (token.type != TOKEN_RIGHT_BRACKET) die("syntax error, missing closing brack \']\'", __LINE__); is_array = true; token = next_token(); } /* read attribute, optional `:` seperated words */ while(1) { assert(token.type == TOKEN_RESERVED && strlen(token.content) != 0); memcpy(identifier + off, token.content, strlen(token.content) + 1); /* include '\0' in copy */ off += strlen(token.content); token = next_token(); if (token.type != TOKEN_COLON) break; identifier[off++] = ':'; token = next_token(); } /* printf("%s * %s %s%s %s\n", tabs, (vi == 0 ? "\b" : var_tokens[vi]), value_type_tokens[vti], (is_array ? "[]" : ""), identifier ); */ /* equal */ if (token.type != TOKEN_EQUAL) { die("= expected after attribute declaration", __LINE__); } switch(attribute_types[vti]) { case AT_BOOL: type = BT_BOOL; break; case AT_INT32: type = BT_INT32; break; case AT_FLOAT: case AT_FLOAT2: case AT_FLOAT3: type = BT_FLOAT; break; case AT_TOKEN: type = BT_TOKEN; break; case AT_NONE: assert(false); __builtin_unreachable(); } attr = (attribute_t*)malloc(sizeof(attribute_t)); memset(attr, 0x0, sizeof(attribute_t)); strcpy((char*)attr->identifier, identifier); attr->base_type = type; //attr->value_type = next_attr_ptr = (prim->attributes == NULL) ? &prim->attributes : &cur_attr->next; assert((*next_attr_ptr) == NULL); *next_attr_ptr = attr; cur_attr = attr; /* reading data */ if (is_array) { attr->data = read_attr_array(type, tuple_sizes[vti]); } else { read_attr_value(type, tuple_sizes[vti], attr); } } } } static void read_prim(prim_t* const prim, size_t l) { token_t token; //char tabs[16]; int schid; token = next_token(); if (!(schid = find(token.content, schema_tokens))) { die("expected schemas", __LINE__); } token = next_token(); if (token.type != TOKEN_IDENTIFIER) { die("expected identifier", __LINE__); } strcpy((char*)prim->identifier, token.content); /* assert(l <= 16L); memset(tabs, ' ', l * 2); tabs[(l == 0) ? 0 : l * 2] = '\0'; printf("%s-%s/%s\n", tabs, schema_tokens[schid], prim->identifier); fflush(stdout); */ token = next_token(); if (token.type == TOKEN_LEFT_PARENTHESIS) { do { inp(); } while (ch != ')'); token = next_token(); } if (token.type != TOKEN_LEFT_BRACE) { die("expected {", __LINE__); } read_prim_body(prim, l); } int main(int argc, char** argv) { token_t token; double elapsed = 0.0; #ifdef _WIN32 LARGE_INTEGER frequency; // ticks per second LARGE_INTEGER start, end; #else long seconds, nanoseconds; struct timespec start, end; #endif line = 1; col = 0; if (argc-- > 1) { file = fopen(argv[1], "r"); } assert(file != NULL); root = (prim_t*)malloc(sizeof(prim_t)); assert(root != NULL); memset(root, 0x0, sizeof(prim_t)); strcpy((char*)root->identifier, "root"); /* Context Free Grammar for USDA file format: ========================================== S -> defBlock | overBlock | classBlock defBlock -> "def" schemas identifier '{' body '}' overBlock -> "over" schemas identifier '{' body '}' classBlock -> "class" schemas identifier '{' body '}' schemas -> "Mesh" | "Camera" | "Light" | "Xform" | ... identifier -> [a-zA-Z0-9_:]+ body -> (defBlock | overBlock | classBlock | statement)* statement -> attribute | otherPrims attribute -> [uniform] type array? identifier '=' value type -> "float" | "int" | "float2" | "float3" | ... array -> "[]" // Array specifier identifier -> [a-zA-Z_:][a-zA-Z0-9_:]* value -> singleValue | tupleValue | arrayValue singleValue -> number tupleValue -> '(' number (',' number)* ')' arrayValue -> '[' (singleValue | tupleValue) (',' (singleValue | tupleValue))* ']' otherPrims -> ... // Define other primitives here if needed number -> [0-9]+(\.[0-9]+)? // Basic number definition */ #ifdef _WIN32 QueryPerformanceFrequency(&frequency); QueryPerformanceCounter(&start); #else clock_gettime(CLOCK_MONOTONIC, &start); #endif while (1) { token = next_token(); if (feof(file)) break; if (!find(token.content, spec_tokens)) { die("expected specifier", __LINE__); } read_prim(root, 0); } #ifdef _WIN32 QueryPerformanceFrequency(&frequency); QueryPerformanceCounter(&end); // Calculate elapsed time in seconds elapsed = (double)(end.QuadPart - start.QuadPart) / (double)frequency.QuadPart; printf("Time taken: %f seconds\n", elapsed); #else clock_gettime(CLOCK_MONOTONIC, &end); seconds = end.tv_sec - start.tv_sec; nanoseconds = end.tv_nsec - start.tv_nsec; elapsed = seconds + nanoseconds * 1e-9; printf("Time taken: %f seconds\n", elapsed); #endif fclose(file); //die("test", __LINE__); free_alloc(root, 0); return 0; } 

Small file for testing if you want to:

#usda 1.0 def Mesh "pCube1" ( kind = "component" ) { uniform bool doubleSided = 1 float3[] extent = [(-0.5, -0.5, -0.5), (0.5, 0.5, 0.5)] int[] faceVertexCounts = [4, 4, 4, 4, 4, 4] int[] faceVertexIndices = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 1, 0, 1, 7, 5, 3, 6, 0, 2, 4] normal3f[] normals = [(0, 0, 1), (0, 0, 1), (0, 0, 1), (0, 0, 1), (0, 1, 0), (0, 1, 0), (0, 1, 0), (0, 1, 0), (0, 0, -1), (0, 0, -1), (0, 0, -1), (0, 0, -1), (0, -1, 0), (0, -1, 0), (0, -1, 0), (0, -1, 0), (1, 0, 0), (1, 0, 0), (1, 0, 0), (1, 0, 0), (-1, 0, 0), (-1, 0, 0), (-1, 0, 0), (-1, 0, 0)] ( interpolation = "faceVarying" ) point3f[] points = [(-0.5, -0.5, 0.5), (0.5, -0.5, 0.5), (-0.5, 0.5, 0.5), (0.5, 0.5, 0.5), (-0.5, 0.5, -0.5), (0.5, 0.5, -0.5), (-0.5, -0.5, -0.5), (0.5, -0.5, -0.5)] color3f[] primvars:displayColor = [(0.13320851, 0.13320851, 0.13320851)] ( customData = { dictionary Maya = { bool generated = 1 } } ) texCoord2f[] primvars:st = [(0.375, 0), (0.625, 0), (0.375, 0.25), (0.625, 0.25), (0.375, 0.5), (0.625, 0.5), (0.375, 0.75), (0.625, 0.75), (0.375, 1), (0.625, 1), (0.875, 0), (0.875, 0.25), (0.125, 0), (0.125, 0.25)] ( customData = { dictionary Maya = { token name = "map1" } } interpolation = "faceVarying" ) int[] primvars:st:indices = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 9, 8, 1, 10, 11, 3, 12, 0, 2, 13] uniform token subdivisionScheme = "none" def GeomSubset "back" { uniform token elementType = "face" uniform token familyName = "componentTag" int[] indices = [2] } def GeomSubset "bottom" { uniform token elementType = "face" uniform token familyName = "componentTag" int[] indices = [3] } def GeomSubset "front" { uniform token elementType = "face" uniform token familyName = "componentTag" int[] indices = [0] } def GeomSubset "left" { uniform token elementType = "face" uniform token familyName = "componentTag" int[] indices = [5] } def GeomSubset "right" { uniform token elementType = "face" uniform token familyName = "componentTag" int[] indices = [4] } def GeomSubset "top" { uniform token elementType = "face" uniform token familyName = "componentTag" int[] indices = [1] } } 
\$\endgroup\$
1
  • \$\begingroup\$(*data)->bytes = (type == BT_BOOL || type == BT_TOKEN) ? 1 : 4; looks like code assumes bool has size 1. Do you want code to work when bool is larger?\$\endgroup\$
    – chux
    CommentedSep 16, 2024 at 20:41

2 Answers 2

5
\$\begingroup\$

I would suggest making a string struct and some helper functions to simplify null terminators / reallocation / calls to strlen. This would also make TOKEN_MAX_LEN and MAX_TABS unnecessary. For example:

typedef struct { char *v; size_t count; size_t capacity; } stringi_raw; typedef stringi_raw *stringi; stringi stringi_create(); void stringi_remove(stringi *target); // free struct void stringi_resize(stringi target); // double capacity void stringi_putchar(stringi target, char c); // append 

Also, I would rewrite init_data to return a pointer:

static data_t *init_data(BaseType type) { data_t *data = malloc(sizeof(data_t)); assert(data != NULL); data->capacity = 1; data->size = 0; data->type = type; data->bytes = (type == BT_BOOL || type == BT_TOKEN) ? 1 : 4; data->memory = malloc(data->bytes); return data; } 

The use of NULL to terminate arrays of strings is also a bit confusing to me - you could similarly store the length to make find more readable. Also, I might remove bytes as a member of data_t because it could be determined from type. The program also does not appear to raise an error when the final token is an open " without a closing ", which I am not sure is intended. I would also rename inp() because it is an intrinsic function in Visual Studio. The test file is highly appreciated.

\$\endgroup\$
1
  • 1
    \$\begingroup\$Thks. Yes I have seen the use a string struct in which the length of the string is stored along side with the string content. Can be done indeed, but remains an implementation detail imo. But thanks for taking the time to look at it and comment.\$\endgroup\$CommentedSep 13, 2024 at 7:33
3
\$\begingroup\$

Unfortunately the code doesn't compile - perhaps you have some non-portable assumptions?

gcc-14 -std=c23 -fPIC -gdwarf-4 -g -Wall -Wextra -Wwrite-strings -Wno-parentheses -Wpedantic -Warray-bounds -Wmissing-braces -Wconversion -Wstrict-prototypes -fanalyzer 293675.c -o 293675 293675.c:5: warning: ignoring ‘#pragma clang diagnostic’ [-Wunknown-pragmas] 5 | #pragma clang diagnostic ignored "-Wunsafe-buffer-usage" 293675.c: In function ‘push_token’: 293675.c:316:24: error: implicit declaration of function ‘max’ [-Wimplicit-function-declaration] 316 | new_capacity = max(data->capacity * 2, req_capacity); | ^~~ 293675.c:316:24: warning: conversion to ‘size_t’ {aka ‘long unsigned int’} from ‘int’ may change the sign of the result [-Wsign-conversion] 293675.c: In function ‘read_prim_body’: 293675.c:581:21: warning: conversion to ‘size_t’ {aka ‘long unsigned int’} from ‘int’ may change the sign of the result [-Wsign-conversion] 581 | off += strlen(token.content); | ^~ 293675.c:581:24: warning: conversion from ‘size_t’ {aka ‘long unsigned int’} to ‘int’ may change value [-Wconversion] 581 | off += strlen(token.content); | ^~~~~~ 293675.c: In function ‘main’: 293675.c:735:5: error: implicit declaration of function ‘clock_gettime’ [-Wimplicit-function-declaration] 735 | clock_gettime(CLOCK_MONOTONIC, &start); | ^~~~~~~~~~~~~ 293675.c:735:19: error: ‘CLOCK_MONOTONIC’ undeclared (first use in this function) 735 | clock_gettime(CLOCK_MONOTONIC, &start); | ^~~~~~~~~~~~~~~ 293675.c:735:19: note: each undeclared identifier is reported only once for each function it appears in 293675.c:758:37: warning: conversion from ‘long int’ to ‘double’ may change value [-Wconversion] 758 | elapsed = seconds + nanoseconds * 1e-9; | ^ 293675.c:758:23: warning: conversion from ‘long int’ to ‘double’ may change value [-Wconversion] 758 | elapsed = seconds + nanoseconds * 1e-9; 

Adding -D_POSIX_C_SOURCE=199309L to the build command adds clock_gettime() to <time.t>, but there's still no max() in scope.

I guessed at the intention:

static size_t max(size_t a, size_t b) { return a < b ? b : a; } 

After adding that, the next problem was an assertion failure:

293675: 293675.c:703: main: Assertion `file != NULL' failed. ==2889477== ==2889477== Process terminating with default action of signal 6 (SIGABRT): dumping core ==2889477== at 0x49093AC: __pthread_kill_implementation (pthread_kill.c:44) ==2889477== by 0x48BA4F1: raise (raise.c:26) ==2889477== by 0x48A34EC: abort (abort.c:79) ==2889477== by 0x48A3414: __assert_fail_base.cold (assert.c:94) ==2889477== by 0x48B3011: __assert_fail (assert.c:103) ==2889477== by 0x10AF26: main (293675.c:703) 

That comes from the first of these unjustifiable assertions:

 if (argc-- > 1) { file = fopen(argv[1], "r"); } assert(file != NULL); 
 root = (prim_t*)malloc(sizeof(prim_t)); assert(root != NULL); 

fopen() and malloc() are both specifically documented as returning null pointers in particular circumstances, so those assertions are wrong, and the code needs to deal with such values (perhaps by aborting, but preferably by emitting a message to stderr and returning EXIT_FAILURE).

When I changed to pass a filename as argument, it ran to completion, but leaked a substantial amount of memory:

==2891855== HEAP SUMMARY: ==2891855== in use at exit: 8,736 bytes in 42 blocks ==2891855== total heap usage: 73 allocs, 31 frees, 18,648 bytes allocated ==2891855== ==2891855== 3,216 (296 direct, 2,920 indirect) bytes in 1 blocks are definitely lost in loss record 4 of 6 ==2891855== at 0x4842808: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==2891855== by 0x10ABB5: read_prim_body (293675.c:628) ==2891855== by 0x10AE7F: read_prim (293675.c:686) ==2891855== by 0x10B03A: main (293675.c:748) ==2891855== ==2891855== 5,520 (1,776 direct, 3,744 indirect) bytes in 6 blocks are definitely lost in loss record 6 of 6 ==2891855== at 0x4842808: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==2891855== by 0x10ABB5: read_prim_body (293675.c:628) ==2891855== by 0x10AE7F: read_prim (293675.c:686) ==2891855== by 0x10A857: read_prim_body (293675.c:538) ==2891855== by 0x10AE7F: read_prim (293675.c:686) ==2891855== by 0x10B03A: main (293675.c:748) 

It's clear that the teardown code is missing some free()s.


Having done some basic experimentation, let's start reading the source:

typedef struct … data_t; 

Beware that in POSIX environments, names ending in _t are reserved for library expansion, so avoid them in user code.

static prim_t* root; static int ch; static FILE* file; static int line; static int col; 

These globals are harmful - they appear to prevent the use of more than one of these parsers at a time. If nothing else, that causes all your unit tests to interact with each other unless you run them one at a time and reset the state after each one.

col = (ch == '\t') ? col + 4 : col + 1; 

That's a very strange interpretation of a tab - normally, tab advances to the next multiple of 8 rather than representing a fixed number of spaces. Moreover, the only use of this variable is in the error reporting, where consumers expect the character position in the line, not display position.

free_alloc() is a misleading name. It seems to print as well as to free.

Here's another assertion that seems to be invented rather than provable:

 assert(l < MAX_TABS); 

This looks an unwieldy way to print a given number of tab characters:

char tabs[MAX_TABS]; memset(tabs, '\t', l); tabs[(l == 0) ? 0 : l] = '\0'; printf("%s\n", tabs); 

Even ignoring the weird (l == 0) ? 0 : l where plain l would be sufficient, this whole business can be replaced with a fixed-width formatting of a longer string:

printf("%.*s\n", l, "\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"); 

This also has the advantage of falling back to a maximal string when l is longer than the largest expected length.

The test here is unnecessary:

 if (attr->data->memory != NULL) free(attr->data->memory); 

Since free() already does nothing when passed a null pointer, we should simply free(attr->data->memory) without the test.

I don't like this memset():

 memset(attr->data, 0x0, sizeof(data_t)); 

Firstly, it requires me to search that the type of attr->data matches data_t, whereas if we'd used sizeof *attr->data the correspondence would be obvious and automatic. Secondly, data_t contains a pointer member, so setting to all-zeros is implementation-specific (not necessarily a null pointer).

The die() function seems to be the handler for malformed input, and sometimes for allocation failure. So it seems wrong to abort() (which produces a core dump in my environment, for example). For something as banal as a syntax error in input, exit(EXIT_FAILURE) would be more appropriate.

Also consider the de-facto standard format for error messages, beginning with the filename and line number, to make it easier for users to go straight to them with the usual tools:

 fprintf(stderr, "%s:" "%u:" "%u: %s (last char read: %c)\n", file, line, col, err, ch); 

This allocation is poor:

*data = (void*)malloc(sizeof(data_t)); (*data)->capacity = 16; /* 16x int, or 16x (float, float, float) if vec3f for instance */ 

There's no need to cast the result of malloc() since it returns void* and that's assignable to any kind of object pointer. Also, like for memset(), it's better to match it to the size of the target, rather than manually writing its type:

*data = malloc(sizeof *data); 

And before we dereference it, we absolutely need to handle a null pointer return.

We have much better handling here:

if (data->size + 1 > data->capacity) { temp = (void*)realloc(data->memory, data->capacity * 2 * data->bytes); if (temp == NULL) die("can't allocate memory", __LINE__); data->memory = temp; data->capacity *= 2; } 

Apart from the needless cast, this is almost the idiom of realloc(). It is somewhat unclear why we multiply capacity by 2 * bytes rather than by a constant such as 2 - we only need to take bytes into account for the initial capacity, I think.

These conversions of void* member memory wouldn't be necessary if we changed its type to char*:

memcpy((char*)data->memory + data->size, token, size); memset((char*)data->memory + data->size + size, '\0', 1); 

There's no good reason for it to be void*.

Here's another unjustified assert:

static void conv_push_bool(data_t* data, const char* const token) { bool b; assert(strlen(token) != 0 && (token[0] == '0' || token[0] == '1')); b = (token[0] == '0') ? false : true; push(data, &b); } 

Since token comes from input, we can't be sure it begins with 0 or 1. We need to test that.

The expression (token[0] == '0') ? false : true looks to be a very cumbersome way to write token[0] != '0'.

And do we really want to accept everything that merely begins with 0 or 1 as a valid boolean value? I think we need to do more checking and reporting.

Similarly, these functions accept any old rubbish, resulting in zero value for invalid input:

static void conv_push_int(data_t* data, const char* const token) { int a; a = atoi(token); push(data, &a); } static void conv_push_float(data_t* data, const char* const token) { float f; f = strtof(token, NULL); push(data, &f); } 

The read_array() functions contain the following pattern:

while (1) { if (condition) break; } 

That would be more naturally expressed using do-while.

\$\endgroup\$
6
  • 1
    \$\begingroup\$Sorry this is incomplete - that's as far as I got before running out of time.\$\endgroup\$CommentedSep 14, 2024 at 10:24
  • \$\begingroup\$Thanks these are very good comments and much appreciated. I learned quite a few things. Your comments are mostly C-related which is great. But I would have also liked more general comment regarding the way I handle scanning and type erasure. But again your feedback is great and I appreciate the time (I compiled on Windows so true didn't check on Unux type env).\$\endgroup\$CommentedSep 14, 2024 at 12:47
  • 1
    \$\begingroup\$Can you cite a specific paragraph for that claim? C23 has §6.3.2.3.3: "An integer constant expression with the value 0, [or] such an expression cast to type void *" can be converted to a null pointer. But I can't find anything that specifies that memory with all bits zero, when interpreted as a pointer, must result in a null pointer.\$\endgroup\$CommentedSep 16, 2024 at 14:00
  • 1
    \$\begingroup\$"guarantees that a pointer set to NULL is represented by all bits zero. " is very common yet not specified by C. p = NULL; may set the pointer to non-zero bits even if NULL is 0. Avoid assuming the integer bit pattern is the same as the pointer bit pattern. IOWs, a null pointer's bit pattern (of which there may be many) is an implementation detail.\$\endgroup\$
    – chux
    CommentedSep 16, 2024 at 20:19
  • 1
    \$\begingroup\$@chux, your wording "even if NULL is 0" is possibly a red herring. NULL is either an integer 0 or that integer converted to void*, but the point is that C guarantees the conversion from 0 to pointer always results in a null pointer even if all-bits-zero isn't naturally a null pointer on the target. Also that reinterpreting memory doesn't make the same guarantee (it can't, because all-bits-zero may represent a valid, non-null, pointer).\$\endgroup\$CommentedSep 20, 2024 at 7:15

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.