7
\$\begingroup\$

Trying to write fast non dependency IP4 string validator for embedded system, It should counts leading zeros, values over 255 and other corrupted strings. Not sure if I missed something. Сan you guys take a look:

// Return 0 on error, store result to segs, if segs != 0. int is_valid_ip4(const char *str, unsigned char* segs) { int dots = 0; // Dot count. int chcnt = 0; // Character count within segment. int accum = 0; // Accumulator for segment. // Catch NULL pointer. if (str == 0) return 0; // Process every character in string. for (;;) { // Check numeric. if (*str < '0' || *str > '9') return 0; // Prevent leading zeros. if (chcnt > 0 && accum == 0) return 0; // Accumulate and check segment. accum = accum * 10 + *str - '0'; if (accum > 255) return 0; // Store result. if (segs) segs[dots] = (unsigned char)accum; // Advance other segment specific stuff. ++chcnt; ++str; // Finish at the end of the string when 3 points are found. if (*str == '\0' && dots == 3) return 1; // Segment changeover. if (*str == '.') { // Limit number of segments. ++dots; if (dots == 4) return 0; // Reset segment values. chcnt = 0; accum = 0; ++str; } } } 
\$\endgroup\$
7
  • 10
    \$\begingroup\$C or C++? They are different languages with very different programming idioms and techniques.\$\endgroup\$
    – L. F.
    CommentedAug 1, 2019 at 9:31
  • \$\begingroup\$The whole project has written on cpp, however, there are not too much space in MCU Flash/RAM memory, so I can’t use std::strings, std::vectors etc. I agree, for clarity, it is better to remove the cpp tag, thank you!\$\endgroup\$CommentedAug 1, 2019 at 15:47
  • 4
    \$\begingroup\$Any reason for not using inet_pton and friends?\$\endgroup\$
    – vnp
    CommentedAug 1, 2019 at 16:02
  • 1
    \$\begingroup\$@VladimirGamalyan Oh I missed it, that's rather obscure. I would suggest you make an IPv4Address struct, or something to that effect, and rename this function to something like IPv4Address_new()\$\endgroup\$
    – Alexander
    CommentedAug 1, 2019 at 16:03
  • 1
    \$\begingroup\$Next time you should post your test suite as well so that we can immediately see which strings are accepted and which are not.\$\endgroup\$CommentedAug 2, 2019 at 15:18

3 Answers 3

9
\$\begingroup\$

<stdbool.h>

Unless you need compatibility with C89 for some reason, I would use bool for the return type of the function and true and false as the possible values.

It helps readability.


sscanf()

This one depends on your performance needs. A solution using sscanf() will be much easier to understand with a simple look and also shorter in code, but may also be much slower (a benchmark would be appropriate).

Update: I would discard sscanf() because it would accept spaces before the numbers; better use inet_pton() as @vnp suggested.


<stdint.h>

Same as with bool: segs seems at first glance to be some string, but after some time I realized it's just an array of 8-bit unsigned integers. You should use uint8_t for that.

In case of <stdint.h> I would say that if you don't have it, you should do the typedefs yourself (enclosed in some #if) anyway. Don't do that with bool, however, which is very dangerous.


Unneeded cast

unsigned char *p; int i; ... *p = (unsigned char)i; 

This cast is not needed.

Usually casts are very dangerous: they can hide bugs that otherwise the compiler would catch easily. Don't ever cast, unless you know a very good reason to.


BUG! (No, I was wrong)

// Check numeric. if (*str < '0' || *str > '9') return 0; 

This line returns 0 at the first dot it finds. The function doesn't work!

This if that comes later is always false due to that:

// Segment changeover. if (*str == '.') { /// Unreachable code!!! } 

<ctype.h>

There are some functions in <ctype.h> that you should use:

// Check numeric. if (*str < '0' || *str > '9') return 0; 

That code above can be simplified to this one (and the comment is now even more obvious and therefore removed ;-):

if (!isdigit((unsigned char)*str)) return 0; 

NOTES: The standards require that the argument c for these functions is either EOF or a value that is representable in the type unsigned char. If the argument c is of type char, it must be cast to unsigned char.

This is necessary because char may be the equivalent of signed char, in which case a byte where the top bit is set would be sign extended when converting to int, yielding a value that is outside the range of unsigned char.

See man isdigit.

If you're going to use these functions, I would create an unsigned char pointer to alias the char * parameter, so that you don't need to cast all the time.

\$\endgroup\$
5
  • \$\begingroup\$Thank you! I added recommended changes to my project. BTW - it is intentionally to check only digits at the very begin of loop, because valid IP strings cannot begin with a dot, and, when we see the digits, we move string position forward (++str), where we can see the dot (after which there should be digit on the next iteration of the loop).\$\endgroup\$CommentedAug 1, 2019 at 15:55
  • \$\begingroup\$@VladimirGamalyan But at some iteration you will want to find a dot,and you won't. Try the function yourself and see.\$\endgroup\$CommentedAug 1, 2019 at 16:53
  • \$\begingroup\$I agree with Vladimir: after a digit is found, the string pointer gets advanced; if the next character is a dot, the string pointer is advanced again. This seems good to me as it doesn't allow a dot at the start (just as it shouldn't) but catches dots after digits.\$\endgroup\$CommentedAug 1, 2019 at 18:03
  • \$\begingroup\$@Guntram @Vladimir Oh, now I see it. The iterations where the pointer would point to the dots are being skipped. I would use inet_pton() as @vnp mentioned in a comment, or use sscanf(); any of them would be much more readable.\$\endgroup\$CommentedAug 1, 2019 at 18:48
  • 2
    \$\begingroup\$@VladimirGamalyan I would say however, that a forever loop isn't the most readable way to write this. Your intentions are not a forever loop, so you should avoid the form of a forever loop.\$\endgroup\$CommentedAug 1, 2019 at 18:50
5
\$\begingroup\$

I'll flag another issue that wasn't pointed out: it seems that the output array of segs (maybe rename it to segments?) is assumed to be allocated by the caller. Since there is no comment that describes that it needs to be of size 4, then the caller might send a smaller array and then the segs[dots] access would seg fault in the best case and overwrite memory silently in the worst case. I would either document that requirement or allocate the memory internally and ask the caller to delete[] it.

\$\endgroup\$
1
  • 1
    \$\begingroup\$uint8_t segs[static restrict 4] would be a good way to document this :)\$\endgroup\$CommentedAug 2, 2019 at 21:09
4
\$\begingroup\$

The first thing to improve the readability of the code is to define proper data types. Having an unsigned char * is confusing, both to humans and to the compiler.

#include <stdint.h> typedef struct { uint8_t part[4]; } ip4_address; 

Next, I found your code hard to follow because of the forever loop and the many if clauses. You do have a point though because you mention '0' and '9' and '.' only once in the code, which is good.

One thing I don't like is that you write unfinished parsing results to the returned IPv4 address. For example, when parsing 123.456, the returned IPv4 address would be 123.45.?.?. It's just the unnecessary memory access that I don't like. I prefer code that keeps the intermediate values in local variables and only writes them back when it is completely calculated.

I find it much simpler to read if the code is grouped by things from the application domain, which in this case is numbers and dots. I would write it like this:

#include <stdbool.h> #include <stdint.h> #include <stdio.h> typedef struct { uint8_t part[4]; } ip4_address; bool is_valid_ip4(const char *str, ip4_address *addr) { if (!str) return false; for (int i = 0; i < 4; i++) { if (!(*str >= '1' && *str <= '9')) return false; unsigned n = *str++ - '0'; if (*str >= '0' && *str <= '9') { n = 10 * n + *str++ - '0'; if (*str >= '0' && *str <= '9') { n = 10 * n + *str++ - '0'; if (n > 255) return false; } } addr->part[i] = n; if (i == 3) return *str == '\0'; if (*str++ != '.') return false; } return false; } int main(int argc, char **argv) { ip4_address addr; if (1 < argc && is_valid_ip4(argv[1], &addr)) { printf("%d.%d.%d.%d\n", addr.part[0], addr.part[1], addr.part[2], addr.part[3]); } else return 1; } 

An IPv4 address, in the usual syntax, consists of 4 numbers. This is expressed by the for loop that contains exactly 4 iterations.

An IPv4 address must start with a number. Therefore the upper part of the for loop is concerned with parsing and validating this number. The lower part handles the dots.

Organizing the code like this produces fewer jumps when stepping through it, and as a result, the execution position is simple to follow. Inside the for loop, it just goes straight from the top to the bottom. In your code, on the other hand, it jumps around much more often, which makes the code hard to follow.

Sure, my code repeats the character constants more often and is therefore more susceptible to typos, but finding these is the job of the test suite.

I took care of only defining the minimum necessary variables since keeping track of changing variable values is difficult.

As in your code, I explicitly test against the digits 0 to 9 instead of calling isdigit, just in case the code runs with an Arabic execution character set that defines some more digits. IPv4 addresses, according to the RFC, are written in ASCII digits.

I didn't write any comments in my code because I think that its structure is so simple and follows common practice that there is nothing surprising to it. By contrast, you wrote many comments, most of which are redundant.

\$\endgroup\$
3
  • \$\begingroup\$Thank you! I like this shortened version! If correctly understood, leading zeros are not taken into account here? (like 192.168.1.001)\$\endgroup\$CommentedAug 3, 2019 at 14:11
  • 1
    \$\begingroup\$Oops, that was a mistake on my side. I fixed it.\$\endgroup\$CommentedAug 4, 2019 at 8:37
  • \$\begingroup\$I used your code with slight modification -- reverted first condition to *str >= '0' && *str <= '9' and added if (!n) return false; after second condition. Otherwise lines with only zero in octets (192.168.0.1) do not pass the test. Thank you!\$\endgroup\$CommentedAug 8, 2019 at 4:16

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.