5
\$\begingroup\$

This is a Python script I wrote to identify the country code of a given IP address, using data obtained from tor.

It uses geoip.txt to identify country code for IPv4 addresses, and geoip6.txt to do so for IPv6 addresses.

This is done by converting the address to an integer, then find the index of the starting IP address of IP ranges for that IP type closest to the integer, by binary search using bisect.bisect. Then check the ending IP address of IP ranges located at that index, if the integer is no greater than the ending IP address, return the corresponding country code located at the index.

I wrote the script both as a programming challenge and as a library to be used by later scripts. I have specifically chosen not to use ipaddress module, instead I wrote custom functions to convert IPv4 and IPv6 to int and back. And I have benchmarked my code against ipaddress module, and I found my code to be more time-efficient.

The code:

import re from bisect import bisect MAX_IPV4 = 2**32-1 MAX_IPV6 = 2**128-1 DIGITS = set('0123456789abcdef') le255 = '(25[0-5]|2[0-4]\d|[01]?\d\d?)' IPV4_PATTERN = re.compile(f'^({le255}\.){{3}}{le255}$') EMPTY = re.compile(r':?\b(?:0\b:?)+') def parse_ipv4(ip: str) -> int: assert isinstance(ip, str) and IPV4_PATTERN.match(ip) a, b, c, d = ip.split('.') return (int(a) << 24) + (int(b) << 16) + (int(c) << 8) + int(d) def to_ipv4(n: int) -> str: assert isinstance(n, int) and 0 <= n <= MAX_IPV4 return ".".join(str(n >> i & 255) for i in range(24, -1, -8)) def parse_ipv6(ip: str) -> int: assert isinstance(ip, str) and len(ip) <= 39 segments = ip.lower().split(":") l, n, p, fields, compressed = len(segments), 0, 7, 0, False last = l - 1 for i, s in enumerate(segments): assert fields <= 8 and len(s) <= 4 and not set(s) - DIGITS if not s: if i in (0, last): continue assert not compressed p = l - i - 2 compressed = True else: n += int(s, 16) << p*16 p -= 1 fields += 1 return n def to_ipv6(n: int, compress: bool = False) -> str: assert isinstance(n, int) and 0 <= n <= MAX_IPV6 ip = '{:032_x}'.format(n).replace('_', ':') if compress: ip = ':'.join(s.lstrip('0') if s != '0000' else '0' for s in ip.split(':')) longest = max(EMPTY.findall(ip)) if len(longest) > 2: ip = ip.replace(longest, '::', 1) return ip def parse_entry4(e: str) -> tuple: a, b, c = e.split(",") return (int(a), int(b), c) def parse_entry6(e: str) -> tuple: a, b, c = e.split(",") return (parse_ipv6(a), parse_ipv6(b), c) with open("D:/network_guard/geoip.txt", "r") as file: data4 = list(map(parse_entry4, file.read().splitlines())) starts4, ends4, countries4 = zip(*data4) with open("D:/network_guard/geoip6.txt", "r") as file: data6 = list(map(parse_entry6, file.read().splitlines())) starts6, ends6, countries6 = zip(*data6) class IP: parse = [parse_ipv4, parse_ipv6] starts = [starts4, starts6] ends = [ends4, ends6] countries = [countries4, countries6] def geoip_country(ip: str, mode: int=0) -> str: assert mode in {0, 1} n = IP.parse[mode](ip) if not (i := bisect(IP.starts[mode], n)): return False i -= 1 return False if n > IP.ends[mode][i] else IP.countries[mode][i] if __name__ == '__main__': ipv6s = [ '2404:6800:4003:c03::88', '2404:6800:4004:80f::200e', '2404:6800:4006:802::200e', '2607:f8b0:4004:800::200e', '2607:f8b0:4005:801::200e', '2607:f8b0:4006:81c::200e', '2607:f8b0:4006:822::200e', '2607:f8b0:4006:823::200e', '2607:f8b0:4006:824::200e', '2607:f8b0:400a:809::200e', '2800:3f0:4001:80b::200e', '2a00:1450:400b:c01::be' ] ipv4s = [ '74.125.24.93', '74.125.24.136', '74.125.24.190', '74.125.68.91', '74.125.68.93', '74.125.68.136', '74.125.193.91', '74.125.193.93', '74.125.193.136', '74.125.193.190', '74.125.200.91', '142.250.64.78' ] for ipv4 in ipv4s: n = parse_ipv4(ipv4) print(ipv4, n, to_ipv4(n), geoip_country(ipv4)) for ipv6 in ipv6s: n = parse_ipv6(ipv6) print(ipv6, n, to_ipv6(n, 1), geoip_country(ipv6, 1)) 

Output:

74.125.24.93 1249712221 74.125.24.93 US 74.125.24.136 1249712264 74.125.24.136 US 74.125.24.190 1249712318 74.125.24.190 US 74.125.68.91 1249723483 74.125.68.91 US 74.125.68.93 1249723485 74.125.68.93 US 74.125.68.136 1249723528 74.125.68.136 US 74.125.193.91 1249755483 74.125.193.91 US 74.125.193.93 1249755485 74.125.193.93 US 74.125.193.136 1249755528 74.125.193.136 US 74.125.193.190 1249755582 74.125.193.190 US 74.125.200.91 1249757275 74.125.200.91 US 142.250.64.78 2398765134 142.250.64.78 US 2404:6800:4003:c03::88 47875086426100614638538221612324356232 2404:6800:4003:c03::88 AU 2404:6800:4004:80f::200e 47875086426101804896252833647432835086 2404:6800:4004:80f::200e AU 2404:6800:4006:802::200e 47875086426104222508084389947558076430 2404:6800:4006:802::200e AU 2607:f8b0:4004:800::200e 50552053919386769199309343019258355726 2607:f8b0:4004:800::200e US 2607:f8b0:4005:801::200e 50552053919387978143575701722142613518 2607:f8b0:4005:801::200e US 2607:f8b0:4006:81c::200e 50552053919389187567457406341475213326 2607:f8b0:4006:81c::200e US 2607:f8b0:4006:822::200e 50552053919389187678137870783732523022 2607:f8b0:4006:822::200e US 2607:f8b0:4006:823::200e 50552053919389187696584614857442074638 2607:f8b0:4006:823::200e US 2607:f8b0:4006:824::200e 50552053919389187715031358931151626254 2607:f8b0:4006:824::200e US 2607:f8b0:400a:809::200e 50552053919394022920247727457692557326 2607:f8b0:400a:809::200e US 2800:3f0:4001:80b::200e 53169199713192736830836323499043201038 2800:3f0:4001:80b::200e AR 2a00:1450:400b:c01::be 55827987829231936335941766789076091070 2a00:1450:400b:c01::be IE 

It is indeed working, but I want my code to be more concise and efficient, and more Pythonic. How can I do so?

(P.S. I also generated a version with documentation using mintlify just for the gigs, I am not responsible for most of the docs, I just pressed CTRL+. in Visual Studio Code, but I edited here and there. As it is way too verbose I uploaded it to Google Drive)

\$\endgroup\$

    1 Answer 1

    4
    \$\begingroup\$

    The syntax for :: IPv6 addresses can be a bit involved. You chose to go it alone, to avoid using an existing well-tested parsing library. I predict you will want to revisit that choice.


    We see that a Public API has been defined here. But it is not documented, beyond annotated signatures.

    We need to see """docstrings""" on these.

    Cite your references. Perhaps there is an RFC you strive to conform to?


    IPV4_PATTERN = re.compile(f'^({le256}\.){{3}}{le256}$') 

    This is very nice, kudos on DRYing it up.

    le256 = '(25[0-5]|2[0-4]\d|[01]?\d\d?)' 

    This "less than or equal to 256", OTOH, impresses me as the Wrong approach.

    Use the right tool for the job. Parse out digits with a regex, and then evaluate whether integer A <= B using other language facilities.

    The name is definitely wrong and should be fixed -- it appears that Author's Intent was lt256 (or perhaps le255).

    The DIGITS identifier is kind of OK. But consider renaming it to HEXDIGITS.


    I'm a little sad that parse_ipv4 and to_ipv4 do not have parallel structure. That is, the parse is accomplished with a manually unrolled loop, while the inverse is a bit more elegant due to the explicit loop.

    A single assert, executed four times, would be a natural way to express that we want a 0 <= n < 256 byte value.

    The type annotations are lovely; I thank you.


    Right, let's parse an IPv6!

     l, n, p, fields, compressed = len(segments), 0, 7, 0, False 

    No, please don't do that. Yes, it is shorter. No, it does not help the Gentle Reader understand what's going on. Saving four SLOC just isn't worth it.

    Also, l is clearly "length", but I don't know how I'm supposed to mentally pronounce p. Maybe rename it to pos for "position"? Could we perhaps find a more specific name?

    Pep-8 advises you to choose a different spelling for your length variable.

    Repeatedly asserting not set(s) - DIGITS works. But it seems more convenient to do that character set validation just once on the whole string up front.

    Consider renaming s to seg.

    Overall, this just isn't a very nice function to work with. It is doing more than one thing, and those things are interleaved across loop iterations.

    Consider reworking this function to split work into three phases:

    1. check there's only valid characters
    2. normalize, so there's no :: abbreviations
    3. convert to integer

    As written, passing in the "" empty string returns zero. I guess that is kind of correct? But I am reluctant to say the empty string is a valid IPv6 address. In any event it certainly isn't the all-zeros broadcast address, since there isn't one, with ff02::1 multicasts taking on that role.


    When converting back to string, it's not clear that the anti-pattern of checking isinstance(n, int) really does anything for us. Duck typing suffices, given that we immediately compare against zero and then use a hex format. The type annotated signature is spelling this out very clearly for mypy and for humans.

    Consider embracing the duck, and removing isinstance from four functions.

     ip = ':'.join(s.lstrip('0') if s != '0000' else '0' for s in ip.split(':')) 

    No, this expression is not maintainable. It is correct, sure, but that's uninteresting. Break out a named helper, which can be called by a unit test.

    This appears to produce the wrong answer, one which is too short: to_ipv6(parse_ipv6(""))

    It seems that Author's Intent was a format string of '{:039_x}'


    The entry{4,6} parsing belongs in a separate module, as those functions are about a separate concept. We have moved from generic IP addresses to GeoIP-specific file formats.

    with open("D:/network_guard/geoip.txt", ... 

    Avoid doing such work at a module's toplevel. Why? So a unit test doing an import of this code cannot fail with FileNotFoundError. Prefer to bury this code in a function that can be called when needed. You (very nicely) wrote a __main__ guard with similar motivation, to avoid side effects at import time.


    class IP: 

    This is a great class name. Consider creating IPv4 and IPv6 classes which conform to a contract specified by an abstract base class.


    def geoip_country(ip: str, mode: int=0) -> str: 

    The mode identifier is rather vague. The intent here is ip_type or ip_version.

    Given that we won't be using the integers 4 or 6, this should probably be an enum.

     return False 

    No. You just told me, a few lines up, that you shall return str, and bool is not a string.

    Also, prefer to return None instead of False, if for some reason you feel "" empty string is not a good match for your use case.

    So the signature would end with: ...) -> Optional[str]:


     print(ipv6, n, to_ipv6(n, 1), geoip_country(ipv6, 1)) 

    No.

    You told me in the signature that callers should instead say this:

     print(ipv6, n, to_ipv6(n, True), geoip_country(ipv6, 1)) 

    This codebase achieves a subset of its design goals. It works correctly on a portion of the input space. It is less well tested and documented than competing libraries.

    This codebase curiously avoids a uniform representation of an IP address as a byte vector, preferring to work with strings and large integers.

    I would be willing to accept, but not delegate, maintenance tasks on this codebase in its current form. Things that should be addressed sooner than later are

    • lack of docstrings
    • lack of unit tests
    • cryptic identifiers

    I just took a look at the mintlify output. It's a utility I had never heard of.

    The text it generates is just terrible, similar to

     i += 1 # increment the value of index i by exactly one 

    Recommend that you not use it in coding projects. Write a single sentence for each docstring, optionally followed by blank line and a paragraph.

    Do consider using doctest, where you append ">>> some(expression)" and its result.

    The trouble with comments is that sometimes they lie, they get out of sync with evolving code edits. The brilliant thing about doctest comments is they verifiably tell the truth, even after edits.

    \$\endgroup\$

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.