This is a Python script I wrote to identify the country code of a given IP address, using data obtained from tor.
It uses geoip.txt to identify country code for IPv4 addresses, and geoip6.txt to do so for IPv6 addresses.
This is done by converting the address to an integer, then find the index of the starting IP address of IP ranges for that IP type closest to the integer, by binary search using bisect.bisect
. Then check the ending IP address of IP ranges located at that index, if the integer is no greater than the ending IP address, return the corresponding country code located at the index.
I wrote the script both as a programming challenge and as a library to be used by later scripts. I have specifically chosen not to use ipaddress
module, instead I wrote custom functions to convert IPv4 and IPv6 to int
and back. And I have benchmarked my code against ipaddress
module, and I found my code to be more time-efficient.
The code:
import re from bisect import bisect MAX_IPV4 = 2**32-1 MAX_IPV6 = 2**128-1 DIGITS = set('0123456789abcdef') le255 = '(25[0-5]|2[0-4]\d|[01]?\d\d?)' IPV4_PATTERN = re.compile(f'^({le255}\.){{3}}{le255}$') EMPTY = re.compile(r':?\b(?:0\b:?)+') def parse_ipv4(ip: str) -> int: assert isinstance(ip, str) and IPV4_PATTERN.match(ip) a, b, c, d = ip.split('.') return (int(a) << 24) + (int(b) << 16) + (int(c) << 8) + int(d) def to_ipv4(n: int) -> str: assert isinstance(n, int) and 0 <= n <= MAX_IPV4 return ".".join(str(n >> i & 255) for i in range(24, -1, -8)) def parse_ipv6(ip: str) -> int: assert isinstance(ip, str) and len(ip) <= 39 segments = ip.lower().split(":") l, n, p, fields, compressed = len(segments), 0, 7, 0, False last = l - 1 for i, s in enumerate(segments): assert fields <= 8 and len(s) <= 4 and not set(s) - DIGITS if not s: if i in (0, last): continue assert not compressed p = l - i - 2 compressed = True else: n += int(s, 16) << p*16 p -= 1 fields += 1 return n def to_ipv6(n: int, compress: bool = False) -> str: assert isinstance(n, int) and 0 <= n <= MAX_IPV6 ip = '{:032_x}'.format(n).replace('_', ':') if compress: ip = ':'.join(s.lstrip('0') if s != '0000' else '0' for s in ip.split(':')) longest = max(EMPTY.findall(ip)) if len(longest) > 2: ip = ip.replace(longest, '::', 1) return ip def parse_entry4(e: str) -> tuple: a, b, c = e.split(",") return (int(a), int(b), c) def parse_entry6(e: str) -> tuple: a, b, c = e.split(",") return (parse_ipv6(a), parse_ipv6(b), c) with open("D:/network_guard/geoip.txt", "r") as file: data4 = list(map(parse_entry4, file.read().splitlines())) starts4, ends4, countries4 = zip(*data4) with open("D:/network_guard/geoip6.txt", "r") as file: data6 = list(map(parse_entry6, file.read().splitlines())) starts6, ends6, countries6 = zip(*data6) class IP: parse = [parse_ipv4, parse_ipv6] starts = [starts4, starts6] ends = [ends4, ends6] countries = [countries4, countries6] def geoip_country(ip: str, mode: int=0) -> str: assert mode in {0, 1} n = IP.parse[mode](ip) if not (i := bisect(IP.starts[mode], n)): return False i -= 1 return False if n > IP.ends[mode][i] else IP.countries[mode][i] if __name__ == '__main__': ipv6s = [ '2404:6800:4003:c03::88', '2404:6800:4004:80f::200e', '2404:6800:4006:802::200e', '2607:f8b0:4004:800::200e', '2607:f8b0:4005:801::200e', '2607:f8b0:4006:81c::200e', '2607:f8b0:4006:822::200e', '2607:f8b0:4006:823::200e', '2607:f8b0:4006:824::200e', '2607:f8b0:400a:809::200e', '2800:3f0:4001:80b::200e', '2a00:1450:400b:c01::be' ] ipv4s = [ '74.125.24.93', '74.125.24.136', '74.125.24.190', '74.125.68.91', '74.125.68.93', '74.125.68.136', '74.125.193.91', '74.125.193.93', '74.125.193.136', '74.125.193.190', '74.125.200.91', '142.250.64.78' ] for ipv4 in ipv4s: n = parse_ipv4(ipv4) print(ipv4, n, to_ipv4(n), geoip_country(ipv4)) for ipv6 in ipv6s: n = parse_ipv6(ipv6) print(ipv6, n, to_ipv6(n, 1), geoip_country(ipv6, 1))
Output:
74.125.24.93 1249712221 74.125.24.93 US 74.125.24.136 1249712264 74.125.24.136 US 74.125.24.190 1249712318 74.125.24.190 US 74.125.68.91 1249723483 74.125.68.91 US 74.125.68.93 1249723485 74.125.68.93 US 74.125.68.136 1249723528 74.125.68.136 US 74.125.193.91 1249755483 74.125.193.91 US 74.125.193.93 1249755485 74.125.193.93 US 74.125.193.136 1249755528 74.125.193.136 US 74.125.193.190 1249755582 74.125.193.190 US 74.125.200.91 1249757275 74.125.200.91 US 142.250.64.78 2398765134 142.250.64.78 US 2404:6800:4003:c03::88 47875086426100614638538221612324356232 2404:6800:4003:c03::88 AU 2404:6800:4004:80f::200e 47875086426101804896252833647432835086 2404:6800:4004:80f::200e AU 2404:6800:4006:802::200e 47875086426104222508084389947558076430 2404:6800:4006:802::200e AU 2607:f8b0:4004:800::200e 50552053919386769199309343019258355726 2607:f8b0:4004:800::200e US 2607:f8b0:4005:801::200e 50552053919387978143575701722142613518 2607:f8b0:4005:801::200e US 2607:f8b0:4006:81c::200e 50552053919389187567457406341475213326 2607:f8b0:4006:81c::200e US 2607:f8b0:4006:822::200e 50552053919389187678137870783732523022 2607:f8b0:4006:822::200e US 2607:f8b0:4006:823::200e 50552053919389187696584614857442074638 2607:f8b0:4006:823::200e US 2607:f8b0:4006:824::200e 50552053919389187715031358931151626254 2607:f8b0:4006:824::200e US 2607:f8b0:400a:809::200e 50552053919394022920247727457692557326 2607:f8b0:400a:809::200e US 2800:3f0:4001:80b::200e 53169199713192736830836323499043201038 2800:3f0:4001:80b::200e AR 2a00:1450:400b:c01::be 55827987829231936335941766789076091070 2a00:1450:400b:c01::be IE
It is indeed working, but I want my code to be more concise and efficient, and more Pythonic. How can I do so?
(P.S. I also generated a version with documentation using mintlify
just for the gigs, I am not responsible for most of the docs, I just pressed CTRL+. in Visual Studio Code, but I edited here and there. As it is way too verbose I uploaded it to Google Drive)