IP Address lookups using Python

Knowing the country to which an IP address is mapped is very helpful for localising an experience on a website and providing demographics information to website owners.

External API-based lookups

Different environments and use cases will demand different solutions. For non-time sensitive, low-volume cases an external, often free, service can be very quick to implement and not require any database updates to the consuming service.

But making service requests can often take 250 to 500 milliseconds. If the information is needed on the spot in order to make decisions on content to serve or business logic to use this can be a huge overhead.

There is also the risk that a flood of traffic could cause the remote API to throttle requests. Networking woes can also cause the information to be unavailable.

As an antidote exercise I ran the following lookup three times:

$ time \
  curl -s freegeoip.net/json/24.24.24.24 | \
  python -m json.tool | \
  grep -oP 'country_code": "\K([A-Z]+)'

The responses took 314ms, 561ms and 289ms respectively to return. For non-time-sensitive tasks this would be fine (e.g. admin interface with IP address investigation tools).

Local database lookups

If you're serving web requests where the user's location is of importance then a local database will save you the external network overhead and communication error risks.

Not all solutions are created equally. Below I compare a C-based lookup against a pure python lookup (useful for cloud services which only allow pure python modules) versus looking up results in a local redis database.

C-based lookup

The first module I'll look at is GeoIP. It's written in C and offers the fastest lookups I've seen of any solution.

I ran the following on an Ubuntu 14 machine to install it:

$ sudo apt install \
    python-dev \
    libgeoip-dev
$ pip install GeoIP

Next, I put together a benchmark. In each iteration, I generate a random IP address and lookup the country for it:

import timeit


lookup_code = '''
ip_address = '.'.join([str(randint(0, 255)) for _ in range(0, 4)])
gi.country_code_by_addr(ip_address)
'''
setup_code = '''
from random import randint

import GeoIP


gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
'''

I ran the code one million times and it took 3.603 seconds to complete:

>>> timeit.timeit(stmt=lookup_code, setup=setup_code, number=1000000)

Pure python-based lookup

Google App Engine, among other hosting providers, only allows customers to use pure python modules (unless they've already provided the module themselves). GeoIP won't work but pygeoip, a pure python module, will.

You will need MaxMind's GeoIP database. If you install libgeoip-dev on Ubuntu then it'll be stored in /usr/share/GeoIP/GeoIP.dat.

I ran the following on an Ubuntu 14 machine to install pygeoip:

$ sudo apt install \
    libgeoip-dev # Installs GeoIP.dat
$ pip install pygeoip

I built a variation of the previous benchmark, replacing GeoIP with pygeoip:

import timeit


lookup_code = '''
ip_address = '.'.join([str(randint(0, 255)) for _ in range(0, 4)])
gi.country_code_by_addr(ip_address)
'''
setup_code = '''
from random import randint

import pygeoip


gi = pygeoip.GeoIP('/usr/share/GeoIP/GeoIP.dat',
                   flags=pygeoip.const.MMAP_CACHE)
'''

I ran the code one million times and it took 33.394 seconds to complete:

>>> timeit.timeit(stmt=lookup_code, setup=setup_code, number=1000000)

A small number of lookups with this library will hardly be noticeably slower but if you are dealing with high volumes or a large batch job then this is a significant slowdown.

Redis-based lookup

I wondered if using redis as a data source would be faster or slower than any of the above solutions. I found that I'm not the first to wonder this as I found this helpful question on Stackoverflow.

To start I needed to import Maxmind's GeoIP Country CSV file into redis. First I downloaded and unzip'ed the database:

$ wget http://geolite.maxmind.com/download/geoip/database/GeoIPCountryCSV.zip
$ unzip GeoIPCountryCSV.zip

I then installed redis client bindings for python:

$ pip install redis

Then I ran a script that would import the CSV data I was interested in into redis:

import csv
import socket
import struct

import redis


def ip2long(ip):
    """
    Convert an IP string to long
    """
    packedIP = socket.inet_aton(ip)
    return struct.unpack("!L", packedIP)[0]


if __name__ == "__main__":
    redis_con = redis.StrictRedis(host='localhost', port=6379, db=0)

    with open('GeoIPCountryWhois.csv', 'rb') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')

        for row in csv_reader:
            ip_range_start, country_code = ip2long(row[0]), row[4]
            redis_con.zadd('countries',
                           ip_range_start,
                           '%s@%d' % (country_code, ip_range_start))

The space requirements to store the data in redis were pretty minimal:

$ redis-cli info | grep used_memory_peak_human
used_memory_peak_human:14.69M

For comparison GeoIP.dat was 808KB and GeoIPCountryWhois.csv was 7.4MB on my machine at the time of writing.

I then built a lookup benchmark:

import timeit


lookup_code = '''
ip_address = ip2long('.'.join([str(randint(0, 255)) for _ in range(0, 4)]))
resp = redis_con.zrangebyscore(name='countries',
                               min=ip_address,
                               max='+inf',
                               start=0,
                               num=1)

country = resp[0].split('@')[0] if resp else None
'''
setup_code = '''
from random import randint
import socket
import struct

import redis


def ip2long(ip):
    """
    Convert an IP string to long
    """
    packedIP = socket.inet_aton(ip)
    return struct.unpack("!L", packedIP)[0]


redis_con = redis.StrictRedis(host='localhost', port=6379, db=0)
'''

I ran the code one million times and it took 64.495 seconds to complete:

>>> timeit.timeit(stmt=lookup_code, setup=setup_code, number=1000000)

These benchmarks are only useful for batch processing

The first two benchmarks already had the IP address database in memory so there was little overhead for them. The redis database also held the database in memory but has a client-server communications overhead.

To level the playing field I created three scripts that would only do a single lookup on a randomly-generated IP address. There was a script for the C-based lookup, pure python-based lookup and redis-based lookup:

c_based.py:

from random import randint

import GeoIP


if __name__ == "__main__":
    gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
    ip_address = '.'.join([str(randint(0, 255)) for _ in range(0, 4)])
    gi.country_code_by_addr(ip_address)

pure_python_based.py:

from random import randint

import pygeoip


if __name__ == "__main__":
    gi = pygeoip.GeoIP('/usr/share/GeoIP/GeoIP.dat',
                       flags=pygeoip.const.MMAP_CACHE)
    ip_address = '.'.join([str(randint(0, 255)) for _ in range(0, 4)])
    gi.country_code_by_addr(ip_address)

redis_based.py:

from random import randint
import socket
import struct

import redis


def ip2long(ip):
    """
    Convert an IP string to long
    """
    packedIP = socket.inet_aton(ip)
    return struct.unpack("!L", packedIP)[0]


if __name__ == "__main__":
    redis_con = redis.StrictRedis(host='localhost', port=6379, db=0)
    ip_address = ip2long('.'.join([str(randint(0, 255))
                                   for _ in range(0, 4)]))
    resp = redis_con.zrangebyscore(name='countries',
                                   min=ip_address,
                                   max='+inf',
                                   start=0,
                                   num=1)

    country = resp[0].split('@')[0] if resp else None

I then created a bash file that would run each script 1000 times and output how long each took:

$ cat benchmark.sh
#!/bin/bash

function c_based {
    for i in `seq 1 1000`;
    do
        python ./c_based.py
    done
}

function pure_python_based {
    for i in `seq 1 1000`;
    do
        python ./pure_python_based.py
    done
}

function redis_based {
    for i in `seq 1 1000`;
    do
        python ./redis_based.py
    done
}

time c_based
time pure_python_based
time redis_based

Here is the result of running the benchmark:

$ ./benchmark.sh

# C-based
real    0m8.407s
user    0m5.688s
sys     0m2.588s

# Pure python-based
real    0m17.498s
user    0m13.492s
sys     0m3.737s

# redis-based
real    0m29.075s
user    0m21.545s
sys     0m6.939s

For ad hoc requests the time it takes to load the database into memory levels the playing field a lot. The C-based approach is still about twice as fast as the pure python-based approach but now the redis-based approach is only around twice as slow and the pure python-based approach.

The curious case of 24.24.24.24

The CSV database I downloaded from MaxMind and the binary one I installed via the libgeoip-dev package had differences between them. One of the test IP addresses I used when I started building these scripts was 24.24.24.24. According to whois 24.24.24.24 the IP address is mapped to a network in Herndon, VA, USA and sits in the net range 24.24.0.0 - 24.29.255.255.

When I ran a redis lookup manually though it came back with Romania as the country where the IP address is mapped to:

$ redis-cli
127.0.0.1:6379> ZRANGEBYSCORE countries 2130706433 +inf LIMIT 0 1
1) "RO@2147483648"

The closest, lower value to the IP address will always be returned with the redis lookup implementation used in this blog. That means if there is no exact range the IP address being looked up in the database then it won't be flagged up.

I looked at the CSV file and it turns out there are no mappings for any 24.x.x.x ranges before 24.36.x.x:

$ grep '^"24\.' GeoIPCountryWhois.csv | head
"24.36.0.0","24.37.255.255","405012480","405143551","CA","Canada"
"24.38.0.0","24.38.143.255","405143552","405180415","US","United States"
"24.38.144.0","24.38.159.255","405180416","405184511","CA","Canada"
"24.38.160.0","24.41.95.255","405184512","405364735","US","United States"
"24.41.96.0","24.41.127.255","405364736","405372927","CA","Canada"
"24.41.128.0","24.42.63.255","405372928","405422079","PR","Puerto Rico"
"24.42.64.0","24.47.255.255","405422080","405798911","US","United States"
"24.48.0.0","24.48.127.255","405798912","405831679","CA","Canada"
"24.48.128.0","24.48.175.255","405831680","405843967","US","United States"
"24.48.176.0","24.48.191.255","405843968","405848063","CA","Canada"

At this point I wondered if any IP addresses would return the same results from all three implementations and whois. I picked 24.244.192.0 from the CSV file:

$ grep 24.244.192.0 GeoIPCountryWhois.csv
"24.244.192.0","24.244.255.255","418693120","418709503","CA","Canada"

Whois said the IP address is mapped to a network in Richmond Hill, Ontario, Canada. All python-based and redis-based lookups returned Canada as their answer as well.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.