Knowing the country to which an IP address is mapped is very helpful for localising an experience on a website and providing demographics information to website owners.
External API-based lookups
Different environments and use cases will demand different solutions. For non-time sensitive, low-volume cases an external, often free, service can be very quick to implement and not require any database updates to the consuming service.
But making service requests can often take 250 to 500 milliseconds. If the information is needed on the spot in order to make decisions on content to serve or business logic to use this can be a huge overhead.
There is also the risk that a flood of traffic could cause the remote API to throttle requests. Networking woes can also cause the information to be unavailable.
As an antidote exercise I ran the following lookup three times:
$ time \
curl -s freegeoip.net/json/24.24.24.24 | \
python -m json.tool | \
grep -oP 'country_code": "\K([A-Z]+)'
The responses took 314ms, 561ms and 289ms respectively to return. For non-time-sensitive tasks this would be fine (e.g. admin interface with IP address investigation tools).
Local database lookups
If you're serving web requests where the user's location is of importance then a local database will save you the external network overhead and communication error risks.
Not all solutions are created equally. Below I compare a C-based lookup against a pure python lookup (useful for cloud services which only allow pure python modules) versus looking up results in a local redis database.
C-based lookup
The first module I'll look at is GeoIP. It's written in C and offers the fastest lookups I've seen of any solution.
I ran the following on an Ubuntu 14 machine to install it:
$ sudo apt install \
python-dev \
libgeoip-dev
$ pip install GeoIP
Next, I put together a benchmark. In each iteration, I generate a random IP address and lookup the country for it:
import timeit
lookup_code = '''
ip_address = '.'.join([str(randint(0, 255)) for _ in range(0, 4)])
gi.country_code_by_addr(ip_address)
'''
setup_code = '''
from random import randint
import GeoIP
gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
'''
I ran the code one million times and it took 3.603 seconds to complete:
>>> timeit.timeit(stmt=lookup_code, setup=setup_code, number=1000000)
Pure python-based lookup
Google App Engine, among other hosting providers, only allows customers to use pure python modules (unless they've already provided the module themselves). GeoIP won't work but pygeoip, a pure python module, will.
You will need MaxMind's GeoIP database. If you install libgeoip-dev on Ubuntu then it'll be stored in /usr/share/GeoIP/GeoIP.dat.
I ran the following on an Ubuntu 14 machine to install pygeoip:
$ sudo apt install \
libgeoip-dev # Installs GeoIP.dat
$ pip install pygeoip
I built a variation of the previous benchmark, replacing GeoIP with pygeoip:
import timeit
lookup_code = '''
ip_address = '.'.join([str(randint(0, 255)) for _ in range(0, 4)])
gi.country_code_by_addr(ip_address)
'''
setup_code = '''
from random import randint
import pygeoip
gi = pygeoip.GeoIP('/usr/share/GeoIP/GeoIP.dat',
flags=pygeoip.const.MMAP_CACHE)
'''
I ran the code one million times and it took 33.394 seconds to complete:
>>> timeit.timeit(stmt=lookup_code, setup=setup_code, number=1000000)
A small number of lookups with this library will hardly be noticeably slower but if you are dealing with high volumes or a large batch job then this is a significant slowdown.
Redis-based lookup
I wondered if using redis as a data source would be faster or slower than any of the above solutions. I found that I'm not the first to wonder this as I found this helpful question on Stackoverflow.
To start I needed to import Maxmind's GeoIP Country CSV file into redis. First I downloaded and unzip'ed the database:
$ wget http://geolite.maxmind.com/download/geoip/database/GeoIPCountryCSV.zip
$ unzip GeoIPCountryCSV.zip
I then installed redis client bindings for python:
$ pip install redis
Then I ran a script that would import the CSV data I was interested in into redis:
import csv
import socket
import struct
import redis
def ip2long(ip):
"""
Convert an IP string to long
"""
packedIP = socket.inet_aton(ip)
return struct.unpack("!L", packedIP)[0]
if __name__ == "__main__":
redis_con = redis.StrictRedis(host='localhost', port=6379, db=0)
with open('GeoIPCountryWhois.csv', 'rb') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
ip_range_start, country_code = ip2long(row[0]), row[4]
redis_con.zadd('countries',
ip_range_start,
'%s@%d' % (country_code, ip_range_start))
The space requirements to store the data in redis were pretty minimal:
$ redis-cli info | grep used_memory_peak_human
used_memory_peak_human:14.69M
For comparison GeoIP.dat was 808KB and GeoIPCountryWhois.csv was 7.4MB on my machine at the time of writing.
I then built a lookup benchmark:
import timeit
lookup_code = '''
ip_address = ip2long('.'.join([str(randint(0, 255)) for _ in range(0, 4)]))
resp = redis_con.zrangebyscore(name='countries',
min=ip_address,
max='+inf',
start=0,
num=1)
country = resp[0].split('@')[0] if resp else None
'''
setup_code = '''
from random import randint
import socket
import struct
import redis
def ip2long(ip):
"""
Convert an IP string to long
"""
packedIP = socket.inet_aton(ip)
return struct.unpack("!L", packedIP)[0]
redis_con = redis.StrictRedis(host='localhost', port=6379, db=0)
'''
I ran the code one million times and it took 64.495 seconds to complete:
>>> timeit.timeit(stmt=lookup_code, setup=setup_code, number=1000000)
These benchmarks are only useful for batch processing
The first two benchmarks already had the IP address database in memory so there was little overhead for them. The redis database also held the database in memory but has a client-server communications overhead.
To level the playing field I created three scripts that would only do a single lookup on a randomly-generated IP address. There was a script for the C-based lookup, pure python-based lookup and redis-based lookup:
c_based.py:
from random import randint
import GeoIP
if __name__ == "__main__":
gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
ip_address = '.'.join([str(randint(0, 255)) for _ in range(0, 4)])
gi.country_code_by_addr(ip_address)
pure_python_based.py:
from random import randint
import pygeoip
if __name__ == "__main__":
gi = pygeoip.GeoIP('/usr/share/GeoIP/GeoIP.dat',
flags=pygeoip.const.MMAP_CACHE)
ip_address = '.'.join([str(randint(0, 255)) for _ in range(0, 4)])
gi.country_code_by_addr(ip_address)
redis_based.py:
from random import randint
import socket
import struct
import redis
def ip2long(ip):
"""
Convert an IP string to long
"""
packedIP = socket.inet_aton(ip)
return struct.unpack("!L", packedIP)[0]
if __name__ == "__main__":
redis_con = redis.StrictRedis(host='localhost', port=6379, db=0)
ip_address = ip2long('.'.join([str(randint(0, 255))
for _ in range(0, 4)]))
resp = redis_con.zrangebyscore(name='countries',
min=ip_address,
max='+inf',
start=0,
num=1)
country = resp[0].split('@')[0] if resp else None
I then created a bash file that would run each script 1000 times and output how long each took:
$ cat benchmark.sh
#!/bin/bash
function c_based {
for i in `seq 1 1000`;
do
python ./c_based.py
done
}
function pure_python_based {
for i in `seq 1 1000`;
do
python ./pure_python_based.py
done
}
function redis_based {
for i in `seq 1 1000`;
do
python ./redis_based.py
done
}
time c_based
time pure_python_based
time redis_based
Here is the result of running the benchmark:
$ ./benchmark.sh
# C-based
real 0m8.407s
user 0m5.688s
sys 0m2.588s
# Pure python-based
real 0m17.498s
user 0m13.492s
sys 0m3.737s
# redis-based
real 0m29.075s
user 0m21.545s
sys 0m6.939s
For ad hoc requests the time it takes to load the database into memory levels the playing field a lot. The C-based approach is still about twice as fast as the pure python-based approach but now the redis-based approach is only around twice as slow and the pure python-based approach.
The curious case of 24.24.24.24
The CSV database I downloaded from MaxMind and the binary one I installed via the libgeoip-dev package had differences between them. One of the test IP addresses I used when I started building these scripts was 24.24.24.24. According to whois 24.24.24.24 the IP address is mapped to a network in Herndon, VA, USA and sits in the net range 24.24.0.0 - 24.29.255.255.
When I ran a redis lookup manually though it came back with Romania as the country where the IP address is mapped to:
$ redis-cli
127.0.0.1:6379> ZRANGEBYSCORE countries 2130706433 +inf LIMIT 0 1
1) "RO@2147483648"
The closest, lower value to the IP address will always be returned with the redis lookup implementation used in this blog. That means if there is no exact range the IP address being looked up in the database then it won't be flagged up.
I looked at the CSV file and it turns out there are no mappings for any 24.x.x.x ranges before 24.36.x.x:
$ grep '^"24\.' GeoIPCountryWhois.csv | head
"24.36.0.0","24.37.255.255","405012480","405143551","CA","Canada"
"24.38.0.0","24.38.143.255","405143552","405180415","US","United States"
"24.38.144.0","24.38.159.255","405180416","405184511","CA","Canada"
"24.38.160.0","24.41.95.255","405184512","405364735","US","United States"
"24.41.96.0","24.41.127.255","405364736","405372927","CA","Canada"
"24.41.128.0","24.42.63.255","405372928","405422079","PR","Puerto Rico"
"24.42.64.0","24.47.255.255","405422080","405798911","US","United States"
"24.48.0.0","24.48.127.255","405798912","405831679","CA","Canada"
"24.48.128.0","24.48.175.255","405831680","405843967","US","United States"
"24.48.176.0","24.48.191.255","405843968","405848063","CA","Canada"
At this point I wondered if any IP addresses would return the same results from all three implementations and whois. I picked 24.244.192.0 from the CSV file:
$ grep 24.244.192.0 GeoIPCountryWhois.csv
"24.244.192.0","24.244.255.255","418693120","418709503","CA","Canada"
Whois said the IP address is mapped to a network in Richmond Hill, Ontario, Canada. All python-based and redis-based lookups returned Canada as their answer as well.