UPDATE: Since writing this blog post I've had other developers get in touch with ideas of improving the script described in this post. I've create a repo on github where pull requests can be submitted.
I recently published a blog post on finding the fastest way to lookup the country mapping of any given IP Address. Within a day I found some interesting insights made by gigarray on /r/Python:
"MaxMind's data is widely known to be fairly garbage." followed by "converting the IP address ... to an assigned block and doing a single lookup for that block ... it converts 4 billion IPv4 addresses down to 65K ASNs"
With that I thought "How hard could it be to scrape WHOIS databases for all known IPv4 addresses?" if I only needed to make a little more than 65,000 WHOIS queries then it shouldn't take long to see the world's IPv4 mappings and have some interesting data to analyse.
Making the problem ~66,153 times smaller
When querying for an IP address often you get back the net range the IP address sits in:
$ whois 24.0.0.0
...
NetRange: 24.0.0.0 - 24.15.255.255
When you see that 24.0.0.0 - 24.15.255.255 it the net range then you can make your next query for 24.16.0.0 instead of 24.0.0.1.
Parsing WHOIS records
WHOIS records look uniform but there are many differences between them.
$ whois 24.0.0.0
...
NetRange: 24.0.0.0 - 24.15.255.255
CIDR: 24.0.0.0/12
NetName: EASTERNSHORE-1
NetHandle: NET-24-0-0-0-1
Parent: NET24 (NET-24-0-0-0-0)
NetType: Direct Allocation
OriginAS:
Organization: Comcast Cable Communications, Inc. (CMCS)
RegDate: 2003-10-06
Updated: 2012-03-02
Comment: ADDRESSES WITHIN THIS BLOCK ARE NON-PORTABLE
Ref: http://whois.arin.net/rest/net/NET-24-0-0-0-1
...
So the first library I looked for was one that could perform the WHOIS query and parse it, ipwhois does just that. Looking at some of its code I could see it handled a lot of edge cases when parsing records and returned the result as a dictionary.
In [1]: from ipwhois import IPWhois
In [2]: IPWhois('24.24.24.24').lookup_rws()
Out[2]:
{'asn': '11351',
'asn_cidr': '24.24.0.0/18',
'asn_country_code': 'US',
'asn_date': '2000-06-09',
'asn_registry': 'arin',
'nets': [{'abuse_emails': 'abuse@rr.com',
'address': '13820 Sunrise Valley Dr',
'cidr': '24.24.0.0/14, 24.28.0.0/15',
'city': 'Herndon',
'country': 'US',
'created': '2000-06-09T00:00:00-04:00',
'description': 'Time Warner Cable Internet LLC',
'handle': u'NET-24-24-0-0-1',
'misc_emails': None,
'name': 'ROAD-RUNNER-1',
'postal_code': '20171',
'range': u'24.24.0.0 - 24.29.255.255',
'state': 'VA',
'tech_emails': 'abuse@rr.com',
'updated': '2011-07-06T16:44:52-04:00'}],
'query': '24.24.24.24',
'raw': None}
4.3 billion addresses but not all are for public use
Next I needed to make sure I didn't waste queries on IP ranges that would never return a proper result. 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 are well known for being private network addresses, 127.0.0.0/8 is for loop back, 224.0.0.0/4 is for multicast, the first and last 256 addresses in the 169.254/16 block are 'reserved for the future', the list goes on...
I found the ipaddr.py module had a decent list of defined networks so I built my list from there:
import socket
import struct
import ipcalc
def get_next_ip(ip_address):
"""
:param str ip_address: ipv4 address
:return: next ipv4 address
:rtype: str
>>> get_next_ip('0.0.0.0')
'0.0.0.1'
>>> get_next_ip('24.24.24.24')
'24.24.24.25'
>>> get_next_ip('24.24.255.255')
'24.25.0.0'
>>> get_next_ip('255.255.255.255') is None
True
"""
assert ip_address.count('.') == 3, \
'Must be an IPv4 address in str representation'
if ip_address == '255.255.255.255':
return None
try:
return socket.inet_ntoa(struct.pack('!L', ip2long(ip_address) + 1))
except Exception, error:
print 'Unable to get next IP for %s' % ip_address
raise error
def get_next_undefined_address(ip):
"""
Get the next non-private IPv4 address if the address sent is private
:param str ip: IPv4 address
:return: ipv4 address of net non-private address
:rtype: str
>>> get_next_undefined_address('0.0.0.0')
'1.0.0.0'
>>> get_next_undefined_address('24.24.24.24')
'24.24.24.24'
>>> get_next_undefined_address('127.0.0.1')
'128.0.0.0'
>>> get_next_undefined_address('255.255.255.256') is None
True
"""
try:
# Should weed out many invalid IP addresses
ipcalc.Network(ip)
except ValueError, error:
return None
defined_networks = (
'0.0.0.0/8',
'10.0.0.0/8',
'127.0.0.0/8',
'169.254.0.0/16',
'192.0.0.0/24',
'192.0.2.0/24',
'192.88.99.0/24',
'192.168.0.0/16',
'198.18.0.0/15',
'198.51.100.0/24',
'203.0.113.0/24',
'224.0.0.0/4',
'240.0.0.0/4',
'255.255.255.255/32',
)
for network_cidr in defined_networks:
if ip in ipcalc.Network(network_cidr):
return get_next_ip(get_netrange_end(network_cidr))
return ip
Now I could start from 0.0.0.0 and work my way to 255.255.255.255. But before I query each ip address I check to see if it's defined and if it is, get back the next undefined address:
>>> get_next_undefined_address('0.0.0.0')
'1.0.0.0'
The ipcalc module came in handy when trying to see if an IP address was within a CIDR-defined range.
How many IPs are unassigned?
One problem I came up against is when I found a range of IP addresses that were unassigned I wasn't told how large the range was. So when I hit 192.0.1.0 I just had to keep going through IP addresses one at a time till I found another IP address again. This took a long time and felt unproductive.
$ whois 192.0.1.0
#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# http://www.arin.net/public/whoisinaccuracy/index.xhtml
#
No match found for n + 192.0.1.0.
#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
# If you see inaccuracies in the results, please report at
# http://www.arin.net/public/whoisinaccuracy/index.xhtml
#
My script would just print out the following and try the next IP address:
Missing ASN CIDR in whois resp: {
'asn_registry': 'arin',
'asn_date': '',
'asn_country_code': '',
'raw': None,
'asn_cidr': 'NA',
'query': '192.0.1.86',
'nets': [],
'asn': 'NA'
}
To minimise the amount of time I spent where no data was being collected I decided to break up the job. My script accepts a number of threads and divides up the IPv4 address space into equal chunks for each thread to handle.
def break_up_ipv4_address_space(num_threads=8):
"""
>>> break_up_ipv4_address_space() == \
[('0.0.0.0', '31.255.255.255'), ('32.0.0.0', '63.255.255.255'),\
('64.0.0.0', '95.255.255.255'), ('96.0.0.0', '127.255.255.255'),\
('128.0.0.0', '159.255.255.255'), ('160.0.0.0', '191.255.255.255'),\
('192.0.0.0', '223.255.255.255'), ('224.0.0.0', '255.255.255.255')]
True
"""
ranges = []
multiplier = 256 / num_threads
for marker in range(0, num_threads):
starting_class_a = (marker * multiplier)
ending_class_a = ((marker + 1) * multiplier) - 1
ranges.append(('%d.0.0.0' % starting_class_a,
'%d.255.255.255' % ending_class_a))
return ranges
gevent is used to launch each thread asynchronously:
threads = [gevent.spawn(get_netranges, starting_id, ending_ip, ...)
for starting_id, ending_ip in get_ranges(num_threads)]
gevent.joinall(threads)
What was collected?
I stored the various items of information ipwhois returned in Elasticsearch along with the starting and ending ip address for each range and the number of addresses within each range. I then created a small method to show (up to) the top 10 countries and cities by number of IP addresses assigned to networks within them along with the number of respective IP addresses.
I didn't completely run the scraping through the whole of the IPv4 space as this was just an experiment. The following are from just a few minutes of data that I collected:
$ ./whois.py stats http://127.0.0.1:9200/ netblocks
Top 10 netblock locations by country
67,836,672 us
327,680 eu
73,728 ca
65,536 gb
65,536 ie
32,768 th
20,480 jp
15,872 cn
6,656 ro
2,048 dk
Top 10 netblock locations by city
16,842,752 columbus
16,785,408 houston
16,777,216 lake mary
16,252,928 ann arbor
524,288 littleton
262,144 herdon
131,072 nashville
131,072 sioux falls
65,536 toronto
61,184 spanish fork
There probably is a better way of doing this
I wouldn't be surprised if someone does provide a data dump of all WHOIS records for the IPv4 space somewhere online. For 2014, this seems like a lot of effort just to see the state of IPv4 assignments.
There were a lot of edge cases I came up against and I'm in deep admiration to anyone who can scrape this data consistently quickly and reliably.
I time-boxed my efforts on this code to one day so its far from a shining example of what I can do when I'm at my best. I welcome any feedback or suggestions on the code.
My script and its requirements
$ cat requirements.txt
argparse==1.2.1
dnspython==1.12.0
docopt==0.6.2
gevent==1.0.1
greenlet==0.4.5
ipaddr==2.1.11
ipcalc==1.1.3
ipwhois==0.9.1
pyelasticsearch==0.7.1
requests==2.4.3
simplejson==3.6.5
six==1.8.0
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
IPv4 Whois data collection and analysis tool
Usage:
./whois.py collect <elastic_search_url> <index_name> <doc_name>
[--sleep_min=<n>] [--sleep_max=<n>] [--threads=<n>]
./whois.py stats <elastic_search_url> <index_name>
./whois.py test
./whois.py (-h | --help)
Options:
-h, --help Show this screen and exit.
--sleep_min=<n> Least number of seconds to sleep for [Default: 1]
--sleep_max=<n> Most number of seconds to sleep for [Default: 5]
--threads=<n> Number of threads [Default: 8]
Examples:
./whois.py collect http://127.0.0.1:9200/ netblocks netblock
./whois.py stats http://127.0.0.1:9200/ netblocks
License:
The MIT License (MIT)
Copyright (c) 2014 Mark Litwintschik
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
"""
import json
from random import randint
import socket
import struct
import sys
from docopt import docopt
import ipcalc
from ipwhois import IPWhois
import gevent
from pyelasticsearch import ElasticSearch
from pyelasticsearch.exceptions import \
ElasticHttpError, ElasticHttpNotFoundError
import requests
def ip2long(ip):
"""
Convert IPv4 address in string format into an integer
:param str ip: ipv4 address
:return: ipv4 address
:rtype: integer
"""
packed_ip = socket.inet_aton(ip)
return struct.unpack("!L", packed_ip)[0]
def get_next_ip(ip_address):
"""
:param str ip_address: ipv4 address
:return: next ipv4 address
:rtype: str
>>> get_next_ip('0.0.0.0')
'0.0.0.1'
>>> get_next_ip('24.24.24.24')
'24.24.24.25'
>>> get_next_ip('24.24.255.255')
'24.25.0.0'
>>> get_next_ip('255.255.255.255') is None
True
"""
assert ip_address.count('.') == 3, \
'Must be an IPv4 address in str representation'
if ip_address == '255.255.255.255':
return None
try:
return socket.inet_ntoa(struct.pack('!L', ip2long(ip_address) + 1))
except Exception, error:
print 'Unable to get next IP for %s' % ip_address
raise error
def get_netrange_end(asn_cidr):
"""
:param str asn_cidr: ASN CIDR
:return: ipv4 address of last IP in netrange
:rtype: str
"""
try:
last_in_netrange = \
ip2long(str(ipcalc.Network(asn_cidr).host_first())) + \
ipcalc.Network(asn_cidr).size() - 2
except ValueError, error:
print 'Issue calculating size of %s network' % asn_cidr
raise error
return socket.inet_ntoa(struct.pack('!L', last_in_netrange))
def get_next_undefined_address(ip):
"""
Get the next non-private IPv4 address if the address sent is private
:param str ip: IPv4 address
:return: ipv4 address of net non-private address
:rtype: str
>>> get_next_undefined_address('0.0.0.0')
'1.0.0.0'
>>> get_next_undefined_address('24.24.24.24')
'24.24.24.24'
>>> get_next_undefined_address('127.0.0.1')
'128.0.0.0'
>>> get_next_undefined_address('255.255.255.256') is None
True
"""
try:
# Should weed out many invalid IP addresses
ipcalc.Network(ip)
except ValueError, error:
return None
defined_networks = (
'0.0.0.0/8',
'10.0.0.0/8',
'127.0.0.0/8',
'169.254.0.0/16',
'192.0.0.0/24',
'192.0.2.0/24',
'192.88.99.0/24',
'192.168.0.0/16',
'198.18.0.0/15',
'198.51.100.0/24',
'203.0.113.0/24',
'224.0.0.0/4',
'240.0.0.0/4',
'255.255.255.255/32',
)
for network_cidr in defined_networks:
if ip in ipcalc.Network(network_cidr):
return get_next_ip(get_netrange_end(network_cidr))
return ip
def break_up_ipv4_address_space(num_threads=8):
"""
>>> break_up_ipv4_address_space() == \
[('0.0.0.0', '31.255.255.255'), ('32.0.0.0', '63.255.255.255'),\
('64.0.0.0', '95.255.255.255'), ('96.0.0.0', '127.255.255.255'),\
('128.0.0.0', '159.255.255.255'), ('160.0.0.0', '191.255.255.255'),\
('192.0.0.0', '223.255.255.255'), ('224.0.0.0', '255.255.255.255')]
True
"""
ranges = []
multiplier = 256 / num_threads
for marker in range(0, num_threads):
starting_class_a = (marker * multiplier)
ending_class_a = ((marker + 1) * multiplier) - 1
ranges.append(('%d.0.0.0' % starting_class_a,
'%d.255.255.255' % ending_class_a))
return ranges
def get_netranges(starting_ip='1.0.0.0',
last_ip='2.0.0.0',
elastic_search_url='http://127.0.0.1:9200/',
index_name='netblocks',
doc_name='netblock', sleep_min=1, sleep_max=5):
connection = ElasticSearch(elastic_search_url)
current_ip = starting_ip
while True:
# See if we've finished the range of work
if ip2long(current_ip) > ip2long(last_ip):
return
current_ip = get_next_undefined_address(current_ip)
if current_ip == None: # No more undefined ip addresses
return
print current_ip
try:
whois_resp = IPWhois(current_ip).lookup_rws()
except Exception as error:
"""
If a message like: 'STDERR: getaddrinfo(whois.apnic.net): Name or
service not known' appears' then print it out and try the next
IP address.
"""
print type(error), error
current_ip = get_next_ip(current_ip)
if current_ip is None:
return # No more undefined ip addresses
gevent.sleep(randint(sleep_min, sleep_max))
continue
if 'asn_cidr' in whois_resp and \
whois_resp['asn_cidr'] is not None and \
whois_resp['asn_cidr'].count('.') == 3:
last_netrange_ip = get_netrange_end(whois_resp['asn_cidr'])
else:
try:
last_netrange_ip = \
whois_resp['nets'][0]['range'].split('-')[-1].strip()
assert last_netrange_ip.count('.') == 3
except:
# No match found for n + 192.0.1.0.
print 'Missing ASN CIDR in whois resp: %s' % whois_resp
current_ip = get_next_ip(current_ip)
if current_ip is None:
return # No more undefined ip addresses
gevent.sleep(randint(sleep_min, sleep_max))
continue
assert last_netrange_ip is not None and \
last_netrange_ip.count('.') == 3, \
'Unable to find last netrange ip for %s: %s' % (current_ip,
whois_resp)
# Save current_ip and whois_resp
entry = {
'netblock_start': current_ip,
'netblock_end': last_netrange_ip,
'block_size': ip2long(last_netrange_ip) - ip2long(current_ip) + 1,
'whois': json.dumps(whois_resp),
}
keys = ('cidr', 'name', 'handle', 'range', 'description',
'country', 'state', 'city', 'address', 'postal_code',
'abuse_emails', 'tech_emails', 'misc_emails', 'created',
'updated')
for _key in keys:
entry[_key] = str(whois_resp['nets'][0][_key]) \
if _key in whois_resp['nets'][0] and \
whois_resp['nets'][0][_key] else None
if _key == 'city' and entry[_key] and ' ' in entry[_key]:
entry[_key] = entry[_key].replace(' ', '_')
try:
connection.index(index_name, doc_name, entry)
except ElasticHttpError, error:
print 'At %s. Unable to save record: %s' % (current_ip, entry)
raise error
current_ip = get_next_ip(last_netrange_ip)
if current_ip is None:
return # No more undefined ip addresses
gevent.sleep(randint(sleep_min, sleep_max))
def stats(elastic_search_url, index_name, doc_name):
fields = ('country', 'city')
url = '%s/%s/_search?fields=aggregations' % (elastic_search_url, index_name)
for field in fields:
data = {
"aggs": {
field: {
"terms": {
"field": field,
"order": {"total_ips": "desc"}
},
"aggs": {
"total_ips": {"sum": {"field": "block_size"}}
}
}
}
}
resp = requests.get(url, data=json.dumps(data))
assert resp.status_code == 200, \
'Did not get HTTP 200 back: %s' % resp.status_code
_stats = json.loads(resp.content)["aggregations"][field]["buckets"]
_stats = {stat['key']: int(stat['total_ips']['value'])
for stat in _stats}
print 'Top 10 netblock locations by %s' % field
for _key in sorted(_stats, key=_stats.get, reverse=True):
print "{:14,d}".format(_stats[_key]), _key.replace('_', ' ')
print
def main(argv):
"""
:param dict argv: command line arguments
"""
opt = docopt(__doc__, argv)
if opt['collect']:
sleep_min = int(opt['--sleep_min']) \
if opt['--sleep_min'] is not None else randint(1, 5)
sleep_max = int(opt['--sleep_max']) \
if opt['--sleep_max'] is not None else randint(1, 5)
num_threads = int(opt['--threads'])
if sleep_min > sleep_max:
sleep_min, sleep_max = sleep_max, sleep_min
threads = [gevent.spawn(get_netranges, starting_id, ending_ip,
opt['<elastic_search_url>'], opt['<index_name>'],
opt['<doc_name>'], sleep_min, sleep_max)
for starting_id, ending_ip in
break_up_ipv4_address_space(num_threads)]
gevent.joinall(threads)
if opt['stats']:
stats(opt['<elastic_search_url>'],
opt['<index_name>'],
opt['<doc_name>'])
if opt['test']:
import doctest
doctest.testmod()
if __name__ == "__main__":
try:
main(sys.argv[1:])
except KeyboardInterrupt:
pass
Newer versions of the above script can be found on github.