Mass IP Address WHOIS Collection with Django & Kafka

In two previous blogs I looked at ways of collecting a snapshot of the WHOIS data for all IPv4 addresses. I wanted to try a third method of collecting this information using a cluster of AWS EC2 spot instances. I wasn't sure how well this method would work so I time-boxed this experiment to two days.

The code base can be found here. Keep in mind it's undergone many changes since this blog post was originally published. To read about the improvements please see my "Faster IPv4 WHOIS Crawling" blog post.

The Architecture

I used two different types of nodes in a cluster. A single coordinator node and 50 worker nodes.

The coordinator generates 4.7 million IPv4 addresses, avoiding known reserved addressed and large class A's with well-known owners. These IPs are samples from across the 4 billion+ IPv4 addresses that exist. IP addresses tend to be allocated in fairly large blocks so 4.7 million look ups should be able to find a large portion of allocated addresses.

The worker nodes would then ask for 1,000 of those addresses at a time. Going through each of those 1,000 addresses, the worker nodes would see if each IP address hits any CIDR blocks that other nodes have discovered while doing their own look ups. If no other node has come across a CIDR block with that IP address within it, the worker node will then see which of the 5 major registries manages the IPv4 space that IP sits in. The IP will then be added to a queue assigned to deal with that registry's IPs.

There is a queue for each registry for doing WHOIS look ups which run concurrently and are rate limited to avoid any blacklisting.

When a successful WHOIS request is complete the data will be shared in a Kafka 'results' topic that all the worker nodes and the coordinator have access to.

The coordinator monitors the results topic and stores each unique CIDR block in Redis. This persists the list of known CIDRs and is used by the endpoint that tells worker nodes if the IP address they have is in a known CIDR block or not.

When the IPs are sent out to workers they're from random parts of the IPv4 address space. The hope is that as the workers work their way through the queue of IPs, more and more will hit CIDRs that other workers found before them and not bother with the lookup. It's much quicker to see if an IP has a hit than to query a registry.

Launching a Cluster on AWS

I first created a security group in eu-west-1 called 'ip-whois-sg'. It allowed my home IP address to connect in via port 22 and allowed any instances in the security group to communicate with one another on port 8000 (for Django), 2181 (for Zookeeper) and 9092 (for Kafka).

I launched an on-demand t2.small instance for the coordinator and 50 m4.large spot instances which acted as worker nodes. I used the ami-f95ef58a Ubuntu 14.04 LTS image for every instance.

When I launched the spot instances I did so after 3PM UTC as the spot prices were around 50% of where they were just a few hours earlier. Also, launching them in eu-west-1a brought savings over launching them in eu-west-1b and eu-west-1c.

Prices for m4.large spot instances as of 2016-04-26 16:37:29 UTC:

eu-west-1a: $0.0186 / hour
eu-west-1b: $0.0242 / hour
eu-west-1c: $0.0216 / hour

I bid $0.02 / hour max for the spot instances. The whole cluster cost, at most, $1.028 / hour.

After I'd made the requests many spot instances were up within a few minutes but a number of other ones reported issues such as "instance terminated capacity oversubscribed" and "capacity oversubscribed". In the end only 35 of the 50 spot requests were filled. I was keen on using 50 instances but the 35 that had launched have already cost me money as they're billed by the hour rounded up.

There's a saying that you can pay for one machine for 10 hours or 10 machines for one hour. In the case of spot instances the minimum price changes a lot throughout the day for certain instance types in certain regions so it's important to get the work over and done with sooner so there is less risk of the spot instance being shut down due to your maximum bid no longer being competitive. This is why I try to guess how many machines are needed to complete a job within a few hours.

Bootstrapping Instances

To keep things simple I created a zip file that contains all the source code and other files needed to run each of the nodes. An Ansible script would later upload this file when setting up each of the nodes.

$ zip -r \
    app.zip \
    ips/ *.txt \
    -x *.sqlite3 \
    -x *.pid \
    -x *.pyc

Then, to avoid any prompts by Ansible, I added the ECDSA key fingerprints of every EC2 instance to my list of known hosts.

$ EC2_IPS=$(aws ec2 describe-instances \
              --query 'Reservations[].Instances[].[PublicIpAddress]' \
              --output text |
              sort |
              uniq |
              grep -v None)

$ for IP in $EC2_IPS; do
      ssh -i ~/.ssh/ip_whois.pem \
          -o StrictHostKeyChecking=no \
          ubuntu@$IP \
          "uptime" &
  done

Without anything more fancy than echoing out the value of EC2_IPS and column editing I created a inventory file for Ansible.

$ vi devops/inventory

[coordinator]
coord1 ansible_host=54.229.76.227 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem

[worker]
worker1 ansible_host=54.171.68.229 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker2 ansible_host=54.194.200.82 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker3 ansible_host=54.194.58.88 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker4 ansible_host=54.171.222.219 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker5 ansible_host=54.171.110.235 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker6 ansible_host=54.171.180.145 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker7 ansible_host=54.171.74.153 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker8 ansible_host=54.171.248.30 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker9 ansible_host=54.194.159.165 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workera ansible_host=54.171.70.55 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerb ansible_host=54.171.226.234 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerc ansible_host=54.229.97.204 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerd ansible_host=54.171.154.107 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workere ansible_host=54.171.69.162 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerf ansible_host=54.229.96.245 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerg ansible_host=54.194.200.148 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerh ansible_host=54.171.117.35 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workeri ansible_host=54.171.160.49 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerj ansible_host=54.229.27.31 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerk ansible_host=54.171.69.4 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerl ansible_host=54.194.228.213 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerm ansible_host=54.171.178.129 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workern ansible_host=54.194.44.204 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workero ansible_host=54.194.105.191 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerp ansible_host=54.171.181.17 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerq ansible_host=54.171.71.135 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerr ansible_host=54.171.83.54 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workers ansible_host=54.171.213.63 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workert ansible_host=54.229.81.6 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workeru ansible_host=54.194.201.129 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerv ansible_host=54.171.179.116 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerw ansible_host=54.171.208.231 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerx ansible_host=54.171.238.185 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workery ansible_host=54.171.248.114 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerz ansible_host=54.171.73.179 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem

With that in place I ran the bootstrap playbook:

$ cd devops
$ ansible-playbook bootstrap.yml

After that finished running there were a few machines that had issues.

NO MORE HOSTS LEFT *************************************************************
        to retry, use: --limit @bootstrap.retry

PLAY RECAP *********************************************************************
coord1                     : ok=37   changed=14   unreachable=0    failed=0
worker1                    : ok=0    changed=0    unreachable=1    failed=0
worker2                    : ok=16   changed=6    unreachable=0    failed=0
worker3                    : ok=16   changed=6    unreachable=0    failed=0
worker4                    : ok=16   changed=6    unreachable=0    failed=0
worker5                    : ok=16   changed=6    unreachable=0    failed=0
worker6                    : ok=16   changed=6    unreachable=0    failed=0
worker7                    : ok=16   changed=11   unreachable=0    failed=0
worker8                    : ok=16   changed=6    unreachable=0    failed=0
worker9                    : ok=16   changed=6    unreachable=0    failed=0
workera                    : ok=16   changed=6    unreachable=0    failed=0
workerb                    : ok=16   changed=6    unreachable=0    failed=0
workerc                    : ok=16   changed=6    unreachable=0    failed=0
workerd                    : ok=16   changed=6    unreachable=0    failed=0
workere                    : ok=16   changed=6    unreachable=0    failed=0
workerf                    : ok=16   changed=9    unreachable=0    failed=0
workerg                    : ok=16   changed=6    unreachable=0    failed=0
workerh                    : ok=16   changed=6    unreachable=0    failed=0
workeri                    : ok=16   changed=6    unreachable=0    failed=0
workerj                    : ok=16   changed=6    unreachable=0    failed=0
workerk                    : ok=16   changed=6    unreachable=0    failed=0
workerl                    : ok=16   changed=6    unreachable=0    failed=0
workerm                    : ok=14   changed=5    unreachable=0    failed=1
workern                    : ok=14   changed=5    unreachable=0    failed=1
workero                    : ok=16   changed=7    unreachable=0    failed=0
workerp                    : ok=16   changed=7    unreachable=0    failed=0
workerq                    : ok=16   changed=7    unreachable=0    failed=0
workerr                    : ok=14   changed=5    unreachable=1    failed=0
workers                    : ok=14   changed=5    unreachable=1    failed=0
workert                    : ok=16   changed=7    unreachable=0    failed=0
workeru                    : ok=16   changed=7    unreachable=0    failed=0
workerv                    : ok=16   changed=7    unreachable=0    failed=0
workerw                    : ok=16   changed=11   unreachable=0    failed=0
workerx                    : ok=16   changed=11   unreachable=0    failed=0
workery                    : ok=16   changed=11   unreachable=0    failed=0
workerz                    : ok=0    changed=0    unreachable=1    failed=0

In an attempt to fix them I ran the following:

$ ansible-playbook bootstrap.yml \
    --limit @bootstrap.retry

That had a bit more success but I had a few instances that still weren't playing ball.

PLAY RECAP *********************************************************************
worker1                    : ok=0    changed=0    unreachable=1    failed=0
workerr                    : ok=16   changed=7    unreachable=0    failed=0
workers                    : ok=16   changed=7    unreachable=0    failed=0
workerz                    : ok=0    changed=0    unreachable=1    failed=0

It was at this point I regretted not baking an AMI for the worker nodes. Ansible took 30 minutes to run and didn't finish properly. Normally launching an AMI takes two minutes and everything will be already setup and good to go.

Generating 4.7M IP Addresses

I wrote a small management command in Django that would generate ~4.7 million IP addresses and store them in a database table. Each address would be tagged when it was assigned to a worker so that workers would only be given unassigned addresses.

I SSH'ed into the coordinator to run this generator. Keep in mind the coordinator's public IP address was 54.229.76.227 and its private address (which is references a lot by the workers) was 172.30.0.172.

$ ssh -i ~/.ssh/ip_whois.pem ubuntu@54.229.76.227
$ screen
$ cd /home/ubuntu/ips && \
  source /home/ubuntu/.ips/bin/activate && \
  python manage.py gen_ips

The above is computationally expensive and took about 20 minutes to complete on the t2.small instance it ran on. In hindsight I'd rather have built this dataset once before launching the coordinator and then pushed it up when I launched it as it kept the cluster from getting started sooner.

Launching Coordinator Services

The reference WSGI interface doesn't do a good job of queuing requests like nginx does. My hope with this experiment was that it would hold up well enough against all the worker nodes that it wouldn't warrant installing nginx as well as the time associated with writing the Ansible scripts and configuration files. There were performance problems down the line but not strictly related to the reference WSGI itself.

$ ansible coordinator \
    -m shell \
    -a 'bash -c "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 nohup python manage.py runserver 0.0.0.0:8000 &"'

I created a collect_whois management task for the coordinator to collect all the results from a Kafka topic and store the CIDR blocks in Redis. Below was the command I used to launch that task.

$ ansible coordinator \
    -m shell \
    -a 'bash -c "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 KAFKA_HOST=172.30.0.172:9092 \
                 nohup python manage.py collect_whois &"'

Launching Worker Services

Celery didn't like being launched via an Ansible shell statement for some reason.

$ ansible worker \
    -m shell \
    -a 'bash -c "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 KAFKA_HOST=172.30.0.172:9092 \
                 nohup python manage.py celeryd --concurrency=30 &"'

It should have been at this point that I'd configure Supervisor scripts to run this and the other tasks but in the interests of time I ran a hacky bash loop to launch celeryd on each worker nodes.

WORKER_IPS=$(aws ec2 describe-instances \
              --query 'Reservations[].Instances[].[PublicIpAddress]' \
              --output text |
              sort |
              uniq |
              grep -v None |
              grep -v '54.229.76.227')

$ for IP in $WORKER_IPS; do
      ssh -i ~/.ssh/ip_whois.pem \
          -o StrictHostKeyChecking=no \
          ubuntu@$IP \
          "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 KAFKA_HOST=172.30.0.172:9092 \
                 nohup python manage.py celeryd --concurrency=30 &" &
  done

The above would create ssh connections that would just hang after they launched celeryd so I had to kill them.

$ killall ssh

I hate to leave so much unpolished process lying around but in the back of my mind I was worried that if the cluster doesn't perform well then all the time spent building the perfect, automated deployment would be for nothing.

The following launched the metrics reporter. It would communicate telemetry from each of the workers back to Kafka. I could then follow the 'metrics' topic and get a general idea of how the cluster was performing.

$ ansible worker \
    -m shell \
    -a 'bash -c "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 KAFKA_HOST=172.30.0.172:9092 \
                 nohup python manage.py celerybeat &"'

Finally I launched the management task that takes 1,000 IPs at a time from the coordinator and processes them.

$ ansible worker \
    -m shell \
    -a 'bash -c "cd /home/ubuntu/ips &&
                 source /home/ubuntu/.ips/bin/activate &&
                 COORDINATOR_ENDPOINT=http://172.30.0.172:8000/coordinator/ \
                 HIT_ENDPOINT=http://172.30.0.172:8000/coordinator/cidr-hit/ \
                 KAFKA_HOST=172.30.0.172:9092 \
                 nohup python manage.py get_ips_from_coordinator &"'

Coordinator Under Siege

Once everything was launched I started following the 'metrics' topic.

$ ssh -i ~/.ssh/ip_whois.pem ubuntu@54.229.76.227 \
    "/tmp/kafka_2.11-0.8.2.1/bin/kafka-console-consumer.sh \
    --zookeeper localhost:2181 \
    --topic metrics \
    --from-beginning"

Within a few minutes I could see that some workers were getting on fine while others were struggling.

{"Awaiting Registry": 1, "Failed to lookup WHOIS": 48, "Got WHOIS": 544, "Host": "172.30.0.121", "Timestamp": "2016-04-26T17:35:00.048931", "Within Known CIDR Block": 208}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 49, "Got WHOIS": 552, "Host": "172.30.0.124", "Timestamp": "2016-04-26T17:35:00.046517", "Within Known CIDR Block": 243}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 50, "Got WHOIS": 535, "Host": "172.30.0.143", "Timestamp": "2016-04-26T17:35:00.057408", "Within Known CIDR Block": 186}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 50, "Got WHOIS": 554, "Host": "172.30.0.249", "Timestamp": "2016-04-26T17:35:00.008689", "Within Known CIDR Block": 242}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 51, "Got WHOIS": 584, "Host": "172.30.0.20", "Timestamp": "2016-04-26T17:35:00.047658", "Within Known CIDR Block": 202}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 54, "Got WHOIS": 481, "Host": "172.30.0.250", "Timestamp": "2016-04-26T17:35:00.058991", "Within Known CIDR Block": 212}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 54, "Got WHOIS": 525, "Host": "172.30.0.238", "Timestamp": "2016-04-26T17:35:00.057304", "Within Known CIDR Block": 243}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 54, "Got WHOIS": 525, "Host": "172.30.0.240", "Looking up WHOIS": 1, "Timestamp": "2016-04-26T17:35:00.033937", "Within Known CIDR Block": 253}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 56, "Got WHOIS": 511, "Host": "172.30.0.236", "Timestamp": "2016-04-26T17:35:00.058570", "Within Known CIDR Block": 222}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 57, "Got WHOIS": 492, "Host": "172.30.0.239", "Timestamp": "2016-04-26T17:35:00.058613", "Within Known CIDR Block": 184}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 57, "Got WHOIS": 545, "Host": "172.30.0.170", "Timestamp": "2016-04-26T17:35:00.058690", "Within Known CIDR Block": 213}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 57, "Got WHOIS": 569, "Host": "172.30.0.117", "Looking up WHOIS": 1, "Timestamp": "2016-04-26T17:35:00.062300", "Within Known CIDR Block": 219}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 59, "Got WHOIS": 566, "Host": "172.30.0.112", "Timestamp": "2016-04-26T17:35:00.054850", "Within Known CIDR Block": 229}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 60, "Got WHOIS": 515, "Host": "172.30.0.88", "Looking up WHOIS": 2, "Timestamp": "2016-04-26T17:35:00.022801", "Within Known CIDR Block": 238}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 60, "Got WHOIS": 561, "Host": "172.30.0.13", "Timestamp": "2016-04-26T17:35:00.038099", "Within Known CIDR Block": 210}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 61, "Got WHOIS": 520, "Host": "172.30.0.136", "Timestamp": "2016-04-26T17:35:00.057452", "Within Known CIDR Block": 237}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 61, "Got WHOIS": 553, "Host": "172.30.0.71", "Looking up WHOIS": 1, "Timestamp": "2016-04-26T17:35:00.034859", "Within Known CIDR Block": 259}
{"Awaiting Registry": 1, "Failed to lookup WHOIS": 8, "Got WHOIS": 89, "Host": "172.30.0.11", "Timestamp": "2016-04-26T17:35:00.047294", "Within Known CIDR Block": 54}
{"Host": "172.30.0.109", "Timestamp": "2016-04-26T17:35:00.059843"}
{"Host": "172.30.0.135", "Timestamp": "2016-04-26T17:35:00.058640"}
{"Host": "172.30.0.15", "Timestamp": "2016-04-26T17:35:00.060111"}
{"Host": "172.30.0.167", "Timestamp": "2016-04-26T17:35:00.059880"}
{"Host": "172.30.0.188", "Timestamp": "2016-04-26T17:35:00.060530"}
{"Host": "172.30.0.193", "Timestamp": "2016-04-26T17:35:00.028907"}
{"Host": "172.30.0.45", "Timestamp": "2016-04-26T17:35:00.059367"}
{"Host": "172.30.0.47", "Timestamp": "2016-04-26T17:35:00.059336"}
{"Host": "172.30.0.50", "Timestamp": "2016-04-26T17:35:00.059753"}
{"Host": "172.30.0.6", "Timestamp": "2016-04-26T17:35:00.059641"}
{"Host": "172.30.0.62", "Timestamp": "2016-04-26T17:35:00.059618"}
{"Host": "172.30.0.69", "Timestamp": "2016-04-26T17:35:00.059966"}
{"Host": "172.30.0.86", "Timestamp": "2016-04-26T17:35:00.060063"}

If a line didn't have at least a single "Awaiting Registry" entry then it means it hasn't even collected its initial 1,000 IP addresses. This would turn out to be a problem where the coordinator was so busy with the CIDR hit endpoint that connections to all other Django endpoints were beginning to time out.

What's worse is an early version of the worker's code would raise a time out exception and then it would stop. Having this process running via Supervisor would mean Supervisor would have a go at re-starting the process and possibly add some resiliency to this process.

Here's the output of top on the coordinator at one point during this exercise.

top - 18:21:27 up  1:49,  1 user,  load average: 4.56, 2.84, 3.14
Tasks: 119 total,   3 running, 115 sleeping,   0 stopped,   1 zombie
%Cpu0  : 31.6 us,  4.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si, 64.0 st
KiB Mem:   2048516 total,  1922796 used,   125720 free,   110600 buffers
KiB Swap:        0 total,        0 used,        0 free.  1165020 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
31495 ubuntu    20   0  775088  83276   4760 S 76.9  4.1  56:28.14 /home/ubuntu/.ips/bin/python manage.py +
 4896 ubuntu    20   0 1905688 196972  12428 S  4.7  9.6   2:14.61 java -Xmx1G -Xms1G -server -XX:+UseParN+
 3524 ubuntu    20   0 1323452  95312  12324 S  3.3  4.7   0:49.79 /usr/lib/jvm/java-7-oracle/bin/java -Xm+
31517 ubuntu    20   0  392488  34884   4272 S  2.7  1.7   1:18.73 python manage.py collect_whois
...

And here's the output of top on a worker. As you can see it's underutilised.

top - 18:26:34 up  1:46,  1 user,  load average: 0.00, 0.01, 0.05
Tasks: 143 total,   2 running, 141 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.0 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8175632 total,  1938020 used,  6237612 free,   111492 buffers
KiB Swap:        0 total,        0 used,        0 free.   655132 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
31170 ubuntu    20   0   23716   1864   1280 R   0.3  0.0   0:00.05 top
26735 ubuntu    20   0  105632   1880    896 S   0.0  0.0   0:00.00 sshd: ubuntu@notty
26737 ubuntu    20   0   11152    932    692 S   0.0  0.0   0:00.00 bash -c cd /home/ubuntu/ips &&...
26742 ubuntu    20   0  121116  39464   4404 S   0.0  0.5   0:06.02 python manage.py celeryd --concurrency=30
26747 ubuntu    20   0  426108  44460   3596 S   0.0  0.5   0:00.50 python manage.py celeryd --concurrency=30
...

I ended up killing a number of the less-productive spot instances to ease the burden on the coordinator.

If I end up coming back to this experiment I'd use Redis slaves to share the data among the worker nodes and have them calculate the CIDR hits without contacting the coordinator.

Collecting the Results

After two hours the cluster had managed to collect 11,125 results. I terminated the cluster and examined what it had managed to collect.

$ ssh -i ~/.ssh/ip_whois.pem ubuntu@54.229.76.227

$ /tmp/kafka_2.11-0.8.2.1/bin/kafka-console-consumer.sh \
    --zookeeper localhost:2181 \
    --topic results \
    --from-beginning > results &

# Wait here till you see the results file stop growing.

$ gzip results

$ gunzip -c results.gz | wc -l
11154

From what I believe is a race condition the 11,154 results turned into 7,435 unique CIDR blocks.

$ gunzip results.gz

$ ipython

import json


results = [json.loads(line)
           for line in open('results').read().split('\n')
           if line.strip()]

cidr = set([res['Whois']['asn_cidr']
            for res in results
            if 'Whois' in res and
               'asn_cidr' in res['Whois'] and
               res['Whois']['asn_cidr'] != 'NA'])

print len(cidr)
7435

Wikipedia states there are 592,708,864 reversed IP addresses in the IPv4 address space leaving 3,702,258,432 left for allocation. The unique CIDR blocks I collected represented 1,372,794,884 unique IP addresses.

from netaddr import *


print sum([IPNetwork(c).size
           for c in cidr if c])
1372794884

This means some 37% of the whole of IPv4's address space is accounted for in the WHOIS records I've collected. I think this isn't too bad for a little under $2, two hours of run time and two days of my time.

There is a lot of room for improvement and many more ways this data can be collected so I might revisit this at some point.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.