The code base discussed in this blog can be found on GitHub.
A few days ago I put together an IPv4 WHOIS crawler using Django, Redis and Kafka and launched it on what should have been a 51-node cluster on AWS EC2. While the cluster ran I was able to identify several shortcomings of the code and style of execution. Nonetheless, I could also see that there were good performance characteristics that could be further improved upon. These findings are discussed throughout my blog post "Mass IP Address WHOIS Collection with Django & Kafka".
Yesterday I sat down for a few hours and made some architectural changes to the code base and took it for another spin.
Architectural Improvements
First, all workers will now find out if the IP address they're looking up is within a CIDR block that has already been crawled on their own machine. This means the master node isn't performing CPU-intensive lookups on behalf of 50 worker nodes. To do this, each worker will speak to a local Redis instance that is a slave of the Redis instance on the coordinator. When a WHOIS query is successful completed, the worker node passes the result to Kafka. The coordinator pulls out every unique CIDR block it sees in Kafka and stores them as a single string value in Redis. Redis then replicates that key across all the slave nodes. That key is then used in the CIDR hit calculations on each worker.
Second, all the Django-based processes on the worker nodes now run via Supervisor. If a process exits with an exception, a reasonable number of attempts are made to restart the process. If the exception is a one-off or a rare occurrence then the worker node can continue to be productive rather than just sit idle.
Third, all worker nodes pull their configuration settings from Redis. I can set the configuration keys via a management command on the coordinator and Redis will replicate them to each worker's Redis instance. Workers are designed to wait and try again if they can't get the Redis key with the HTTP endpoint for the coordinator and/or the Kafka host. When they can see both of those values they will then begin working. This makes deployment of the workers a lot easier as I don't need to know any configuration settings for them in advance.
Fourth, I've generated the 4.7 million seed list of IPv4 addresses in advance in an sqlite3 file and push it to the coordinator after the coordinator has been deployed. This saves me time getting the coordinator up and running and gets the workers to work faster.
Fifth, I've created a management command to display aggregated telemetry so I can see overall progress when the cluster is running.
1 Coordinator, 1 Worker, Up & Running
To start I'll add a rule to the ip-whois-sg security group to allow all EC2 instances within the group to speak to one another on Redis' port 6379.
Then I launched two on-demand instances using the ami-f95ef58a Ubuntu 14.04 LTS image. The first instance is a t2.small for the coordinator. It has the public and private IP addresses of 54.171.53.151 and 172.30.0.239 respectively.
The second instance I launched was an on-demand t2.medium instance with the public IP address of 54.171.49.114. This instance will be setup as a worker, have an AMI image of it baked and then the instance will be terminated. The AMI image will then be used to launch 50 spot instances.
The last time I setup this cluster I used Ansible to provision each worker and a number of them didn't provision properly even though multiple attempts were made. Not only is using an AMI more reliable, it's much faster than Ansible and its 1,000s of SSH connections.
With those two instances launched I created a devops/inventory file.
[coordinator]
coord1 ansible_host=54.171.53.151 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
[worker]
worker1 ansible_host=54.171.49.114 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
I then ran two SSH commands to add their ECDSA key fingerprints to my list of known hosts.
$ ssh -i ~/.ssh/ip_whois.pem \
-o StrictHostKeyChecking=no \
ubuntu@54.171.53.151 \
"test"
$ ssh -i ~/.ssh/ip_whois.pem \
-o StrictHostKeyChecking=no \
ubuntu@54.171.49.114 \
"test"
I then zipped up the code base so that Ansible will be able to deploy it to the instances.
$ zip -r \
app.zip \
ips/ *.txt \
-x *.sqlite3 \
-x *.pid \
-x *.pyc
With the zip file in place I was then able to run the Ansible-based bootstrap script.
$ cd devops
$ ansible-playbook bootstrap.yml
When that completed I checked the Supervisor-managed processes were running on the worker. It's important that when a worker instance is booted up that these three processes start properly. When Supervisor is installed via apt it'll install scripts to start itself when the machine is launched. Then, if the virtual environment wrapper script works properly and the code base is in place each of the three processes should launch correctly and consistently.
$ ssh -i ~/.ssh/ip_whois.pem \
ubuntu@54.171.49.114 \
'sudo supervisorctl status'
celerybeat RUNNING pid 26436, uptime 0:01:08
celeryd RUNNING pid 26435, uptime 0:01:08
get_ips_from_coordinator RUNNING pid 26437, uptime 0:01:08
The three processes above are configured for Supervisor in the devops/config/worker.supervisor.conf file. Those processes are:
- get_ips_from_coordinator will take batches of 1,000 IPv4 addresses from the coordinator, see if each IP address hasn't been found in an existing CIDR block and if not, find the registry for that address and queue up the WHOIS lookup.
- celeryd runs the celery queues that look up WHOIS details on each of the five registries.
- celerybeat will feed telemetry back to Kafka that will be picked up by the coordinator node.
Below is the supervisor configuration file.
[program:celeryd]
autorestart=true
autostart=true
command=/home/ubuntu/.ips/bin/exec python manage.py celeryd --concurrency=30
directory=/home/ubuntu/ips
redirect_stderr=True
startsecs=10
stdout_logfile=/home/ubuntu/celeryd.log
stopasgroup=true
stopsignal=KILL
stopwaitsecs=60
user=ubuntu
[program:celerybeat]
autorestart=true
autostart=true
command=/home/ubuntu/.ips/bin/exec python manage.py celerybeat
directory=/home/ubuntu/ips
redirect_stderr=True
startsecs=10
stdout_logfile=/home/ubuntu/celerybeat.log
stopasgroup=true
stopsignal=KILL
stopwaitsecs=60
user=ubuntu
[program:get_ips_from_coordinator]
autorestart=true
autostart=true
command=/home/ubuntu/.ips/bin/exec python manage.py get_ips_from_coordinator
directory=/home/ubuntu/ips
redirect_stderr=True
startsecs=10
stdout_logfile=/home/ubuntu/get_ips_from_coordinator.log
stopasgroup=true
stopsignal=KILL
stopwaitsecs=60
user=ubuntu
With the worker behaving as expected I baked an AMI image called 'worker' and terminated the on-demand instance.
Pre-generated IPv4 Seed List
To avoid spending 25 minutes running a CPU-intensive IPv4 generation task on the coordinator, I ran the gen_ips management command on my own, more powerful local machine.
$ python manage.py gen_ips
I then compressed the 109 MB db.sqlite3 database file my local instance of Django was using, uploaded it to the coordinator and decompressed it in place ready to go.
$ gzip db.sqlite3
$ scp -i ~/.ssh/ip_whois.pem \
db.sqlite3.gz \
ubuntu@54.171.53.151:/home/ubuntu/ips/
$ cd devops
$ ansible coordinator \
-m shell \
-a 'bash -c "cd /home/ubuntu/ips &&
gunzip -f db.sqlite3.gz"'
I then checked it was in the position I'm expecting it to be in and in the original, 109 MB form it was before.
$ ssh -i ~/.ssh/ip_whois.pem \
ubuntu@54.171.53.151 \
'ls -lh ips/db.sqlite3'
-rw-r--r-- 1 ubuntu ubuntu 109M Apr 29 18:55 ips/db.sqlite3
Launching 50 EC2 Spot Instances
With the coordinator already up I now needed to launch a cluster of 50 worker spot instances. The smallest type of spot instance I can launch is the m4.large. I bid a maximum of $0.02 / hour for each of the instances bringing my total cluster cost to a maximum of $1.028 / hour.
When I requested the spot instances I asked that they use the 'worker' AMI image I had baked. That way each of the spot instances would launch with all their software already in place and Supervisor can launch the three processes they need to run automatically on boot.
Within two minutes all of my spot instances had been provisioned and were running. I then collected the public IP addresses of each of the worker instances and added their ECDSA key fingerprints to my list of known hosts.
$ WORKER_IPS=$(aws ec2 describe-instances \
--query 'Reservations[].Instances[].[PublicIpAddress]' \
--output text |
sort |
uniq |
grep -v None |
grep -v '54.171.53.151')
$ for IP in $WORKER_IPS; do
ssh -i ~/.ssh/ip_whois.pem \
-o StrictHostKeyChecking=no \
ubuntu@$IP \
"test" &
done
I then re-wrote my devops/inventory file replacing the original worker entry with the 50 new workers. I wish I'd written a fancy script for this task but instead I used some search/replace and column editing in my text editor to complete this work.
Here is the resulting devops/inventory file:
[coordinator]
coord1 ansible_host=54.171.53.151 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
[worker]
worker1 ansible_host=54.171.109.146 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker2 ansible_host=54.171.109.215 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker3 ansible_host=54.171.109.55 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker4 ansible_host=54.171.114.203 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker5 ansible_host=54.171.115.48 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker6 ansible_host=54.171.118.226 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker7 ansible_host=54.171.119.195 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker8 ansible_host=54.171.120.62 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker9 ansible_host=54.171.129.29 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker0 ansible_host=54.171.139.137 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workera ansible_host=54.171.142.194 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerb ansible_host=54.171.152.199 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerc ansible_host=54.171.158.140 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerd ansible_host=54.171.159.0 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workere ansible_host=54.171.174.252 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerf ansible_host=54.171.175.16 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerg ansible_host=54.171.175.180 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerh ansible_host=54.171.175.225 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workeri ansible_host=54.171.176.62 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerj ansible_host=54.171.177.14 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerk ansible_host=54.171.177.213 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerl ansible_host=54.171.208.177 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerm ansible_host=54.171.209.128 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workern ansible_host=54.171.210.135 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workero ansible_host=54.171.210.4 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerp ansible_host=54.171.212.94 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerq ansible_host=54.171.222.148 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerr ansible_host=54.171.222.249 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workers ansible_host=54.171.224.201 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workert ansible_host=54.171.226.27 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workeru ansible_host=54.171.51.109 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerv ansible_host=54.171.51.188 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerw ansible_host=54.171.52.148 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerx ansible_host=54.171.52.212 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workery ansible_host=54.171.54.52 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
workerz ansible_host=54.171.55.140 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker11 ansible_host=54.171.56.152 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker12 ansible_host=54.171.57.251 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker13 ansible_host=54.171.69.208 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker14 ansible_host=54.171.69.4 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker15 ansible_host=54.171.70.196 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker16 ansible_host=54.171.71.153 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker17 ansible_host=54.171.71.156 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker18 ansible_host=54.171.74.186 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker19 ansible_host=54.171.74.205 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker20 ansible_host=54.171.74.34 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker21 ansible_host=54.171.74.92 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker23 ansible_host=54.171.81.69 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker24 ansible_host=54.171.82.207 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
worker25 ansible_host=54.171.83.114 ansible_user=ubuntu ansible_private_key_file=~/.ssh/ip_whois.pem
Please ignore the naming strategy, it's not something I thought through.
Coordinator Services Up & Running
I've created a Django command that allows me to set the configuration the cluster needs by nothing more than giving the coordinator's private IP address. This value is then used to set multiple Redis key/value pairs.
$ ansible coordinator \
-m shell \
-a 'bash -c "cd /home/ubuntu/ips &&
source /home/ubuntu/.ips/bin/activate &&
python manage.py set_config 172.30.0.239"'
I then launched the reference WSGI web server and collect_whois process that monitors Kafka and creates the Redis key of all the unique CIDR blocks that have been seen across all the successful WHOIS queries.
$ ansible coordinator \
-m shell \
-a 'bash -c "cd /home/ubuntu/ips &&
source /home/ubuntu/.ips/bin/activate &&
nohup python manage.py runserver 0.0.0.0:8000 &"'
$ ansible coordinator \
-m shell \
-a 'bash -c "cd /home/ubuntu/ips &&
source /home/ubuntu/.ips/bin/activate &&
nohup python manage.py collect_whois &"'
With those in place I'll tell each Redis instance across the cluster of worker nodes what the private IP address of the Redis master is. The worker nodes are already up and running but won't begin to work till they can collect their configuration from their local Redis instance. The master Redis instance already has these configuration keys in place and once the slaves have this information replicated to them they will get started.
$ ansible worker \
-m shell \
-a "echo 'slaveof 172.30.0.239 6379' | redis-cli"
Cluster Telemetry
I have two primary commands that report back on the progress the cluster is making. The first shows the per-minute, per-node telemetry which can be found simply by following the 'metrics' Kafka topic.
$ ssh -i ~/.ssh/ip_whois.pem \
ubuntu@54.171.53.151 \
"/tmp/kafka_2.11-0.8.2.1/bin/kafka-console-consumer.sh \
--zookeeper localhost:2181 \
--topic metrics \
--from-beginning"
Here is an example output line (formatted and key-sorted for clarity).
{
"Host": "172.30.0.12",
"Timestamp": "2016-04-29T19:09:59.575451",
"Within Known CIDR Block": 93,
"Awaiting Registry": 1,
"Found Registry": 135,
"Looking up WHOIS": 10,
"Got WHOIS": 191,
"Failed to lookup WHOIS": 10
}
The second command collects the latest telemetry from each individual host seen in the 'metrics' topic and sums their values of each metric reported on. This lets me see a running total of the cluster's overall performance.
$ ssh -i ~/.ssh/ip_whois.pem \
ubuntu@54.171.53.151 \
"cd /home/ubuntu/ips &&
source /home/ubuntu/.ips/bin/activate &&
python manage.py telemetry"
Here is an example output line (formatted and key-sorted for clarity).
{
"Within Known CIDR Block": 1953,
"Awaiting Registry": 47,
"Found Registry": 2303,
"Looking up WHOIS": 378,
"Got WHOIS": 4080,
"Failed to lookup WHOIS": 128
}
In the previous deployment of this cluster the coordinator was under heavy load from performing CPU-intensive CIDR hit calculations on behalf of all the worker nodes. I've since moved that task on to each of the worker nodes themselves. 45 minutes after the cluster was launched I ran top on the coordinator and one of the workers to see how much pressure they were under.
The following is from the coordinator. As you can see it's pretty quiet.
top - 19:46:17 up 1:14, 1 user, load average: 0.04, 0.18, 0.25
Tasks: 108 total, 2 running, 105 sleeping, 0 stopped, 1 zombie
%Cpu0 : 4.5 us, 2.7 sy, 0.0 ni, 88.7 id, 0.7 wa, 0.0 hi, 2.7 si, 0.7 st
KiB Mem: 2048516 total, 1915612 used, 132904 free, 131260 buffers
KiB Swap: 0 total, 0 used, 0 free. 940036 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29070 redis 20 0 538276 290828 984 S 2.2 14.2 0:59.54 /usr/bin/redis-server 0.0.0.0:6379
4960 ubuntu 20 0 1902972 257788 12408 S 2.9 12.6 1:46.59 java -Xmx1G -Xms1G -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -Djava+
30391 ubuntu 20 0 654672 46840 4700 S 19.7 2.3 3:35.93 /home/ubuntu/.ips/bin/python manage.py runserver 0.0.0.0:8000
29179 zookeep+ 20 0 1244636 42356 11180 S 0.0 2.1 0:03.56 /usr/bin/java -cp /etc/zookeeper/conf:/usr/share/java/jline.jar:/usr/share/java/log4j-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/xmlParserAPIs.j+
11192 rabbitmq 20 0 594416 40704 2464 S 0.0 2.0 0:06.21 /usr/lib/erlang/erts-5.10.4/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq+
30415 ubuntu 20 0 393460 37320 4264 S 3.2 1.8 1:48.28 python manage.py collect_whois
30386 ubuntu 20 0 84720 26176 4020 S 0.0 1.3 0:00.20 python manage.py runserver 0.0.0.0:8000
...
Here is one of the workers. It's using a fair amount of CPU but not an excessive amount. The networking and CPU loads are now better balanced.
top - 19:47:08 up 45 min, 1 user, load average: 1.37, 1.04, 0.87
Tasks: 146 total, 3 running, 143 sleeping, 0 stopped, 0 zombie
%Cpu0 : 30.7 us, 0.7 sy, 0.0 ni, 67.7 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 59.2 us, 0.0 sy, 0.0 ni, 40.5 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 8175632 total, 1550272 used, 6625360 free, 136020 buffers
KiB Swap: 0 total, 0 used, 0 free. 242072 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1208 rabbitmq 20 0 1196892 111400 2564 S 0.7 1.4 0:21.47 /usr/lib/erlang/erts-5.10.4/bin/beam.smp -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rab+
1460 ubuntu 20 0 120696 45008 4944 R 88.7 0.6 28:55.69 python manage.py get_ips_from_coordinator
1593 ubuntu 20 0 430460 44828 4160 S 0.0 0.5 0:01.47 python manage.py celeryd --concurrency=30
...
The Redis key that stores the list of CIDR blocks will keep growing as the cluster works its way through the workload. After an hour and 20 minutes of running the key's value had reached 700,771 bytes in length.
$ ssh -i ~/.ssh/ip_whois.pem \
ubuntu@54.171.53.151 \
'echo "GET cidrs" |
redis-cli |
wc -c'
700771
Every time the coordinator updates the 'cidrs' key Redis replicates it to all 50 slaves. The more CIDR blocks in that list, the longer it will take each worker node to find out if the IP address they're about to look up is in that list or not. I would expect that the CPU and network usage would just eventually grow out of hand but the AWS CloudWatch charts showed the CPU on the worker nodes plateaued around 50% on average and the network usage, despite sending ~700 KB of data to 50 machines after every successful WHOIS lookup, was relatively low.
Getting Through The Workload
After an hour and 15 minutes the telemetry management command was reporting a great deal of progress.
{
"Within Known CIDR Block": 133181,
"Awaiting Registry": 49,
"Found Registry": 5447,
"Looking up WHOIS": 946,
"Got WHOIS": 103091,
"Failed to lookup WHOIS": 14721
}
257,435 of the seed 4.7 million IPv4 addresses either have been or were being processed and 51% of those records required no external action. These performance metrics gave me a lot of confidence that if I were to spin up 200 spot instances to act as worker nodes that they could reliably perform their tasks and get through the lion's share of work in 3 to 4 hours.
I did spot check the error log on one of the workers. AFRINIC and LACNIC were no longer responding to WHOIS requests but the other three registries were responding well.
$ tail -n12 ~/celeryd.log
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/154.126.120.129.
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/196.36.216.193.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/200.0.36.65.
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/102.16.144.65.
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/102.229.204.65.
... HTTP lookup failed for http://rdap.afrinic.net/rdap/ip/102.240.12.193.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/177.28.132.1.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/187.248.216.1.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/186.230.24.193.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/186.61.132.65.
... HTTP lookup failed for http://rdap.lacnic.net/rdap/ip/187.57.48.129.
... ASN registry lookup failed.
Two Hour Cut-Off
My intention with this cluster was to see if I could improve both the reliability of the code running and improve upon the previous performance improvements. I decided to shut down the cluster before the two hour mark and collect the IPv4 WHOIS results. At an hour and 45 minutes, just before I shut down the cluster, this was the output from the telemetry command:
{
"Within Known CIDR Block": 176524,
"Awaiting Registry": 49,
"Found Registry": 65,
"Looking up WHOIS": 527,
"Got WHOIS": 128541,
"Failed to lookup WHOIS": 18254
}
Collecting the Results
I ran the following to collect the WHOIS results off the coordinator.
$ ssh -i ~/.ssh/ip_whois.pem \
ubuntu@54.171.53.151
$ /tmp/kafka_2.11-0.8.2.1/bin/kafka-console-consumer.sh \
--zookeeper localhost:2181 \
--topic results \
--from-beginning > results &
# Wait here till you see the results file stop growing.
$ gzip results
The results file is 240 MB when uncompressed and contains 129,183 lines of WHOIS results in line-delimited, JSON format.
How much IPv4 space was covered?
I ran two calculations on the data. The first was to find how many distinct CIDR addresses were successfully found. The answer is 50,751.
import json
results = [json.loads(line)
for line in open('results').read().split('\n')
if line.strip()]
cidr = set([res['Whois']['asn_cidr']
for res in results
if 'Whois' in res and
'asn_cidr' in res['Whois'] and
res['Whois']['asn_cidr'] != 'NA'])
print len(cidr)
50751
I then wanted to see how many distinct IPv4 addresses this represented and what proportion of the non-reserved IPv4 address space is covered.
from netaddr import *
print sum([IPNetwork(c).size
for c in cidr if c])
2390992225
The cluster managed to find WHOIS details on CIDR blocks representing 2,390,992,225 distinct IPv4 addresses covering over 64% of the entire non-reserved IPv4 address space.