Home | Benchmarks | Categories | Atom Feed

Posted on Wed 25 August 2021 under Rust

Faster Top Level Domain Name Extraction with Rust

Last year, I built a database of second-level domains from the reverse DNS names for 1.27 billion IPv4 addresses. I covered the steps I took to create the dataset in my Fast IPv4 to Host Lookups blog post. The source data came from Rapid7's Reverse DNS (RDNS) Study and is formatted in line-delimited JSON. I used a Python library called tldextract to extract the top-level domain (TLD) and any CNAMEs from each record's full domain name. For example, "company-name" would be extracted from "test.system.company-name.co.uk".

The extraction process took a day. I raised a ticket to see if there were any obvious optimisations that could use to speed up the process. None of the suggestions looked like they'd bring the processing time down from a day to minutes. I recently revisited the problem and came across a Rust library called tldextract-rs where the author, Weiyuan Wu, states that he ported the Python-based tldextract to Rust.

In this post, I'll use the above Rust library to see if I can speed up the extraction process.

Rust Up & Running

The system used in this blog post is a step up from the one used in 2019. It has been upgraded to Ubuntu 20.04 LTS with 16 GB of RAM and 1 TB of SSD capacity. The CPU is still the same, a 4-core, Intel Core i5 4670K clocked at 3.4 GHz.

I'll use Rustup to install Rust version 1.54.0.

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

I'll then install jq, OpenSSL's development files and pigz.

$ sudo apt update
$ sudo apt install \
    jq \
    libssl-dev \
    pkg-config \
    pigz

If you're using macOS, the following will install the above prerequisites via Homebrew.

$ brew install \
    coreutils \
    jq \
    openssl \
    pigz

Rapid7's Reverse DNS Dataset

The following will download the RDNS dataset from Rapid7. The 11 GB GZIP-compressed, JSON line-delimited archive contains 1,242,695,760 lines and just over 125 GB of uncompressed data. Note, this is an updated dataset containing ~30M fewer records than the one I used in 2019.

$ wget -c https://opendata.rapid7.com/sonar.rdns_v2/2021-07-28-1627430820-rdns.json.gz

This is what the first record in the archive looks like.

$ pigz -dc 2021-07-28-1627430820-rdns.json.gz \
    | head -n1 \
    | jq
{
  "timestamp": "1627467007",
  "name": "1.120.175.74",
  "value": "cpe-1-120-175-74.4cbp-r-037.cha.qld.bigpond.net.au",
  "type": "ptr"
}

The Rust code below will only max out a single core on my system. I'll split the JSON file up into four separate files so that I can run four processes, one for each file, at the same time. This will cause all four cores on my CPU to max out and the job should complete ~4x faster than it would otherwise.

$ pigz -dc 2021-07-28-1627430820-rdns.json.gz \
    | split --lines=310673940 \
            --filter="pigz > rdns_\$FILE.json.gz"

If you're running the above on macOS, replace split with gsplit.

A Data Transformer Built with Rust

I'll use Rust's build system and package manager "Cargo" to start a new project.

$ cargo new rdns
$ cd rdns/

I'll then add four external packages I'll be using to this project's dependencies list.

$ vi Cargo.toml
[package]
name = "rdns"
version = "0.1.0"
edition = "2018"

[dependencies]
flate2 = "1.0.20"
json = "0.12.4"
structopt = "0.3.13"
tldextract = "0.5.1"

There is a single source code file for this project. Below is a CLI application that takes a file's path, opens it and iterates through it parsing JSON from each line.

Each JSON record will contain a "name" attribute which is an IPv4 address and a "value" attribute which is a full domain name. I'll prepend the HTTPS protocol to each domain name as that's the format the parser is expecting. If parsing is successful, I'll print the integer value for the IPv4 address alongside the second-level domain.

$ vi src/main.rs
use flate2::read::GzDecoder;
use json;
use json::JsonValue;
use std::fs::File;
use std::io::{self, prelude::*, BufReader, BufWriter};
use std::net::Ipv4Addr;
use std::str::FromStr;
use structopt::StructOpt;
use tldextract::{TldExtractor, TldOption};

#[derive(StructOpt)]
struct Cli {
    #[structopt(parse(from_os_str))]
    path: std::path::PathBuf,
}

fn main() -> io::Result<()> {
    let args       = Cli::from_args();
    let file       = File::open(&args.path)?;
    let mut reader = BufReader::new(GzDecoder::new(file));
    let mut buffer = BufWriter::new(io::stdout());
    let mut line   = String::new();

    let options = TldOption {
       cache_path:      Some(".tld_cache".to_string()),
       private_domains: false,
       update_local:    false,
       naive_mode:      false,
    };
    let tld_ex = TldExtractor::new(options);

    while reader.read_line(&mut line).unwrap_or(0) > 0 {
        let record: JsonValue = json::parse(&line).unwrap();
        let https_domain = format!("https://{}", &record["value"].to_string());
        let ipv4: u32 = Ipv4Addr::from_str(&record["name"].to_string())
                            .unwrap().into();

        if let Ok(domain_parts) = tld_ex.extract(&https_domain) {
            if domain_parts.domain != None {
                write!(buffer,
                       "{},{}\n",
                       ipv4,
                       domain_parts.domain.unwrap()).unwrap();
            }
        }

        line.clear();
    }

    Ok(())
}

I'll build the above and then run it in four separate processes, one for each JSON file.

$ RUSTFLAGS='-Ctarget-cpu=native' \
    cargo build --release
$ ls ../rdns_*.json.gz \
    | xargs \
        -P4 \
        -n1 \
        -I {} \
        sh -c "target/release/rdns {} > {}.csv"

The above finished in 33 minutes and 3 seconds. The 125 GB of raw JSON was processed at a rate of ~64.5 MB/s. I watch htop while the above was running. Each process used around 11 MB of resident memory and all CPU cores were maxed out during the run.

The combined size of the four output files are a little over 22 GB. Below is a sample of the CSV-formatted results.

$ head ../rdns_xaa.json.gz.csv
24686410,bigpond
24686411,bigpond
24686412,bigpond
24686413,bigpond
24686414,bigpond
24686415,bigpond
24686344,bigpond
24686416,bigpond
24686417,bigpond
24686418,bigpond

The original version of the above code took just over 48 minutes to complete but thanks to some feedback I kindly received on lobste.rs I was able to bring the processing time down considerably.

Concluding Thoughts

The dataset and hardware differences make it hard to do a direct comparison but nonetheless, speeding up the process 43x over what it took to do this last year is fantastic.

The Rust script, albeit with no built-in multi-threading functionality, is only a few lines longer than the Python script I wrote last year. It easily fits on a single screen and I can read it without a great deal of cognitive overhead. The tooling around Rust that I have used works as I expected and error messages are helpful without being too verbose.

I've been coding in Python for more than a decade and it feels as native to me as the English language but with that being said, Rust's syntax is very readable and overall, the language and tooling is agreeable with the way I work.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.