Building PostgreSQL Extensions with Rust

Please do check back with this post. I will update it as fixes come through.

Last year, I built a database of second-level domains from the reverse DNS names for 1.27 billion IPv4 addresses. For example, "company-name" would be extracted from "test.system.company-name.co.uk". I originally wrote the code for this in Python but last week I ported it to Rust and saw a 43x speed up.

I recently came across pgx. It's a framework that lets you build extensions for PostgreSQL using Rust. This means that a transformation process like the kind I did in the above posts can be largely wrapped up inside of SQL function calls.

The pgx project is about two years old and is made up of ~11K lines of Rust when you exclude tests and auto-generated code. There are a large number of examples to learn from within its code repository. Eric Ridge, the lead developer of pgx, has also produced video walkthroughs for eight of the examples as of this writing. This is one of the first projects where I've seen a section on "bad ideas" in its training literature.

In this blog post, I'll attempt to build a PostreSQL extension in Rust and use it to enrich 1.27B records.

PGX Up & Running

The system used in this blog post is running Ubuntu 20.04 LTS with 16 GB of RAM and 1 TB of SSD capacity. The CPU is a 4-core, Intel Core i5 4670K clocked at 3.4 GHz.

I'll use Rustup to install Rust version 1.54.0.

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

I'll then install jq, pigz and a few tools that will be used to build PostgreSQL.

$ sudo apt update
$ sudo apt install \
    bison \
    flex \
    jq \
    libclang-dev \
    libreadline-dev \
    pigz \
    python3-pip

The following will install pgx.

$ cargo install cargo-pgx

Building a PostgreSQL Extension

I'll initialise pgx with the init command below. Among other tasks, it will download the source code for four different versions of PostgreSQL. As of this writing, they are version 13.4, 12.8, 11.13 and 10.18. These live in the ~/.pgx folder. You can choose any and all the above versions to test your extensions against. Each instance of PostgreSQL will run independently of the others.

$ cargo pgx init

I'll set up a new project for my extension.

$ cargo pgx new sld_ext
$ cd sld_ext

Below is what the Cargo file for this project looks like after I added tldextract as a dependency.

$ vi Cargo.toml

[package]
name = "sld_ext"
version = "0.0.0"
edition = "2018"

[lib]
crate-type = ["cdylib"]

[features]
default = ["pg13"]
pg10 = ["pgx/pg10", "pgx-tests/pg10" ]
pg11 = ["pgx/pg11", "pgx-tests/pg11" ]
pg12 = ["pgx/pg12", "pgx-tests/pg12" ]
pg13 = ["pgx/pg13", "pgx-tests/pg13" ]
pg_test = []

[dependencies]
pgx = "0.1.22"
pgx-macros = "0.1.22"
tldextract = "0.5.1"

[dev-dependencies]
pgx-tests = "0.1.22"

[profile.dev]
panic = "unwind"

[profile.release]
panic = "unwind"
opt-level = 3
lto = "fat"
codegen-units = 1

Below is the code for the PostgreSQL extension. A function called get_sld will accept a hostname and return its second-level domain.

$ vi src/lib.rs

use pgx::*;
use tldextract::{TldExtractor, TldOption};

pg_module_magic!();

#[pg_extern]
fn get_sld(hostname: &str) -> String {
    let https_domain = format!("https://{}", &hostname);

    let options = TldOption {
       cache_path:      Some(".tld_cache".to_string()),
       private_domains: false,
       update_local:    false,
       naive_mode:      false,
    };
    let tld_ex = TldExtractor::new(options);

    if let Ok(domain_parts) = tld_ex.extract(&https_domain) {
        domain_parts.domain.unwrap()
    } else {
        error!("Unable to extract SLD from {}", hostname)
    }
}

#[cfg(any(test, feature = "pg_test"))]
mod tests {
    use pgx::*;

    #[pg_test]
    fn test_get_sld() {
        assert_eq!("company-name",
                   crate::get_sld("test.system.company-name.co.uk"));
    }

}

#[cfg(test)]
pub mod pg_test {
    pub fn setup(_options: Vec<&str>) {
        // perform one-off initialization when the pg_test framework starts
    }

    pub fn postgresql_conf_options() -> Vec<&'static str> {
        // return any postgresql.conf settings that are required for your tests
        vec![]
    }
}

In a previous version of this post, there was a tokio-timer issue that has since been corrected by running cargo update.

$ cargo update

Updating crates.io index
Updating cc v1.0.69 -> v1.0.70
Updating proc-macro2 v1.0.28 -> v1.0.29
Updating siphasher v0.3.6 -> v0.3.7
Updating tokio v1.10.1 -> v1.11.0
Updating typenum v1.13.0 -> v1.14.0

The Rust code above contains a unit test. I'll run the test suite to make sure it runs without issue.

$ cargo pgx test pg13

test tests::pg_test_get_sld ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 10.61s

I'll build a release version of the extension and launch PostgreSQL 13.

$ RUSTFLAGS='-Ctarget-cpu=native' \
    cargo pgx run pg13 --release

The above produced a 5 MB binary that can be compressed with a 3:1 ratio with gzip and a 4:1 ratio with xz.

I'll create a new function in PostgreSQL that will make the extension callable.

CREATE OR REPLACE FUNCTION
    "get_sld" ("hostname" TEXT)
RETURNS TEXT STRICT
LANGUAGE c
    AS '/home/mark/.pgx/13.4/pgx-install/lib/postgresql/sld_ext.so',
    'get_sld_wrapper';

Below you can see it now appears in PostgreSQL's functions list.

\x on
\df+ get_sld

List of functions
-[ RECORD 1 ]-------+----------------
Schema              | public
Name                | get_sld
Result data type    | text
Argument data types | hostname text
Type                | func
Volatility          | volatile
Parallel            | unsafe
Owner               | mark
Security            | invoker
Access privileges   |
Language            | c
Source code         | get_sld_wrapper
Description         |

The following shows that the extension can produce an error properly within PostgreSQL.

\x off
SELECT get_sld('@');

ERROR:  Unable to extract SLD from @
CONTEXT:  src/lib.rs:21:9

Rapid7's Reverse DNS Dataset

The following will download the Reverse DNS (RDNS) dataset from Rapid7. The 11 GB GZIP-compressed, JSON line-delimited archive contains 1,242,695,760 lines and just over 125 GB of uncompressed data. Note, this is an updated dataset containing ~30M fewer records than the one I used in 2019.

$ wget -c -O rdns.json.gz \
    https://opendata.rapid7.com/sonar.rdns_v2/2021-07-28-1627430820-rdns.json.gz

This is what the first record in the archive looks like.

$ pigz -dc rdns.json.gz \
    | head -n1 \
    | jq

{
  "timestamp": "1627467007",
  "name": "1.120.175.74",
  "value": "cpe-1-120-175-74.4cbp-r-037.cha.qld.bigpond.net.au",
  "type": "ptr"
}

I found using PostgreSQL's JSON loading functionality to be fairly sensitive. I'll transform the JSON into CSV first and patch a few odd hostnames before loading everything into PostgreSQL. Below is the loading script.

$ vi loading.py

import json
import sys


for line in sys.stdin:
    rec = json.loads(line)
    if 'name' in rec and 'value' in rec:
        print(rec['name'] + ',' +
              rec['value'].replace('"', '')
                          .replace(',', '.'))

The above will produce an outlook that looks like the following.

1.120.175.74,cpe-1-120-175-74.4cbp-r-037.cha.qld.bigpond.net.au
1.120.175.75,cpe-1-120-175-75.4cbp-r-037.cha.qld.bigpond.net.au
1.120.175.76,cpe-1-120-175-76.4cbp-r-037.cha.qld.bigpond.net.au

I'll launch PostgreSQL 13 again, create a table to load the RDNS data into and measure the import duration.

$ RUSTFLAGS='-Ctarget-cpu=native' \
    cargo pgx run pg13 --release

CREATE TABLE rdns (
  ipv4      INET NOT NULL,
  hostname  TEXT
);
\timing on
\copy rdns FROM PROGRAM 'pigz -dc ~/rdns.json.gz | python3 loading.py' CSV;

The above took 1 hour, 38 minutes and 29 seconds to complete with a throughput rate of 21.7 MB/s. According to the following SQL, the 12 GB GZIP file containing 125 GB of uncompressed JSON ends up as a little more than 92 GBs in PostgreSQL's internal storage format.

SELECT   a.table_name,
         PG_RELATION_SIZE(quote_ident(a.table_name))
             AS num_bytes
FROM     information_schema.tables a
WHERE    table_schema = 'public'
ORDER BY num_bytes DESC;

Enrich using SQL

The table with the RDNS data looks like the following at the moment.

SELECT *
FROM   rdns
LIMIT  3;

     ipv4     |                      hostname
--------------+----------------------------------------------------
 1.120.175.74 | cpe-1-120-175-74.4cbp-r-037.cha.qld.bigpond.net.au
 1.120.175.75 | cpe-1-120-175-75.4cbp-r-037.cha.qld.bigpond.net.au
 1.120.175.76 | cpe-1-120-175-76.4cbp-r-037.cha.qld.bigpond.net.au

I'll create an enriched table where I'll store the IPv4 address along with the second-level domain.

CREATE TABLE rdns_enriched (
  ipv4      INET NOT NULL,
  hostname  TEXT
);

This is a limited enrichment SQL command. I'll see how quickly 10K records can be enriched so I can estimate how long 1.27B records could take.

INSERT INTO rdns_enriched (ipv4, hostname)
SELECT ipv4,
       get_sld(hostname)
FROM   rdns
LIMIT  10000;

The resulting dataset looks as follows.

SELECT *
FROM   rdns_enriched
LIMIT  3;

     ipv4     | hostname
--------------+----------
 1.120.175.74 | bigpond
 1.120.175.75 | bigpond
 1.120.175.76 | bigpond

The INSERT command finished in 20 minutes and 13 seconds. If that pace remained unchanged it could take 4.75 years to run the above on all 1.27B records. By comparison, the file-based version of this Rust project took a little over 33 minutes to complete.

I noticed during the execution that the process running the command never sat at more than 15% CPU utilisation according to htop. I've raised an issue to see if there are any performance improvements pgx can offer.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.