Home | Benchmarks | Categories | Atom Feed

Posted on Sun 17 September 2023 under GIS

Enhancing ClickHouse's Geospatial Support

For much of this year, I've been replacing PostGIS with DuckDB's Spatial extension in many of my GIS workflows. For datasets that can be broken up into 500K-row chunks or less, the performance has been pretty good and there are many quality-of-life enhancements over other databases I've worked with in the past.

But often USA-wide datasets are over 100M rows. These row counts were never an issue for ClickHouse, even on modest hardware but GIS support is ClickHouse is much more limited than what PostGIS or DuckDB offer. The primary missing features were the ability to work with WKB and GeoJSON strings.

When representing geometry in text form, these three forms are the most common.

This is a WKT (well-known text) string:

POINT(11.912291 55.01179)

This is the same point in GeoJSON:

{"type": "Point", "coordinates": [11.912291, 55.01179]}

This is the same point in WKB (well-known binary):

0101000000cade52ce17d32740bd35b05582814b40

Earlier this week, Nabil Servais, a Data Tech Lead from Avignon, France, reached out to me stating he'd added support for WKB and GeoJSON to ClickHouse via his Clickgis extension. In this post, I'll examine his geospatial extension for ClickHouse.

A Rust Build Environment

I'm using a fresh install of Ubuntu 22.04 LTS with an Intel Core i5 4670K clocked at 3.4 GHz, 16 GB of DDR3 RAM and 2 TB of NVMe SSD capacity.

I'll set up a Rust build environment.

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source ~/.cargo/env

I'll then build Nabil's extension. The resulting binary, which was compiled in debug mode, is 21 MB.

$ git clone https://github.com/blackrez/clickgis ~/clickgis
$ cd ~/clickgis
$ cargo build
$ ls -lht target/debug/clickgis # 21 MB

I'll download and install ClickHouse version 23.9.1.951 and then copy over and configure Nabil's extension.

$ cd ~
$ curl --user-agent "curl/7.54.1" https://clickhouse.com/ | sh
$ sudo ./clickhouse install
$ sudo cp ~/clickgis/target/debug/clickgis \
          /var/lib/clickhouse/user_scripts/
$ sudo chown clickhouse:clickhouse \
             /var/lib/clickhouse/user_scripts/clickgis
$ sudo vi /var/lib/clickhouse/preprocessed_configs/clickgis_function.xml
<functions>
    <function>
        <type>executable_pool</type>
        <name>st_fromwkb</name>
        <return_type>String</return_type>
        <argument>
            <type>String</type>
            <name>value</name>
        </argument>
        <format>TabSeparated</format>
        <command>clickgis --function st_fromwkb</command>
    </function>
    <function>
        <type>executable</type>
        <name>st_asgeojson</name>
        <return_type>String</return_type>
        <argument>
            <type>String</type>
            <name>value</name>
        </argument>
        <format>TabSeparated</format>
        <command>clickgis --function st_asgeojson</command>
    </function>
</functions>
$ sudo chown clickhouse:clickhouse \
             /var/lib/clickhouse/preprocessed_configs/clickgis_function.xml

I won't launch ClickHouse Server as I'll be running ClickHouse in local mode.

$ cd ~/
$ sudo clickhouse \
        local \
        -- \
        --user_scripts_path=/var/lib/clickhouse/user_scripts \
        --user_defined_executable_functions_config=/var/lib/clickhouse/preprocessed_configs/clickgis_function.xml

Importing Places into ClickHouse

The Overture Foundation released a 'places' dataset in July which uses WKB within Parquet files to store geometry. There are 30 Parquet files for the places dataset which are between 260 and 290 MB in size each. I'll import the smallest of these into a table in ClickHouse.

CREATE TABLE places ENGINE=Log AS
    SELECT *
    FROM s3('https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-07-26-alpha.0/theme=places/type=place/*70ddc600d953', '', '', 'Parquet');

The above imported 1,860,989 rows off of S3 in 51 seconds. The following is an example record. The geometry field is in binary so I'll use the hex function to convert it into a hexadecimal string.

SELECT id,
       updatetime,
       version,
       names,
       categories,
       confidence,
       websites,
       socials,
       emails,
       phones,
       brand,
       addresses,
       sources,
       bbox,
       hex(geometry)
FROM places
LIMIT 1;\G
id:            tmp_C19C32131FC4F3A4CFC09B65F994C1EB
updatetime:    2023-07-24T00:00:00.000
version:       0
names:         {'common':[{'value':'Vivians Hundesalon Vordingborg','language':'da'}]}
categories:    (NULL,[])
confidence:    0.6
websites:      []
socials:       []
emails:        []
phones:        []
brand:         ({},NULL)
addresses:     [{'locality':'Vordingborg','postcode':'4760','freeform':'Prins Jørgens Alle 3','region':'Zealand','country':'DK'}]
sources:       [{'dataset':'msft','property':'','recordid':'1125899910784139'}]
bbox:          (11.912291,11.912291,55.01179,55.01179)
hex(geometry): 0101000000CADE52CE17D32740BD35B05582814B40

New Geospatial Functions

The following is a function that can take a hexadecimal WKB string and convert it into a WKT string.

SELECT st_fromwkb('0101000000CADE52CE17D32740BD35B05582814B40') AS geom
┌─geom──────────────────────┐
│ POINT(11.912291 55.01179) │
└───────────────────────────┘

Below I'll cast the WKB data from the 1,860,989 records in the places table into GeoJSON. This operation took 23.226 seconds.

CREATE TABLE places_geojson ENGINE=Log AS
    SELECT st_asgeojson(hex(geometry)) geom
    FROM places;

This is an example record from the new table.

SELECT geom
FROM places_geojson
LIMIT 1;\G
{"type": "Point", "coordinates": [11.912291,55.01179]}

Made with Rust

The extension itself is made up of 11 lines of TOML and 40 lines of Rust. The geospatial functionality comes from the GeoZero crate. This extension is called via the command line and uses clap to handle any incoming arguments.

I'm blown away by how small and simple this extension is. Below is all the code from the src/main.rs file that was used in the above examples.

use geozero::wkb::{Wkb};
use geozero::wkt::{WktReader, WktWriter};
use clap::{arg, command};
use geozero::{ToWkt, ToGeo, ToJson};
use std::convert::TryFrom;
use std::io::{self, BufRead};

fn st_fromwkb(hexwkb: &str) -> String{
    let wkb = Wkb(hex::decode(hexwkb.to_string()).unwrap());
    return wkb.to_wkt().unwrap()
}

fn st_asgeojson(hexwkb: &str) -> String {
    let wkb = Wkb(hex::decode(hexwkb.to_string()).unwrap());
    return wkb.to_json().unwrap()
}

fn main() {
    let matches = command!()
        .arg(arg!(--function <VALUE>).required(true))
        .get_matches();

    let function = matches.get_one::<String>("function").expect("required");

    loop {
        let mut input = String::new();
        match io::stdin().read_line(&mut input) {
            Ok(n) => {
                if n == 0 {
                    // End files
                    break;
                }
                let mut result = match function.as_str() {
                    "st_asgeojson" => st_asgeojson(input.trim_end_matches('\n')),
                    "st_fromwkb" => st_fromwkb(input.trim_end_matches('\n')),
                    _ => panic!("not implemented"),
                };
                println!("{}", result);
            }
            Err(error) => {
                println!("Read error : {}", error);
                break;
            }
        }
    }
}

GeoZero is a mature library so I want to encourage Nabil and the wider open source geospatial community to continue working on this extension.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.