Home | Benchmarks | Categories | Atom Feed

Posted on Sun 09 April 2023 under GIS

DuckDB's Spatial Extension

Max Gabrielsson recently reached out to me announcing that he was working on a DuckDB extension that would allow it to read and write to over 50 geospatial file formats natively. Max is originally from Sweden but now lives and works as a Software Engineer for DuckDB Labs in Amsterdam.

DuckDB's Spatial extension is made up of ~7K lines of C++ excluding its dependencies.

There are three major GIS projects this extension uses to provide its functionality. The first is GDAL, a 1.7M-line C++ project that provides file conversion functionality for both raster and vector-based GIS file formats. The second is PROJ, a 150K-line C++ project that provides coordinate transformations. The third is GEOS, a 150K-line C++ project that provides geometry manipulation functionality.

All of these dependency projects were started more than 20 years ago, have been extensively battle-tested and seem to underpin almost every GIS project and service I've come across.

In this post, I'll walk through some example GIS workflows with the DuckDB Spatial extension.

DuckDB, Up & Running

I've attached an external storage drive to my 2020 MacBook Pro. I'll be using it to store all the GIS data, source code and compiled binaries used in this post.

$ cd /Volumes/Seagate
$ mkdir -p gis_data

I've installed Homebrew and I'll use it to install h3, jq, Python and some build tools.

$ brew install \
    cmake \
    gdal \
    h3 \
    jq \
    ninja \
    openssl@3 \
    virtualenv

I'll clone the Spatial extension's repo which includes DuckDB's source code and all of the dependency projects. I'll use Ninja to build everything as it'll use all the CPU cores on my system by default. This reduces the build time over just using make release substantially.

$ git clone  --recurse-submodules \
    https://github.com/duckdblabs/duckdb_spatial
$ cd duckdb_spatial
$ GEN=ninja \
    OPENSSL_ROOT_DIR=/usr/local/Cellar/openssl\@3/3.1.0/ \
    make release

I'll build the H3 extension for DuckDB as well.

$ cd /Volumes/Seagate

$ git clone https://github.com/isaacbrodsky/h3-duckdb duckdb_h3
$ cd ./duckdb_h3
$ git submodule update --init
$ GEN=ninja make release

Below are three Python libraries I'll use throughout this post.

$ virtualenv ~/.gis
$ source ~/.gis/bin/activate
$ pip install \
    geojson \
    h3

I'll launch DuckDB from the GIS working folder on my external drive.

$ cd /Volumes/Seagate/gis_data
$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
    -unsigned \
    aus.duckdb

I'll load in the Spatial, H3 and JSON extensions for DuckDB.

LOAD '/Volumes/Seagate/duckdb_spatial/build/debug/extension/spatial/spatial.duckdb_extension';
LOAD '/Volumes/Seagate/duckdb_h3/build/release/h3.duckdb_extension';
INSTALL json;
LOAD json;

Supported GIS Formats

The DuckDB Spatial extension used in this post has a lot of functionality overlap with the DuckDB geo extension I covered in my Geospatial DuckDB post last month. The biggest feature difference between these two extensions is the Spatial extension's ability to read and write to over 50 GIS file formats. This is thanks to its use of the GDAL library.

.maxrows 51

SELECT short_name,
       long_name,
       can_create,
       can_copy,
       can_open
FROM ST_DRIVERS()
ORDER BY 1;
┌────────────────┬──────────────────────────────────────────────────────┬────────────┬──────────┬──────────┐
│   short_name   │                      long_name                       │ can_create │ can_copy │ can_open │
│    varchar     │                       varchar                        │  boolean   │ boolean  │ boolean  │
├────────────────┼──────────────────────────────────────────────────────┼────────────┼──────────┼──────────┤
│ AVCBin         │ Arc/Info Binary Coverage                             │ false      │ false    │ true     │
│ AVCE00         │ Arc/Info E00 (ASCII) Coverage                        │ false      │ false    │ true     │
│ AmigoCloud     │ AmigoCloud                                           │ true       │ false    │ true     │
│ CAD            │ AutoCAD Driver                                       │ false      │ false    │ true     │
│ CSV            │ Comma Separated Value (.csv)                         │ true       │ false    │ true     │
│ CSW            │ OGC CSW (Catalog  Service for the Web)               │ false      │ false    │ true     │
│ Carto          │ Carto                                                │ true       │ false    │ true     │
│ DGN            │ Microstation DGN                                     │ true       │ false    │ true     │
│ DXF            │ AutoCAD DXF                                          │ true       │ false    │ true     │
│ EDIGEO         │ French EDIGEO exchange format                        │ false      │ false    │ true     │
│ ESRI Shapefile │ ESRI Shapefile                                       │ true       │ false    │ true     │
│ ESRIJSON       │ ESRIJSON                                             │ false      │ false    │ true     │
│ Elasticsearch  │ Elastic Search                                       │ true       │ false    │ true     │
│ FlatGeobuf     │ FlatGeobuf                                           │ true       │ false    │ true     │
│ GML            │ Geography Markup Language (GML)                      │ true       │ false    │ true     │
│ GPKG           │ GeoPackage                                           │ true       │ true     │ true     │
│ GPSBabel       │ GPSBabel                                             │ true       │ false    │ true     │
│ GPX            │ GPX                                                  │ true       │ false    │ true     │
│ GeoJSON        │ GeoJSON                                              │ true       │ false    │ true     │
│ GeoJSONSeq     │ GeoJSON Sequence                                     │ true       │ false    │ true     │
│ GeoRSS         │ GeoRSS                                               │ true       │ false    │ true     │
│ Geoconcept     │ Geoconcept                                           │ true       │ false    │ true     │
│ JML            │ OpenJUMP JML                                         │ true       │ false    │ true     │
│ KML            │ Keyhole Markup Language (KML)                        │ true       │ false    │ true     │
│ LVBAG          │ Kadaster LV BAG Extract 2.0                          │ false      │ false    │ true     │
│ MVT            │ Mapbox Vector Tiles                                  │ true       │ false    │ true     │
│ MapInfo File   │ MapInfo File                                         │ true       │ false    │ true     │
│ MapML          │ MapML                                                │ true       │ false    │ true     │
│ Memory         │ Memory                                               │ true       │ false    │ true     │
│ NGW            │ NextGIS Web                                          │ true       │ true     │ true     │
│ OAPIF          │ OGC API - Features                                   │ false      │ false    │ true     │
│ ODS            │ Open Document/ LibreOffice / OpenOffice Spreadsheet  │ true       │ false    │ true     │
│ OGR_GMT        │ GMT ASCII Vectors (.gmt)                             │ true       │ false    │ true     │
│ OGR_VRT        │ VRT - Virtual Datasource                             │ false      │ false    │ true     │
│ OSM            │ OpenStreetMap XML and PBF                            │ false      │ false    │ true     │
│ OpenFileGDB    │ ESRI FileGDB                                         │ true       │ false    │ true     │
│ PGDUMP         │ PostgreSQL SQL dump                                  │ true       │ false    │ false    │
│ PLSCENES       │ Planet Labs Scenes API                               │ false      │ false    │ true     │
│ S57            │ IHO S-57 (ENC)                                       │ true       │ false    │ true     │
│ SQLite         │ SQLite / Spatialite                                  │ true       │ false    │ true     │
│ SVG            │ Scalable Vector Graphics                             │ false      │ false    │ true     │
│ SXF            │ Storage and eXchange Format                          │ false      │ false    │ true     │
│ Selafin        │ Selafin                                              │ true       │ false    │ true     │
│ TIGER          │ U.S. Census TIGER/Line                               │ false      │ false    │ true     │
│ TopoJSON       │ TopoJSON                                             │ false      │ false    │ true     │
│ UK .NTF        │ UK .NTF                                              │ false      │ false    │ true     │
│ VDV            │ VDV-451/VDV-452/INTREST Data Format                  │ true       │ false    │ true     │
│ VFK            │ Czech Cadastral Exchange Data Format                 │ false      │ false    │ true     │
│ WAsP           │ WAsP .map format                                     │ true       │ false    │ true     │
│ WFS            │ OGC WFS (Web Feature Service)                        │ false      │ false    │ true     │
│ XLSX           │ MS Office Open XML spreadsheet                       │ true       │ false    │ true     │
├────────────────┴──────────────────────────────────────────────────────┴────────────┴──────────┴──────────┤
│ 51 rows                                                                                        5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Loading GIS Data into DuckDB

Two years ago, Microsoft's Bing Maps team released a database of 48,900,000 KM of roads, including 1,165,000 KM of roads missing from Open Street Map. These were produced by examining satellite imagery from Bing Maps, Maxar and Airbus. They have since continued to refresh the dataset with 15 new updates last year and an update for Japan six weeks ago.

First, I'll download and decompress the dataset for Oceania. The 205 MB ZIP file contains a 5,545,580-line, 1.3 GB TSV file.

$ wget https://usaminedroads.blob.core.windows.net/road-detections/Oceania-Full.zip
$ unzip -ojq Oceania-Full.zip

Each line of the TSV file contains a 3-character ISO country identifier followed by a GeoJSON record containing the road vectors.

$ head -n1 Oceania-Full.tsv
AUS {"type":"Feature","geometry":{"type":"LineString","coordinates":[[147.084939479828,-30.7773677126431],[147.0849609375,-30.7773584948334],[147.084971666336,-30.7773400592114]]},"properties":{}}

Below is the country count. Australia makes up 83% of this dataset.

$ cut -f1 Oceania-Full.tsv \
    | sort \
    | uniq -c \
    | sort -rn
4628534 AUS
 709436 NZL
 116000 PNG
  35416 VUT
  22367 FJI
  20340 SLB
   5622 TON
   4243 WSM
   1060 FSM
    831 KIR
    811 PLW
    555 MHL
    225 TUV
    140 NRU

I'll extract the Australian records into their own line-delimited GeoJSON file.

$ grep AUS Oceania-Full.tsv \
    | cut -f2 \
    > Oceania_AUS.geojson

GeoPackage files support R-Tree spatial indices. This lets them load much quicker than GeoJSON in QGIS and numerous other GIS tools. I'll use GDAL's ogr2ogr to convert the 1.2 GB GeoJSON file into an 800 MB GeoPackage file.

$ ogr2ogr \
    -f GPKG \
    Oceania_AUS.gpkg \
    Oceania_AUS.geojson

I'll launch DuckDB and load the GeoPackage file into a new table.

$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
    -unsigned \
    aus.duckdb
CREATE OR REPLACE TABLE ms_roads AS
    SELECT *
    FROM st_read('Oceania_AUS.gpkg');

I downloaded the Tourist Spots Shapefile from the Australia Bureau of Statistics. This will help me group the roads into their respective Australian tourist regions. The 27 MB ZIP file below contains ~40 MB of Shapefile data.

$ wget https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/TR_2021_AUST_GDA2020_SHP.zip
$ unzip -ojq TR_2021_AUST_GDA2020_SHP.zip

$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
    -unsigned \
    aus.duckdb
CREATE OR REPLACE TABLE tourist_spots AS
    SELECT *
    FROM st_read('TR_2021_AUST_GDA2020.shx');

Enriching Geospatial Data in DuckDB

The geometry fields use the WKB blob data type. I need to convert them into GEOMETRY types as well as convert the tourist regions data from the EPSG:7844 projection into EPSG:4326 so it matches the roads dataset. WKB blob fields will incur conversion penalties when working with GIS functions so converting them early prevents slower query times later on.

CREATE OR REPLACE TABLE roads2 AS
    SELECT ST_GEOMFROMWKB(geom) geom
    FROM ms_roads;

CREATE OR REPLACE TABLE tourist_spots2 AS
    SELECT * EXCLUDE(wkb_geometry),
           ST_TRANSFORM(ST_GEOMFROMWKB(wkb_geometry),
                        'EPSG:7844',
                        'EPSG:4326') geom
    FROM tourist_spots;

I'll join these two datasets so every road record has its corresponding tourism region details.

CREATE OR REPLACE TABLE joined1 AS
    SELECT a.geom,
           b.TR_CODE21,
           b.TR_NAME21,
           b.STE_CODE21,
           b.STE_NAME21,
           b.AUS_CODE21,
           b.AUS_NAME21,
           b.AREASQKM21,
           b.LOCI_URI21,
           b.SHAPE_Leng,
           b.SHAPE_Area
    FROM roads2 a
    JOIN tourist_spots2 b
        ON ST_WITHIN(a.geom, b.geom);

Below is the schema for the new table.

.schema --indent joined1
CREATE TABLE joined1(
  geom GEOMETRY,
  "TR_CODE21" VARCHAR,
  "TR_NAME21" VARCHAR,
  "STE_CODE21" VARCHAR,
  "STE_NAME21" VARCHAR,
  "AUS_CODE21" VARCHAR,
  "AUS_NAME21" VARCHAR,
  "AREASQKM21" DOUBLE,
  "LOCI_URI21" VARCHAR,
  "SHAPE_Leng" DOUBLE,
  "SHAPE_Area" DOUBLE
);;

Below is an example record.

.mode line

SELECT *
FROM joined1
LIMIT 1;
      geom = LINESTRING (151.076881885529 -33.495588797786, 151.076753139496 -33.4951324954955)
 TR_CODE21 = 1R040
 TR_NAME21 = Sydney
STE_CODE21 = 1
STE_NAME21 = New South Wales
AUS_CODE21 = AUS
AUS_NAME21 = Australia
AREASQKM21 = 7288.3598
LOCI_URI21 = http://linked.data.gov.au/dataset/asgsed3/TR/1R040
SHAPE_Leng = 14.6210946243
SHAPE_Area = 0.708622228461

Exporting to GeoPackage

I'll export the Gold Coast records to a GeoPackage file so I can examine them in QGIS. The geometry will need to be converted back into a WKB blob field type in order for QGIS to recognise the geometry. The GeoPackage file below is 96K.

COPY (SELECT ST_AsWKB(geom) geom,
             TR_CODE21,
             TR_NAME21,
             STE_CODE21,
             STE_NAME21,
             AUS_CODE21,
             AUS_NAME21,
             AREASQKM21,
             LOCI_URI21,
             SHAPE_Leng,
             SHAPE_Area
      FROM joined1
      WHERE TR_NAME21='Gold Coast')
TO 'joined.gpkg'
WITH (FORMAT GDAL,
      DRIVER 'GPKG',
      LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');

I can't create a multi-layer GeoPackage file with DuckDB so I'll export a second GeoPackage file of the Gold Coast boundaries. The GeoPackage file below is 236K.

COPY (SELECT *
      FROM tourist_spots
      WHERE TR_NAME21='Gold Coast')
TO 'tourist_spots.gold_coast.gpkg'
WITH (FORMAT GDAL,
      DRIVER 'GPKG',
      LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');

Clustering Roads

I will also export the H3 zoom level 2 counts for the dataset. This will be useful for visualising record density across Australia.

$ vi h3.sql
.mode csv
LOAD h3;
LOAD json;

SELECT printf('%x',
              h3_latlng_to_cell(
                  ST_Y(ST_CENTROID(geom)),
                  ST_X(ST_CENTROID(geom)),
                  2)::bigint) as h3_2,
       count(*) count
FROM  roads2
GROUP BY 1;
$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
        -unsigned \
        aus.duckdb \
    < h3.sql \
    | tail +2 \
    > h3.csv
$ python3
import csv

from geojson import (dumps,
                     Feature,
                     FeatureCollection,
                     Polygon)
import h3


fc = FeatureCollection([
    Feature(geometry=Polygon([h3.h3_to_geo_boundary(h3_id,
                                                    geo_json=True)]),
            properties={'sample_cnt': sample_cnt})
    for h3_id, sample_cnt in csv.reader(open('h3.csv'))])

open('h3.geojson', 'w').write(dumps(fc))

I dragged the h3.geojson file onto a new QGIS document and added the Esri World Imagery basemap with the Tile+ plugin. I then set the h3's layer's simple fill stroke width to 5, fill style to "Dense 4" and used the following expression for the fill and stroke colours:

CASE
  WHEN "sample_cnt" >= 40 THEN '#e93a27'
  WHEN "sample_cnt" >= 30 THEN '#ef684b'
  WHEN "sample_cnt" >= 20 THEN '#ed8d75'
  WHEN "sample_cnt" >= 10 THEN '#f7c4b1'
  ELSE '#e5cccf'
END
H3 Layer Style in QGIS

The following shows the majority of records are clustered around the major population centres.

Australian H3 GIS data in QGIS

Satellite Analysis in QGIS

The Tile+ plugin lets you import satellite and pre-rendered basemaps into QGIS projects. It supports a large number of sources and I'll be using it to fetch satellite imagery from Bing, Esri and Google for the examples below.

$ grep -Eo 'https?://.*\]]' /Users/mark/Library/Application\ Support/QGIS/QGIS3/profiles/default/python/plugins/tile_plus/basemap.xml \
    | python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));" \
    | sed 's/]]//g' \
    | sort
http://basemaps.cartocdn.com/dark_all/{z}/{x}/{y}.png
http://basemaps.cartocdn.com/light_all/{z}/{x}/{y}.png
http://ecn.t3.tiles.virtualearth.net/tiles/a{q}.jpeg?g=1
http://realearth.ssec.wisc.edu/tiles/Earthquake-mag/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/GLOBALterratc/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/NESDIS-GHE-HourlyRainfall/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/NESDIS-SST/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/NightLightsColored/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/VIIRS-Fire/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/VIIRS-MASK-54000x27000/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalir-avn/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalir-rr/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalir/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalvis/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalwv-grad/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/lsat8-llook-fc/{z}/{x}/{y}.png
http://services.arcgisonline.com/ArcGIS/rest/services/Canvas/World_Dark_Gray_Base/MapServer/tile/{z}/{y}/{x}
http://services.arcgisonline.com/ArcGIS/rest/services/Canvas/World_Light_Gray_Base/MapServer/tile/{z}/{y}/{x}
http://services.arcgisonline.com/ArcGIS/rest/services/World_Topo_Map/MapServer/tile/{z}/{y}/{x}
http://tile.openstreetmap.fr/hot/{z}/{x}/{y}.png
http://tile.openstreetmap.org/{z}/{x}/{y}.png
http://tile.stamen.com/terrain/{z}/{x}/{y}.png
http://tile.stamen.com/toner-lite/{z}/{x}/{y}.png
http://tile.stamen.com/toner/{z}/{x}/{y}.png
http://tile.stamen.com/watercolor/{z}/{x}/{y}.jpg
http://tiles.wmflabs.org/bw-mapnik/{z}/{x}/{y}.png
https://mt1.google.com/vt/lyrs=m&x={x}&y={y}&z={z}
https://mt1.google.com/vt/lyrs=p&x={x}&y={y}&z={z}
https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}
https://mt1.google.com/vt/lyrs=t&x={x}&y={y}&z={z}
https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}
https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}
https://server.arcgisonline.com/ArcGIS/rest/services/World_Terrain_Base/MapServer/tile/{z}/{y}/{x}
https://services.arcgisonline.com/ArcGIS/rest/services/Ocean/World_Ocean_Base/MapServer/tile/{z}/{y}/{x}

I've started a new project in QGIS, created a basemap using Bing's Virtual Earth satellite imagery with Tile+, dragged the roads GeoPackage file in and coloured them yellow as well as imported the Gold Coast boundary GeoPackage file and coloured it purple.

Australian GIS data in QGIS 1

Zooming into Surfers Paradise there appears to be near-complete road coverage from Microsoft's dataset.

Australian GIS data in QGIS 2

This includes public footpaths on private property. They've also been clever enough to not include footpaths in forested areas that most likely don't support regular vehicles.

Australian GIS data in QGIS 3

The 76-floor tower on 88 Esplanade is blocking Bing's satellite's view of Staghorn Avenue, yet the road is still properly marked. I've rotated the map below 90 degrees and it's facing west.

Australian GIS data in QGIS 4

Google's map has a more top-down view before the tower was taller than its neighbours. I doubt this was a source image for Bing but it would be interesting to see what the Maxar and Airbus images that Microsoft used look like.

Australian GIS data in QGIS 5

Below is Esri's image of the same location with the buildings tilting in the other direction. It's possible an image from this angle was blended with the other angle to 'look around' buildings and identify the roads next to them.

Australian GIS data in QGIS 6

HERE's satellite map credits Maxar. Given the building is photographed from its Northern side it could be Microsoft mixed these photos to look around the building.

Australian GIS data in QGIS 7

I'm not sure if Microsoft had photos from multiple angles or the same angle but over a long period of time or if they rely on nearby roads to hint there may be one here. Nonetheless, collecting a complete list of roads anywhere in the world is still an unsolved problem and their attempt is fantastic.

That said, I was able to spot some unconnected roads. Here is one leading up to a golf course clubhouse.

Australian GIS data in QGIS 4

And a very long diving board off of a bridge.

Australian GIS data in QGIS 5

There are attempts to mark out lanes in parking lots. Some are better than others.

Australian GIS data in QGIS 6

I found one example where a road looks to lead right into a mall's multi-level parking lot and abruptly stops. I suspect the 3rd dimension isn't being taken into account.

Australian GIS data in QGIS 8

Some roads are picked out of areas with dense vegetation obstructing the satellite's view. They seem to get this right more often than not.

Australian GIS data in QGIS 7
Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.