Max Gabrielsson recently reached out to me announcing that he was working on a DuckDB extension that would allow it to read and write to over 50 geospatial file formats natively. Max is originally from Sweden but now lives and works as a Software Engineer for DuckDB Labs in Amsterdam.
DuckDB's Spatial extension is made up of ~7K lines of C++ excluding its dependencies.
There are three major GIS projects this extension uses to provide its functionality. The first is GDAL, a 1.7M-line C++ project that provides file conversion functionality for both raster and vector-based GIS file formats. The second is PROJ, a 150K-line C++ project that provides coordinate transformations. The third is GEOS, a 150K-line C++ project that provides geometry manipulation functionality.
All of these dependency projects were started more than 20 years ago, have been extensively battle-tested and seem to underpin almost every GIS project and service I've come across.
In this post, I'll walk through some example GIS workflows with the DuckDB Spatial extension.
DuckDB, Up & Running
I've attached an external storage drive to my 2020 MacBook Pro. I'll be using it to store all the GIS data, source code and compiled binaries used in this post.
$ cd /Volumes/Seagate
$ mkdir -p gis_data
I've installed Homebrew and I'll use it to install h3, jq, Python and some build tools.
$ brew install \
cmake \
gdal \
h3 \
jq \
ninja \
openssl@3 \
virtualenv
I'll clone the Spatial extension's repo which includes DuckDB's source code and all of the dependency projects. I'll use Ninja to build everything as it'll use all the CPU cores on my system by default. This reduces the build time over just using make release substantially.
$ git clone --recurse-submodules \
https://github.com/duckdblabs/duckdb_spatial
$ cd duckdb_spatial
$ GEN=ninja \
OPENSSL_ROOT_DIR=/usr/local/Cellar/openssl\@3/3.1.0/ \
make release
I'll build the H3 extension for DuckDB as well.
$ cd /Volumes/Seagate
$ git clone https://github.com/isaacbrodsky/h3-duckdb duckdb_h3
$ cd ./duckdb_h3
$ git submodule update --init
$ GEN=ninja make release
Below are three Python libraries I'll use throughout this post.
$ virtualenv ~/.gis
$ source ~/.gis/bin/activate
$ pip install \
geojson \
h3
I'll launch DuckDB from the GIS working folder on my external drive.
$ cd /Volumes/Seagate/gis_data
$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
-unsigned \
aus.duckdb
I'll load in the Spatial, H3 and JSON extensions for DuckDB.
LOAD '/Volumes/Seagate/duckdb_spatial/build/debug/extension/spatial/spatial.duckdb_extension';
LOAD '/Volumes/Seagate/duckdb_h3/build/release/h3.duckdb_extension';
INSTALL json;
LOAD json;
Supported GIS Formats
The DuckDB Spatial extension used in this post has a lot of functionality overlap with the DuckDB geo extension I covered in my Geospatial DuckDB post last month. The biggest feature difference between these two extensions is the Spatial extension's ability to read and write to over 50 GIS file formats. This is thanks to its use of the GDAL library.
.maxrows 51
SELECT short_name,
long_name,
can_create,
can_copy,
can_open
FROM ST_DRIVERS()
ORDER BY 1;
┌────────────────┬──────────────────────────────────────────────────────┬────────────┬──────────┬──────────┐
│ short_name │ long_name │ can_create │ can_copy │ can_open │
│ varchar │ varchar │ boolean │ boolean │ boolean │
├────────────────┼──────────────────────────────────────────────────────┼────────────┼──────────┼──────────┤
│ AVCBin │ Arc/Info Binary Coverage │ false │ false │ true │
│ AVCE00 │ Arc/Info E00 (ASCII) Coverage │ false │ false │ true │
│ AmigoCloud │ AmigoCloud │ true │ false │ true │
│ CAD │ AutoCAD Driver │ false │ false │ true │
│ CSV │ Comma Separated Value (.csv) │ true │ false │ true │
│ CSW │ OGC CSW (Catalog Service for the Web) │ false │ false │ true │
│ Carto │ Carto │ true │ false │ true │
│ DGN │ Microstation DGN │ true │ false │ true │
│ DXF │ AutoCAD DXF │ true │ false │ true │
│ EDIGEO │ French EDIGEO exchange format │ false │ false │ true │
│ ESRI Shapefile │ ESRI Shapefile │ true │ false │ true │
│ ESRIJSON │ ESRIJSON │ false │ false │ true │
│ Elasticsearch │ Elastic Search │ true │ false │ true │
│ FlatGeobuf │ FlatGeobuf │ true │ false │ true │
│ GML │ Geography Markup Language (GML) │ true │ false │ true │
│ GPKG │ GeoPackage │ true │ true │ true │
│ GPSBabel │ GPSBabel │ true │ false │ true │
│ GPX │ GPX │ true │ false │ true │
│ GeoJSON │ GeoJSON │ true │ false │ true │
│ GeoJSONSeq │ GeoJSON Sequence │ true │ false │ true │
│ GeoRSS │ GeoRSS │ true │ false │ true │
│ Geoconcept │ Geoconcept │ true │ false │ true │
│ JML │ OpenJUMP JML │ true │ false │ true │
│ KML │ Keyhole Markup Language (KML) │ true │ false │ true │
│ LVBAG │ Kadaster LV BAG Extract 2.0 │ false │ false │ true │
│ MVT │ Mapbox Vector Tiles │ true │ false │ true │
│ MapInfo File │ MapInfo File │ true │ false │ true │
│ MapML │ MapML │ true │ false │ true │
│ Memory │ Memory │ true │ false │ true │
│ NGW │ NextGIS Web │ true │ true │ true │
│ OAPIF │ OGC API - Features │ false │ false │ true │
│ ODS │ Open Document/ LibreOffice / OpenOffice Spreadsheet │ true │ false │ true │
│ OGR_GMT │ GMT ASCII Vectors (.gmt) │ true │ false │ true │
│ OGR_VRT │ VRT - Virtual Datasource │ false │ false │ true │
│ OSM │ OpenStreetMap XML and PBF │ false │ false │ true │
│ OpenFileGDB │ ESRI FileGDB │ true │ false │ true │
│ PGDUMP │ PostgreSQL SQL dump │ true │ false │ false │
│ PLSCENES │ Planet Labs Scenes API │ false │ false │ true │
│ S57 │ IHO S-57 (ENC) │ true │ false │ true │
│ SQLite │ SQLite / Spatialite │ true │ false │ true │
│ SVG │ Scalable Vector Graphics │ false │ false │ true │
│ SXF │ Storage and eXchange Format │ false │ false │ true │
│ Selafin │ Selafin │ true │ false │ true │
│ TIGER │ U.S. Census TIGER/Line │ false │ false │ true │
│ TopoJSON │ TopoJSON │ false │ false │ true │
│ UK .NTF │ UK .NTF │ false │ false │ true │
│ VDV │ VDV-451/VDV-452/INTREST Data Format │ true │ false │ true │
│ VFK │ Czech Cadastral Exchange Data Format │ false │ false │ true │
│ WAsP │ WAsP .map format │ true │ false │ true │
│ WFS │ OGC WFS (Web Feature Service) │ false │ false │ true │
│ XLSX │ MS Office Open XML spreadsheet │ true │ false │ true │
├────────────────┴──────────────────────────────────────────────────────┴────────────┴──────────┴──────────┤
│ 51 rows 5 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Loading GIS Data into DuckDB
Two years ago, Microsoft's Bing Maps team released a database of 48,900,000 KM of roads, including 1,165,000 KM of roads missing from Open Street Map. These were produced by examining satellite imagery from Bing Maps, Maxar and Airbus. They have since continued to refresh the dataset with 15 new updates last year and an update for Japan six weeks ago.
First, I'll download and decompress the dataset for Oceania. The 205 MB ZIP file contains a 5,545,580-line, 1.3 GB TSV file.
$ wget https://usaminedroads.blob.core.windows.net/road-detections/Oceania-Full.zip
$ unzip -ojq Oceania-Full.zip
Each line of the TSV file contains a 3-character ISO country identifier followed by a GeoJSON record containing the road vectors.
$ head -n1 Oceania-Full.tsv
AUS {"type":"Feature","geometry":{"type":"LineString","coordinates":[[147.084939479828,-30.7773677126431],[147.0849609375,-30.7773584948334],[147.084971666336,-30.7773400592114]]},"properties":{}}
Below is the country count. Australia makes up 83% of this dataset.
$ cut -f1 Oceania-Full.tsv \
| sort \
| uniq -c \
| sort -rn
4628534 AUS
709436 NZL
116000 PNG
35416 VUT
22367 FJI
20340 SLB
5622 TON
4243 WSM
1060 FSM
831 KIR
811 PLW
555 MHL
225 TUV
140 NRU
I'll extract the Australian records into their own line-delimited GeoJSON file.
$ grep AUS Oceania-Full.tsv \
| cut -f2 \
> Oceania_AUS.geojson
GeoPackage files support R-Tree spatial indices. This lets them load much quicker than GeoJSON in QGIS and numerous other GIS tools. I'll use GDAL's ogr2ogr to convert the 1.2 GB GeoJSON file into an 800 MB GeoPackage file.
$ ogr2ogr \
-f GPKG \
Oceania_AUS.gpkg \
Oceania_AUS.geojson
I'll launch DuckDB and load the GeoPackage file into a new table.
$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
-unsigned \
aus.duckdb
CREATE OR REPLACE TABLE ms_roads AS
SELECT *
FROM st_read('Oceania_AUS.gpkg');
I downloaded the Tourist Spots Shapefile from the Australia Bureau of Statistics. This will help me group the roads into their respective Australian tourist regions. The 27 MB ZIP file below contains ~40 MB of Shapefile data.
$ wget https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/TR_2021_AUST_GDA2020_SHP.zip
$ unzip -ojq TR_2021_AUST_GDA2020_SHP.zip
$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
-unsigned \
aus.duckdb
CREATE OR REPLACE TABLE tourist_spots AS
SELECT *
FROM st_read('TR_2021_AUST_GDA2020.shx');
Enriching Geospatial Data in DuckDB
The geometry fields use the WKB blob data type. I need to convert them into GEOMETRY types as well as convert the tourist regions data from the EPSG:7844 projection into EPSG:4326 so it matches the roads dataset. WKB blob fields will incur conversion penalties when working with GIS functions so converting them early prevents slower query times later on.
CREATE OR REPLACE TABLE roads2 AS
SELECT ST_GEOMFROMWKB(geom) geom
FROM ms_roads;
CREATE OR REPLACE TABLE tourist_spots2 AS
SELECT * EXCLUDE(wkb_geometry),
ST_TRANSFORM(ST_GEOMFROMWKB(wkb_geometry),
'EPSG:7844',
'EPSG:4326') geom
FROM tourist_spots;
I'll join these two datasets so every road record has its corresponding tourism region details.
CREATE OR REPLACE TABLE joined1 AS
SELECT a.geom,
b.TR_CODE21,
b.TR_NAME21,
b.STE_CODE21,
b.STE_NAME21,
b.AUS_CODE21,
b.AUS_NAME21,
b.AREASQKM21,
b.LOCI_URI21,
b.SHAPE_Leng,
b.SHAPE_Area
FROM roads2 a
JOIN tourist_spots2 b
ON ST_WITHIN(a.geom, b.geom);
Below is the schema for the new table.
.schema --indent joined1
CREATE TABLE joined1(
geom GEOMETRY,
"TR_CODE21" VARCHAR,
"TR_NAME21" VARCHAR,
"STE_CODE21" VARCHAR,
"STE_NAME21" VARCHAR,
"AUS_CODE21" VARCHAR,
"AUS_NAME21" VARCHAR,
"AREASQKM21" DOUBLE,
"LOCI_URI21" VARCHAR,
"SHAPE_Leng" DOUBLE,
"SHAPE_Area" DOUBLE
);;
Below is an example record.
.mode line
SELECT *
FROM joined1
LIMIT 1;
geom = LINESTRING (151.076881885529 -33.495588797786, 151.076753139496 -33.4951324954955)
TR_CODE21 = 1R040
TR_NAME21 = Sydney
STE_CODE21 = 1
STE_NAME21 = New South Wales
AUS_CODE21 = AUS
AUS_NAME21 = Australia
AREASQKM21 = 7288.3598
LOCI_URI21 = http://linked.data.gov.au/dataset/asgsed3/TR/1R040
SHAPE_Leng = 14.6210946243
SHAPE_Area = 0.708622228461
Exporting to GeoPackage
I'll export the Gold Coast records to a GeoPackage file so I can examine them in QGIS. The geometry will need to be converted back into a WKB blob field type in order for QGIS to recognise the geometry. The GeoPackage file below is 96K.
COPY (SELECT ST_AsWKB(geom) geom,
TR_CODE21,
TR_NAME21,
STE_CODE21,
STE_NAME21,
AUS_CODE21,
AUS_NAME21,
AREASQKM21,
LOCI_URI21,
SHAPE_Leng,
SHAPE_Area
FROM joined1
WHERE TR_NAME21='Gold Coast')
TO 'joined.gpkg'
WITH (FORMAT GDAL,
DRIVER 'GPKG',
LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
I can't create a multi-layer GeoPackage file with DuckDB so I'll export a second GeoPackage file of the Gold Coast boundaries. The GeoPackage file below is 236K.
COPY (SELECT *
FROM tourist_spots
WHERE TR_NAME21='Gold Coast')
TO 'tourist_spots.gold_coast.gpkg'
WITH (FORMAT GDAL,
DRIVER 'GPKG',
LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
Clustering Roads
I will also export the H3 zoom level 2 counts for the dataset. This will be useful for visualising record density across Australia.
$ vi h3.sql
.mode csv
LOAD h3;
LOAD json;
SELECT printf('%x',
h3_latlng_to_cell(
ST_Y(ST_CENTROID(geom)),
ST_X(ST_CENTROID(geom)),
2)::bigint) as h3_2,
count(*) count
FROM roads2
GROUP BY 1;
$ /Volumes/Seagate/duckdb_spatial/build/release/duckdb \
-unsigned \
aus.duckdb \
< h3.sql \
| tail +2 \
> h3.csv
$ python3
import csv
from geojson import (dumps,
Feature,
FeatureCollection,
Polygon)
import h3
fc = FeatureCollection([
Feature(geometry=Polygon([h3.h3_to_geo_boundary(h3_id,
geo_json=True)]),
properties={'sample_cnt': sample_cnt})
for h3_id, sample_cnt in csv.reader(open('h3.csv'))])
open('h3.geojson', 'w').write(dumps(fc))
I dragged the h3.geojson file onto a new QGIS document and added the Esri World Imagery basemap with the Tile+ plugin. I then set the h3's layer's simple fill stroke width to 5, fill style to "Dense 4" and used the following expression for the fill and stroke colours:
CASE
WHEN "sample_cnt" >= 40 THEN '#e93a27'
WHEN "sample_cnt" >= 30 THEN '#ef684b'
WHEN "sample_cnt" >= 20 THEN '#ed8d75'
WHEN "sample_cnt" >= 10 THEN '#f7c4b1'
ELSE '#e5cccf'
END
The following shows the majority of records are clustered around the major population centres.
Satellite Analysis in QGIS
The Tile+ plugin lets you import satellite and pre-rendered basemaps into QGIS projects. It supports a large number of sources and I'll be using it to fetch satellite imagery from Bing, Esri and Google for the examples below.
$ grep -Eo 'https?://.*\]]' /Users/mark/Library/Application\ Support/QGIS/QGIS3/profiles/default/python/plugins/tile_plus/basemap.xml \
| python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));" \
| sed 's/]]//g' \
| sort
http://basemaps.cartocdn.com/dark_all/{z}/{x}/{y}.png
http://basemaps.cartocdn.com/light_all/{z}/{x}/{y}.png
http://ecn.t3.tiles.virtualearth.net/tiles/a{q}.jpeg?g=1
http://realearth.ssec.wisc.edu/tiles/Earthquake-mag/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/GLOBALterratc/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/NESDIS-GHE-HourlyRainfall/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/NESDIS-SST/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/NightLightsColored/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/VIIRS-Fire/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/VIIRS-MASK-54000x27000/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalir-avn/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalir-rr/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalir/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalvis/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/globalwv-grad/{z}/{x}/{y}.png
http://realearth.ssec.wisc.edu/tiles/lsat8-llook-fc/{z}/{x}/{y}.png
http://services.arcgisonline.com/ArcGIS/rest/services/Canvas/World_Dark_Gray_Base/MapServer/tile/{z}/{y}/{x}
http://services.arcgisonline.com/ArcGIS/rest/services/Canvas/World_Light_Gray_Base/MapServer/tile/{z}/{y}/{x}
http://services.arcgisonline.com/ArcGIS/rest/services/World_Topo_Map/MapServer/tile/{z}/{y}/{x}
http://tile.openstreetmap.fr/hot/{z}/{x}/{y}.png
http://tile.openstreetmap.org/{z}/{x}/{y}.png
http://tile.stamen.com/terrain/{z}/{x}/{y}.png
http://tile.stamen.com/toner-lite/{z}/{x}/{y}.png
http://tile.stamen.com/toner/{z}/{x}/{y}.png
http://tile.stamen.com/watercolor/{z}/{x}/{y}.jpg
http://tiles.wmflabs.org/bw-mapnik/{z}/{x}/{y}.png
https://mt1.google.com/vt/lyrs=m&x={x}&y={y}&z={z}
https://mt1.google.com/vt/lyrs=p&x={x}&y={y}&z={z}
https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}
https://mt1.google.com/vt/lyrs=t&x={x}&y={y}&z={z}
https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}
https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}
https://server.arcgisonline.com/ArcGIS/rest/services/World_Terrain_Base/MapServer/tile/{z}/{y}/{x}
https://services.arcgisonline.com/ArcGIS/rest/services/Ocean/World_Ocean_Base/MapServer/tile/{z}/{y}/{x}
I've started a new project in QGIS, created a basemap using Bing's Virtual Earth satellite imagery with Tile+, dragged the roads GeoPackage file in and coloured them yellow as well as imported the Gold Coast boundary GeoPackage file and coloured it purple.
Zooming into Surfers Paradise there appears to be near-complete road coverage from Microsoft's dataset.
This includes public footpaths on private property. They've also been clever enough to not include footpaths in forested areas that most likely don't support regular vehicles.
The 76-floor tower on 88 Esplanade is blocking Bing's satellite's view of Staghorn Avenue, yet the road is still properly marked. I've rotated the map below 90 degrees and it's facing west.
Google's map has a more top-down view before the tower was taller than its neighbours. I doubt this was a source image for Bing but it would be interesting to see what the Maxar and Airbus images that Microsoft used look like.
Below is Esri's image of the same location with the buildings tilting in the other direction. It's possible an image from this angle was blended with the other angle to 'look around' buildings and identify the roads next to them.
HERE's satellite map credits Maxar. Given the building is photographed from its Northern side it could be Microsoft mixed these photos to look around the building.
I'm not sure if Microsoft had photos from multiple angles or the same angle but over a long period of time or if they rely on nearby roads to hint there may be one here. Nonetheless, collecting a complete list of roads anywhere in the world is still an unsolved problem and their attempt is fantastic.
That said, I was able to spot some unconnected roads. Here is one leading up to a golf course clubhouse.
And a very long diving board off of a bridge.
There are attempts to mark out lanes in parking lots. Some are better than others.
I found one example where a road looks to lead right into a mall's multi-level parking lot and abruptly stops. I suspect the 3rd dimension isn't being taken into account.
Some roads are picked out of areas with dense vegetation obstructing the satellite's view. They seem to get this right more often than not.