Satellites Spotting Depth

Depth Anything V2 is a depth estimation model that was released last year. It was developed by a team from TikTok and the University of Hong Kong (HKU). Almost ~600K synthetic, labelled images and over 62M real, unlabelled images were used in its training.

In this post, I'll run Depth Anything V2's largest model against Maxar's 2025 satellite imagery of Bangkok, Thailand.

My Workstation

I'm using a 5.7 GHz AMD Ryzen 9 9950X CPU. It has 16 cores and 32 threads and 1.2 MB of L1, 16 MB of L2 and 64 MB of L3 cache. It has a liquid cooler attached and is housed in a spacious, full-sized Cooler Master HAF 700 computer case.

The system has 96 GB of DDR5 RAM clocked at 4,800 MT/s and a 5th-generation, Crucial T700 4 TB NVMe M.2 SSD which can read at speeds up to 12,400 MB/s. There is a heatsink on the SSD to help keep its temperature down. This is my system's C drive.

The system is powered by a 1,200-watt, fully modular Corsair Power Supply and is sat on an ASRock X870E Nova 90 Motherboard.

I'm running Ubuntu 24 LTS via Microsoft's Ubuntu for Windows on Windows 11 Pro. In case you're wondering why I don't run a Linux-based desktop as my primary work environment, I'm still using an Nvidia GTX 1080 GPU which has better driver support on Windows and I use ArcGIS Pro from time to time which only supports Windows natively.

Installing Prerequisites

I'm running Esri's ArcGIS Pro 3.5. This version is their latest and was released last week.

I'll also use Python 3.12.3 in this post.

$ sudo add-apt-repository ppa:deadsnakes/ppa
$ sudo apt update
$ sudo apt install \
    python3-pip \
    python3.12-venv

I'll clone DepthAnything's v2 repo.

$ git clone https://github.com/DepthAnything/Depth-Anything-V2 \
    ~/Depth-Anything-V2

I'll set up a Python Virtual Environment and install Depth Anything's dependencies.

$ python3 -m venv ~/.depth_anything_v2
$ source ~/.depth_anything_v2/bin/activate
$ python3 -m pip install \
    -r ~/Depth-Anything-V2/requirements.txt

DepthAnything v2 has three different pre-trained models available. I'll download their largest 335.3M-parameter model that has a footprint of ~1.3 GB.

$ mkdir -p ~/Depth-Anything-V2/checkpoints
$ cd ~/Depth-Anything-V2/checkpoints
$ wget 'https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth?download=true'

Maxar's Bangkok Satellite Imagery

Maxar have an open data programme that I wrote a post on a few years ago. I later revisited this feed last month after the earthquake struck Myanmar and Thailand.

I'll be using two images from this feed in this post. The first image covers part of the Chatuchak district and includes Ratchadaphisek Road which has several tall towers along it.

The image is a GeoTIFF pyramid containing a 17408x17408-pixel JPEG covering an area ~5.2 x 4.2 KM. The image was captured on February 14th by Maxar's WorldView 3 satellite at a resolution of 38cm.

$ wget https://maxar-opendata.s3.amazonaws.com/events/Earthquake-Myanmar-March-2025/ard/47/122022102203/2025-02-14/10400100A4C67F00-visual.tif

This is the image below in relation to its surrounding area in Bangkok.

Below I've zoomed into Ratchadaphisek Road where you can see the tall towers.

This is the metadata for the above image.

{
  "geometry": {
    "coordinates": [
      [
        [
          100.574350016387555,
          13.833249903918164
        ],
        [
          100.525209298594646,
          13.833560533960037
        ],
        [
          100.525429970258031,
          13.86737334
        ],
        [
          100.574579936582708,
          13.86737334
        ],
        [
          100.574350016387555,
          13.833249903918164
        ]
      ]
    ],
    "type": "Polygon"
  },
  "properties": {
    "ard_metadata_version": "0.0.1",
    "catalog_id": "10400100A4C67F00",
    "data-mask": "https://maxar-opendata.s3.amazonaws.com/events/Earthquake-Myanmar-March-2025/ard/47/122022102203/2025-02-14/10400100A4C67F00-data-mask.gpkg",
    "datetime": "2025-02-14T04:02:15Z",
    "grid:code": "MXRA-Z47-122022102203",
    "gsd": 0.38,
    "ms_analytic": "https://maxar-opendata.s3.amazonaws.com/events/Earthquake-Myanmar-March-2025/ard/47/122022102203/2025-02-14/10400100A4C67F00-ms.tif",
    "pan_analytic": "https://maxar-opendata.s3.amazonaws.com/events/Earthquake-Myanmar-March-2025/ard/47/122022102203/2025-02-14/10400100A4C67F00-pan.tif",
    "platform": "WV03",
    "proj:bbox": "664843.75,1529843.75,670156.25,1533619.8386004784",
    "proj:code": "EPSG:32647",
    "proj:geometry": {
      "coordinates": [
        [
          [
            670156.25,
            1529843.75
          ],
          [
            664843.75,
            1529843.75
          ],
          [
            664843.75,
            1533585.6070091636
          ],
          [
            670156.25,
            1533619.8386004784
          ],
          [
            670156.25,
            1529843.75
          ]
        ]
      ],
      "type": "Polygon"
    },
    "quadkey": "122022102203",
    "tile:clouds_area": 0.0,
    "tile:clouds_percent": 0,
    "tile:data_area": 19.9,
    "utm_zone": 47,
    "view:azimuth": 243.9,
    "view:incidence_angle": 59.9,
    "view:off_nadir": 27.2,
    "view:sun_azimuth": 139.3,
    "view:sun_elevation": 55.3,
    "visual": "https://maxar-opendata.s3.amazonaws.com/events/Earthquake-Myanmar-March-2025/ard/47/122022102203/2025-02-14/10400100A4C67F00-visual.tif"
  },
  "type": "Feature"
}

The second image has been cropped out of Maxar's original source imagery and is focused on the intersection between Rattanathibet Road and Tiwanon Road in the Bang Kraso distinct of Northern Bangkok.

The above image is much smaller that the other one for reasons I'll get to later on in this post. It's a 3829x1936-pixel JPEG screen shot captured at 100% from its source imagery.

First Inference Attempt

I'll create an output folder and run the larger image through DepthAnything's largest model.

$ cd ~/Depth-Anything-V2/
$ mkdir -p out

$ python run.py \
      --encoder vitl \
      --pred-only \
      --grayscale \
      --img-path 10400100A4C67F00-visual.tif \
      --outdir out/

The resulting depth map didn't highlight any of the buildings in the image.

This is likely due to part of the source image is completely black. This threw off the model which treated the empty area as the peak of the image.

Second Inference Attempt

I'll run the smaller image on through DepthAnything's largest model.

$ python run.py \
      --encoder vitl \
      --pred-only \
      --grayscale \
      --img-path Photos_epK2mkc7uS.jpg \
      --outdir out/

This result was much better. The location data is missing from the screen shot so I georefernced the depth map back into position.

Below is the depth map in relation to the original source image from Maxar.

The depth information itself is relative so work would be needed to figure out the height of the tallest building in the image and adjust the scale accordingly.

It's conceivable to put together a workflow where images are tiled and Overture's building dataset could be sourced to find the tallest building within any one tile. Then the height scale could be set based on that.

Aerial Imagery

The model also does a good job with images taken at height, such as from tall buildings. Below is an example taken of Tallinn's Old Town from the top of the Viru Hotel.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.