Home | Benchmarks | Categories | Atom Feed

Posted on Tue 18 November 2025 under DevOps and Networking

American Data Centers

In September, Business Insider published a video on the locations, ownership and power and water consumption of America's Data Centers. They collected diesel generator permits from every state in the US and mapped the applicant's company name back to their parent company. This revealed 1,240 sites along with a large amount of metadata for each location.

They published an interactive map with the underlying GeoJSON dataset embedded as JavaScript.

In this post, I'll explore the Business Insider's dataset.

My Workstation

I'm using a 5.7 GHz AMD Ryzen 9 9950X CPU. It has 16 cores and 32 threads and 1.2 MB of L1, 16 MB of L2 and 64 MB of L3 cache. It has a liquid cooler attached and is housed in a spacious, full-sized Cooler Master HAF 700 computer case.

The system has 96 GB of DDR5 RAM clocked at 4,800 MT/s and a 5th-generation, Crucial T700 4 TB NVMe M.2 SSD which can read at speeds up to 12,400 MB/s. There is a heatsink on the SSD to help keep its temperature down. This is my system's C drive.

The system is powered by a 1,200-watt, fully modular Corsair Power Supply and is sat on an ASRock X870E Nova 90 Motherboard.

I'm running Ubuntu 24 LTS via Microsoft's Ubuntu for Windows on Windows 11 Pro. In case you're wondering why I don't run a Linux-based desktop as my primary work environment, I'm still using an Nvidia GTX 1080 GPU which has better driver support on Windows and ArcGIS Pro only supports Windows natively.

Installing Prerequisites

I'll use Python 3.12.3 and a few other tools to help analyse the data in this post.

$ sudo add-apt-repository ppa:deadsnakes/ppa
$ sudo apt update
$ sudo apt install \
    jq \
    python3-pip \
    python3.12-venv

I'll set up a Python Virtual Environment and a JavaScript interpreter for Python.

$ python3 -m venv ~/.dcs
$ source ~/.dcs/bin/activate
$ python -m pip install \
    duckdb \
    esprima \
    levenshtein

I'll use DuckDB v1.4.1, along with its H3, JSON, Lindel, Parquet and Spatial extensions, in this post.

$ cd ~
$ wget -c https://github.com/duckdb/duckdb/releases/download/v1.4.1/duckdb_cli-linux-amd64.zip
$ unzip -j duckdb_cli-linux-amd64.zip
$ chmod +x duckdb
$ ~/duckdb
INSTALL h3 FROM community;
INSTALL lindel FROM community;
INSTALL json;
INSTALL parquet;
INSTALL spatial;

I'll set up DuckDB to load every installed extension each time it launches.

$ vi ~/.duckdbrc
.timer on
.width 180
LOAD h3;
LOAD lindel;
LOAD json;
LOAD parquet;
LOAD spatial;

The maps in this post were rendered using QGIS version 3.44. QGIS is a desktop application that runs on Windows, macOS and Linux. The application has grown in popularity in recent years and has ~15M application launches from users all around the world each month.

I used QGIS' Tile+ plugin to add basemaps from Esri and Bing to the maps in this post.

Analysis-Ready Data

I'll download the JavaScript used for Business Insider's interactive map.

$ mkdir -p ~/business_insider
$ cd ~/business_insider

$ wget https://tbimedia.s3.us-east-1.amazonaws.com/bistudios/_00/dev_edit/graphics/2025/09/2025-09-datacenters-map-table/index.js

I'll use Python to convert the dataset into line-delimited JSON.

$ python3
import json

import esprima


resp = esprima.parseScript(
            open('index.js', 'r')
                .read()
                .split('features:Q0},')[-1]
                .split(',I9=[];')[0])


def get_key(prop):
    if prop.key.value:
        return prop.key.value.lower()

    if prop.key.name:
        return prop.key.name.lower()

    return 'unknown'


dataset = [
    {get_key(prop): prop.value.value
     for prop in element.properties}
    for element in resp.body[0].expression.right.elements]

with open('data_centres.json', 'w') as f:
    for rec in dataset:
        f.write(json.dumps(rec, sort_keys=True) + '\n')

I'll use DuckDB to convert the JSON into a Parquet file. This will make analysis quicker than any of the above formats.

$ ~/duckdb
COPY (
    SELECT * EXCLUDE(long, lat),
           ST_POINT(long::FLOAT, lat::FLOAT) geometry
    FROM 'data_centres.json'
    ORDER BY HILBERT_ENCODE([ST_Y(ST_CENTROID(geometry)),
                             ST_X(ST_CENTROID(geometry))]::double[2])
) TO 'data_centres.parquet' (
        FORMAT 'PARQUET',
        CODEC  'ZSTD',
        COMPRESSION_LEVEL 22,
        ROW_GROUP_SIZE 15000);

The above produced a 234 KB, 1,240-row, 99-column Parquet file.

Data Fluency

Below is an example record from this dataset. It's of a hyperscaler-level site owned by Apple in Arizona.

$ echo "FROM  'data_centres.parquet'
        WHERE address LIKE '%Mesa AZ 85212'
        AND   brand = 'Apple'
        LIMIT 1" \
    | ~/duckdb -json \
    | jq -S .
[
  {
    "# facilities note": "",
    "# of tier 2": "51",
    "# of tier 3": "",
    "# of tier 4": "",
    "address": "3740 S Signal Butte Rd, Mesa AZ 85212",
    "annual water consumption (gallons)": "-",
    "aquifer name": "-",
    "avert region": "Southwest",
    "brand": "Apple",
    "brand source": "https://www.sec.gov/Archives/edgar/data/1394954/000144530514004770/a8-kapplesettlementagreeme.htm",
    "case status desc": "-",
    "city": "Mesa",
    "co tpy": "-",
    "co2e tpy": "-",
    "company": "Platypus Development LLC",
    "county": "Maricopa",
    "daily water consumption (gallons)": "-",
    "data center construct year": "",
    "estimate power consumption in kw/hr (calculated 30%)": "55864",
    "estimate power consumption in kw/hr (calculated 50%)": "93106",
    "estimate power consumption in kw/hr (calculated 60%)": "111,727",
    "estimated at 2n": "46553",
    "estimated at 2n (in twh)": "0.408",
    "estimated at n +1(in twh)": "0.652",
    "estimated at n+1": "74484.8",
    "estimated data center electricity use in megawatt-hours at 50% capacity": "93.11",
    "estimated data center electricity use in terrawatt-hours a year at 50% capacity": "0.82",
    "first permit issue year": "2020",
    "generator type": "Caterpillar",
    "geometry": "POINT (-111.60407257080078 33.34719467163086)",
    "large facilites": "Possible hyperscaler",
    "latest permit issue year": "2022",
    "link to records": "Apple - Platypus Development LLC.docx",
    "major basin name": "North America, Colorado",
    "minor basin name": "Middle Gila",
    "new permit or existing update?": "Existing update",
    "nox tpy": "90",
    "number of buildings": "",
    "number of epa enforcements": "-",
    "page": "5",
    "pm tpy": "-",
    "pm10 tpy": "-",
    "pm2.5 tpy": "-",
    "population within 1-mile radius - 10,000": "fewer than 10,000",
    "population within 1-mile radius - 5,000": "More than 5,000",
    "primary law": "-",
    "private equity or asset manager?": "",
    "private equity or asset manager? (source)": "",
    "rate capacity original": "249715 HP",
    "reporter": "Narimes",
    "reporter notes": "eight emergency generators rated at 3604 HP, two rated at 2206 HP, one rated at 762 HP, one rated at 1141 HP, fifty seven rated at 5646 HP",
    "size capacity over 100 mw": "-",
    "size category at 30%": "Possible hyperscaler",
    "size category at 50% capacity": "Possible hyperscaler",
    "size category at 50% capacity 'large' vs 'small'": "large-scale",
    "size category at 60%": "Possible hyperscaler",
    "sox tpy": "-",
    "state": "AZ",
    "state environmental justice concern": "no",
    "state percentile for ej index for diesel particulate matter": "32",
    "state percentile for ej index for drinking water non-compliance": "68",
    "state percentile for ej index for hazardous waste proximity": "39",
    "state percentile for ej index for lead paint indicator": "48",
    "state percentile for ej index for nitrogen dioxide (no2)": "27",
    "state percentile for ej index for ozone": "52",
    "state percentile for ej index for particulate matter": "35",
    "state percentile for ej index for rmp proximity": "44",
    "state percentile for ej index for superfund proximity": "68",
    "state percentile for ej index for toxic releases to air": "55",
    "state percentile for ej index for traffic proximity and volume": "33",
    "state percentile for ej index for underground storage tanks (ust) indicator": "36",
    "state percentile for ej index for wastewater discharge indicator": "32",
    "statename": "Arizona",
    "total generator rate capacity kw": "186212",
    "total penalties assessed": "-",
    "total population within 1 mile of site": "9,026",
    "unique id": "37",
    "unknown": "60518.9",
    "us environmental justice concern": "no",
    "us percentile for ej index for diesel particulate matter": "45",
    "us percentile for ej index for drinking water non-compliance": "79",
    "us percentile for ej index for hazardous waste proximity": "33",
    "us percentile for ej index for lead paint indicator": "15",
    "us percentile for ej index for nitrogen dioxide (no2)": "47",
    "us percentile for ej index for ozone": "62",
    "us percentile for ej index for particulate matter": "25",
    "us percentile for ej index for rmp proximity": "41",
    "us percentile for ej index for superfund proximity": "66",
    "us percentile for ej index for toxic releases to air": "51",
    "us percentile for ej index for traffic proximity and volume": "42",
    "us percentile for ej index for underground storage tanks (ust) indicator": "38",
    "us percentile for ej index for wastewater discharge indicator": "50",
    "vocs tpy": "-",
    "water notes": "-",
    "water record link": "-",
    "water requested?": "denied",
    "water stress": "Extremely High (>80%)",
    "zip": "85212"
  }
]

Below are the field names, data types, percentages of NULLs per column, number of unique values and minimum and maximum values for each column.

$ ~/duckdb
.maxrows 500

SELECT   column_name,
         column_type,
         null_percentage,
         approx_unique,
         min[:30],
         max[:30]
FROM     (SUMMARIZE
          FROM   'data_centres.parquet')
ORDER BY 1;
┌─────────────────────────────────────────────────────────────────────────────────┬─────────────┬─────────────────┬───────────────┬────────────────────────────────┬────────────────────────────────┐
│                                   column_name                                   │ column_type │ null_percentage │ approx_unique │            min[:30]            │            max[:30]            │
│                                     varchar                                     │   varchar   │  decimal(9,2)   │     int64     │            varchar             │            varchar             │
├─────────────────────────────────────────────────────────────────────────────────┼─────────────┼─────────────────┼───────────────┼────────────────────────────────┼────────────────────────────────┤
│ # facilities note                                                               │ VARCHAR     │            0.00 │             9 │                                │ 1 of 8 facilities              │
│ # of tier 2                                                                     │ VARCHAR     │            0.00 │            12 │                                │ 8                              │
│ # of tier 3                                                                     │ VARCHAR     │            0.00 │             2 │                                │ 1                              │
│ # of tier 4                                                                     │ VARCHAR     │            0.00 │             2 │                                │ 414                            │
│ address                                                                         │ VARCHAR     │            0.40 │          1360 │ 0 Williams Road, Palmetto, Geo │ intersection of Morse Rd and B │
│ annual water consumption (gallons)                                              │ VARCHAR     │            0.00 │            67 │                                │ aggregate                      │
│ aquifer name                                                                    │ VARCHAR     │            0.00 │             6 │ -                              │ Northern Great Plains / Interi │
│ avert region                                                                    │ VARCHAR     │            0.00 │            14 │ California                     │ Texas                          │
│ brand                                                                           │ VARCHAR     │            0.00 │           345 │ 11:11 Systems                  │ unWired Broadband              │
│ brand source                                                                    │ VARCHAR     │            0.24 │           620 │                                │ zColo = Databank               │
│ case status desc                                                                │ VARCHAR     │            0.00 │             5 │                                │ Resolved                       │
│ city                                                                            │ VARCHAR     │            0.00 │           376 │                                │ Wood Dale                      │
│ co tpy                                                                          │ VARCHAR     │            0.00 │           604 │                                │ 99.9                           │
│ co2e tpy                                                                        │ VARCHAR     │            0.00 │            38 │                                │ 9772.31                        │
│ company                                                                         │ VARCHAR     │            0.08 │           742 │ 1000 Coit Road, L.P.           │ zColo, LLC                     │
│ county                                                                          │ VARCHAR     │            0.00 │           198 │ Accomack                       │ Yolo                           │
│ daily water consumption (gallons)                                               │ VARCHAR     │            0.00 │            10 │                                │ aggregate water                │
│ data center construct year                                                      │ VARCHAR     │            0.00 │             5 │                                │ in progress                    │
│ estimate power consumption in kw/hr (calculated 30%)                            │ VARCHAR     │            0.00 │           630 │ 100                            │ redacted                       │
│ estimate power consumption in kw/hr (calculated 50%)                            │ VARCHAR     │            0.00 │           801 │ 100                            │ redacted                       │
│ estimate power consumption in kw/hr (calculated 60%)                            │ VARCHAR     │            0.00 │           778 │ 1,048                          │ redacted                       │
│ estimated at 2n                                                                 │ VARCHAR     │            0.00 │           930 │ 100                            │ redacted                       │
│ estimated at 2n (in twh)                                                        │ VARCHAR     │            0.00 │           335 │ 0                              │ redacted                       │
│ estimated at n +1(in twh)                                                       │ VARCHAR     │            0.00 │           388 │ 0                              │ redacted                       │
│ estimated at n+1                                                                │ VARCHAR     │            0.00 │           881 │ 100                            │ redacted                       │
│ estimated data center electricity use in megawatt-hours at 50% capacity         │ VARCHAR     │            0.00 │           820 │ 0.02                           │ redacted                       │
│ estimated data center electricity use in terrawatt-hours a year at 50% capacity │ VARCHAR     │            0.00 │           130 │ 0                              │ redacted                       │
│ first permit issue year                                                         │ VARCHAR     │            0.00 │            38 │ 1976                           │ 2024                           │
│ generator type                                                                  │ VARCHAR     │            0.08 │           134 │                                │ Waukesha; Doosan               │
│ geometry                                                                        │ GEOMETRY    │            0.00 │          1165 │ POINT (-112.328125 33.57888793 │ POINT (-121.95311737060547 37. │
│ large facilites                                                                 │ VARCHAR     │            0.00 │             5 │ Possible hyperscaler           │ redacted                       │
│ latest permit issue year                                                        │ VARCHAR     │            0.00 │            34 │ 1977                           │ 2025                           │
│ link to records                                                                 │ VARCHAR     │            0.00 │           876 │                                │ wsdc0101.pdf wsdc0102.pdf wsdc │
│ major basin name                                                                │ VARCHAR     │            0.00 │            11 │                                │ United States, North Atlantic  │
│ minor basin name                                                                │ VARCHAR     │            0.00 │           184 │                                │ Wheeler Lake                   │
│ new permit or existing update?                                                  │ VARCHAR     │            0.00 │             2 │ Existing update                │ New                            │
│ nox tpy                                                                         │ VARCHAR     │            0.00 │           712 │ -                              │ 99.97                          │
│ number of buildings                                                             │ VARCHAR     │            0.00 │            10 │                                │ 8                              │
│ number of epa enforcements                                                      │ VARCHAR     │            0.00 │             5 │                                │ 3                              │
│ page                                                                            │ VARCHAR     │            0.00 │           209 │                                │ rows 9481-9484                 │
│ pm tpy                                                                          │ VARCHAR     │            0.00 │           280 │                                │ 9.9                            │
│ pm10 tpy                                                                        │ VARCHAR     │            0.00 │           345 │                                │ 9.97                           │
│ pm2.5 tpy                                                                       │ VARCHAR     │            0.00 │           344 │                                │ 9.9                            │
│ population within 1-mile radius - 10,000                                        │ VARCHAR     │            0.00 │             3 │ -                              │ fewer than 10,000              │
│ population within 1-mile radius - 5,000                                         │ VARCHAR     │            0.00 │             3 │                                │ fewer than 5,000               │
│ primary law                                                                     │ VARCHAR     │            0.00 │             4 │                                │ CWA                            │
│ private equity or asset manager?                                                │ VARCHAR     │            0.00 │            20 │                                │ pg 1 PI32426_PCP210001.pdf     │
│ private equity or asset manager? (source)                                       │ VARCHAR     │            0.00 │            29 │                                │ https://www.streamdatacenters. │
│ rate capacity original                                                          │ VARCHAR     │            0.00 │           453 │                                │ [combined]                     │
│ reporter                                                                        │ VARCHAR     │            0.00 │            12 │                                │ Yuheng/Rosemarie               │
│ reporter notes                                                                  │ VARCHAR     │           16.69 │          1142 │                                │ two operating scenarios: 1) fo │
│ size capacity over 100 mw                                                       │ VARCHAR     │            0.00 │             3 │                                │ over 100 MW                    │
│ size category at 30%                                                            │ VARCHAR     │            0.00 │             5 │ Possible hyperscaler           │ redacted                       │
│ size category at 50% capacity                                                   │ VARCHAR     │            0.00 │             5 │ Possible hyperscaler           │ redacted                       │
│ size category at 50% capacity 'large' vs 'small'                                │ VARCHAR     │            0.00 │             5 │ large-scale                    │ small-scale                    │
│ size category at 60%                                                            │ VARCHAR     │            0.00 │             5 │ Possible hyperscaler           │ redacted                       │
│ sox tpy                                                                         │ VARCHAR     │            0.00 │           243 │                                │ 9.907                          │
│ state                                                                           │ VARCHAR     │            0.00 │            50 │ AL                             │ WY                             │
│ state environmental justice concern                                             │ VARCHAR     │            0.00 │             4 │                                │ yes                            │
│ state percentile for ej index for diesel particulate matter                     │ VARCHAR     │            0.00 │            85 │                                │ 99                             │
│ state percentile for ej index for drinking water non-compliance                 │ VARCHAR     │            0.00 │            41 │                                │ 99                             │
│ state percentile for ej index for hazardous waste proximity                     │ VARCHAR     │            0.00 │            84 │                                │ 99                             │
│ state percentile for ej index for lead paint indicator                          │ VARCHAR     │            0.00 │            92 │                                │ 98                             │
│ state percentile for ej index for nitrogen dioxide (no2)                        │ VARCHAR     │            0.00 │            88 │                                │ 98                             │
│ state percentile for ej index for ozone                                         │ VARCHAR     │            0.00 │            92 │                                │ 99                             │
│ state percentile for ej index for particulate matter                            │ VARCHAR     │            0.00 │            90 │                                │ 99                             │
│ state percentile for ej index for rmp proximity                                 │ VARCHAR     │            0.00 │            84 │                                │ 99                             │
│ state percentile for ej index for superfund proximity                           │ VARCHAR     │            0.00 │            59 │                                │ 99                             │
│ state percentile for ej index for toxic releases to air                         │ VARCHAR     │            0.00 │            92 │                                │ 98                             │
│ state percentile for ej index for traffic proximity and volume                  │ VARCHAR     │            0.00 │            92 │                                │ 99                             │
│ state percentile for ej index for underground storage tanks (ust) indicator     │ VARCHAR     │            0.00 │            81 │                                │ 98                             │
│ state percentile for ej index for wastewater discharge indicator                │ VARCHAR     │            0.00 │            92 │                                │ 99                             │
│ statename                                                                       │ VARCHAR     │            0.00 │            49 │ Alabama                        │ Wyoming                        │
│ total generator rate capacity kw                                                │ VARCHAR     │            0.00 │           825 │ 100                            │ redacted                       │
│ total penalties assessed                                                        │ VARCHAR     │            0.00 │            38 │                                │ -                              │
│ total population within 1 mile of site                                          │ VARCHAR     │            0.00 │          1166 │ -                              │ 994                            │
│ unique id                                                                       │ VARCHAR     │            0.00 │          1103 │ 10                             │ 999                            │
│ unknown                                                                         │ VARCHAR     │            0.00 │           923 │ 1006.53                        │ redacted                       │
│ us environmental justice concern                                                │ VARCHAR     │            0.00 │             4 │                                │ yes                            │
│ us percentile for ej index for diesel particulate matter                        │ VARCHAR     │            0.00 │            92 │                                │ 99                             │
│ us percentile for ej index for drinking water non-compliance                    │ VARCHAR     │            0.00 │            27 │                                │ 99                             │
│ us percentile for ej index for hazardous waste proximity                        │ VARCHAR     │            0.00 │            78 │                                │ 99                             │
│ us percentile for ej index for lead paint indicator                             │ VARCHAR     │            0.00 │            78 │                                │ 97                             │
│ us percentile for ej index for nitrogen dioxide (no2)                           │ VARCHAR     │            0.00 │            92 │                                │ 99                             │
│ us percentile for ej index for ozone                                            │ VARCHAR     │            0.00 │            92 │                                │ 99                             │
│ us percentile for ej index for particulate matter                               │ VARCHAR     │            0.00 │            88 │ -                              │ 99                             │
│ us percentile for ej index for rmp proximity                                    │ VARCHAR     │            0.00 │            69 │                                │ 99                             │
│ us percentile for ej index for superfund proximity                              │ VARCHAR     │            0.00 │            39 │                                │ 98                             │
│ us percentile for ej index for toxic releases to air                            │ VARCHAR     │            0.00 │            92 │                                │ 99                             │
│ us percentile for ej index for traffic proximity and volume                     │ VARCHAR     │            0.00 │            91 │                                │ 99                             │
│ us percentile for ej index for underground storage tanks (ust) indicator        │ VARCHAR     │            0.00 │            69 │                                │ 99                             │
│ us percentile for ej index for wastewater discharge indicator                   │ VARCHAR     │            0.00 │            91 │                                │ 98                             │
│ vocs tpy                                                                        │ VARCHAR     │            0.00 │           382 │                                │ 9.94                           │
│ water notes                                                                     │ VARCHAR     │            0.08 │            51 │                                │ usage is from Jan 2023-Dec 202 │
│ water record link                                                               │ VARCHAR     │            0.00 │            47 │                                │ https://www.denverpost.com/202 │
│ water requested?                                                                │ VARCHAR     │            0.00 │             7 │                                │ yes                            │
│ water stress                                                                    │ VARCHAR     │            0.00 │             6 │ Arid and Low Water Use         │ Medium - High (20-40%)         │
│ zip                                                                             │ VARCHAR     │            0.00 │           552 │ -                              │ 99019                          │
├─────────────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────────┴───────────────┴────────────────────────────────┴────────────────────────────────┤
│ 98 rows                                                                                                                                                                                 6 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

A lot could be done to group these fields into dictionaries and normalise their values. If this post proves popular, I might revisit this.

Diesel Generator Permits

Virginia has published a list of 188 diesel generator permits they've issued to data centers in their state.

In some cases, the well-known trading name of the applicant will be stated. The following is a quote from a permit issued to AWS.

Attached is a permit to construct and operate emergency diesel engine generator sets (gen-sets) at Amazon Data Services’ data centers (IAD-51, IAD-56, IAD-88, IAD-89, IAD-192, and IAD214), in accordance with the provisions of the Commonwealth of Virginia State Air Pollution Control Board (Board’s) Regulations for the Control and Abatement of Air Pollution (Regulations). This permit document combines the terms and conditions from, and supersedes your permit document dated October 12, 2023.

Below is a manifest of diesel generators they're using at their site.

American Data Centers

Not all states make permits accessible via google searches. If anyone is looking to continue this research, the BBC listed the top jurisdictions and their market share in and outside of the US where data centers are located. This can help narrow the search space.

14% Northern Virginia
 6% Oregon
 4% Iowa
 3% Dallas, Texas
 2% Arizona
 2% Nebraska
 2% Illinois
6% Beijing
4% Dublin
2% Shanghai

Data Center Owners

Below are the most and least common brands represented in this dataset.

$ ~/duckdb
SELECT   COUNT(*),
         brand
FROM     'data_centres.parquet'
GROUP BY 2
ORDER BY 1 DESC;
┌──────────────┬──────────────────────────────────────────────────────────────┐
│ count_star() │                            brand                             │
│    int64     │                           varchar                            │
├──────────────┼──────────────────────────────────────────────────────────────┤
│          177 │ Amazon                                                       │
│           70 │ Digital Realty                                               │
│           53 │ Equinix                                                      │
│           47 │ Google                                                       │
│           44 │ Microsoft                                                    │
│           34 │ QTS                                                          │
│           33 │ DataBank                                                     │
│           33 │ Centersquare                                                 │
│           32 │ Meta                                                         │
│           31 │ CyrusOne                                                     │
│           31 │ Lumen Technologies                                           │
│           23 │ Verizon                                                      │
│           19 │ TierPoint                                                    │
│           18 │ Flexential                                                   │
│           17 │ CoreSite                                                     │
│           16 │ Iron Mountain                                                │
│           15 │ NTT                                                          │
│           14 │ Stack Infrastructure                                         │
│           13 │ Cogent                                                       │
│           13 │ EdgeConneX                                                   │
│            · │     ·                                                        │
│            · │     ·                                                        │
│            · │     ·                                                        │
│            1 │ SOMO village                                                 │
│            1 │ Datacate                                                     │
│            1 │ PC Solutions                                                 │
│            1 │ Showers Development, LLC                                     │
│            1 │ US Signal                                                    │
│            1 │ Reliance Industries                                          │
│            1 │ Inova                                                        │
│            1 │ George Washington University                                 │
│            1 │ The Vanguard Group                                           │
│            1 │ University of Minnesota - Minnesota Supercomputing Institute │
│            1 │ WW Grainger                                                  │
│            1 │ MasterCard                                                   │
│            1 │ PSCU Financial Services                                      │
│            1 │ SRI Ten 706 Wilshire LLC                                     │
│            1 │ Rosegate LLC                                                 │
│            1 │ Stanford University Data Center                              │
│            1 │ California Legislative Counsel                               │
│            1 │ Adobe                                                        │
│            1 │ American Honda Motor Company                                 │
│            1 │ NeuStar                                                      │
├──────────────┴──────────────────────────────────────────────────────────────┤
│ 305 rows (40 shown)                                               2 columns │
└─────────────────────────────────────────────────────────────────────────────┘

Locations Heatmap

Below is a heatmap of the data center locations. The brighter hexagons have more sites.

$ ~/duckdb
CREATE OR REPLACE TABLE h3_3_stats AS
    SELECT   H3_LATLNG_TO_CELL(
                ST_Y(ST_CENTROID(geometry)),
                ST_X(ST_CENTROID(geometry)), 3) AS h3_3,
             COUNT(*) num_buildings
    FROM     'data_centres.parquet'
    GROUP BY 1;

COPY (
    SELECT ST_ASWKB(H3_CELL_TO_BOUNDARY_WKT(h3_3)::geometry) geometry,
           num_buildings
    FROM   h3_3_stats
    WHERE  ST_XMIN(geometry::geometry) BETWEEN -179 AND 179
    AND    ST_XMAX(geometry::geometry) BETWEEN -179 AND 179
) TO 'h3_4_stats.parquet' (
        FORMAT 'PARQUET',
        CODEC  'ZSTD',
        COMPRESSION_LEVEL 22,
        ROW_GROUP_SIZE 15000);
American Data Centers

In many cases, a single record points to a single building.

American Data Centers

But in other cases, the point rests on the parcel of land that more than one facility is located on.

American Data Centers

The metadata does mention when there is more than one building being referred to in the record but it would be nice to see this dataset turned into a building footprint-specific dataset at some point.

Hyperscalers

Each location has some indication if its a hyperscaler location or not.

SELECT   "large facilites",
         COUNT(*)
FROM     'data_centres.parquet'
GROUP BY 1;
┌──────────────────────┬──────────────┐
│   large facilites    │ count_star() │
│       varchar        │    int64     │
├──────────────────────┼──────────────┤
│ Possible hyperscaler │          156 │
│ multiple permited    │          165 │
│ no value             │           19 │
│ not a hyperscaler    │          880 │
│ redacted             │           20 │
└──────────────────────┴──────────────┘

These are the number of hyperscaler sites by brand. It's interesting to see Apple only has two locations.

SELECT   brand,
         COUNT(*)
FROM     'data_centres.parquet'
WHERE    "large facilites" = 'Possible hyperscaler'
GROUP BY 1
ORDER BY 2 DESC;
┌─────────────────────────────┬──────────────┐
│            brand            │ count_star() │
│           varchar           │    int64     │
├─────────────────────────────┼──────────────┤
│ Amazon                      │           45 │
│ Microsoft                   │           21 │
│ Google                      │           12 │
│ QTS                         │           12 │
│ Aligned Data Centers        │            9 │
│ Digital Realty              │            9 │
│ Meta                        │            8 │
│ CyrusOne                    │            6 │
│ Vantage Data Centers        │            5 │
│ NTT                         │            5 │
│ CloudHQ                     │            4 │
│ Stack Infrastructure        │            4 │
│ Compass Datacenters         │            3 │
│ Apple                       │            2 │
│ Yondr                       │            1 │
│ Stream Data Centers         │            1 │
│ Edged Energy                │            1 │
│ Sabey Data Centers          │            1 │
│ Corscale Data Centers       │            1 │
│ Iron Mountain               │            1 │
│ CoreSite                    │            1 │
│ Skybox                      │            1 │
│ Cologix                     │            1 │
│ Equinix                     │            1 │
│ US National Security Agency │            1 │
├─────────────────────────────┴──────────────┤
│ 25 rows                          2 columns │
└────────────────────────────────────────────┘

Below are the individual hyperscaler locations along with the year of their first permit.

American Data Centers

OpenStreetMap Data

I looked for metadata in OpenStreetMap (OSM) for data centers so I could get an idea for how it compares against Business Insider's dataset. I was only able to location ~900 building footprints with OSM's Layercake dataset, which is updated weekly.

$ ~/duckdb
SELECT   COUNT(*),
         tags.building,
         tags."building:use"
FROM     'https://data.openstreetmap.us/layercake/buildings.parquet'
where    tags.building       ILIKE '%data%'
OR       tags."building:use" ILIKE '%data%'
GROUP BY 2, 3
ORDER BY 1 DESC;
┌──────────────┬────────────────────────────────────────────────────────────┬──────────────────┐
│ count_star() │                          building                          │   building:use   │
│    int64     │                          varchar                           │     varchar      │
├──────────────┼────────────────────────────────────────────────────────────┼──────────────────┤
│          901 │ data_center                                                │ NULL             │
│            8 │ datacenter                                                 │ NULL             │
│            8 │ data_centre                                                │ NULL             │
│            3 │ data                                                       │ NULL             │
│            3 │ industrial                                                 │ data_center      │
│            2 │ data center                                                │ NULL             │
│            1 │ Structure added, not on data, appears on satellite imagery │ NULL             │
│            1 │ sa_data_yaye                                               │ NULL             │
│            1 │ data_center                                                │ industrial       │
│            1 │ Ciber DataClic                                             │ NULL             │
│            1 │ office                                                     │ data_center      │
│            1 │ yes                                                        │ NDATANG K KORUNG │
│            1 │ industrial                                                 │ datacenter       │
│            1 │ apartments                                                 │ data_center      │
├──────────────┴────────────────────────────────────────────────────────────┴──────────────────┤
│ 14 rows                                                                            3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

Below are the data centers in OSM for the Bay Area.

American Data Centers

Below are Business Insider's.

American Data Centers

I noted that telecom: data_center or variations of that attribute are used from time to time in OSM. Below is one example for an Apple data center.

American Data Centers

This attribute isn't making its way into Layercake. I've raised a ticket so hopefully at some point, that might help fill in the gaps in OSM's DC coverage.

$ echo "SELECT * EXCLUDE(tags),
               tags: tags::JSON
        FROM   'https://data.openstreetmap.us/layercake/buildings.parquet'
        WHERE  id = 300974499
        LIMIT  1" \
    | ~/duckdb -json \
    | jq -S .
[
  {
    "bbox": "{'xmin': -111.60577, 'ymin': 33.34491, 'xmax': -111.60252, 'ymax': 33.349155}",
    "geometry": "MULTIPOLYGON (((-111.6057634 33.3462637, -111.6057589 33.3458225, -111.6054351 33.3458248, -111.6054259 33.3449094, -111.6027724 33.344928, -111.60278 33.3456873, -111.6025203 33.3456891, -111.6025379 33.3474469, -111.6025464 33.3482889, -111.6028057 33.3482871, -111.6028144 33.3491573, -111.6054686 33.3491387, -111.6054397 33.346266, -111.6057634 33.3462637)))",
    "id": 300974499,
    "tags": {
      "access": null,
      "addr:city": "Mesa",
      "addr:housenumber": "3740",
      "addr:postcode": "85212",
      "addr:street": "South Signal Butte Road",
      "building": "industrial",
      "building:colour": null,
      "building:flats": null,
      "building:levels": null,
      "building:material": null,
      "building:part": null,
      "building:use": "data_center",
      "height": null,
      "name": "Apple Data Center",
      "roof:colour": null,
      "roof:height": null,
      "roof:levels": null,
      "roof:material": null,
      "roof:orientation": null,
      "roof:shape": null,
      "start_date": "2012",
      "website": null,
      "wheelchair": null,
      "wikidata": null,
      "wikipedia": null
    },
    "type": "way"
  }
]

Diesel Generators

I wanted to do some analysis of the manufacturers of the diesel generators. Around 1/3rd of the records either don't contain the manufacturer(s) or my normalisation script below wasn't able to work on those records properly.

$ python3
from collections import Counter
import re

import duckdb
from   Levenshtein import levenshtein_cpp


brands = [
    'Allis-Chalmers',
    'Alzeta',
    'Baudouin',
    'Baylor',
    'Blue Star',
    'Caterpillar',
    'Clarke',
    'Cummins',
    'Detroit Diesel',
    'Deutz',
    'Doosan',
    'Generac',
    'Hino',
    'Hitec',
    'John Deere',
    'Katolight',
    'Kohler',
    'Mercedes-Benz',
    'Mitsubishi',
    'MTU',
    'Olympian',
    'Onan',
    'Perkins',
    'Rolls Royce',
    'Solar Saturn',
    'Turbinen-Union',
    'Volvo',
    'Waukesha',]

duckdb.sql('INSTALL parquet')
duckdb.sql('LOAD parquet')

sql = '''SELECT "generator type"
         FROM   'data_centres.parquet';'''

resp = []

for raw_val in duckdb.sql(sql).fetchall():
    if raw_val is None or raw_val[0] is None:
        continue

    raw_vals = re.split('[^a-zA-Z]', raw_val[0])

    brand_set = set()

    for brand in brands:
        for brand_part in re.split('[^a-zA-Z]', brand):
            for raw_val in raw_vals:
                if float(levenshtein_cpp.seqratio(brand, raw_val)) > 0.7:
                    brand_set.add(brand)

    resp.append(sorted(brand_set))

Counter(','.join(list(x)) for x in resp).most_common()
[('', 417),
 ('Caterpillar', 364),
 ('Cummins', 182),
 ('Caterpillar,Cummins', 69),
 ('MTU', 54),
 ('Kohler', 21),
 ('Caterpillar,MTU', 16),
 ('Caterpillar,Kohler', 9),
 ('Generac', 9),
 ('Caterpillar,Cummins,MTU', 7),
 ('Cummins,Kohler', 7),
 ('Volvo', 7),
 ('Mitsubishi', 6),
 ('Cummins,MTU', 6),
 ('Waukesha', 5),
 ('Kohler,MTU', 4),
 ('Clarke', 4),
 ('Caterpillar,Clarke', 3),
 ('Caterpillar,Cummins,Kohler', 3),
 ('Cummins,Generac', 3),
 ('Katolight', 2),
 ('Cummins,Onan', 2),
 ('Caterpillar,Onan', 2),
 ('Onan', 2),
 ('Cummins,Katolight', 2),
 ('Deutz', 2),
 ('Mercedes-Benz', 2),
 ('Clarke,Cummins', 2),
 ('Caterpillar,MTU,Mitsubishi', 2),
 ('Doosan,Waukesha', 2),
 ('Caterpillar,Katolight,MTU', 1),
 ('Caterpillar,Generac', 1),
 ('Caterpillar,Clarke,MTU', 1),
 ('Caterpillar,Clarke,Volvo', 1),
 ('Caterpillar,Kohler,MTU', 1),
 ('Clarke,Cummins,MTU', 1),
 ('Caterpillar,Cummins,MTU,Olympian', 1),
 ('Baylor', 1),
 ('Caterpillar,Cummins,Hitec,Kohler', 1),
 ('Allis-Chalmers,Caterpillar,Cummins', 1),
 ('Alzeta,Caterpillar', 1),
 ('Allis-Chalmers', 1),
 ('Caterpillar,Perkins', 1),
 ('Caterpillar,Cummins,Generac,Hino', 1),
 ('Caterpillar,Cummins,Generac', 1),
 ('Caterpillar,Cummins,Generac,MTU', 1),
 ('Cummins,Mitsubishi', 1),
 ('Cummins,Perkins', 1),
 ('Caterpillar,Olympian,Perkins', 1),
 ('Caterpillar,Olympian', 1),
 ('Kohler,Turbinen-Union', 1),
 ('Caterpillar,Waukesha', 1),
 ('Baudouin', 1)]

I suspect running AI and OCR on the underlying permit PDFs would yield a richer dataset where model numbers, power ratings, etc.. could be extracted and used to model power usage during blackouts, etc...

I haven't seen the underlying PDFs published anywhere yet but hopefully they will be in the near future.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2025 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.