I break up an OSM file into 1,087 themed GeoPackage files.
I explore adsb.lol's flight tracking dataset.
I explore the Estonian Land Board's LiDAR scans dataset.
I explore Natural Earth's freely available global geospatial datasets.
I explore 1 TB of Maxar's freely available satellite imagery.
I explore Overture's three global and free-to-use mapping dataset releases.
A review of their six-week spatial imagery course.
I walk through setting up a research and development environment for H.266 / VVC encoding.
I identify objects in aerial and phone camera imagery using Meta AI's Segmentation Model.
A review of their six-week course which focuses on their ArcGIS Pro offering.
I review Clickgis, a Rust-based extension that adds WKB and GeoJSON support to ClickHouse.
I ask Platypus2 13B questions about a PDF.
I revisit Uber's H3 with a more concise method for producing geospatial clusters.
I've extracted the most popular commercial airline passenger routes from 21 GB of Wikipedia articles.
I walk through hosting streaming videos using FFmpeg, Bento4, Caddy Server and HLS.
I walk through IPinfo's free IPv4 and IPv6 location database.
DuckDB can now open 50+ GIS file formats. I use it to help examine the Bing Maps team's AI road detection project.
I walk through basic geospatial workflows in DuckDB.
I build a pan-European Bus Route Planner.
I compare shipping data via CSV and Parquet from PostgreSQL to BigQuery.
I investigate how fast DoubleCloud can query 1.1 billion taxi journeys using their managed ClickHouse solution.
I show how you can create beautiful isochrone maps using Valhalla and QGIS.
I explore a Python wrapper for Apache ECharts.
I explore Altair, a concise API for charting in Python.
I walk through setting up BastionZero on an AWS EC2 instance.
I show how you can create beautiful maps in Python.
I walk through a GIS toolchain for creating heatmaps.
A review of the Rust-based Web Framework Poem.
I review the features and community benchmarks of the Rust-based Web Framework Axum.
Cost-effective, mobile-friendly file sharing using two Go-based offerings.
I explain how Open5G digs through 3.5 trillion records produced by a deep learning algorithm trained on a massive cluster in Switzerland that was fed imagery of the entire earth from two satellites to decide how to roll out 5G in California.
I walk through a GIS toolchain for visualising the streets of Monaco and its Formula 1 circuit.
I look at the latest way to get ClickHouse running quickly.
I compare latitude and longitude to h3 binning times between PostgreSQL, BigQuery and ClickHouse.
I describe how IPinfo finds the location of almost every IP address on earth.
I port a Python-based TLD extraction script to Go.
I look at an implementation of FizzBuzz that can generate output at a rate of 56 GB/s.
I review the features and benchmark ROAPI.
I review the features and community benchmarks of Actix.
I review the features and community benchmarks of Rocket.
I build a PostgreSQL function in Rust and use it to try and transform 1.27B records.
I port a Python-based TLD extraction script to Rust.
I walk through tracking changes in rich documents using Git.
I walk through installing and running Snappy's S2 extension.
I walk through installing and running MeiliSearch.
I walk through running an AWS S3-compatible storage service on HDFS.
Keep an eye on ClickHouse with Prometheus and Grafana.
Build a better understanding of your data in PostgreSQL.
I examine the performance of Hydrolix against my 1.1B taxi rides benchmark.
I investigate how fast OmniSciDB can query 1.1 billion taxi journeys using a 16" MacBook Pro.
Proxy Python and curl web requests through WireGuard and OpenSSH.
I compare PostgreSQL and ClickHouse performance characteristics while performing IPv4 to hostname lookups.
I compare the decompression times of various DEFLATE implementations.
I compare import times of various formats into ClickHouse.
I analyse material recently published on Google's "Procella" query processing engine which powers YouTube.
I analyse and debate arguments surrounding the "demise" of Hadoop.
I look at various aspects of lossless compression.
I look for faster ways of transferring files between HDFS and AWS S3.
I take a look at Apache Flume and walk through an example using it to connect Kafka to HDFS.
I take a short look at FoundationDB and walk through a leaderboard example using Python.
I review the Hadoop-focused book "Architecting Modern Data Platforms".
I investigate how fast ClickHouse 18.16.1 can query 1.1 billion taxi journeys on a 3-node, 108-core AWS EC2 cluster.
I compare the ORC file construction times of Spark 2.4.0, Hive 2.3.4 and Presto 0.214.
I investigate how fast Spark and Presto can query 1.1 Billion Taxi Journeys using a 21-node EMR cluster.
I explore several HDFS interfaces and compare them to the JVM-based Apache Hadoop HDFS CLI.
An examination and comparison of top, Htop and Glances; three tools for performing ad-hoc monitoring of systems and application performance.
This tutorial covers converting Wikipedia's XML dump of its English-language site into CSV, JSON, AVRO and ORC file formats as well as analysing the data using ClickHouse.
This tutorial covers importing CSV data into SQL Server 2017, automating data pipeline tasks via Apache Airflow and visualising data using Pandas and Jupyter Notebooks.
I investigate how fast SQLite can query 1.1 billion taxi journeys from Parquet files off of HDFS.
I walk through setting up Apache Airflow to use Dask.distributed, PostgreSQL, logging to AWS S3 as well as create User accounts and Plugins.
A guide to connecting to five different data stores using Presto.
A guide to running Airflow and Jupyter Notebook with Hadoop 3, Spark & Presto.
I investigate how fast Spark and Presto can query 1.1 Billion Taxi Journeys using an i3.8xlarge EC2 instance with 1.7 TB of NVMe storage versus a 21-node EMR cluster.
A simple Hadoop 3 installation guide for Ubuntu 16 that includes Hive, Spark and Presto.
I investigate how fast BrytlytDB 2.1 can query 1.1 billion taxi journeys using five IBM Minsky servers with 20 Nvidia P100 GPUs.
I investigate how fast BrytlytDB 2.0 can query 1.1 billion taxi journeys using two p16.8xlarge AWS EC2 instances.
This tutorial covers importing CSV data into SQLite 3, manipulating data via Python and visualising data using Pandas and Jupyter Notebooks.
I investigate how fast Spark 2.2 can query 1.1 billion taxi journeys using a cluster of three Raspberry Pis.
I investigate how fast BrytlytDB can query 1.1 billion taxi journeys using two p16.8xlarge AWS EC2 instances.
In this tutorial I walk-through building MapD from source on an Ubuntu 16.04.2 machine.
I investigate how fast MapD 3.0 can query 1.1 billion taxi journeys using two p2.8xlarge AWS EC2 instances.
I explore the task of bot detection in web traffic logs.
I walk through using TensorFlow to train AI Bots to play Doom, a classic first-person shooter.
I demonstrate how to extract analytical data from petabytes worth of websites collected by Common Crawl.
I review an early release of Martin Kleppmann's book "Designing Data-Intensive Applications".
I investigate how fast ClickHouse can query 1.1 billion taxi journeys on an Intel Core i5 4670K.
I investigate how fast Vertica Community Edition 8.0.1 can query 1.1 billion taxi journeys on an Intel Core i5 4670K.
I investigate how fast an 11-node Spark 2.1.0 cluster can query over a billion records.
I investigate how fast kdb+/q can query 1.1 billion taxi journeys on 4 Intel Xeon Phi 7210 CPUs.
I investigate how fast Amazon Athena can query 1.1 billion taxi journeys.
I walk through installing, loading in data and querying Alenka.
I investigate how fast MapD can query 1.1 billion taxi journeys using 8 Nvidia Pascal-based Titan X cards.
I walk through setting up TensorFlow, a Deep Learning Framework, on Ubuntu 16 with an Nvidia GTX 1080 and use it to build "Deep Fizz buzz".
I walk through setting up a data pipeline for currency exchange rates using Airflow, PostgreSQL and Redis.
I investigate how fast MapD can query 1.1 billion taxi journeys using 4 g2.8xlarge EC2 instances.
I investigate how fast MapD can query 1.1 billion taxi journeys using 4 Nvidia Titan X cards.
I investigate how fast MapD can query 1.1 billion taxi journeys using 8 Nvidia Telsa K80 GPU cards.
I investigate how fast a series of graph generated using R can be created across 4 different types of AWS RDS instances.
I investigate how fast a 6-node ds2.8xlarge Redshift Cluster can query over a billion records.
I investigate how fast a single Redshift ds2.xlarge instance can query over a billion records.
I look at ways of fitting every column of the 1.1 billion taxi rides into Elasticsearch on a single, 850 GB SSD.
I investigate how fast a 50-node Dataproc cluster queries the metadata of 1.1 billion taxi trips.
I investigate the performance impact of ORC file sizes on Presto query times using Google Cloud's Dataproc service.
I examine the performance and reliably increases from using Redis across a 51-node IPv4 WHOIS crawling cluster.
I look at speeding up Presto queries on 1.1 billion records run on a 10-node Dataproc cluster.
I investigate how fast a cluster of EC2 instances can collect WHOIS records of IPv4 addresses.
I investigate the speed differences between S3 and HDFS when querying over a billion records using Presto on AWS EMR.
I investigate how fast a small Dataproc cluster can query over a billion records using Presto.
I investigate how fast a 50-node AWS EMR cluster can query over a billion records using Presto.
I investigate how fast BigQuery can query the metadata of 1.1 billion NYC taxi journeys.
I investigate how fast a 40-node Hadoop cluster on AWS EMR can collect WHOIS records of IPv4 addresses.
I look at query speeds on 1.1 billion records on a single PostgreSQL installation running on an SSD.
I investigate how fast a single instance of Elasticsearch can query over a billion records.
I investigate how fast a small AWS EMR cluster can query over a billion records using Spark.
I investigate how fast a small AWS EMR cluster can query over a billion records using Presto.
I look at the relationship between topic counts and producer latency with Kafka.
Import the metadata of over a billion Yellow and Green Taxi and Uber rides in New York City into ORC-formatted, columnar-based files on HDFS and query them using Hive & Presto.
Import the metadata of over a billion Yellow and Green Taxi and Uber rides in New York City into a columnar-based Data Warehouse.
Using Airpal to execute queries on Parquet-fomatted data via Presto.
Parallel imports of CSV data from AWS S3 into Redshift.
I explore three ways to get Hadoop installed and running.
Reduce the I/O overhead of running tests in Django.
Scraping 29K Wikipedia pages to find the most popular commercial airline passenger routes.
An end-to-end guide to building a film recommendation engine.
A strategy for blocking dictionary attacks and restricting access to a white list of IP addresses.
Parsing and linting UK postcodes is ripe with edge cases.
A review of Django auth's password storage format and password storage upgrading capabilities.
Six tips for speeding up Python code.
A strategy for crushing, caching and deploying front-end-optimised Django sites.
Python's most popular package management tool is pip. I explore some tools to increase its functionality.
Setup a load-balanced, two-node Django cluster with a minimal Ansible footprint.
Run Django tests concurrently with pytest-xdist.
How to capture, monitor and analyse exceptions raised from a Django project.
I look into the steps of creating a blog using Pelican and hosting it with low-cost CDN services from Amazon with the help of S3cmd.
An exploratory effort to see how hard it is to collect all IPv4's WHOIS records.
I stopped coding in PHP in 2011, here are the thoughts that led me to that decision.
How to upload files to Amazon S3 from a form in Django as well as (very important) how to test the upload process.
A comparison of four methods used to find the country of an IP address.
django-jsonview offers a method decorator which will cause all responses (including exceptions) to return in API-friend, JSON format.
GAE strips HTTP body payloads if sent via HTTP GET. Elasticsearch excepts post bodies sent via HTTP GET. Re-writing the HTTP verb fixes the communications problem.
Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.