Category: Databases | All Categories

1.1 Billion Taxi Rides in ClickHouse on DoubleCloud

I investigate how fast DoubleCloud can query 1.1 billion taxi journeys using their managed ClickHouse solution.


Install ClickHouse Faster

I look at the latest way to get ClickHouse running quickly.


1.1 Billion Taxi Rides using Hydrolix on AWS

I examine the performance of Hydrolix against my 1.1B taxi rides benchmark.


1.1 Billion Taxi Rides using OmniSciDB and a MacBook Pro

I investigate how fast OmniSciDB can query 1.1 billion taxi journeys using a 16" MacBook Pro.


Faster ClickHouse Imports

I compare import times of various formats into ClickHouse.


YouTube's Database "Procella"

I analyse material recently published on Google's "Procella" query processing engine which powers YouTube.


Is Hadoop Dead?

I analyse and debate arguments surrounding the "demise" of Hadoop.


Faster File Distribution with HDFS and S3

I look for faster ways of transferring files between HDFS and AWS S3.


A Minimalist Guide to Flume

I take a look at Apache Flume and walk through an example using it to connect Kafka to HDFS.


A Minimalist Guide to FoundationDB

I take a short look at FoundationDB and walk through a leaderboard example using Python.


1.1 Billion Taxi Rides: 108-core ClickHouse Cluster

I investigate how fast ClickHouse 18.16.1 can query 1.1 billion taxi journeys on a 3-node, 108-core AWS EC2 cluster.


Convert CSVs to ORC Faster

I compare the ORC file construction times of Spark 2.4.0, Hive 2.3.4 and Presto 0.214.


1.1 Billion Taxi Rides: Spark 2.4.0 versus Presto 0.214

I investigate how fast Spark and Presto can query 1.1 Billion Taxi Journeys using a 21-node EMR cluster.


Working with the Hadoop Distributed File System

I explore several HDFS interfaces and compare them to the JVM-based Apache Hadoop HDFS CLI.


A Minimalist Guide to Microsoft SQL Server 2017 on Ubuntu Linux

This tutorial covers importing CSV data into SQL Server 2017, automating data pipeline tasks via Apache Airflow and visualising data using Pandas and Jupyter Notebooks.


1.1 Billion Taxi Rides with SQLite, Parquet & HDFS

I investigate how fast SQLite can query 1.1 billion taxi journeys from Parquet files off of HDFS.


Using SQL to query Kafka, MongoDB, MySQL, PostgreSQL and Redis with Presto

A guide to connecting to five different data stores using Presto.


1.1 Billion Taxi Rides: EC2 versus EMR

I investigate how fast Spark and Presto can query 1.1 Billion Taxi Journeys using an i3.8xlarge EC2 instance with 1.7 TB of NVMe storage versus a 21-node EMR cluster.


Hadoop 3 Single-Node Install Guide

A simple Hadoop 3 installation guide for Ubuntu 16 that includes Hive, Spark and Presto.


1.1 Billion Taxi Rides with BrytlytDB 2.1 & a 5-node IBM Minsky Cluster

I investigate how fast BrytlytDB 2.1 can query 1.1 billion taxi journeys using five IBM Minsky servers with 20 Nvidia P100 GPUs.


1.1 Billion Taxi Rides with BrytlytDB 2.0 & 2 GPU-Powered p2.16xlarge EC2 Instances

I investigate how fast BrytlytDB 2.0 can query 1.1 billion taxi journeys using two p16.8xlarge AWS EC2 instances.


A Minimalist Guide to SQLite

This tutorial covers importing CSV data into SQLite 3, manipulating data via Python and visualising data using Pandas and Jupyter Notebooks.


1.1 Billion Taxi Rides with Spark 2.2 & 3 Raspberry Pi 3 Model Bs

I investigate how fast Spark 2.2 can query 1.1 billion taxi journeys using a cluster of three Raspberry Pis.


1.1 Billion Taxi Rides with BrytlytDB & 2 GPU-Powered p2.16xlarge EC2 Instances

I investigate how fast BrytlytDB can query 1.1 billion taxi journeys using two p16.8xlarge AWS EC2 instances.


Compiling MapD's Source Code

In this tutorial I walk-through building MapD from source on an Ubuntu 16.04.2 machine.


1.1 Billion Taxi Rides with MapD 3.0 & 2 GPU-Powered p2.8xlarge EC2 Instances

I investigate how fast MapD 3.0 can query 1.1 billion taxi journeys using two p2.8xlarge AWS EC2 instances.


Analysing Petabytes of Websites

I demonstrate how to extract analytical data from petabytes worth of websites collected by Common Crawl.


1.1 Billion Taxi Rides on ClickHouse & an Intel Core i5

I investigate how fast ClickHouse can query 1.1 billion taxi journeys on an Intel Core i5 4670K.


1.1 Billion Taxi Rides on Vertica & an Intel Core i5

I investigate how fast Vertica Community Edition 8.0.1 can query 1.1 billion taxi journeys on an Intel Core i5 4670K.


1.1 Billion Taxi Rides on AWS EMR 5.3.0 & Spark 2.1.0

I investigate how fast an 11-node Spark 2.1.0 cluster can query over a billion records.


1.1 Billion Taxi Rides on kdb+/q & 4 Xeon Phi CPUs

I investigate how fast kdb+/q can query 1.1 billion taxi journeys on 4 Intel Xeon Phi 7210 CPUs.


1.1 Billion Taxi Rides on Amazon Athena

I investigate how fast Amazon Athena can query 1.1 billion taxi journeys.


Alenka: A GPU-Driven, Open Source Database

I walk through installing, loading in data and querying Alenka.


1.1 Billion Taxi Rides with MapD & 8 Nvidia Pascal Titan Xs

I investigate how fast MapD can query 1.1 billion taxi journeys using 8 Nvidia Pascal-based Titan X cards.


1.1 Billion Taxi Rides with MapD & AWS EC2

I investigate how fast MapD can query 1.1 billion taxi journeys using 4 g2.8xlarge EC2 instances.


1.1 Billion Taxi Rides with MapD & 4 Nvidia Titan Xs

I investigate how fast MapD can query 1.1 billion taxi journeys using 4 Nvidia Titan X cards.


1.1 Billion Taxi Rides with MapD & 8 Nvidia Tesla K80s

I investigate how fast MapD can query 1.1 billion taxi journeys using 8 Nvidia Telsa K80 GPU cards.


1.2 Billion Taxi Rides on AWS RDS running PostgreSQL

I investigate how fast a series of graph generated using R can be created across 4 different types of AWS RDS instances.


1.1 Billion Taxi Rides on a Large Redshift Cluster

I investigate how fast a 6-node ds2.8xlarge Redshift Cluster can query over a billion records.


All 1.1 Billion Taxi Rides on Redshift

I investigate how fast a single Redshift ds2.xlarge instance can query over a billion records.


All 1.1 Billion Taxi Rides in Elasticsearch

I look at ways of fitting every column of the 1.1 billion taxi rides into Elasticsearch on a single, 850 GB SSD.


50-node Presto Cluster on Google Cloud's Dataproc

I investigate how fast a 50-node Dataproc cluster queries the metadata of 1.1 billion taxi trips.


Performance Impact of File Sizes on Presto Query Times

I investigate the performance impact of ORC file sizes on Presto query times using Google Cloud's Dataproc service.


33x Faster Queries on Google Cloud's Dataproc

I look at speeding up Presto queries on 1.1 billion records run on a 10-node Dataproc cluster.


A Billion Taxi Rides: AWS S3 versus HDFS

I investigate the speed differences between S3 and HDFS when querying over a billion records using Presto on AWS EMR.


A Billion Taxi Rides on Google's Dataproc running Presto

I investigate how fast a small Dataproc cluster can query over a billion records using Presto.


50-node Presto Cluster on Amazon EMR

I investigate how fast a 50-node AWS EMR cluster can query over a billion records using Presto.


A Billion Taxi Rides on Google's BigQuery

I investigate how fast BigQuery can query the metadata of 1.1 billion NYC taxi journeys.


A Billion Taxi Rides in PostgreSQL

I look at query speeds on 1.1 billion records on a single PostgreSQL installation running on an SSD.


A Billion Taxi Rides in Elasticsearch

I investigate how fast a single instance of Elasticsearch can query over a billion records.


A Billion Taxi Rides on Amazon EMR running Spark

I investigate how fast a small AWS EMR cluster can query over a billion records using Spark.


A Billion Taxi Rides on Amazon EMR running Presto

I investigate how fast a small AWS EMR cluster can query over a billion records using Presto.


Kafka Producer Latency with Large Topic Counts

I look at the relationship between topic counts and producer latency with Kafka.


A Billion Taxi Rides in Hive & Presto

Import the metadata of over a billion Yellow and Green Taxi and Uber rides in New York City into ORC-formatted, columnar-based files on HDFS and query them using Hive & Presto.


A Billion Taxi Rides in Redshift

Import the metadata of over a billion Yellow and Green Taxi and Uber rides in New York City into a columnar-based Data Warehouse.


Presto, Parquet & Airpal

Using Airpal to execute queries on Parquet-fomatted data via Presto.


A Million Songs on AWS Redshift

Parallel imports of CSV data from AWS S3 into Redshift.


Hadoop Up and Running

I explore three ways to get Hadoop installed and running.


Recommendation Engine built using Spark and Python

An end-to-end guide to building a film recommendation engine.


Querying Elasticsearch from Google App Engine

GAE strips HTTP body payloads if sent via HTTP GET. Elasticsearch excepts post bodies sent via HTTP GET. Re-writing the HTTP verb fixes the communications problem.

Copyright © 2014 - 2022 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.