Working with Data Feeds

As companies become more data-driven there is often a proliferation of data from both internal sources as well as third parties being consumed. Rarely have I seen firms try and centralise where datasets are stored. Instead, data is often copied onto infrastructure for individual teams and departments. This allows teams to not disrupt others with their work as well as avoid disruption from other teams.

Data sources are often refreshed in batches ranging from every few minutes to monthly updates. The file formats, compression schemes and encryption systems used to proliferate these datasets can vary greatly.

There is no one single tool I use for collection and analysis of new datasets. I do my best to pick tools that help me avoid writing a lot of bespoke code while taking advantage of the hardware available on any one system I may be using.

In this guide I'll walk through a exercise in consuming, transforming and analysing a data dump of the English language version of Wikipedia.

Installing Prerequisites

The following commands were run on a fresh install of Ubuntu 16.04.2 LTS on a system with 16 GB of RAM.

I'll first add the repository information needed to install ClickHouse, an open source analytics database, from Debian packages.

$ sudo apt-key adv \
    --keyserver hkp://keyserver.ubuntu.com:80 \
    --recv E0C56BD4
$ echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" | \
    sudo tee /etc/apt/sources.list.d/clickhouse.list
$ sudo apt update

I'll then install ClickHouse, Python, a JSON manipulation tool and a few decompression utilities (some of which can take advantage of multiple CPU cores).

$ sudo apt install \
    clickhouse-client \
    clickhouse-server-common \
    jq \
    lbzip2 \
    pigz \
    python-pip \
    python-virtualenv \
    software-properties-common \
    unzip

I'll be using a number of Java-based utilities in this exercise so I'll install Oracle's Java 8 distribution.

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt update
$ sudo apt install oracle-java8-installer

I'll create a Python virtual environment and install a couple data transformation and analysis utilities.

$ virtualenv ~/.feed
$ source ~/.feed/bin/activate
$ pip install \
    csvkit \
    pandas \
    xmltodict

Finally, I'll download and unpack a few libraries for the Hadoop-centric file formats I'll be working with.

$ wget -c https://repo1.maven.org/maven2/org/apache/orc/orc-tools/1.5.2/orc-tools-1.5.2-uber.jar
$ wget -c https://www-eu.apache.org/dist/avro/avro-1.8.2/java/avro-tools-1.8.2.jar
$ wget -c https://github.com/jairamc/csv-avro-converter/releases/download/v0.1-alpha/csv-avro-converter-cli-1.0.zip
$ unzip csv-avro-converter-cli-1.0.zip

Downloading Wikipedia

Wikipedia dumps page contents and metadata for their English-language website every couple of weeks. As of this writing the dump is made up of 55 bzip2-compressed XML files. They're around ~300 MB when compressed and ~1.5 GB uncompressed each.

The following will download the 55 compressed XML files creating a little over 14 GB of compressed content. This command will pull the URLs from a manifest file and transform them so that they are complete URLs. The URLs are piped through xargs so that two files will be downloaded in parallel at any one time.

$ wget -qO- https://dumps.wikimedia.org/enwiki/20180920/dumpstatus.json \
    | jq '.jobs.articlesdump.files[].url' \
    | sed 's/\"//g' \
    | sed 's/^/https:\/\/dumps.wikimedia.org/' \
    | xargs -n 1 \
            -P 2 \
            wget -c

Converting XML into JSON

XML files can be challenging to work with. Some libraries expect to work with files as a whole and can quickly exhaust system memory with simple operations. Other libraries can be unintuitive to work with and require a great deal of cognitive effort for simple tasks. For these reasons I try and convert XML files into JSON as soon as I receive them.

I'll be streaming out records from each of the bzip2-compressed XML files and dumping each record onto individual lines using JSON serialisation. I will use a Python library to first convert the XML into a Python dictionary. Then the Python script below will convert the Python dictionary into a JSON string.

$ vi dump_pages.py

import json
import marshal
import sys


while True:
    try:
        _, page = marshal.load(sys.stdin)
        print json.dumps(page)
    except EOFError:
        break

Below is a bash script that will find each of the bzip2-compressed XML files. Each file will be decompressed using a multi-core-optimised bzip2 utility. The decompressed XML will be converted into Python dictionaries with the xmltodict library. Then the script above will convert each record into a single-line JSON string. Finally, a multi-core-optimised gzip drop-in replacement will compress them using gzip compression.

$ for FILE in *.bz2; do
      echo $FILE
      lbunzip2 --stdout $FILE \
        | python xmltodict.py 2 \
        | python dump_pages.py \
        | pigz > $FILE.json.gz
  done

The transformed content in the bzip2 files grew by 33% once they turned into line-delimited, gzip-compressed JSON files.

Converting JSON into CSV

Now that I have the data in compressed JSON format I want to be able to flatten out the nested elements into their own columns. For reference, this is a truncated and pretty-printed view of what one JSON record currently looks like:

{
    "id": "41067206",
    "ns": "0",
    "revision": {
        "comment": "###",
        "contributor": {
            "id": "244263",
            "username": "Czar"
        },
        "format": "text/x-wiki",
        "id": "817456361",
        "model": "wikitext",
        "parentid": "788230430",
        "sha1": "2egdyphfnavxhsjrxvhjpfvb6ndh0wi",
        "text": {
            "#text": "###",
            "@xml:space": "preserve"
        },
        "timestamp": "2017-12-28T14:11:08Z"
    },
    "title": "###"
}

I'm looking to transform the above into the following columns. Note I'm using an underscore to denote a nested child node.

id
ns
redirect_@title
revision_comment
revision_contributor_id
revision_contributor_username
revision_format
revision_id
revision_model
revision_parentid
revision_sha1
revision_text_#text
revision_text_@xml:space
revision_timestamp
title
revision_contributor_ip
revision_minor
restrictions

The following Python script will iterate through each of the compressed JSON files, flatten out their nested data structures and save the data as a gzip-compressed CSV file. The only dataset-specific check I have in this code is making sure there is a title element in each record. This distinguishes the page records from other types held in the dump.

$ python

from   glob import glob
import gzip
import json

import pandas as pd
from   pandas.io.json.normalize import nested_to_record


for filename in glob('*.json.gz'):
    df = pd.DataFrame()

    for line in gzip.open(filename, 'rb'):
        page = nested_to_record(json.loads(line),
                                sep='_')

        if 'title' in page.keys():
            df = df.append(page, ignore_index=True)

    df.to_csv('%s.csv.gz' % filename.split('.')[0],
              index=False,
              encoding='utf-8',
              compression='gzip')

The gzip-compressed CSV files are now a little over double the size of the bzip2-compressed XML files.

Note that the columns in each CSV file may not align with columns in others CSV files if the datasets vary in structure. You can examine the consistency of the headers across the gzip-compressed CSV files with the following:

$ for FILE in *.csv.gz; do
      echo $FILE
      pigz -d -c $FILE | head -n1
  done

Cherry-Picking Columns of Interest

I want to cut down on the amount of data I'll be working with in this exercise. To start, I'll examine the first thousand lines of one of the gzip-compressed CSV files and see what field names and data types it contains.

Note I've set the maximum field size csvstat will allow for to eight million bytes so that none of the lengthy Markdown contents will cause an issue. I've also had to declare the delimiter as the last row is being cut off mid-record on the 1000th line (records can span multiple lines) and this is throwing csvstat's delimiter auto-detection off.

$ pigz -d -c enwiki-20180920-pages-articles1.csv.gz \
    | head -n1000 \
    | csvstat --maxfieldsize 8000000 \
              --delimiter ',' \
              --type

 1. id: Number
 2. ns: Boolean
 3. redirect_@title: Text
 4. revision_comment: Text
 5. revision_contributor_id: Number
 6. revision_contributor_username: Text
 7. revision_format: Text
 8. revision_id: Number
 9. revision_model: Text
10. revision_parentid: Number
11. revision_sha1: Text
12. revision_text_#text: Text
13. revision_text_@xml:space: Text
14. revision_timestamp: DateTime
15. title: Text
16. revision_contributor_ip: Text
17. revision_minor: Boolean
18. restrictions: Boolean

I'll create a new gzip-compressed CSV file made up of five columns of interest.

$ pigz -d -c enwiki-20180920-pages-articles1.csv.gz \
    | csvcut --maxfieldsize 8000000 \
             --columns 1,5,6,14,15 \
    | pigz > metadata.csv.gz

I'll now examine the statistics of each column in the newly-created gzip-compressed CSV file. Note, I can examine the compressed file, there's no need to decompress it ahead of time.

$ csvstat metadata.csv.gz

1. "id"

      Type of data:          Number
      Contains null values:  False
      Unique values:         19833
      Smallest value:        10
      Largest value:         30,302
      Sum:                   305,772,174
      Mean:                  15,417.344
      Median:                15,524
      StDev:                 8,567.878
      Most common values:    10 (1x)
                             12 (1x)
                             13 (1x)
                             14 (1x)
                             15 (1x)

2. "revision_contributor_id"

      Type of data:          Number
      Contains null values:  True (excluded from calculations)
      Unique values:         5197
      Smallest value:        0
      Largest value:         34,699,348
      Sum:                   204,280,488,782
      Mean:                  11,801,299.179
      Median:                9,612,106
      StDev:                 11,423,342.666
      Most common values:    None (2523x)
                             9,784,415 (1248x)
                             27,015,025 (462x)
                             212,624 (412x)
                             194,203 (390x)

3. "revision_contributor_username"

      Type of data:          Text
      Contains null values:  True (excluded from calculations)
      Unique values:         5198
      Longest value:         51 characters
      Most common values:    None (2523x)
                             Tom.Reding (1248x)
                             InternetArchiveBot (462x)
                             MZMcBride (412x)
                             Graham87 (390x)

4. "revision_timestamp"

      Type of data:          DateTime
      Contains null values:  False
      Unique values:         19788
      Smallest value:        2002-02-25 15:43:11+00:00
      Largest value:         2018-09-20 11:47:36+00:00
      Most common values:    2002-02-25 15:51:15+00:00 (11x)
                             2002-02-25 15:43:11+00:00 (8x)
                             2018-09-16 18:47:23+00:00 (2x)
                             2018-09-20 06:53:14+00:00 (2x)
                             2018-09-13 20:16:20+00:00 (2x)

5. "title"

      Type of data:          Text
      Contains null values:  True (excluded from calculations)
      Unique values:         19833
      Longest value:         97 characters
      Most common values:    AccessibleComputing (1x)
                             Anarchism (1x)
                             AfghanistanHistory (1x)
                             AfghanistanGeography (1x)
                             AfghanistanPeople (1x)

The above is incredibly handy for building data fluency. I have enough information to create a fairly granular schema in a structured data store.

Note, the values printed out have been transformed to be more consistent and readable. Timestamps share the same format and numbers are printed using commas.

The following will print the first ten records of the new CSV file. Note the data has been formatted for easier reading here as well.

$ pigz -d -c metadata.csv.gz \
    | head -n10 \
    | csvlook

| id | revision_contributor_id | revision_contributor_username |        revision_timestamp | title                          |
| -- | ----------------------- | ----------------------------- | ------------------------- | ------------------------------ |
| 10 |              23,257,138 | Godsy                         | 2018-08-14 06:47:24+00:00 | AccessibleComputing            |
| 12 |                         |                               | 2018-09-19 12:07:26+00:00 | Anarchism                      |
| 13 |               9,784,415 | Tom.Reding                    | 2017-06-05 04:18:18+00:00 | AfghanistanHistory             |
| 14 |               9,784,415 | Tom.Reding                    | 2017-06-05 04:18:23+00:00 | AfghanistanGeography           |
| 15 |               9,784,415 | Tom.Reding                    | 2017-06-05 04:19:42+00:00 | AfghanistanPeople              |
| 18 |               9,784,415 | Tom.Reding                    | 2017-06-05 04:19:45+00:00 | AfghanistanCommunications      |
| 19 |               9,784,415 | Tom.Reding                    | 2017-06-04 21:42:11+00:00 | AfghanistanTransportations     |
| 20 |               9,784,415 | Tom.Reding                    | 2017-06-04 21:43:11+00:00 | AfghanistanMilitary            |
| 21 |               9,784,415 | Tom.Reding                    | 2017-06-04 21:43:14+00:00 | AfghanistanTransnationalIssues |

If you want to see the data in tabular form but without the transformations run the command again with the no inference flag.

$ pigz -d -c metadata.csv.gz \
    | head -n10 \
    | csvlook --no-inference

| id | revision_contributor_id | revision_contributor_username | revision_timestamp   | title                          |
| -- | ----------------------- | ----------------------------- | -------------------- | ------------------------------ |
| 10 | 23257138                | Godsy                         | 2018-08-14T06:47:24Z | AccessibleComputing            |
| 12 |                         |                               | 2018-09-19T12:07:26Z | Anarchism                      |
| 13 | 9784415                 | Tom.Reding                    | 2017-06-05T04:18:18Z | AfghanistanHistory             |
| 14 | 9784415                 | Tom.Reding                    | 2017-06-05T04:18:23Z | AfghanistanGeography           |
| 15 | 9784415                 | Tom.Reding                    | 2017-06-05T04:19:42Z | AfghanistanPeople              |
| 18 | 9784415                 | Tom.Reding                    | 2017-06-05T04:19:45Z | AfghanistanCommunications      |
| 19 | 9784415                 | Tom.Reding                    | 2017-06-04T21:42:11Z | AfghanistanTransportations     |
| 20 | 9784415                 | Tom.Reding                    | 2017-06-04T21:43:11Z | AfghanistanMilitary            |
| 21 | 9784415                 | Tom.Reding                    | 2017-06-04T21:43:14Z | AfghanistanTransnationalIssues |

The csvlook utility does support a max-rows parameter as well but it takes a lot longer to pull the first ten records from a compressed CSV file than pigz and head do.

Converting CSV to AVRO

AVRO files store data in a self-describing form by maintaining a schema of each individual record along with its data. Data migrations are supported as well with schema versioning for both reading and writing records. AVRO files themselves can be read and worked with using a large variety of tools in the Hadoop ecosystem as well as with various Python and Java libraries. These features make AVRO an excellent transport format for micro-batches of data.

Below I'll use a simple CSV to AVRO conversion tool. It doesn't support reading data from gzip-compressed files so I'll decompress the CSV file first.

$ pigz -d --keep metadata.csv.gz
$ ~/csv-avro-converter-cli-1.0/bin/csv-avro-converter-cli \
    --in  metadata.csv \
    --out metadata.avro

The file sizes of the uncompressed CSV and AVRO are about the same.

The tool will perform some basic analysis on the contents of the CSV data and put together a schema for the dataset. Note, the timestamp column ended up being stored as a string rather than as any sort of smaller timestamp column type. Below is a dump of the schema the conversion tool put together.

$ java -jar ~/avro-tools-1.8.2.jar \
       getschema \
       metadata.avro

{
  "type" : "record",
  "name" : "csv",
  "namespace" : "metadata",
  "fields" : [ {
    "name" : "id",
    "type" : "int"
  }, {
    "name" : "revision_contributor_id",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "revision_contributor_username",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "revision_timestamp",
    "type" : "string"
  }, {
    "name" : "title",
    "type" : "string"
  } ]
}

Below I'll extract a record from the AVRO file. Note that the data hasn't been truncated or degraded in any way.

$ java -jar ~/avro-tools-1.8.2.jar \
       tojson \
       metadata.avro \
    | head -n1 \
    | python -mjson.tool

{
    "id": 10,
    "revision_contributor_id": {
        "int": 23257138
    },
    "revision_contributor_username": {
        "string": "Godsy"
    },
    "revision_timestamp": "2018-08-14T06:47:24Z",
    "title": "AccessibleComputing"
}

Converting CSV to ORC

ORC files store data in a columnar and compressed form. The project itself began in early 2013 and has been lead by Hortonwork's Founder and Technical Fellow Owen O'Malley. The format is very popular in the Hadoop ecosystem and in 2014, Facebook announced they had successfully converted 10s of petabytes of data in their then-300 PB data warehouse into ORC format.

Statistics are kept for each column throughout an ORC file and for each group of 10,000 rows within each column. This allows queries to skip the parts of the file that aren't relevant and does a good job of speeding up queries by narrowing down the search space.

When the ORC project started it was closely tied to Hive but since 2016 releases no longer depend on it. Different Hadoop projects often don't share the same version of their underlying ORC library but this year it was announced that the next release of Spark, version 2.4, will use the same version of the ORC library that Hive 3.0 uses.

Below I'll convert the compressed CSV file into an ORC file. The tool I'm using doesn't support feeding in compressed data nor can it skip the header row so I'll decompress all but the first line of data into a new CSV file.

$ pigz -d -c metadata.csv.gz \
    | tail -n +2 \
    > metadata.noheader.csv

ORC files have schemas with field names and data types. Below I'll define the schema for the five fields within the CSV file. I've kept the field names a, b, c, d and e in order to keep the command short. Longer field names are supported if you don't wish to be so ambiguous.

$ java -jar ~/orc-tools-1.5.2-uber.jar \
     convert \
     metadata.noheader.csv \
     --schema "struct<a:int,b:int,c:string,d:timestamp,e:string>" \
     --timestampformat "yyyy-MM-dd'T'HH:mm:ss'Z'" \
     --output metadata.orc

The resulting ORC file is around 25% of the size of the uncompressed CSV data.

I'll print out the properties of the resulting ORC file. Below you can see the file is compressed using zlib, has a single stripe and statistics on the minimum, maximum and sum values for each column are given. The third column that contains the revision contributor usernames is dictionary encoded.

$ java -jar ~/orc-tools-1.5.2-uber.jar \
     meta \
     metadata.orc

Processing data file metadata.orc [length: 341494]
Structure for metadata.orc
File Version: 0.12 with ORC_135
Rows: 19833
Compression: ZLIB
Compression size: 262144
Type: struct<a:int,b:int,c:string,d:timestamp,e:string>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 19833 hasNull: false
    Column 1: count: 19833 hasNull: false bytesOnDisk: 5511 min: 10 max: 30302 sum: 305772174
    Column 2: count: 17310 hasNull: true bytesOnDisk: 44996 min: 0 max: 34699348 sum: 204280488782
    Column 3: count: 17310 hasNull: true bytesOnDisk: 64840 min: "Serious" Sam Stone max: 칼빈500 sum: 173207
    Column 4: count: 19833 hasNull: false bytesOnDisk: 70303 min: 2002-02-25 15:43:11.0 max: 2018-09-20 11:47:36.0
    Column 5: count: 19833 hasNull: false bytesOnDisk: 154814 min: "Hello, World!" program max: ♯P sum: 312852

File Statistics:
  Column 0: count: 19833 hasNull: false
  Column 1: count: 19833 hasNull: false bytesOnDisk: 5511 min: 10 max: 30302 sum: 305772174
  Column 2: count: 17310 hasNull: true bytesOnDisk: 44996 min: 0 max: 34699348 sum: 204280488782
  Column 3: count: 17310 hasNull: true bytesOnDisk: 64840 min: "Serious" Sam Stone max: 칼빈500 sum: 173207
  Column 4: count: 19833 hasNull: false bytesOnDisk: 70303 min: 2002-02-25 15:43:11.0 max: 2018-09-20 11:47:36.0
  Column 5: count: 19833 hasNull: false bytesOnDisk: 154814 min: "Hello, World!" program max: ♯P sum: 312852

Stripes:
  Stripe: offset: 3 data: 340464 rows: 19833 tail: 144 index: 412
    Stream: column 0 section ROW_INDEX start: 3 length 21
    Stream: column 1 section ROW_INDEX start: 24 length 62
    Stream: column 2 section ROW_INDEX start: 86 length 62
    Stream: column 3 section ROW_INDEX start: 148 length 115
    Stream: column 4 section ROW_INDEX start: 263 length 62
    Stream: column 5 section ROW_INDEX start: 325 length 90
    Stream: column 1 section DATA start: 415 length 5511
    Stream: column 2 section PRESENT start: 5926 length 1686
    Stream: column 2 section DATA start: 7612 length 43310
    Stream: column 3 section PRESENT start: 50922 length 1686
    Stream: column 3 section DATA start: 52608 length 27735
    Stream: column 3 section LENGTH start: 80343 length 3229
    Stream: column 3 section DICTIONARY_DATA start: 83572 length 32190
    Stream: column 4 section DATA start: 115762 length 70288
    Stream: column 4 section SECONDARY start: 186050 length 15
    Stream: column 5 section DATA start: 186065 length 140495
    Stream: column 5 section LENGTH start: 326560 length 14319
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2
    Encoding column 2: DIRECT_V2
    Encoding column 3: DICTIONARY_V2[5197]
    Encoding column 4: DIRECT_V2
    Encoding column 5: DIRECT_V2

File length: 341494 bytes
Padding length: 0 bytes
Padding ratio: 0%

I'll dump out the first ten rows of data from the ORC file and as you can see, the data looks to be in place.

$ java -jar ~/orc-tools-1.5.2-uber.jar \
        data \
        metadata.orc \
    | head

{"a":10,"b":23257138,"c":"Godsy","d":"2018-08-14 06:47:24.0","e":"AccessibleComputing"}
{"a":12,"b":null,"c":null,"d":"2018-09-19 12:07:26.0","e":"Anarchism"}
{"a":13,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:18:18.0","e":"AfghanistanHistory"}
{"a":14,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:18:23.0","e":"AfghanistanGeography"}
{"a":15,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:42.0","e":"AfghanistanPeople"}
{"a":18,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:45.0","e":"AfghanistanCommunications"}
{"a":19,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:42:11.0","e":"AfghanistanTransportations"}
{"a":20,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:43:11.0","e":"AfghanistanMilitary"}
{"a":21,"b":9784415,"c":"Tom.Reding","d":"2017-06-04 21:43:14.0","e":"AfghanistanTransnationalIssues"}
{"a":23,"b":9784415,"c":"Tom.Reding","d":"2017-06-05 04:19:50.0","e":"AssistiveTechnology"}

For comparison, this is the data for the first ten rows in the source CSV file.

$ csvlook --no-header-row \
          --no-inference \
          --max-rows 10 \
          metadata.noheader.csv

| a  | b        | c          | d                    | e                              |
| -- | -------- | ---------- | -------------------- | ------------------------------ |
| 10 | 23257138 | Godsy      | 2018-08-14T06:47:24Z | AccessibleComputing            |
| 12 |          |            | 2018-09-19T12:07:26Z | Anarchism                      |
| 13 | 9784415  | Tom.Reding | 2017-06-05T04:18:18Z | AfghanistanHistory             |
| 14 | 9784415  | Tom.Reding | 2017-06-05T04:18:23Z | AfghanistanGeography           |
| 15 | 9784415  | Tom.Reding | 2017-06-05T04:19:42Z | AfghanistanPeople              |
| 18 | 9784415  | Tom.Reding | 2017-06-05T04:19:45Z | AfghanistanCommunications      |
| 19 | 9784415  | Tom.Reding | 2017-06-04T21:42:11Z | AfghanistanTransportations     |
| 20 | 9784415  | Tom.Reding | 2017-06-04T21:43:11Z | AfghanistanMilitary            |
| 21 | 9784415  | Tom.Reding | 2017-06-04T21:43:14Z | AfghanistanTransnationalIssues |
| 23 | 9784415  | Tom.Reding | 2017-06-05T04:19:50Z | AssistiveTechnology            |

Importing CSVs into ClickHouse

Once I've gotten a large sample of a data feed together, it's helpful to build fluency in the dataset using an analytics-focused database. For this exercise I'll import the CSV data into ClickHouse.

ClickHouse has done well in my 1.1 Billion Taxi Rides Benchmarks and even though it doesn't execute queries on GPUs, it is very performant, even on older Intel CPUs. The installation process is done via a simple apt install command, the default configuration works well and there's no license fee required for commercial purposes.

I'll make sure the ClickHouse server is running and then connect to it with the client.

$ sudo service clickhouse-server start
$ clickhouse-client

I'll first create a table called pages that will store the data in more or less the same form as it appears in the CSV file. Since ClickHouse won't interpret the Z in the timestamp strings properly I'll store the timestamp initially as a string and transform it later. The Z in the timestamps refers to Zulu time / UTC.

CREATE TABLE pages (
    page_id                       UInt32,
    revision_contributor_id       UInt32,
    revision_contributor_username Nullable(String),
    revision_timestamp            String,
    title                         Nullable(String)
) ENGINE=Log;

The following will feed the decompressed CSV data into the table while skipping the header row.

$ pigz -d -c metadata.csv.gz \
    | tail -n +2 \
    | clickhouse-client \
        --query="INSERT INTO pages FORMAT CSV"

Now I can create a new pages_mt table which will use the faster MergeTree engine. This will do a good job at speeding up queries on this dataset.

$ clickhouse-client

CREATE TABLE pages_mt ENGINE=MergeTree(date_, page_id, 8192) AS
    SELECT page_id,
           revision_contributor_id,
           revision_contributor_username,
           toDateTime(replaceOne(revision_timestamp, 'Z', '')) as revision_timestamp_,
           toDate(replaceOne(revision_timestamp, 'Z', '')) as date_,
           title
    FROM pages;

The following is an example query looking for the ten most frequent years for revisions.

SELECT toYear(date_) AS year,
       count(*) AS revisions
FROM pages_mt
GROUP BY year
ORDER BY 2 DESC
LIMIT 10

┌─year─┬─revisions─┐
│ 2002 │       289 │
│ 2003 │       176 │
│ 2004 │       193 │
│ 2005 │       159 │
│ 2006 │       211 │
│ 2007 │       125 │
│ 2008 │       129 │
│ 2009 │       186 │
│ 2010 │       200 │
│ 2011 │       188 │
└──────┴────────────┘

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.