Home | Benchmarks | Archives | Atom Feed

Posted on Sat 23 January 2016

A Million Songs on AWS Redshift

In this post I'll walk through extracting data from a database of one million contemporary popular music tracks, loading them into AWS Redshift via S3 and running some simple analysis via PostgreSQL's interactive terminal psql.

Installing Requirements

The following requirements were all installed on a fresh Ubuntu 15.10 installation.

➫ sudo apt-get update
➫ sudo apt-get install \
    python-pip \
    python-virtualenv \
    postgresql-client-9.4 \
    sqlite3 \
    parallel

For the AWS CLI tool and s3cmd I'll install them via pip into a virtual environment.

➫ virtualenv redshift
➫ source redshift/bin/activate

➫ pip install \
    awscli \
    https://github.com/s3tools/s3cmd/archive/v1.6.1.zip#egg=s3cmd

Fetching the Metadata

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The dataset is the result of a collaboration between The Echo Nest and LabROSA at Columbia University, supported in part by the NSF.

Within this dataset there are several subsets of data. The one I'm interested in is the million track metadata extract in SQLite3 format. The following will download it from Columbia University's Servers.

➫ wget -c http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/track_metadata.db

Exploring the Metadata

Before I load the data into Redshift I'll first explore the shape of the data and value distributions.

➫ sqlite3 track_metadata.db

The database has a single table within it. Most of the columns aren't very granular in terms of data types so I'll need to explore how small I can make each column when designing the Redshift table schema later on.

sqlite> .schema
CREATE TABLE songs (
    track_id            text PRIMARY KEY,
    title               text,
    song_id             text,
    release             text,
    artist_id           text,
    artist_mbid         text,
    artist_name         text,
    duration            real,
    artist_familiarity  real,
    artist_hotttnesss   real,
    year                int,
    track_7digitalid    int,
    shs_perf            int,
    shs_work            int
);
...

There are indeed one million records within this table.

sqlite> select count(*) from songs;
1000000

I'll examine a sample record to get an idea of the nature of the data stored.

sqlite> .mode line
sqlite> SELECT * FROM songs LIMIT 1;
          track_id = TRMMMYQ128F932D901
             title = Silent Night
           song_id = SOQMMHC12AB0180CB8
           release = Monster Ballads X-Mas
         artist_id = ARYZTJS1187B98C555
       artist_mbid = 357ff05d-848a-44cf-b608-cb34b5701ae5
       artist_name = Faster Pussy cat
          duration = 252.05506
artist_familiarity = 0.649822100201
 artist_hotttnesss = 0.394031892714
              year = 2003
  track_7digitalid = 7032331
          shs_perf = -1
          shs_work = 0

I'll then check the cardinatily of the track and song identifier columns. The track_id column is unique where a song_id can have multiple track_ids.

sqlite> SELECT count(*) cnt, track_id
        FROM songs
        GROUP BY track_id
        ORDER BY cnt desc
        LIMIT 1;
     cnt = 1
track_id = TRAAAAK128F9318786
sqlite> SELECT count(*) cnt, song_id
        FROM songs
        GROUP BY song_id
        ORDER BY cnt DESC
        LIMIT 1;
    cnt = 3
song_id = SOBPAEP12A58A77F49

I'll see how long the various identifier columns are and the maximum values of the two integer fields I'm interested in. Knowing these values will help me pick the smallest column types for the Redshift table schema later on.

sqlite> .mode csv
sqlite> .headers off

sqlite> SELECT MAX(LENGTH(artist_id)) FROM songs;
18

sqlite> SELECT MAX(LENGTH(artist_familiarity)) FROM songs;
16

sqlite> SELECT MAX(LENGTH(artist_hotttnesss)) FROM songs;
15

sqlite> SELECT MAX(LENGTH(track_id)) FROM songs;
18

sqlite> SELECT MAX(LENGTH(duration)) FROM songs;
10

sqlite> SELECT MAX(year) FROM songs;
2011

sqlite> select MAX(duration) FROM songs;
3034.90567

Exporting Data of Interest

I've created a query to pull six fields of interest from every record the table and I'll export that data out into a CSV file.

➫ sqlite3 track_metadata.db <<!
.headers off
.mode csv
.output track_metadata.csv
SELECT track_id,
       artist_id,
       artist_familiarity,
       artist_hotttnesss,
       duration,
       year
FROM songs;
!

Split and Compress

The fastest way to load data into Redshift is by breaking up your data into separate files and compressing them. Below I'll break the one-million-row CSV file up into twenty CSV files of 50K rows each and compress them.

➫ split -l 50000 \
         --additional-suffix=.csv \
         track_metadata.csv \
         track_metadata_

➫ gzip track_metadata_*

The resulting GZIP files are about 1.8MB each in size.

➫ ls -lh track_metadata*.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_aa.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ab.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ac.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ad.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ae.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_af.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ag.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ah.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ai.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_aj.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ak.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_al.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_am.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_an.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ao.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ap.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_aq.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ar.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_as.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_at.csv.gz

Uploading in Parallel to S3

You want to make sure your S3 bucket is created in the same region that you'll be launching your Redshift instance in. Amazon's us-east-1 region has the cheapest price on Redshift clusters so I'll use that region to create my S3 bucket in.

➫ s3cmd --configure
...
Default Region [US]: US
...
➫ s3cmd mb s3://track_metadata_example
Bucket 's3://track_metadata_example/' created

The following will use GNU's parallel command to upload eight files at a time to Amazon S3.

➫ find track_metadata_*gz | \
    parallel -j8 \
    s3cmd put {/} s3://track_metadata_example/

Redshift needs a manifest of the files we'll be loading in.

➫ vi songs.manifest
{
    "entries": [
        {"url": "s3://track_metadata_example/track_metadata_aa.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ab.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ac.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ad.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ae.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_af.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ag.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ah.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ai.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_aj.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ak.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_al.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_am.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_an.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ao.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ap.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_aq.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_ar.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_as.csv.gz", "mandatory": true},
        {"url": "s3://track_metadata_example/track_metadata_at.csv.gz", "mandatory": true}
    ]
}

The manifest itself will also need to live on S3.

➫ s3cmd put songs.manifest s3://track_metadata_example/

Launching the Redshift Cluster

The cheapest Redshift cluster I could find on Amazon's pricing page was a dc1.large in us-east-1 at $0.25 / hour. It comes with 2 vCPUs, 15 GB of RAM, 0.16 TB of SSD-backed storage and throughput support of 0.20GB / second.

Below I'll set the environment variables for my AWS access credentials and set the master username and password for my Redshift instance.

read AWS_ACCESS_KEY_ID
➫ read AWS_SECRET_ACCESS_KEY
➫ export AWS_ACCESS_KEY_ID
➫ export AWS_SECRET_ACCESS_KEY
read MASTER_USERNAME
➫ read MASTER_PASSWORD
➫ export MASTER_USERNAME
➫ export MASTER_PASSWORD

Before creating the Redshift cluster make sure the AWS CLI tool's default region is the same region your S3 bucket is located in. You can run the configure command to update any settings and ensure they're as you expect.

➫ aws configure
...
Default region name [None]: us-east-1
...

This will create a Redshift cluster with your default security group.

➫ aws redshift create-cluster \
    --db-name songs \
    --cluster-type single-node \
    --node-type dc1.large \
    --master-username $MASTER_USERNAME \
    --master-user-password $MASTER_PASSWORD \
    --publicly-accessible \
    --cluster-identifier song-data \
    --availability-zone us-east-1a
{
    "Cluster": {
        "ClusterVersion": "1.0",
        "AvailabilityZone": "us-east-1a",
        "NodeType": "dc1.large",
        "PubliclyAccessible": true,
        "Tags": [],
        "MasterUsername": "mark",
        "ClusterParameterGroups": [
            {
                "ParameterGroupName": "default.redshift-1.0",
                "ParameterApplyStatus": "in-sync"
            }
        ],
        "Encrypted": false,
        "ClusterSecurityGroups": [
            {
                "Status": "active",
                "ClusterSecurityGroupName": "default"
            }
        ],
        "AllowVersionUpgrade": true,
        "VpcSecurityGroups": [],
        "NumberOfNodes": 1,
        "AutomatedSnapshotRetentionPeriod": 1,
        "ClusterStatus": "creating",
        "ClusterIdentifier": "song-data",
        "DBName": "songs",
        "PreferredMaintenanceWindow": "mon:09:00-mon:09:30",
        "PendingModifiedValues": {
            "MasterUserPassword": "****"
        }
    }
}

Once you execute that command Amazon will set about creating the cluster. In my experience it usually takes a few minutes. I tend to run a watch command to keep an eye on the cluster status and wait till I can see the cluster's endpoint address and see the status set to "available".

➫ watch -n10 aws redshift describe-clusters
{
    "Clusters": [
        {
            "PubliclyAccessible": true,
            "MasterUsername": "mark",
            "VpcSecurityGroups": [],
            "ClusterPublicKey": "...",
            "NumberOfNodes": 1,
            "PendingModifiedValues": {},
            "ClusterVersion": "1.0",
            "Tags": [],
            "AutomatedSnapshotRetentionPeriod": 1,
            "ClusterParameterGroups": [
                {
                    "ParameterGroupName": "default.redshift-1.0",
                    "ParameterApplyStatus": "in-sync"
                }
            ],
            "DBName": "songs",
            "PreferredMaintenanceWindow": "mon:09:00-mon:09:30",
            "Endpoint": {
                "Port": 5439,
                "Address": "song-data.cttuaolixpsz.us-east-1.redshift.amazonaws.com"
            },
            "AllowVersionUpgrade": true,
            "ClusterCreateTime": "2016-01-23T16:59:38.615Z",
            "ClusterSecurityGroups": [
                {
                    "Status": "active",
                    "ClusterSecurityGroupName": "default"
                }
            ],
            "ClusterIdentifier": "song-data",
            "ClusterNodes": [
                {
                    "NodeRole": "SHARED",
                    "PrivateIPAddress": "10.234.73.99",
                    "PublicIPAddress": "54.157.227.115"
                }
            ],
            "AvailabilityZone": "us-east-1a",
            "NodeType": "dc1.large",
            "Encrypted": false,
            "ClusterRevisionNumber": "1019",
            "ClusterStatus": "available"
        }
    ]
}

In the above you can see the security group is set to my default one. Within that group I've allowed my local IP address to connect on port 5439. I tend to set my security groups up via the AWS console rather than the CLI tool so I don't have a CLI example to copy-and-paste.

Once you've enabled access from your IP address to the Redshift cluster via port 5439 you'll be able to access it via the PostgreSQL interactive terminal. Below I'm using version 9.4.5. I've seen 9.3 connect to Redshift just fine as well.

➫ psql --version
psql (PostgreSQL) 9.4.5

➫ PGPASSWORD=$MASTER_PASSWORD \
    psql -h song-data.cttuaolixpsz.us-east-1.redshift.amazonaws.com \
         -p 5439 \
         -U $MASTER_USERNAME songs

Loading Compressed CSV data into Redshift

I'll create an exploratory schema for the track data to load into.

CREATE TABLE tracks (
    track_id            VARCHAR(18) NOT NULL DISTKEY ENCODE LZO,
    artist_id           VARCHAR(18) NOT NULL ENCODE LZO,
    artist_familiarity  DECIMAL(16,15) ENCODE MOSTLY8,
    artist_hotttnesss   DECIMAL(16,15) ENCODE MOSTLY8,
    duration            DECIMAL(12,8) ENCODE MOSTLY8,
    year                SMALLINT ENCODE MOSTLY8,
    primary key(track_id)
) sortkey(year);

I'll then issue a command to load in the data from the GZIP'ed CSV files on S3. This command has place holders where your AWS access key identifier and secret access key need to go. Please change these before executing this command.

COPY tracks
  FROM 's3://track_metadata_example/songs.manifest'
  CREDENTIALS
    'aws_access_key_id=...;aws_secret_access_key=...'
  DELIMITER ','
  EMPTYASNULL
  ESCAPE
  GZIP
  MANIFEST
  MAXERROR 1000
  REMOVEQUOTES
  TRIMBLANKS
  TRUNCATECOLUMNS;

When I ran this command 77MB of data (when uncompressed) was loaded in 9.92 seconds (~7.76 MB / second). The networking overhead of fetching so many small files probably contributed to the slow performance of this operation.

INFO:  Load into table 'tracks' completed, 1000000 record(s) loaded successfully.
COPY
Time: 9921.177 ms

I'll then run the VACUUM command to make sure the data is sorted properly.

VACUUM tracks;

Querying data on Redshift

Now that the data is loaded I can run some SQL commands to explore the data.

SELECT year, AVG(duration)
FROM tracks
WHERE year > 1970
GROUP BY year
ORDER BY year;
 year |     avg
------+--------------
 1971 | 245.59519935
 1972 | 249.72693913
 1973 | 250.97967923
 1974 | 248.27482999
 1975 | 258.27489716
 1976 | 255.10848195
 1977 | 250.99925862
 1978 | 245.41075019
 1979 | 245.40586793
 1980 | 237.43758797
 1981 | 238.90122831
 1982 | 232.92932395
 1983 | 239.75413748
 1984 | 243.27543189
 1985 | 250.00972298
 1986 | 244.71989363
 1987 | 237.88305477
 1988 | 242.14960836
 1989 | 241.54657340
 1990 | 240.88123135
 1991 | 242.51786532
 1992 | 245.00212516
 1993 | 246.61140532
 1994 | 246.56816199
 1995 | 250.44964826
 1996 | 247.59096758
 1997 | 252.23722730
 1998 | 252.05911886
 1999 | 251.08285731
 2000 | 251.41572952
 2001 | 254.42115771
 2002 | 249.59224630
 2003 | 246.78189993
 2004 | 248.30687510
 2005 | 245.18002133
 2006 | 247.90531512
 2007 | 249.69939469
 2008 | 250.91283764
 2009 | 249.94171984
 2010 | 250.41501357
 2011 | 185.05098000
(41 rows)

Improving the Column Encoding Settings

When I created the table in Redshift I took a guess as to the encoding methods to use on each field. With the data now loaded I can analyse 100K records and see which encoding method will result in the smallest amount of space being used. Redshift is primarily I/O constrained so generally the less disk space you data takes up the faster your queries will run. There are exceptions to this rule in relation to sort keys though.

ANALYZE COMPRESSION tracks;
 Table  |       Column       | Encoding
--------+--------------------+----------
 tracks | track_id           | lzo
 tracks | artist_id          | lzo
 tracks | artist_familiarity | lzo
 tracks | artist_hotttnesss  | lzo
 tracks | duration           | lzo
 tracks | year               | lzo
(6 rows)

The above suggests I use LZO compression on every single column in the table.

Thank you for taking the time to read this post. I offer consulting, architecture and hands-on development services to clients in Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2017 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.