In this post I'll walk through extracting data from a database of one million contemporary popular music tracks, loading them into AWS Redshift via S3 and running some simple analysis via PostgreSQL's interactive terminal psql.
Installing Requirements
The following requirements were all installed on a fresh Ubuntu 15.10 installation.
$ sudo apt update
$ sudo apt install \
python-pip \
python-virtualenv \
postgresql-client-9.4 \
sqlite3 \
parallel
For the AWS CLI tool and s3cmd I'll install them via pip into a virtual environment.
$ virtualenv redshift
$ source redshift/bin/activate
$ pip install \
awscli \
https://github.com/s3tools/s3cmd/archive/v1.6.1.zip#egg=s3cmd
Fetching the Metadata
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The dataset is the result of a collaboration between The Echo Nest and LabROSA at Columbia University, supported in part by the NSF.
Within this dataset there are several subsets of data. The one I'm interested in is the million track metadata extract in SQLite3 format. The following will download it from Columbia University's Servers.
$ wget -c http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/track_metadata.db
Exploring the Metadata
Before I load the data into Redshift I'll first explore the shape of the data and value distributions.
$ sqlite3 track_metadata.db
The database has a single table within it. Most of the columns aren't very granular in terms of data types so I'll need to explore how small I can make each column when designing the Redshift table schema later on.
sqlite> .schema
CREATE TABLE songs (
track_id text PRIMARY KEY,
title text,
song_id text,
release text,
artist_id text,
artist_mbid text,
artist_name text,
duration real,
artist_familiarity real,
artist_hotttnesss real,
year int,
track_7digitalid int,
shs_perf int,
shs_work int
);
...
There are indeed one million records within this table.
sqlite> select count(*) from songs;
1000000
I'll examine a sample record to get an idea of the nature of the data stored.
sqlite> .mode line
sqlite> SELECT * FROM songs LIMIT 1;
track_id = TRMMMYQ128F932D901
title = Silent Night
song_id = SOQMMHC12AB0180CB8
release = Monster Ballads X-Mas
artist_id = ARYZTJS1187B98C555
artist_mbid = 357ff05d-848a-44cf-b608-cb34b5701ae5
artist_name = Faster Pussy cat
duration = 252.05506
artist_familiarity = 0.649822100201
artist_hotttnesss = 0.394031892714
year = 2003
track_7digitalid = 7032331
shs_perf = -1
shs_work = 0
I'll then check the cardinatily of the track and song identifier columns. The track_id column is unique where a song_id can have multiple track_ids.
sqlite> SELECT count(*) cnt, track_id
FROM songs
GROUP BY track_id
ORDER BY cnt desc
LIMIT 1;
cnt = 1
track_id = TRAAAAK128F9318786
sqlite> SELECT count(*) cnt, song_id
FROM songs
GROUP BY song_id
ORDER BY cnt DESC
LIMIT 1;
cnt = 3
song_id = SOBPAEP12A58A77F49
I'll see how long the various identifier columns are and the maximum values of the two integer fields I'm interested in. Knowing these values will help me pick the smallest column types for the Redshift table schema later on.
sqlite> .mode csv
sqlite> .headers off
sqlite> SELECT MAX(LENGTH(artist_id)) FROM songs;
18
sqlite> SELECT MAX(LENGTH(artist_familiarity)) FROM songs;
16
sqlite> SELECT MAX(LENGTH(artist_hotttnesss)) FROM songs;
15
sqlite> SELECT MAX(LENGTH(track_id)) FROM songs;
18
sqlite> SELECT MAX(LENGTH(duration)) FROM songs;
10
sqlite> SELECT MAX(year) FROM songs;
2011
sqlite> select MAX(duration) FROM songs;
3034.90567
Exporting Data of Interest
I've created a query to pull six fields of interest from every record the table and I'll export that data out into a CSV file.
$ sqlite3 track_metadata.db <<!
.headers off
.mode csv
.output track_metadata.csv
SELECT track_id,
artist_id,
artist_familiarity,
artist_hotttnesss,
duration,
year
FROM songs;
!
Split and Compress
The fastest way to load data into Redshift is by breaking up your data into separate files and compressing them. Below I'll break the one-million-row CSV file up into twenty CSV files of 50K rows each and compress them.
$ split -l 50000 \
--additional-suffix=.csv \
track_metadata.csv \
track_metadata_
$ gzip track_metadata_*
The resulting GZIP files are about 1.8MB each in size.
$ ls -lh track_metadata*.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_aa.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ab.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ac.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ad.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ae.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_af.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ag.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ah.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ai.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_aj.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ak.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_al.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_am.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_an.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ao.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ap.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_aq.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_ar.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_as.csv.gz
-rw-rw-r-- 1 mark mark 1.8M Jan 23 08:48 track_metadata_at.csv.gz
Uploading in Parallel to S3
You want to make sure your S3 bucket is created in the same region that you'll be launching your Redshift instance in. Amazon's us-east-1 region has the cheapest price on Redshift clusters so I'll use that region to create my S3 bucket in.
$ s3cmd --configure
...
Default Region [US]: US
...
$ s3cmd mb s3://track_metadata_example
Bucket 's3://track_metadata_example/' created
The following will use GNU's parallel command to upload eight files at a time to Amazon S3.
$ find track_metadata_*gz | \
parallel -j8 \
s3cmd put {/} s3://track_metadata_example/
Redshift needs a manifest of the files we'll be loading in.
$ vi songs.manifest
{
"entries": [
{"url": "s3://track_metadata_example/track_metadata_aa.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ab.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ac.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ad.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ae.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_af.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ag.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ah.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ai.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_aj.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ak.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_al.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_am.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_an.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ao.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ap.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_aq.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_ar.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_as.csv.gz", "mandatory": true},
{"url": "s3://track_metadata_example/track_metadata_at.csv.gz", "mandatory": true}
]
}
The manifest itself will also need to live on S3.
$ s3cmd put songs.manifest s3://track_metadata_example/
Launching the Redshift Cluster
The cheapest Redshift cluster I could find on Amazon's pricing page was a dc1.large in us-east-1 at $0.25 / hour. It comes with 2 vCPUs, 15 GB of RAM, 0.16 TB of SSD-backed storage and throughput support of 0.20GB / second.
Below I'll set the environment variables for my AWS access credentials and set the master username and password for my Redshift instance.
$ read AWS_ACCESS_KEY_ID
$ read AWS_SECRET_ACCESS_KEY
$ export AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY
$ read MASTER_USERNAME
$ read MASTER_PASSWORD
$ export MASTER_USERNAME
$ export MASTER_PASSWORD
Before creating the Redshift cluster make sure the AWS CLI tool's default region is the same region your S3 bucket is located in. You can run the configure command to update any settings and ensure they're as you expect.
$ aws configure
...
Default region name [None]: us-east-1
...
This will create a Redshift cluster with your default security group.
$ aws redshift create-cluster \
--db-name songs \
--cluster-type single-node \
--node-type dc1.large \
--master-username $MASTER_USERNAME \
--master-user-password $MASTER_PASSWORD \
--publicly-accessible \
--cluster-identifier song-data \
--availability-zone us-east-1a
{
"Cluster": {
"ClusterVersion": "1.0",
"AvailabilityZone": "us-east-1a",
"NodeType": "dc1.large",
"PubliclyAccessible": true,
"Tags": [],
"MasterUsername": "mark",
"ClusterParameterGroups": [
{
"ParameterGroupName": "default.redshift-1.0",
"ParameterApplyStatus": "in-sync"
}
],
"Encrypted": false,
"ClusterSecurityGroups": [
{
"Status": "active",
"ClusterSecurityGroupName": "default"
}
],
"AllowVersionUpgrade": true,
"VpcSecurityGroups": [],
"NumberOfNodes": 1,
"AutomatedSnapshotRetentionPeriod": 1,
"ClusterStatus": "creating",
"ClusterIdentifier": "song-data",
"DBName": "songs",
"PreferredMaintenanceWindow": "mon:09:00-mon:09:30",
"PendingModifiedValues": {
"MasterUserPassword": "****"
}
}
}
Once you execute that command Amazon will set about creating the cluster. In my experience it usually takes a few minutes. I tend to run a watch command to keep an eye on the cluster status and wait till I can see the cluster's endpoint address and see the status set to "available".
$ watch -n10 aws redshift describe-clusters
{
"Clusters": [
{
"PubliclyAccessible": true,
"MasterUsername": "mark",
"VpcSecurityGroups": [],
"ClusterPublicKey": "...",
"NumberOfNodes": 1,
"PendingModifiedValues": {},
"ClusterVersion": "1.0",
"Tags": [],
"AutomatedSnapshotRetentionPeriod": 1,
"ClusterParameterGroups": [
{
"ParameterGroupName": "default.redshift-1.0",
"ParameterApplyStatus": "in-sync"
}
],
"DBName": "songs",
"PreferredMaintenanceWindow": "mon:09:00-mon:09:30",
"Endpoint": {
"Port": 5439,
"Address": "song-data.cttuaolixpsz.us-east-1.redshift.amazonaws.com"
},
"AllowVersionUpgrade": true,
"ClusterCreateTime": "2016-01-23T16:59:38.615Z",
"ClusterSecurityGroups": [
{
"Status": "active",
"ClusterSecurityGroupName": "default"
}
],
"ClusterIdentifier": "song-data",
"ClusterNodes": [
{
"NodeRole": "SHARED",
"PrivateIPAddress": "10.234.73.99",
"PublicIPAddress": "54.157.227.115"
}
],
"AvailabilityZone": "us-east-1a",
"NodeType": "dc1.large",
"Encrypted": false,
"ClusterRevisionNumber": "1019",
"ClusterStatus": "available"
}
]
}
In the above you can see the security group is set to my default one. Within that group I've allowed my local IP address to connect on port 5439. I tend to set my security groups up via the AWS console rather than the CLI tool so I don't have a CLI example to copy-and-paste.
Once you've enabled access from your IP address to the Redshift cluster via port 5439 you'll be able to access it via the PostgreSQL interactive terminal. Below I'm using version 9.4.5. I've seen 9.3 connect to Redshift just fine as well.
$ psql --version
psql (PostgreSQL) 9.4.5
$ PGPASSWORD=$MASTER_PASSWORD \
psql -h song-data.cttuaolixpsz.us-east-1.redshift.amazonaws.com \
-p 5439 \
-U $MASTER_USERNAME songs
Loading Compressed CSV data into Redshift
I'll create an exploratory schema for the track data to load into.
CREATE TABLE tracks (
track_id VARCHAR(18) NOT NULL DISTKEY ENCODE LZO,
artist_id VARCHAR(18) NOT NULL ENCODE LZO,
artist_familiarity DECIMAL(16,15) ENCODE MOSTLY8,
artist_hotttnesss DECIMAL(16,15) ENCODE MOSTLY8,
duration DECIMAL(12,8) ENCODE MOSTLY8,
year SMALLINT ENCODE MOSTLY8,
primary key(track_id)
) sortkey(year);
I'll then issue a command to load in the data from the GZIP'ed CSV files on S3. This command has place holders where your AWS access key identifier and secret access key need to go. Please change these before executing this command.
COPY tracks
FROM 's3://track_metadata_example/songs.manifest'
CREDENTIALS
'aws_access_key_id=...;aws_secret_access_key=...'
DELIMITER ','
EMPTYASNULL
ESCAPE
GZIP
MANIFEST
MAXERROR 1000
REMOVEQUOTES
TRIMBLANKS
TRUNCATECOLUMNS;
When I ran this command 77MB of data (when uncompressed) was loaded in 9.92 seconds (~7.76 MB / second). The networking overhead of fetching so many small files probably contributed to the slow performance of this operation.
INFO: Load into table 'tracks' completed, 1000000 record(s) loaded successfully.
COPY
Time: 9921.177 ms
I'll then run the VACUUM command to make sure the data is sorted properly.
VACUUM tracks;
Querying data on Redshift
Now that the data is loaded I can run some SQL commands to explore the data.
SELECT year, AVG(duration)
FROM tracks
WHERE year > 1970
GROUP BY year
ORDER BY year;
year | avg
------+--------------
1971 | 245.59519935
1972 | 249.72693913
1973 | 250.97967923
1974 | 248.27482999
1975 | 258.27489716
1976 | 255.10848195
1977 | 250.99925862
1978 | 245.41075019
1979 | 245.40586793
1980 | 237.43758797
1981 | 238.90122831
1982 | 232.92932395
1983 | 239.75413748
1984 | 243.27543189
1985 | 250.00972298
1986 | 244.71989363
1987 | 237.88305477
1988 | 242.14960836
1989 | 241.54657340
1990 | 240.88123135
1991 | 242.51786532
1992 | 245.00212516
1993 | 246.61140532
1994 | 246.56816199
1995 | 250.44964826
1996 | 247.59096758
1997 | 252.23722730
1998 | 252.05911886
1999 | 251.08285731
2000 | 251.41572952
2001 | 254.42115771
2002 | 249.59224630
2003 | 246.78189993
2004 | 248.30687510
2005 | 245.18002133
2006 | 247.90531512
2007 | 249.69939469
2008 | 250.91283764
2009 | 249.94171984
2010 | 250.41501357
2011 | 185.05098000
(41 rows)
Improving the Column Encoding Settings
When I created the table in Redshift I took a guess as to the encoding methods to use on each field. With the data now loaded I can analyse 100K records and see which encoding method will result in the smallest amount of space being used. Redshift is primarily I/O constrained so generally the less disk space you data takes up the faster your queries will run. There are exceptions to this rule in relation to sort keys though.
ANALYZE COMPRESSION tracks;
Table | Column | Encoding
--------+--------------------+----------
tracks | track_id | lzo
tracks | artist_id | lzo
tracks | artist_familiarity | lzo
tracks | artist_hotttnesss | lzo
tracks | duration | lzo
tracks | year | lzo
(6 rows)
The above suggests I use LZO compression on every single column in the table.