In 2015, Todd W Schneider published a blog post titled Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance in which he analysed the metadata of 1.1 billion Taxi journeys made in New York City between 2009 and 2015. Included with this work was a link to a GitHub repository where he published the SQL, Shell and R files he used in his work and instructions on how to get everything up and running using PostgreSQL.
Since discovering this blog post I've looked at the OLAP performance of various data stores with this dataset including Redshift, Presto on AWS EMR, Elasticsearch and even PostgreSQL with an OLAP extension. In this post I'll take a look at Google Cloud's BigQuery and see how fast it can query the metadata of 1.1 billion taxi trips.
If you've already explored BigQuery you might be aware Google already created datasets of the taxi trips. Unfortunately, the data hasn't been federated into a single table nor combined with weather data.
Creating a Google Cloud Platform Account
When you hit the landing page for Google Cloud there is an offer to sign up and receive $300 of credits towards usage of their platform. Once I got my account setup I created a project called "taxis" which they then turned into "taxis-1273" for uniqueness.
Google Cloud SDK Up and Running
To communicate with Google Cloud I'll install a few tools on a fresh installation of Ubuntu 14.04.3 LTS.
I'll first add Google Cloud's package distribution details.
$ export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
$ echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | \
sudo tee /etc/apt/sources.list.d/google-cloud-sdk.list
$ curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
sudo apt-key add -
$ sudo apt update
The following will install, among others, the gcloud and bq (as in BigQuery) commands.
$ sudo apt install
gcc \
google-cloud-sdk \
libffi-dev \
python-dev \
python-pip \
python-setuptools
I'll then initialise my environment. When this command was run I was given a URL to paste into a browser. The browser then asked that I login with my Google account before presenting me with a secret verification code to paste back into the CLI.
$ gcloud init
Welcome! This command will take you through the configuration of gcloud.
Your current configuration has been set to: [default]
To continue, you must log in. Would you like to log in (Y/n)? y
Go to the following link in your browser:
https://accounts.google.com/o/oauth2/auth?redirect_uri=...
Enter verification code: ...
You are now logged in as: [mark@...]
After that I was asked to set a default project.
Pick cloud project to use:
[1] [...]
[2] [taxis-1273]
Please enter your numeric choice: 2
Your current project has been set to: [taxis-1273].
...
gcloud has now been configured!
104 GB of Data on Google Cloud Storage
There is a Python-based package called gsutil that can be used to create buckets on Google Cloud Storage and upload files. The following will install it via pip.
$ sudo pip install gsutil
I'll then authorise my account with this tool as well. It's the same process as above, you get given a long URL, you then paste it into a browser, login with your Google account and get a verification code back.
$ gsutil config
Please navigate your browser to the following URL:
https://accounts.google.com/o/oauth2/auth?scope=...
Enter the authorization code: ...
Once authorised I can tell gsutil which project I'm working with.
What is your project-id? taxis-1273
With gsutil configured I'll create a bucket to store the 56 gzip files containing 1.1 billion denormalised records in CSV format. If you want to create a copy of this data please see the instructions in my Billion Taxi Rides in Redshift blog post.
$ gsutil mb \
-l EU \
gs://taxi-trips
The bucket will physically live in the EU. Google have a number of other locations to choose from if you want to pick one closer to you.
I've kept the gzip files stored in ~/taxi-trips/ on my machine. The following will upload them to the taxi-trips bucket on Google Cloud Storage using parallel uploads. When I ran this command I saw the 30 Mbit/s upload capacity I have available from my ISP was saturated for the duration of the upload.
$ screen
$ gsutil \
-o GSUtil:parallel_composite_upload_threshold=150M \
-m cp \
~/taxi-trips/*.csv.gz \
gs://taxi-trips
During the upload the files are broken up into ~60 MB chunks and stored in temporary folders in the bucket. Once uploaded, they are assembled and the files will then appear with their contents intact.
The following is what the upload process looked like in my terminal.
Copying file:///home/mark/taxi-trips/trips_xaa.csv.gz [Content-Type=text/csv]...
Uploading ...util_help_cp/0f8cc0bfa8b4b45ef456ccf23c0cdf45_29: 60.32 MiB/60.32 MiB
Uploading ...util_help_cp/0f8cc0bfa8b4b45ef456ccf23c0cdf45_15: 60.32 MiB/60.32 MiB
...
Uploading ...sutil_help_cp/0f8cc0bfa8b4b45ef456ccf23c0cdf45_4: 60.32 MiB/60.32 MiB
Uploading ...sutil_help_cp/0f8cc0bfa8b4b45ef456ccf23c0cdf45_0: 60.32 MiB/60.32 MiB
BigQuery Up and Running
The following will create a dataset in my project called trips. If you want your data in the EU then keep the data_location flag as it is. If you don't, specify another location or leave the flag out and the data will be stored in the US. If this location differs from the Google Storage Bucket you could see the ~25-minute load process take an hour.
$ bq mk \
--data_location=EU \
trips
Dataset 'taxis-1273:trips' successfully created.
I'll then create a table within that dataset called taxi_trips.
$ bq mk trips.taxi_trips
Table 'taxis-1273:trips.taxi_trips' successfully created.
I'll define the schema for the taxi_trips table. This will be fed as a string with the load command. For readability I've formatted the schema declaration and then transformed it with a bash command afterwards.
$ read -r -d '' SCHEMA <<- FINISHED
trip_id :integer,
vendor_id :string,
pickup_datetime :timestamp,
dropoff_datetime :timestamp,
store_and_fwd_flag :string,
rate_code_id :integer,
pickup_longitude :float,
pickup_latitude :float,
dropoff_longitude :float,
dropoff_latitude :float,
passenger_count :integer,
trip_distance :float,
fare_amount :float,
extra :float,
mta_tax :float,
tip_amount :float,
tolls_amount :float,
ehail_fee :float,
improvement_surcharge:float,
total_amount :float,
payment_type :string,
trip_type :integer,
pickup :string,
dropoff :string,
cab_type :string,
precipitation :integer,
snow_depth :integer,
snowfall :integer,
max_temperature :integer,
min_temperature :integer,
average_wind_speed :integer,
pickup_nyct2010_gid :integer,
pickup_ctlabel :string,
pickup_borocode :string,
pickup_boroname :string,
pickup_ct2010 :string,
pickup_boroct2010 :string,
pickup_cdeligibil :string,
pickup_ntacode :string,
pickup_ntaname :string,
pickup_puma :string,
dropoff_nyct2010_gid :integer,
dropoff_ctlabel :string,
dropoff_borocode :integer,
dropoff_boroname :string,
dropoff_ct2010 :string,
dropoff_boroct2010 :string,
dropoff_cdeligibil :string,
dropoff_ntacode :string,
dropoff_ntaname :string,
dropoff_puma :string
FINISHED
$ SCHEMA=`echo $SCHEMA | tr -d '[[:space:]]'`
I'll then call BigQuery to load the data from the taxi-trips bucket into the taxis-1273:trips.taxi_trips table. The following took 24 minutes and 27 seconds to complete.
$ bq load \
--max_bad_records=10000 \
--field_delimiter="," \
--source_format=CSV \
taxis-1273:trips.taxi_trips \
gs://taxi-trips/*.csv.gz \
$SCHEMA
Benchmarking BigQuery
I was absolutely blown away by how fast these queries executed. At some point I'll need to launch another Presto cluster on AWS EMR and see how much horsepower will be required there to match these query times.
The following completed in 2 seconds.
bq query "SELECT cab_type,
count(*)
FROM [taxis-1273:trips.taxi_trips]
GROUP BY cab_type;"
The following completed in 2 seconds.
bq query "SELECT passenger_count,
avg(total_amount)
FROM [taxis-1273:trips.taxi_trips]
GROUP BY passenger_count;"
The following completed in 1 second.
bq query "SELECT passenger_count,
year(pickup_datetime),
count(*)
FROM [taxis-1273:trips.taxi_trips]
GROUP BY 1, 2;"
The following completed in 3 seconds.
bq query "SELECT passenger_count,
year(pickup_datetime),
round(trip_distance),
count(*)
FROM [taxis-1273:trips.taxi_trips]
GROUP BY 1, 2, 3
ORDER BY 2, 4 desc;"
Costs of Cloud Storage and BigQuery
I used the pricing calculator Google provides to figure out how much all this is going to cost. If you spot any mistakes hit me up on Twitter or fire me an email.
I'm using 104 GB of Standard Cloud Storage to hold the gzip files which will cost $0.09 / day. Once that data is loaded into BigQuery I can remove them from Cloud Storage so I shouldn't need this for more than a day.
On BigQuery the data is uncompressed and takes up about 500 GB of space so I'll have to pay $0.42 / day for that.
BigQuery charges for each query as well which should come out to $0.07 for each one run on this dataset.