Home | Benchmarks | Categories | Atom Feed

Posted on Sat 01 November 2014 under Python

Python's killer apps for blogging: Pelican and S3cmd

I came across a thread on Hacker News this morning where a blogger wanted to move their site away from Digital Ocean and was looking into alternatives. justinsb commented in the thread:

"If its static, then run it from S3. It'll be cheaper, and there are no servers to secure."

I've been blogging since 2001 and I've used a number of different platforms and hosting options. For the past few years I've been using a small droplet (1 GB of memory, 1 CPU, 30GB of SSD Disk capacity and 2TB of bandwidth a month) on Digital Ocean that has always been more than enough for all the different sites I host on it. The droplet costs $10 a month which is cheap enough that I never felt motivated to try and lower that cost.

But, without using something like Cloudflare's CDN service all requests to this blog in particular would have to travel all the way from where each of my readers are to the data centre in Amsterdam that this blog is hosted in.

If the site contents are uploaded to Amazon S3 then I could use Amazon CloudFront to globally distribute this site to 50+ servers around the world. Then readers would automagically connect to their closest server and theoretically download the content quicker.

Hosting costs (tl;dr: $1.28 / month)

I'm going to use the site structure and traffic profile for an imaginative blog for my calculations:

  1. 1MB of content (HTML, CSS and JPEGs) served from the site itself.
  2. 1,000 daily visitors.
  3. 75% of those visitors are from the contiguous United States.
  4. The second biggest country of origin of readers after the US is only 10% of the US' readership size (basically the rest of the readership is widely distributed around the world).

With those numbers I ran a pricing calculator for Amazon S3.

1,000 people a day, 30K a month, 1.5 pages per person, 150KB downloaded by each reader: 153,600 bytes * 30,000 = 4.3 GB / month in downloaded content.

With an over-estimated 0.005GB of storage, 200 PUT/COPY/POST/LIST requests, 120,000 GET and other requests, 4.3GB of data transfer out and 1GB of data transfer in would cost (assuming I wasn't using the free tier) $0.05 per month on Amazon's US-East / US Standard (Virginia) servers.

But that would only host content from servers in Virginia. In order to serve content closer to where readers' various locations I would need to use Cloudfront on top of S3. As of this writing Amazon says they have 20 serving locations in the US, 15 in Europe, 11 in Asia, 2 in Australia and 2 in South America.

Pricing is based on the number of terabytes transferred and the first cut off point is at 10TB.

The world has been carved up into seven regions and each has its own distinct pricing:

First 10 TB / month:

  • $0.120 United States
  • $0.120 Europe
  • $0.190 Hong Kong, Philippines, S. Korea, Singapore & Taiwan
  • $0.190 Japan
  • $0.250 South America
  • $0.190 Australia
  • $0.170 India

So the moment someone requests a page from Australia and they're served by a server in Australia you'll be billed $0.190 for that month. With 1,000 readers a day it's likely you'll serve content from all of Amazon's regions so the minimum each month is likely to be $1.23.

Setup S3 and Cloudfront

To start, I created an S3 bucket called static-blog-test in Amazon's AWS console, enabled 'Static Website Hosting' and set the Index Document to 'index.html'.

I then created a CloudFront distribution, set the origin domain name to static-blog-test.s3.amazonaws.com, chose the "Use All Edge Locations (Best Performance)" price class, added static-blog-test.marksblogg.com to the alternate domain names and set the default root object to 'index.html'.

Cloudfront then assigned d3bj15xywd5c9c.cloudfront.net to the distribution. I added a CNAME to my marksblogg.com zone file pointing static-blog-test to d3bj15xywd5c9c.cloudfront.net and made sure the DNS setting was resolvable from where I am:

$ dig -t CNAME static-blog-test.marksblogg.com | grep -A1 'ANSWER SECTION'
;; ANSWER SECTION:
static-blog-test.marksblogg.com. 5 IN   CNAME   d3bj15xywd5c9c.cloudfront.net.

Just as a note, if you edit anything in the Cloudfront console, go back to the Cloudfront distributions list page and see if your distribution's status is "In Progress" or "Deployed". I found deployments can take a few minutes so make sure they're done before you start questioning why something isn't working.

Creating content

I tend to use pelican and S3cmd for most static blogs I make. Both projects are written in Python and are actively-maintained.

First, I'll install pelican and generate a small blog:

$ pip install pelican
$ pelican-quickstart
Welcome to pelican-quickstart v3.4.0.

This script will help you create a new Pelican-based website.

Please answer the following questions so this script can generate the files
needed by Pelican.


> Where do you want to create your new web site? [.] .
> What will be the title of this web site? Static Blog
> Who will be the author of this web site? Mark Litwintschik
> What will be the default language of this web site? [en]
> Do you want to specify a URL prefix? e.g., http://example.com   (Y/n) n
> Do you want to enable article pagination? (Y/n) n
> Do you want to generate a Fabfile/Makefile to automate generation and publishing? (Y/n) Y
> Do you want an auto-reload & simpleHTTP script to assist with theme and site development? (Y/n) Y
> Do you want to upload your website using FTP? (y/N) N
> Do you want to upload your website using SSH? (y/N) N
> Do you want to upload your website using Dropbox? (y/N) N
> Do you want to upload your website using S3? (y/N) y
> What is the name of your S3 bucket? [my_s3_bucket] static-blog-test
> Do you want to upload your website using Rackspace Cloud Files? (y/N) N
> Do you want to upload your website using GitHub Pages? (y/N) N
Done. Your new project is available at /home/mark/static_blog

When I filled in the above questionnaire I forgot to add in a URL prefix so I had to add it in afterwards, here are the two files that setting sits in and the setting I used:

$ grep SITEURL {pelicanconf,publishconf}.py
pelicanconf.py:SITEURL = 'http://static-blog-test.marksblogg.com'
publishconf.py:SITEURL = 'http://static-blog-test.marksblogg.com'

I then created a small blog post and saved it to content/hello-world.rst:

$ cat content/hello-world.rst
:title: Hello, World.
:date: 2014-11-01 12:19
:slug: hello-world
:summary: This article is greeting the world.

Hello, World.

Deploying to Amazon S3

S3cmd has been distributed via apt install s3cmd for years now but I was worried that it might be an older version being distributed. I assumed that if I installed via pip install s3cmd that I would get a newer version. I assumed wrong:

$ apt-cache show s3cmd
Package: s3cmd
...
Version: 1.1.0~beta3-2
...
$ pip freeze | grep s3cmd
s3cmd==1.0.1

It turns out 1.1.0~beta3-2 was released in January 2012 while 1.0.1 was released in June 2011.

I didn't want to run the beta version as I know the stable version 1.0.1 hasn't caused me any issues in the past so I stuck with installing S3cmd via pip.

It would be worth a seeing what has changed since January 2012 as their contribution graph on github looks like they've been pretty active since then.

With S3cmd installed I configured it:

$ s3cmd --configure

Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3
Access Key:
...
Use HTTPS protocol [No]: Yes

New settings:
  ...

Test access with supplied credentials? [Y/n] Y
Please wait...
Success. Your access key and secret key worked fine :-)

Now verifying that encryption works...
Not configured. Never mind.

Save settings? [y/N] y
Configuration saved to '/home/mark/.s3cfg'

Generate and publish

From what other blog posts have suggested there is only one command needed to generate the blog and upload its files to Amazon S3 but when I ran the command it would generate the blog content but not upload it:

$ make s3_upload
pelican /home/mark/static_blog/content -o /home/mark/static_blog/output -s /home/mark/static_blog/publishconf.py
Done: Processed 1 article(s), 0 draft(s) and 0 page(s) in 0.07 seconds.

$ s3cmd ls s3://static-blog-test
$

So I manually ran S3cmd and synced the files with my S3 bucket:

$ s3cmd sync output/ s3://static-blog-test \
  --acl-public --delete-removed --guess-mime-type
output/archives.html -> s3://static-blog-test/archives.html  [1 of 37]
 2718 of 2718   100% in    0s     3.16 kB/s  done
...
Done. Uploaded 62069 bytes in 31.4 seconds, 1974.73 B/s

I could then see my content in my bucket:

$ s3cmd ls s3://static-blog-test
                       DIR   s3://static-blog-test/author/
                       DIR   s3://static-blog-test/category/
                       DIR   s3://static-blog-test/feeds/
                       DIR   s3://static-blog-test/theme/
2014-11-01 10:33      2718   s3://static-blog-test/archives.html
2014-11-01 10:33      2730   s3://static-blog-test/authors.html
2014-11-01 10:33      2595   s3://static-blog-test/categories.html
2014-11-01 10:33      3385   s3://static-blog-test/hello-world.html
2014-11-01 10:33      3278   s3://static-blog-test/index.html
2014-11-01 10:33      2603   s3://static-blog-test/tags.html

Make sure the content serves reliably

It's all well and good to see the content serve once or twice reliably but it's better to simulate traffic at busier times. If 1,000 readers visit each day then it's fair to assume you could see around 40 visitors at the busiest of times.

I sanity checked that the first blog post was serving:

$ curl --silent http://static-blog-test.marksblogg.com/hello-world.html | head -n5
<!DOCTYPE html>
<html lang="en">
<head>
        <meta charset="utf-8" />
        <title>Hello, World.</title>

Then I installed ab to see if anything other than HTTP 200 responses would occur with a small traffic simulation.

$ sudo apt install \
    apache2-utils

When I ran the simulation I saw 1,000 requests completed in 1.764 seconds with 0 failures.

$ ab -n 1000 -c 40 http://static-blog-test.marksblogg.com/hello-world.html
...
Benchmarking static-blog-test.marksblogg.com (be patient)
...
Finished 1000 requests
...
Concurrency Level:      40
Time taken for tests:   1.764 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      3943140 bytes
HTML transferred:       3385000 bytes
Requests per second:    566.84 [#/sec] (mean)
Time per request:       70.566 [ms] (mean)
Time per request:       1.764 [ms] (mean, across all concurrent requests)
Transfer rate:          2182.76 [Kbytes/sec] received
...

Versioning and rollbacks

One thing missing from this exercise is rollback functionality. If there were an case where I needed to rollback I would have to have stored each deployment as its own commit in git, checkout a known good deployment commit and redeploy before investigating what had caused the previous deployment to fail.

This would be only one way of handling versioning and rollbacks and I'm sure there are more elaborate and flexible ways of accomplishing these tasks.

Why isn't this blog on S3?

This blog is hosted on a small droplet on Digital Ocean that has never been fully utilised. I use the droplet for more than just hosting static content (such as running Django-based sites, periodic cron jobs, SSH tunnelling, etc...). I like the idea of not having a server to maintain but my server needs to be maintained for other projects anyway so I wouldn't be lowering my workload by much.

Google Analytics site speed measurements tell me most pages on this blog load in or under a second on average around the world.

If I was just hosting static content on my Digital Ocean droplet it would make sense to move as one month on Digital Ocean's cheapest droplet of $5 would buy almost four months of Amazon S3 and Cloudfront hosting.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.