Load balancing Django

The whole playbook from this blog post can be seen in this gist.

Ansible is a Python-based tool for automating application deployment and infrastructure setups. It's often compared with Capistrano, chef, puppet and fabric. These comparisons don't always compare apples to apples as each tool has their own distinctive capabilities and philosophy as to how best automate deployments and system setups. Flexibility and conciseness of describing tasks varies widely between tools as well.

I've had to use every tool mentioned above at one time or another for various clients. What I like most about Ansible is that it keeps playbooks concise while not abstracting anything away. Some tools try to hide the differences between apt install and yum install but I found these abstractions made for a steeper learning curve and made out-of-the-ordinary changes take longer to get working.

Ansible can be installed via pip and just needs an inventory file to begin being useful.

For this post I try to keep Ansible's files to a minimum. You can organise playbooks into separate files, setup Travis CI to test your playbooks and so on but for the sake of simplicity I stick to the task of getting a load-balanced, two-node Django cluster setup with as few lines of code as I could.

A cluster of machines

I launched three Ubuntu 14.04 virtual machines on my workstation, configured them with the user mark, identical passwords and sudo access. I then added the three virtual machine IP addresses to my hosts file:

$ grep 192 /etc/hosts
192.168.223.131 web1
192.168.223.132 web2
192.168.223.133 lb

I copied my SSH public key to the ~/.ssh/authorized_keys file on each VM:

$ ssh-copy-id web1
$ ssh-copy-id web2
$ ssh-copy-id lb

I then made sure each host's ECDSA key fingerprint was stored in my ~/.ssh/known_hosts file by connecting to each host:

$ ssh lb uptime
$ ssh web1 uptime
$ ssh web1 uptime

Running Ansible with an inventory

I installed Ansible:

$ pip install ansible==1.7.2

And created an inventory file:

$ cat inventory
[load_balancers]
lb ansible_ssh_host=lb

[app_servers]
web1 ansible_ssh_host=web1
web2 ansible_ssh_host=web2

The first token is the host alias used by Ansible, the second token is the connection instruction for Ansible. I had already created web1, web2 and lb in my /etc/hosts file so I simply referred to those hostnames.

With an inventory file in place it's possible to test that Ansible can connect and communicate with each node in the cluster:

$ ansible all -i inventory -a uptime
lb | success | rc=0 >>
 09:07:18 up  7:18,  2 users,  load average: 0.05, 0.03, 0.05

web2 | success | rc=0 >>
 09:07:18 up  7:19,  2 users,  load average: 0.01, 0.02, 0.05

web1 | success | rc=0 >>
 09:07:19 up  7:23,  2 users,  load average: 0.00, 0.01, 0.05

A cluster of configuration files

Ansible will need to deploy configuration files, keys and certificates to our cluster. Below is the file and folder layout of this project:

$ tree
.
├── files
│   ├── nginx-app.conf
│   ├── nginx-load-balancer.conf
│   ├── nginx.crt
│   ├── nginx.key
│   ├── ntp.conf
│   ├── supervisord.conf
│   └── venv_activate.sh
├── inventory
└── playbook.yml

SSL-terminating load balancer

To create files/nginx.crt and files/nginx.key I generated a self-signed SSL certificate using openssl. The certificate won't be of much use to https clients that verify certificates against trusted authorities but it's useful for demonstrating ssl termination by the load balancer in a local environment.

$ openssl req -x509 -nodes -days 365 \
  -newkey rsa:2048 \
  -keyout files/nginx.key \
  -out files/nginx.crt
...
Common Name (e.g. server FQDN or YOUR name) []:localhost
...

There are two distinctive nginx configurations in this setup: the first is for the load balancer and the second is for the app servers.

I chose nginx for the load balancer as it supports SSL, caching and handles app servers restarting more gracefully than haproxy.

$ cat files/nginx-load-balancer.conf
upstream app_servers  {
  {% for host in groups['app_servers'] %}
      server {{ hostvars[host]['ansible_eth0']['ipv4']['address'] }} fail_timeout=5s;
  {% endfor %}
}

server {
  listen         80;
  server_name    localhost;
  return         301 https://$server_name$request_uri;
}

server {
  listen 443;
  server_name localhost;

  ssl on;
  ssl_certificate /etc/nginx/ssl/nginx.crt;
  ssl_certificate_key /etc/nginx/ssl/nginx.key;

  location / {
    proxy_pass  http://app_servers;
  }
}

The nginx config for the app servers simply proxies requests to gunicorn:

$ cat files/nginx-app.conf
server {
  listen 80;
  server_name localhost;

  location / {
    proxy_pass  http://127.0.0.1:8000;
  }
}

Keeping clocks in sync

I wanted to keep the clocks on each node in the cluster in sync so I created an NTP configuration file. I choose the European pool but there are pools all around the world listed the NTP pool project site.

$ cat files/ntp.conf
server 0.europe.pool.ntp.org
server 1.europe.pool.ntp.org
server 2.europe.pool.ntp.org
server 3.europe.pool.ntp.org

If you wanted to use the North American pool you could replace the contents with the following:

server 0.north-america.pool.ntp.org
server 1.north-america.pool.ntp.org
server 2.north-america.pool.ntp.org
server 3.north-america.pool.ntp.org

Managing gunicorn and celeryd

I'll use supervisor to run the Django app server and celery task queue. The files below have a number of variables in them, Ansible will transform them when it deploys them.

$ cat files/supervisord.conf
[program:web_app]
autorestart=true
autostart=true
command={{ home_folder }}/.virtualenvs/{{ venv }}/exec gunicorn faulty.wsgi:application -b 127.0.0.1:8000
directory={{ home_folder }}/faulty
redirect_stderr=True
stdout_logfile={{ home_folder }}/faulty/supervisor.log
user=mark

[program:celeryd]
autorestart=true
autostart=true
command={{ home_folder }}/.virtualenvs/{{ venv }}/exec python manage.py celeryd
directory={{ home_folder }}/faulty
redirect_stderr=True
stdout_logfile={{ home_folder }}/faulty/supervisor.log
user=mark

I also need a bash file that will activate the virtualenv used by Django:

$ cat files/venv_activate.sh
#!/bin/bash
source {{ home_folder }}/.virtualenvs/{{ venv }}/bin/activate
$@

When venv_activate.sh is uploaded to each web server it'll sit in /home/mark/.virtualenvs/faulty/exec.

Writing the playbook

I wanted to keep the number of files Ansible I was working with as low as I thought sensible for this blog post. Ansible is very flexible in terms of file organisation so a more broken up and organised approach is possible but for this example I'll just use one playbook.

There's much more hardening that could have been done with these instances. For the sake of conciseness I kept the number of tasks down.

I've broken playbook.yml into sections and will walk through each explaining what actions are taking place and reasoning behind each.

Too see the whole playbook.yml file please see this gist.

SSH tightening

The first tasks are to disable root's ssh account and remove support for password-based authentication. I already stored my public ssh key in each server's authorized_keys file so there is no need to login with password authentication. Once these configuration changes are made the ssh server will be restarted:

---

- name: SSH tightening
  hosts: all
  sudo: True
  tasks:
    - name: Disable root's ssh account
      action: >
        lineinfile
        dest=/etc/ssh/sshd_config
        regexp="^PermitRootLogin"
        line="PermitRootLogin no"
        state=present
      notify: Restart ssh
    - name: Disable password authentication
      action: >
        lineinfile
        dest=/etc/ssh/sshd_config
        regexp="^PasswordAuthentication"
        line="PasswordAuthentication no"
        state=present
      notify: Restart ssh

  handlers:
  - name: Restart ssh
    action: service name=ssh state=restarted

There are three dashes at the top of this file as it's a YAML convention.

Package cache

Every system needs APT's cache updated. You can append update_cache=yes to specific package installations but I found it was required for every package installation so I perform the update once per machine rather than per package.

- name: Update APT package cache
  hosts: all
  gather_facts: False
  sudo: True
  tasks:
    - name: Update APT package cache
      action: apt update_cache=yes

Syncing to UTC

I then set the time zone on each machine to UTC and setup NTP to synchronise their clocks with the European NTP pool. Note when dealing with dpkg-reconfigure make sure you pass --frontend noninteractive, otherwise Ansible will freeze while waiting for dpkg-reconfigure to accept input that Ansible isn't configured to capture interactively.

- name: Set timezone to UTC
  hosts: all
  gather_facts: False
  sudo: True
  tasks:
    - name: Set timezone variables
      copy: >
        content='Etc/UTC'
        dest=/etc/timezone
        owner=root
        group=root
        mode=0644
        backup=yes
      notify:
        - Update timezone
  handlers:
    - name: Update timezone
      command: >
        dpkg-reconfigure
        --frontend noninteractive
        tzdata

- name: Syncronise clocks
  hosts: all
  sudo: True
  tasks:
    - name: install ntp
      apt: name=ntp

    - name: copy ntp config
      copy: src=files/ntp.conf dest=/etc/ntp.conf

    - name: restart ntp
      service: name=ntp state=restarted

I also made sure each machine has unattended upgrades installed.

- name: Setup unattended upgrades
  hosts: all
  gather_facts: False
  sudo: True
  tasks:
    - name: Install unattended upgrades package
      apt: name=unattended-upgrades
      notify:
        - dpkg reconfigure

  handlers:
    - name: dpkg reconfigure
      command: >
        dpkg-reconfigure
        --frontend noninteractive
        -plow unattended-upgrades

Setup the Django app server

The Django app servers have the largest number of tasks. Here is a summarised list of what's being performed:

Setup uncomplicated firewall to block all incoming traffic except for tcp port 22 and 80; HTTP traffic should only come from the load balancer; ssh logins are rate-limited to slow down brute force attacks.

Install Python's virtualenv and development libraries.

Install git and checkout a public repo of a Django project hosted on Bitbucket.

Copy over the virtual environment activation script which is used by every Django command.

Install a virtual environment and the Django project's requirements.

Setup and migrate Django's database. The database doesn't need to be shared between web servers for this application so it's just a local SQLite3 file on each individual server.

Install supervisor and have it run the gunicorn app server and celery task runner.

Launch an nginx reverse proxy which acts as a buffer between gunicorn and the load balancer.

- name: Setup App Server(s)
  hosts: app_servers
  sudo: True
  vars:
    home_folder: /home/mark
    venv: faulty
  tasks:
    - ufw: state=enabled logging=on
    - ufw: direction=incoming policy=deny
    - ufw: rule=limit port=ssh proto=tcp
    - ufw: rule=allow port=22 proto=tcp
    - ufw: >
        rule=allow
        port=80
        proto=tcp
        from_ip={{ hostvars['lb']['ansible_default_ipv4']['address'] }}

    - name: Install python virtualenv
      apt: name=python-virtualenv

    - name: Install python dev
      apt: name=python-dev

    - name: Install git
      apt: name=git

    - name: Checkout Django code
      git: >
        repo=https://bitbucket.org/marklit/faulty.git
        dest={{ home_folder }}/faulty
        update=no
    - file: >
        path={{ home_folder }}/faulty
        owner=mark
        group=mark
        mode=755
        state=directory
        recurse=yes

    - name: Install Python requirements
      pip: >
        requirements={{ home_folder }}/faulty/requirements.txt
        virtualenv={{ home_folder }}/.virtualenvs/{{ venv }}

    - template: >
        src=files/venv_activate.sh
        dest={{ home_folder }}/.virtualenvs/{{ venv }}/exec
        mode=755

    - command: >
        {{ home_folder }}/.virtualenvs/{{ venv }}/exec
        python manage.py syncdb --noinput
      args:
        chdir: '{{ home_folder }}/faulty'

    - command: >
        {{ home_folder }}/.virtualenvs/{{ venv }}/exec
        python manage.py migrate
      args:
        chdir: '{{ home_folder }}/faulty'

    - name: Install supervisor
      apt: name=supervisor

    - template: >
        src=files/supervisord.conf
        dest=/etc/supervisor/conf.d/django_app.conf

    - command: /usr/bin/supervisorctl reload
    - supervisorctl: name=web_app state=restarted
    - supervisorctl: name=celeryd state=restarted

    - name: Install nginx
      apt: name=nginx

    - name: copy nginx config file
      template: >
        src=files/nginx-app.conf
        dest=/etc/nginx/sites-available/default

    - name: enable configuration
      file: >
        dest=/etc/nginx/sites-enabled/default
        src=/etc/nginx/sites-available/default
        state=link

    - service: name=nginx state=restarted

The load balancer

The load balancer has a simpler task list:

Block all incoming traffic except for tcp 22, 80, 443; rate limit ssh.

Install Nginx and copy in the self-signed certificates.

Copy in the load balancer configuration and launch nginx.

- name: Setup Load balancer(s)
  hosts: load_balancers
  sudo: True
  tasks:
    - ufw: state=enabled logging=on
    - ufw: direction=incoming policy=deny
    - ufw: rule=limit port=ssh proto=tcp
    - ufw: rule=allow port=22 proto=tcp
    - ufw: rule=allow port=80 proto=tcp
    - ufw: rule=allow port=443 proto=tcp

    - apt: name=nginx

    - name: copy nginx config file
      template: >
        src=files/nginx-load-balancer.conf
        dest=/etc/nginx/sites-available/default
    - copy: src=files/nginx.key dest=/etc/nginx/ssl/
    - copy: src=files/nginx.crt dest=/etc/nginx/ssl/

    - name: enable configuration
      file: >
        dest=/etc/nginx/sites-enabled/default
        src=/etc/nginx/sites-available/default
        state=link

    - service: name=nginx state=restarted

Running the playbook

I used the following command to run the playbook and setup the cluster:

$ ansible-playbook -i inventory --ask-sudo-pass playbook.yml

I then tested that I could communicate via the load balancer. If --insecure is not passed as a flag to curl you'll not be able to complete the request as curl is setup to not trust self-signed ssl certificates by default:

$ curl --insecure https://lb
k2b71#v!l0_sf7y$0)x(=cw2u_^q05etbf9ediptp(#0m+&=^0
81jy$7n=!3ay%p3o%$e!iv8hknbuyl64*o-sue1xcgygp^owlb
fne-$j$^qyv*^me3r5kx=p^#*+y!t)gq!^a)9_dhs4afcx2x!2
7s5@po!&)zo#ca=16-o0gmv!440%1$q2xgne+uerpp7@*bt*l8
m!y*$2o)8r(tmf!b(*72$knb$&(gt1jspn&h4tu^s#9-3(+x&b
s#(vta0x68#4ihpw1sds06=fjcj9!am8c4c32zy95_0=%==$s(
-j(3pnb^4x)##(^@n)&)fe3#zl2mb&(s1qj5#)9%+ng6%sj%7n
c02$ahq#t$t)1s12-nj!yolz+v687zpefug_o7!+w7055gt5g$
7j8v%$)o50ch(-^#q3^7(dtgl3lvg2orirk$e54l&k89jxj#-1
g@^_eanx#*@4&8kg!xi(va^_@@4xyjz7h497$iw*1=^sb797il
88hmb=+c9+^#2r3x$e7nl)nlf8rb^

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.