Django exception archaeology

Almost every single customer I've worked with has had a different way of capturing and analysing activity in their production systems. Researching and comparing offerings feels like an endless task and a source of endless debate. Regardless of the large number of choices, being able to centralise and analyse application server exceptions is an important step towards improving the reliability of code in production.

I wanted to see how Logstash using Elasticsearch as a data store and Kibana as an analysis tool could be used to monitor a Django application.

To start, I created a small Django application that served random data, raised exceptions occasionally via a catch-all view and had a celery task which would randomly pick a log level to record to every time a new task was scheduled. With all these combinations it should make for a good set of example data to dig through and analyse.

One Django project, many response types

I installed Django, django-celery, SQLAlchemy (needed to record celery's results to an SQLite file), gunicorn and python-logstash. I then created a project called faulty and an app within it called example.

$ pip install Django==1.7.1 \
              django-celery==3.1.16 \
              gunicorn==19.1.1 \
              SQLAlchemy==0.9.8 \
              python-logstash==0.4.2
$ django-admin startproject faulty
$ cd faulty
$ django-admin startapp example

I made the following changes to faulty/settings.py:

import djcelery

INSTALLED_APPS = (
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'kombu.transport.django',
    'djcelery',
    'example',
)

BROKER_URL = 'django://'
CELERY_RESULT_BACKEND = 'db+sqlite:///results.db'
CELERY_SEND_EVENTS = True
CELERY_TASK_RESULT_EXPIRES = 10
CELERY_TIMEZONE = 'UTC'
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"

djcelery.setup_loader()

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'logstash': {
            'level': 'WARNING',
            'class': 'logstash.LogstashHandler',
            'host': 'localhost',
            'port': 5000,
            'version': 1,
        },
    },
    'loggers': {
        'django.request': {
            'handlers': ['logstash'],
            'level': 'WARNING',
            'propagate': True,
        },
        'example.tasks': {
            'handlers': ['logstash'],
            'level': 'WARNING',
        },
    },
}

I then created a task that would accept a randomly-generated content string, its reported length and content type. It would create a sha1 hash of the content and output that hash, the reported content length and the content type to a random log level.

The content length would be the length before the content was wrapped. I could have derived the length in the task itself or hashed the content in the view but the purpose of this exercise is to create noise.

example/tasks.py:

import hashlib
import logging
from random import choice

from celery import task
from celery.utils.log import get_task_logger


logger = get_task_logger(__name__)


@task()
def log_response(content, content_length, content_type):
    log_method = choice((logger.info,
                         logger.warning,
                         logger.error,
                         logger.critical))

    hasher = hashlib.sha1()
    hasher.update(content)
    log_method('{0} {1} {2}'.format(hasher.hexdigest(),
                                    content_length,
                                    content_type))
    return True

I configured the project to direct all traffic to a single endpoint.

faulty/urls.py:

from django.conf.urls import patterns, url


urlpatterns = patterns('',
    url(r'^$', 'example.views.endpoint'),
)

I then created a view that has a 25% chance of raising an exception and if it doesn't, generates random content, sends that content to the celery task and serves the content.

example/views.py:

import mimetypes
from random import choice, randint
from textwrap import wrap

from django.http import HttpResponse

from .tasks import log_response


def endpoint(request):
    if randint(0, 3) == 3: # 25% chance of raising an error
        raise choice((AssertionError, IndexError, KeyError))

    # Pick a random content type:
    content_type = choice(mimetypes.types_map.values())
    content_length = randint(512, 2048)

    # Create a random string between 512 and 2048 characters long
    # and line wrap it every 50 characters
    choices = 'abcdefghijklmnopqrstuvwxyz0123456789!@#$%^&*(-_=+)'
    content = ''.join([choice(choices)
                       for _ in range(content_length)])
    content = '\n'.join(wrap(content, width=50))

    # Send the content, its length and type to this task
    log_response.delay(content, content_length, content_type)

    return HttpResponse(content, content_type=content_type)

Running the app server and task handler

I then setup the database (as celery needed it to keep track of its tasks) and launched celeryd and gunicorn.

$ python manage.py syncdb
$ python manage.py migrate
$ python manage.py celeryd &
$ gunicorn faulty.wsgi:application -b 127.0.0.1:8050 &

I tested the endpoint a few times and never did the responses look alike. Here is one response I got back:

$ curl -v localhost:8050
...
< HTTP/1.1 200 OK
< Server: gunicorn/19.1.1
< Date: Mon, 03 Nov 2014 09:53:34 GMT
< Connection: close
< Transfer-Encoding: chunked
< Vary: Cookie
< X-Frame-Options: SAMEORIGIN
< Content-Type: image/jpeg
<
9h*mybxd!*+bf4%#hir+3w)kmp)%_vc%idx6qvz@7678=pkf-p
4h%(d9=ip7snmb&$s0@wrpj2*s_uif&saam=_-%u50n0_4*vh#
i9+n_c@!c1&zi__9yxx$4#tfy0a1-mv+6oyl@-1(h4_fk^jm-o
!7^ak(ih)q=q@@n0_0!$zy^a==-k99%%q8^20q$k4(_3qvt30e
k=l(i4_2sxq^zt=5mncp!saue0b@&$do-04%wrx4(#kn2pwzl+
%@d50ib#rlz2av*wxjg47i&m7da2ff^9g74-w$fd3_^34bzz%*
945$woykwd4wot4*1l+lcca2+5%riudbxyy0d=jm-pkp123v@v
xibyy0rc00!cwggfoa2k$q-qd)e%-f&gbn2042@1(0bomv7%j4
jk7llo7x6!oj($l-+mcncp$4k^uw-ldi=-$zyxq*0+9@y2)-62
3^=nywszj3tlb@!1nk659r1jn_s6zk(sllfd9+3iqfqr(^-8da
*ysw9s6huzt^n1r_0eg=ap&mzgd#r(=0*5a^39lh!(%v0f-fv+
@pxhpokbt(zum3s2i4*(@n_qszcfl+8kmi$d#ixi(&!1l#f*@3
c8o8lc)f@ts+mde4m))h9%lqaqqafh7(3@#=wli_veot7*1lwr
kr#u!#pu7t_d34l^flqt$brxkh+4hd9e58na0)rz!8+0=+w9pr
#ca0=dkk9na_9_hbammn==7hi3aflyr^$_1e$s+%!hrb7^6j*&
d5$wish65s1b3!q*^*f7bhuxfi!=tv^=$-#p5(m^-fzc_zpq@1
8anr)2av*+pb$14lfz1_f-^n34n%3%9*y8^++ijun_e@zm=&_0
@5^rpf$c&3&zsofrqjeuz2i(b9k=6j_c%q##jc#j+7=&iwa5t=
aise&^owl$iho0k(*33r-l-=s+3kdd2x)ee^$+8&qk$nsk&h&h
i2zxl7&s^9zpbd+r3qp)d%ste9=)(hn5**&kvi^(w(d^0bklrk
4#_&*

And another response, this time returning HTTP 404:

$ curl -v localhost:8050
...
< HTTP/1.1 404 NOT FOUND
< Server: gunicorn/19.1.1
< Date: Mon, 03 Nov 2014 09:59:28 GMT
< Connection: close
< Transfer-Encoding: chunked
< Vary: Cookie
< X-Frame-Options: SAMEORIGIN
< Content-Type: text/html
<

<!DOCTYPE html>
<html lang="en">
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <title>Page not found at /</title>
...

Exception archaeology

In this setup, Logstash will transport exceptions originating from Django, store them in Elasticsearch and Kibana will be used to view and analyse the exceptions.

These are some condensed installation commands I ran on my Ubuntu 14 system:

Install the Java runtime:

$ sudo apt install \
    default-jre

Install and start Elasticsearch:

$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.4.deb
$ sudo dpkg -i elasticsearch-1.3.4.deb
$ sudo /etc/init.d/elasticsearch start

Install Logstash:

$ wget https://download.elasticsearch.org/logstash/logstash/packages/debian/logstash_1.4.2-1-2c0f5a1_all.deb
$ sudo dpkg -i logstash_1.4.2-1-2c0f5a1_all.deb

Install Kibana and launch a web server serving its static content on port 8000:

$ wget https://download.elasticsearch.org/kibana/kibana/kibana-3.1.1.tar.gz
$ tar xzf kibana-3.1.1.tar.gz
$ cd kibana-3.1.1/
$ python -m SimpleHTTPServer &

Capturing exceptions

With all our required packages installed and the web application and task handler already running, I created the following configuration for Logstash. It collects messages from syslog and tcp:5000 (used by Django for reporting exceptions) and sends them both to Elasticsearch and prints them to stdout.

$ cat logstash.conf
input {
  tcp { port => 5000 codec => json }
  udp { port => 5000 type => syslog }
}

output {
  elasticsearch_http {
    host => "127.0.0.1"
    port => 9200
  }
  stdout { codec => rubydebug }
}

I used this command to launch Logstash:

$ /opt/logstash/bin/logstash agent -f logstash.conf

Generating traffic

I used ab to generate traffic for Django:

$ ab -n 200 -c 10 http://127.0.0.1:8050/

With a decent amount of exceptions captured and sent to Elasticsearch, I looked to see what index name Logstash had picked, it turned out to be logstash-2014.11.04:

$ curl --silent localhost:9200/_cat/indices?v
health index               pri rep docs.count docs.deleted store.size pri.store.size
yellow logstash-2014.11.04   5   1        848            0    453.8kb        453.8kb

I then open Kibana and loaded the default dashboard/file/logstash.json file with the URL below and saw an interface that looked like this.

http://127.0.0.1:8000/index.html#/dashboard/file/logstash.json

Further efforts

I'm still looking at ways to tokenise the messages so their properties are easier to isolate, search and aggregate. At the moment messages are stored as a JSON blob rather than as separate keys in a dictionary:

{
       "message" => "{\"host\": \"ubuntu\", \"logger\": \"example.tasks\", \"type\": \"logstash\", \"tags\": [], \"path\": \"/home/mark/faulty/example/tasks.py\", \"@timestamp\": \"2014-11-04T12:37:17.477207Z\", \"@version\": 1, \"message\": \"371679e24301dc140fbb7d0fd7957adb9ad32e41 1625 audio/mpeg\", \"levelname\": \"ERROR\"}",
      "@version" => "1",
    "@timestamp" => "2014-11-04T12:37:17.477Z",
          "type" => "syslog",
          "host" => "127.0.0.1"
}
{
       "message" => "{\"host\": \"ubuntu\", \"logger\": \"example.tasks\", \"type\": \"logstash\", \"tags\": [], \"path\": \"/home/mark/faulty/example/tasks.py\", \"@timestamp\": \"2014-11-04T12:37:17.567138Z\", \"@version\": 1, \"message\": \"9885e4c43c07e399a9a94d934e7f642a724e819b 1667 image/x-cmu-raster\", \"levelname\": \"CRITICAL\"}",
      "@version" => "1",
    "@timestamp" => "2014-11-04T12:37:17.567Z",
          "type" => "syslog",
          "host" => "127.0.0.1"
}

The other task I want to look at is generating random data with distinctive distribution curves. With the way I use random.choice at the moment each choice tends to get picked as often as any other choice but in a random order. When summarising statistics almost every exception happens an equal number of times so the pie charts slices are even and time series lines overlap one another quiet often.

In [1]: from collections import Counter

In [2]: from random import randint

In [3]: Counter([randint(0, 3) for _ in range(0, 200)])
Out[3]: Counter({1: 51, 3: 51, 0: 50, 2: 48})

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.