Postcode validation is a requirement that comes up in a lot of my UK-based client projects. Parsing and linting UK postcodes is ripe with edge cases.
Postcodes do change from time to time. Users can be unaware of their postcode changing or they could have submitted their postcode prior to it being decommissioned. Mail might be delivered to decommissioned postcodes and geographic boundary information can exist meaning they can still be useful.
Network requests to validate postcodes can add a lot of overhead when you're batch processing. New postcodes aren't guaranteed to be in every 3rd-party database either so there is an error rate to take into account. Services where a flat number or building name and a postcode given by the user can be used to get their full address (saving time and spelling mistakes) often charge for querying their database so it's worth minimising calls to these services.
Good is not the enemy of perfect and sometimes just knowing that a postcode fits the format of a UK postcode can be enough for certain postcode-related functionality. Linting a postcode before checking with a 3rd-party database can reduce the costs as well.
There are a lot of regex snippets and libraries for parsing UK postcodes. I took a look at a couple of them to see if they could stand up to a database of 2.5 million current and past UK postcodes.
Rob Cowie's Postcode library
The first I looked at was postcode by Rob Cowie. I quickly found that two digits in the outing code (the first 3-4 characters of a UK postcode) caused the library to raise a TypeError:
$ pip install -e git+https://github.com/robcowie/postcode.git#egg=postcode
>>> from postcode import uk
>>> uk.validate('s11 7ty')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../postcode/uk.py", line 82, in validate
parts[0] = parts[0][0]
TypeError: 'tuple' object does not support item assignment
I raised an issue and began looking around for another library.
Simon Hayward's UK Postcode Parser
The next library I looked at was a fork of ukpostcodeparser by Simon Hayward.
$ pip install -e git+https://github.com/simonhayward/ukpostcodeparser.git#egg=ukpostcodeparser
I downloaded a list of UK postcodes and ran all of them through the parser to see if it raised any exceptions:
$ curl -O http://www.doogal.co.uk/files/postcodes.zip
$ unzip postcodes.zip
from ukpostcodeparser import parse_uk_postcode
"""
The layout of postcodes.csv looks like the following:
AB1 0AD,57.10056,-2.248342,385053,...
AB1 0AE,57.084447,-2.255708,384600,...
AB1 0AF,57.096659,-2.258103,384460,...
"""
for line in open('postcodes.csv'):
pieces = line.strip().split(',')
try:
assert len(pieces) > 1, pieces
except AssertionError:
print 'line invalid', line
continue
# Remove the space between the outward and inward codes
postcode = pieces[0].replace(' ', '')
try:
_postcode = parse_uk_postcode(postcode)
except Exception, error:
print error, postcode
continue
if _postcode is None:
print 'Invalid postcode', postcode
The CSV file contained 2,545,662 postcodes and of them 7,085 came back as invalid.
$ wc -l postcodes.csv
2545662 postcodes.csv
$ python check.py > results
$ wc -l results
7085 results
I took a sampling of the invalid postcodes to see what they looked like:
$ sort --random-sort results | head
Invalid postcode NPT6ZE
Invalid postcode W1R5HD
Invalid postcode NPT7HS
Invalid postcode W1X8NJ
Invalid postcode NPT8AD
Invalid postcode NPT5LU
Invalid postcode W1M0BN
Invalid postcode W1R0DS
Invalid postcode NPT1JW
Invalid postcode NPT2TW
The NPT outing code for Newport is no longer is use so I'm not so concerned with that one but W1R covers part of central London. I experimented with a few combinations of postcodes and found that if a letter came after any digits in the outing code then the postcode would be seen as invalid by the library even though it is valid. For example: "Golden Square, London, W1R 3AD":
>>> parse_uk_postcode('w1r3ad')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../ukpostcodeparser/parser.py", line 129, in parse_uk_postcode
raise ValueError('Invalid postcode')
ValueError: Invalid postcode
But then I tried another postcode, this one for "216 Oxford Street, London, W1D 1LA" and it did work:
>>> parse_uk_postcode('W1D1LA')
('W1D', '1LA')
Before looking to patch the library I wanted to see if there were any other obvious solutions.
Googling for Regexes
I began googling for regexes which claimed to parse UK postcodes. I can across the following regex and ran it against the 2.5 million postcodes list.
import re
pattern = '^([A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|' + \
'[ABEHMNPRV-Y]))|[0-9][A-HJKS-UW])\ [0-9][ABD-HJLNP-UW-Z]{2}|' + \
'(GIR\ 0AA)|(SAN\ TA1)|(BFPO\ (C\/O\ )?[0-9]{1,4})|' + \
'((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA)\ 1ZZ))$'
_POSTCODE_RE = re.compile(pattern)
def is_postcode(postcode):
postcode = postcode
return _POSTCODE_RE.match(postcode) != None
"""
The layout of postcodes.csv looks like the following
AB1 0AD,57.10056,-2.248342,385053,...
AB1 0AE,57.084447,-2.255708,384600,...
AB1 0AF,57.096659,-2.258103,384460,...
"""
for line in open('postcodes.csv'):
pieces = line.strip().split(',')
try:
assert len(pieces) > 1, pieces
except AssertionError:
print 'line invalid', line
continue
# Make sure the postcode is in upper case
postcode = pieces[0].upper()
if not is_postcode(postcode):
print 'invalid postcode', postcode
This failed against 8,614 postcodes. Here is a sampling of a few of them:
$ sort --random-sort results | head
invalid postcode W1V 9PD
invalid postcode W1Y 8HE
invalid postcode W1Y 8DH
invalid postcode W1P 7FW
invalid postcode W1Y 1AR
invalid postcode W1R 6JJ
invalid postcode W1M 5AE
invalid postcode W1R 1FH
invalid postcode NPT 8ET
invalid postcode W1R 0HD
Ignoring the old Newport postcodes W1R was still being caught. I could see that only certain W1[A-Z] outing codes were being caught out so I got a list of them together and found only 7 were being flagged up as invalid. I adjusted the regular expression to allow for M, N, P, R, V, X and Y after any digits in the outing code:
pattern = '^([A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|' + \
'[ABEHMNPRV-Y]))|[0-9][A-HJKMNPRS-UVWXY])\ [0-9][ABD-HJLNP-UW-Z]{2}|' + \
'(GIR\ 0AA)|(SAN\ TA1)|(BFPO\ (C\/O\ )?[0-9]{1,4})|' + \
'((ASCN|BBND|[BFS]IQQ|PCRN|STHL|TDCU|TKCA)\ 1ZZ))$'
_POSTCODE_RE = re.compile(pattern)
I ran the script again and only the 2,418 depreciated Newport postcodes were seen as invalid.
Seeing the pattern of the W1M, W1N, W1P, W1R, W1V, W1X and W1Y outing codes being the edge cases that tripped up Simon Hayward's library I created a pull request.