Web Scrapers often face their IPv4 addresses showing up in aggregated traffic metrics and seeing them subjected to rate limiting. Using a larger number of IPv6 addresses can help mitigate this but not all websites support IPv6. Being able to spread connections across many IPv4 addresses can help reduce the risk of any address being subjected to rate limiting.
To add to this, Cloud-originating IPv4 addresses are easily identifiable and they're often are assumed to host synthetic traffic. Residential IPv4 addresses face less scrutiny.
Using VPNs and/or other tunnelling techniques can go a long way to keeping crawlers under the radar and collecting data as productively as possible. These can be hosted both in Cloud as well as residential environments.
In this post I'll explore two solutions, the first using WireGuard and the second, using an OpenSSH SOCKS5 proxy.
WireGuard: A Modern VPN
WireGuard is a modern VPN solution that has been built by Jason A. Donenfeld over the past five years. It breaks from the traditional prime number-based cryptography schemes by using Elliptic Curves. For the past few decades, prime number schemes have been plagued by side-channel, padding, replay and forgery attacks as well as implementation errors that in some cases left contents unencrypted. In 2017, researchers developed an attack named ROBOT that allowed them to sign messages with Facebook's and PayPal's private keys.
WireGuard uses Curve25519 which was developed by Daniel J. Bernstein in 2005. The encryption version of this curve is called X25519 and a digital signature version is called Ed25519. Curve25519 requires much less computation than previous prime number-based schemes. To add to that, Curve25519 is the fastest curve not covered under any patents and the implementation is in the public domain.
Client-side tools may not notice much of a reduction in computational requirements but servers handling a large number of encrypted requests will be able to handle a far larger number of workloads thanks to these efficiencies. Low-powered Raspberry Pis can happily sustain 20 Mbps when tunnelling WireGuard traffic and I've witnessed WireGuard's Android VPN client battery consumption match that of Spotify's and WhatApp's.
Ed25519 public keys are short as they only need 68 characters to represent them in base64. This contrasts with the 717 characters needed for a 4096-bit RSA public key.
Curve25519 is among many Elliptic Curve implementations. Some of which are suspected of at best being misuse-resistant, lacking rigidness and potentially containing back doors. Daniel J. Bernstein and Tanja Lange have painstakingly catalogued a database of eleven mathematical characteristics of rigidness and judged a wide variety of curves against these criteria. This exercise aimed to prove that their underlying discrete logarithm problem was sufficiently difficult or flag it when it isn't. Curves meeting all of the criteria, such as Curve25519, have been deemed to be "Safe Curves".
WireGuard also supports peers pre-sharing 256-bit symmetric encryption keys which adds an additional layer of protection against future quantum computing-based attacks.
RSA's prime number-based schemes started life in the 1970s and pre-dated the maturity of Elliptic Curves by some 20 years. By 2005, America's National Security Agency (NSA) had promulgated a suite of cryptographic algorithms that included Elliptic Curves. This gave them both credibility and proof that these schemes can be used to protect sensitive information. In 2017, the National Institute of Standards and Technology (NIST) approved Curve25519 for use by the US Federal government.
As of this writing, WireGuard is made up of 5,478 lines of C code and headers making it one of the simplest VPN solutions to date. In contrast, OpenVPN when compiled with OpenSSL which in turn can be compiled with MIT Kerberos sits at over one million lines of C code and headers combined. This is even before you begin to count its compression library dependencies like LZO.
WireGuard will be embedded into version 5.6 of the Linux Kernel. This will remove the overhead of context switching between the Kernel and User space while enjoying a very wide installation base. WireGuard also ships as a standalone package for anyone using a previous version of the Kernel.
Other popular applications implementing Curve25519 include Facebook Messenger, OpenSSH, Signal, Tor, Viber and WhatsApp.
OpenSSH: Ubiquitous Encrypted Tunnelling
The main feature of focus in OpenSSH in this post is the SOCKS5 proxy support. It allows users to set up local ports that can tunnel TCP traffic through a remote OpenSSH server.
The Secure Shell protocol (SSH) was invented in Finland by Tatu Ylönen in 1995. Though he had produced an open-source implementation it came with various restrictions and has since become proprietary software. In 1999, Damien Miller and Darren Tucker forked the SSH code base and created OpenSSH, a suite of tools designed to bring compressed and encrypted tunnelling to various web-centric communication protocols. They released their work under a BSD license.
The OpenSSH suite is probably better known by the tools it bundles. These include SSH, a telnet replacement, SFTP, an FTP replacement, SCP, an RCP replacement and SSHD, a server daemon for the above tools. These tools are nearly ubiquitously installed on every internet-facing UNIX system.
Both Damien and Darren went on to be employed by Google where they've been working as an Information Security Engineer and Site Reliability Engineer respectively for the better part of the past two decades. Damien Miller's LinkedIn describes his role as "Helping prevent Google from getting hacked".
OpenSSH supports a wide variety of prime number-based cryptography schemes and added support for Curve25519 in 2013. To see which digital signature, encryption and compression schemes are supported by both your client and any given SSH server you can connect with, adjust the following with the hostname of the target server.
$ ssh -vvv <hostname> uptime 2>&1 | grep -i kex
Below you can see the algorithms supported by your client.
debug2: local client KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,ext-info-c
These are the algorithms supported by the Server.
debug2: peer server KEXINIT proposal
debug2: KEX algorithms: ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1
This is what your client and the Server have agreed upon using.
debug1: kex: algorithm: ecdh-sha2-nistp256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none
debug1: kex: client->server cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none
Set up a WireGuard Server
For this example, I'll set up a WireGuard VPN Server on AWS EC2. If you want to run the following on a Raspberry Pi running Raspbian on a residential internet connection the steps will be much the same. I'll launch an on-demand t3.micro instance in eu-west-1 running Ubuntu 16. It'll cost $8.32 / month + VAT and has 1 GB of RAM, 2 vCPUs and up to 5 Gbps of network connectivity. I'll set up 8 GB of Magnetic EBS storage which will incur additional costs.
I'll create a new security group called vpn-farm. I'll open up TCP port 22 to my IP address and UDP port 51220 (not TCP, UDP) to the vpn-farm security group.
The external address of this EC2 instance is 54.246.243.162 and the private address, which is accessible across the VPC this EC2 instance lives in, is 172.30.2.186.
To set up the machine I'll first SSH into it. Note, for all my efforts championing Ed25519 above AWS IAM doesn't support it at this time. As far as I can find only RSA keys are supported. Apologies for the link being behind an AWS login screen.
$ ssh ubuntu@54.246.243.162
I'll refresh the packages list and then install WireGuard via PiVPN's installer. When installing WireGuard clients I tend to use WireGuard and Linux tooling directly but for WireGuard servers, PiVPN wraps up a lot of complexity and edge case coverage into its installer.
$ sudo apt update
$ wget -qO- https://install.pivpn.io | bash
You'll be given the option of installing either OpenVPN or WireGuard, choose WireGuard. Select UDP port 51220 as WireGuard's default port.
You'll be presented with a list of DNS providers such as Quad9, OpenDNS, Level3, DNS.WATCH, Norton, FamilyShield, CloudFlare, Google or Custom. Choose what you're comfortable with using.
You can configure WireGuard to work with a domain name or IPv4 address, for this exercise I'm using the private IPv4 address alone.
I'll enable unattended-upgrades of security patches. Following all the above, PiVPN asked to reboot the system.
Once WireGuard was set up I created a new user account called scrape.
$ pivpn add --name scrape
::: Client Keys generated
::: Client config generated
::: Updated server config
::: WireGuard restarted
======================================================================
::: Done! scrape.conf successfully created!
::: scrape.conf was copied to /home/ubuntu/configs for easy transfer.
::: Please use this profile only on one device and create additional
::: profiles for other devices. You can also use pivpn -qr
::: to generate a QR Code you can scan with the mobile app.
======================================================================
Below is a truncation of the configuration file generated. I'll use this on the scraping machine to connect to the WireGuard Server.
$ sudo cat /etc/wireguard/configs/scrape.conf
[Interface]
PrivateKey = A...=
Address = 10.6.0.2/24
DNS = 1.1.1.1, 1.0.0.1
[Peer]
PublicKey = A...=
PresharedKey = A...=
Endpoint = 172.30.2.186:51220
AllowedIPs = 0.0.0.0/0, ::0/0
Set Up WireGuard's Client
I'll set up another Ubuntu 16 Server for scraping. This should have more vCPUs, RAM and disk space as it'll be used for parsing and storing data collected from scraping. The specifications of these sorts of machines are very much dependent on their workloads so I'll refrain from making generic recommendations.
I'll install Python and a utility that will allow installing WireGuard from a 3rd-party repository.
$ sudo apt update
$ sudo apt install \
python-pip \
python-virtualenv \
software-properties-common
Below will give the system the details of the 3rd-party repository hosting the WireGuard package we're interested in and then install it along with OpenResolv.
$ sudo add-apt-repository ppa:wireguard/wireguard
$ sudo apt update
$ sudo apt install \
openresolv \
wireguard
OpenResolv triggered the removal of resolvconf which requires a system reboot.
$ sudo reboot
The scrape.conf file generated on the WireGuard Server has been saved to the home folder I'm using on this machine. I'll copy the configuration into WireGuard's configuration folder.
$ cd ~
$ sudo install \
-o root -g root -m 600 \
scrape.conf \
/etc/wireguard/wg0.conf
I'll then launch WireGuard and tell the system to launch it after any reboot.
$ sudo systemctl start wg-quick@wg0
$ sudo systemctl enable wg-quick@wg0
Run the following to make sure the service launched without issue.
$ sudo systemctl status wg-quick@wg0 | tail -n1
Apr 12 00:11:03 ubuntu systemd[1]: Started WireGuard via wg-quick(8) for wg0.
If you see anything other than the above try running the following again.
$ sudo systemctl start wg-quick@wg0
WireGuard should now be able to report its telemetry.
$ sudo wg
interface: wg0
public key: a...=
private key: (hidden)
listening port: 53806
fwmark: 0xca6c
peer: A...=
preshared key: (hidden)
endpoint: 172.30.2.186:51220
allowed ips: 0.0.0.0/0, ::/0
latest handshake: 30 seconds ago
transfer: 204 B received, 292 B sent
Any networking software on the machine making any new connections will automatically tunnel via WireGuard.
$ wget -qO- https://ipv4.icanhazip.com
54.246.243.162
Here is a short Python example. I'll create a virtual environment with requests which will handle all HTTP and HTTPS calls and BeautifulSoup which will handle the parsing of any HTML returned.
$ virtualenv ~/.scrape
$ source ~/.scrape
$ pip install \
beautifulsoup4 \
requests
Below will set the HTTP agent to a recent version of Chrome. Replace the <hostname> with a server of your choice. I've set up a session so cookies will follow any subsequent requests.
$ python
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/78.0.3904.97 Safari/537.36'
}
session = requests.Session()
resp = session.get('https://<hostname>/', headers=headers)
assert resp.status_code == 200, 'Unexpected HTTP %d' % resp.status_code
The following will parse and print out the contents of any H1 tags found in the above call.
soup = BeautifulSoup(resp.text)
print [x.text.strip().lower()
for x in soup.findAll('h1')]
All of the above ran via WireGuard automatically.
Setting up an OpenSSH SOCKS5 Proxy
The machines set up on AWS EC2 already come with OpenSSH installed and will have SSH public keys dropped into /home/ubuntu/.ssh/authorized_keys. This means I can launch a SOCKS5 proxy with the following on the client system.
$ ssh -D9090 \
-o ServerAliveInterval=50 \
ubuntu@172.30.2.186
Once connected to the server, run top to keep the connection from going stale.
$ top
The above will open up TCP port 9090 locally. Tools that use libcurl should support SOCKS5 proxy settings being defined in the ALL_PROXY environment variable.
$ sudo apt install curl
$ export ALL_PROXY=socks5h://localhost:9090
$ curl https://ipv4.icanhazip.com
For Python, we'll need the SOCKS package included in Request's installation.
$ source ~/.scrape
$ pip install -U 'requests[socks]'
$ python
import requests
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/78.0.3904.97 Safari/537.36'
}
proxies = {
'http': 'socks5://localhost:9090',
'https': 'socks5://localhost:9090'
}
session = requests.Session()
resp = session.get('https://ipv4.icanhazip.com',
headers=headers,
proxies=proxies)
assert resp.status_code == 200, 'Unexpected HTTP %d' % resp.status_code