A Minimalist Guide to Microsoft SQL Server 2017 on Ubuntu Linux

SQL Server is Microsoft's enterprise relational database offering. It was first released in 1989 and has seen support on various Windows and OS/2 platforms since it's release. In October 2017, Microsoft released SQL Server 2017 for Linux. To date, Ubuntu 16, Red Hat Enterprise Linux 7.3 and 7.4 as well as SUSE Enterprise Linux Server v12 are supported.

Though the Linux distribution is missing features found in the Windows offering, the result is a very useful and feature-rich database that fits in well in a UNIX environment.

In this post I'll walk through setting up SQL Server 2017, performing basic data import and export tasks as well as building reports via Jupyter Notebook and automating tasks using Apache Airflow.

Installing SQL Server 2017 on Ubuntu

To start I'll add the public key and package listings from Microsoft to my apt repository. The following was run on a fresh install of Ubuntu 16.04.2 LTS.

$ wget -q https://packages.microsoft.com/keys/microsoft.asc -O - \
    | sudo apt-key add -
$ wget -q https://packages.microsoft.com/config/ubuntu/16.04/mssql-server-2017.list -O - \
    | sudo tee /etc/apt/sources.list.d/mssql-server-2017.list
$ wget -q https://packages.microsoft.com/config/ubuntu/16.04/prod.list -O - \
    | sudo tee /etc/apt/sources.list.d/msprod.list
$ sudo apt update

I'll then install SQL Server, one of its tooling packages, git and Python.

$ sudo apt install \
    git \
    mssql-server \
    mssql-tools \
    python-pip \
    python-virtualenv

The following will allow you to configure the SQL Server license as well as set the SQL Server system administrator password.

$ sudo /opt/mssql/bin/mssql-conf setup

Below are the various licensing options available.

Choose an edition of SQL Server:
  1) Evaluation (free, no production use rights, 180-day limit)
  2) Developer (free, no production use rights)
  3) Express (free)
  4) Web (PAID)
  5) Standard (PAID)
  6) Enterprise (PAID)
  7) Enterprise Core (PAID)
  8) I bought a license through a retail sales channel and have a product key to enter.

Details about editions can be found at
https://go.microsoft.com/fwlink/?LinkId=852748&clcid=0x409

Use of PAID editions of this software requires separate licensing through a
Microsoft Volume Licensing program.
By choosing a PAID edition, you are verifying that you have the appropriate
number of licenses in place to install and run this software.

Enter your edition(1-8):

I'll add the tools binaries folder to my PATH environment variable.

$ echo 'export PATH="$PATH:/opt/mssql-tools/bin"' >> ~/.bashrc
$ source ~/.bashrc

MSSQL-CLI Up & Running

Most SQL Server practitioners will be familiar with sqlcmd as the tool you most often come across when interacting with the server via the command line. But in late 2017, Microsoft released a new CLI tool. It's open source, written in Python and feels a lot more like the CLI tools that accompany Presto and PostgreSQL. The following will create a Python virtual environment and install the CLI tool.

$ virtualenv ~/.ms
$ source ~/.ms/bin/activate
$ pip install mssql-cli

The CLI picks up on various environment variables. The following will set the username and password for the CLI so that they don't have to be entered each time the tool is launched. Set the user to SA and type in its password below.

$ read MSSQL_CLI_USER
$ read MSSQL_CLI_PASSWORD
$ export MSSQL_CLI_USER
$ export MSSQL_CLI_PASSWORD

Microsoft has included optional telemetry collection in this command. The following will opt out of monitoring.

$ export MSSQL_CLI_TELEMETRY_OPTOUT=True

By default the CLI tool will expect a SQL command to sit on a single line and delimit from other commands via a return carriage. To execute one or more commands, a GO command would be issued alone on a single line. This aligns the behaviour of this tool with the sqlcmd tool.

The following will change this functionality so that SQL commands can span multiple lines and will execute once a semi-colon followed by the return carriage are entered. This is referred to as "Multi-line mode".

$ mkdir -p ~/.config/mssqlcli/
$ vi ~/.config/mssqlcli/config

[main]
multi_line = True

Bulk Loading

I'll clone fivethirtyeight's data repository so that I can import a 2014 FIFA World Cup tournament results prediction.

$ git clone https://github.com/fivethirtyeight/data.git

I'll then launch the CLI tool. It will pick up on various environment variables and connect to the local SQL Server installation.

$ mssql-cli

I'll create a WorldCup2014 database and create a Predictions table in it.

CREATE DATABASE WorldCup2014;
USE WorldCup2014;

CREATE TABLE Predictions (
    country     NVARCHAR(50),
    country_id  NVARCHAR(3),
    group_      NVARCHAR(1),
    spi         NUMERIC(4,2),
    spi_offense NUMERIC(5,4),
    spi_defense NUMERIC(5,4),
    win_group   NUMERIC(16,15),
    sixteen     NUMERIC(16,15),
    quarter     NUMERIC(16,15),
    semi        NUMERIC(16,15),
    cup         NUMERIC(16,15),
    win         NUMERIC(16,15));

I'll then load in the predictions CSV data. Note that I'm using the full path of the CSV file. This command skips the header row in the CSV.

BULK INSERT Predictions
FROM "/home/mark/data/world-cup-predictions/wc-20140611-132709.csv"
WITH (FIRSTROW = 2,
      FIELDTERMINATOR = ',',
      ROWTERMINATOR = '\n',
      MAXERRORS = 0,
      KEEPIDENTITY);

Exporting Data / Backups

The Linux version of SQL Server allows you to backup and restore databases using the .bak file format. These are binary files that hold a very complete picture of the schemas, stored procedures and data as well as the server's state, configuration and logs.

To demonstrate I'll create a db.bak file in my home folder and change it's group ownership to mssql so the server has permissions to work with the file.

$ touch db.bak
$ sudo chown mark:mssql db.bak

I'll then launch the CLI and issue a command to backup the WorldCup2014 database to the file I just created.

$ mssql-cli

BACKUP DATABASE WorldCup2014 TO DISK = '/home/mark/db.bak';

To restore the database I'll first pick out the logical names for the schemas, data and log files. Below you can see they're WorldCup2014 and WorldCup2014_log respectively.

RESTORE FILELISTONLY FROM DISK = '/home/mark/db.bak';

-[ RECORD 1 ]-------------------------
LogicalName          | WorldCup2014
PhysicalName         | /var/opt/mssql/data/WorldCup2014.mdf
Type                 | D
FileGroupName        | PRIMARY
Size                 | 8388608
MaxSize              | 35184372080640
FileId               | 1
CreateLSN            | 0
DropLSN              | 0
UniqueId             | cd3f2e59-0848-4e62-adcd-5555a95021f4
ReadOnlyLSN          | 0
ReadWriteLSN         | 0
BackupSizeInBytes    | 2883584
SourceBlockSize      | 4096
FileGroupId          | 1
LogGroupGUID         | NULL
DifferentialBaseLSN  | 0
DifferentialBaseGUID | 00000000-0000-0000-0000-000000000000
IsReadOnly           | 0
IsPresent            | 1
TDEThumbprint        | NULL
SnapshotUrl          | NULL
-[ RECORD 2 ]-------------------------
LogicalName          | WorldCup2014_log
PhysicalName         | /var/opt/mssql/data/WorldCup2014_log.ldf
Type                 | L
FileGroupName        | NULL
Size                 | 8388608
MaxSize              | 2199023255552
FileId               | 2
CreateLSN            | 0
DropLSN              | 0
UniqueId             | f9909495-16b1-4d42-85d9-ac56f85cf611
ReadOnlyLSN          | 0
ReadWriteLSN         | 0
BackupSizeInBytes    | 0
SourceBlockSize      | 4096
FileGroupId          | 0
LogGroupGUID         | NULL
DifferentialBaseLSN  | 0
DifferentialBaseGUID | 00000000-0000-0000-0000-000000000000
IsReadOnly           | 0
IsPresent            | 1
TDEThumbprint        | NULL
SnapshotUrl          | NULL

I will create two target files to restore the schemas, data and logs to in my home folder and assign group ownership of them to the mssql group.

$ touch ~/restored.{mdf,ldf}
$ sudo chown mark:mssql ~/restored.{mdf,ldf}

I will then launch the CLI, create a new database called restored, switch into single user mode and restore the backup to the two restored files in my home directory.

$ mssql-cli

CREATE DATABASE restored;
ALTER DATABASE restored
    SET SINGLE_USER WITH ROLLBACK IMMEDIATE;

RESTORE DATABASE restored
    FROM DISK = '/home/mark/db.bak'
    WITH REPLACE, RECOVERY,
        MOVE 'WorldCup2014' TO '/home/mark/restored.mdf',
        MOVE 'WorldCup2014_log' TO '/home/mark/restored.ldf';

With the database recovered I can switch to the restored database and select out the three countries the model predicted to top the 2014 World Cup as well as their probabilities at winning.

USE restored;

SELECT TOP 3 country, win
FROM Predictions
ORDER BY win DESC;

+-----------+-------------------+
| country   | win               |
|-----------+-------------------|
| Brazil    | 0.453437174610376 |
| Argentina | 0.127798660739543 |
| Germany   | 0.106981061567489 |
+-----------+-------------------+

Export to CSV

If you want to dump the output of a SELECT statement to CSV files and the output is made up of nothing more than integers, fractional numbers and strings that don't require escaping then the sqlcmd can be of good use.

The following will set the environment variable for the database password for the sqlcmd.

$ export SQLCMDPASSWORD=$MSSQL_CLI_PASSWORD

The following will dump the predictions to a CSV file.

$ sqlcmd \
    -S localhost \
    -U SA \
    -d WorldCup2014 \
    -s',' \
    -W \
    -h-1 \
    -Q "SET NOCOUNT ON; SELECT * FROM Predictions" \
    > out.csv

Export to SQL

In early 2017, work began on mssql-scripter, a Python-based CLI tool that could export SQL Server databases to plain-text SQL files. The following will install the tool and set the environment variable for the database password so you don't need to enter it each time you use the tool.

$ pip install mssql-scripter
$ export MSSQL_SCRIPTER_PASSWORD=$MSSQL_CLI_PASSWORD

The following will dump out the World Cup 2014 database.

$ mssql-scripter \
    --server localhost \
    --database WorldCup2014 \
    --user SA \
    --schema-and-data \
    > dump.sql

A lot of effort has gone in to ensure as much state of the database as possible is preserved in the .sql dump file. This is the first 30 lines of the above command's output.

$ head -n30 dump.sql

USE [master]
GO
/****** Object:  Database [WorldCup2014]    Script Date: 8/12/18 7:37:35 AM ******/
CREATE DATABASE [WorldCup2014]
 CONTAINMENT = NONE
 ON  PRIMARY
( NAME = N'WorldCup2014', FILENAME = N'/var/opt/mssql/data/WorldCup2014.mdf' , SIZE = 8192KB , MAXSIZE = UNLIMITED, FILEGROWTH = 65536KB )
 LOG ON
( NAME = N'WorldCup2014_log', FILENAME = N'/var/opt/mssql/data/WorldCup2014_log.ldf' , SIZE = 8192KB , MAXSIZE = 2048GB , FILEGROWTH = 65536KB )
GO
IF (1 = FULLTEXTSERVICEPROPERTY('IsFullTextInstalled'))
begin
EXEC [WorldCup2014].[dbo].[sp_fulltext_database] @action = 'enable'
end
GO
ALTER DATABASE [WorldCup2014] SET ANSI_NULL_DEFAULT OFF
GO
ALTER DATABASE [WorldCup2014] SET ANSI_NULLS OFF
GO
ALTER DATABASE [WorldCup2014] SET ANSI_PADDING OFF
GO
ALTER DATABASE [WorldCup2014] SET ANSI_WARNINGS OFF
GO
ALTER DATABASE [WorldCup2014] SET ARITHABORT OFF
GO
ALTER DATABASE [WorldCup2014] SET AUTO_CLOSE OFF
GO
ALTER DATABASE [WorldCup2014] SET AUTO_SHRINK OFF
GO
ALTER DATABASE [WorldCup2014] SET AUTO_UPDATE_STATISTICS ON

Plotting Data with Jupyter Notebook

You can source SQL Server-stored data for analysis in Jupyter Notebook via the pymssql library. The following will install Pandas, Jupyter Notebook, Matplotlib as well as pymssql.

$ pip install \
    jupyter \
    matplotlib \
    pandas \
    pymssql

I'll set a password for the Notebook server and then launch it.

$ mkdir ~/.jupyter
$ jupyter-notebook password
$ jupyter-notebook \
        --no-browser \
        --ip=0.0.0.0 \
        --NotebookApp.iopub_data_rate_limit=100000000

With the server running I'll open it in a web browser and navigate to http://127.0.0.1:8888.

When the Web UI loads, select "New" in the top right corner and then "Python 2". You should now see a new, blank notebook. The following will plots the relationship between the probabilities of various competing countries reaching the quarter and semi finals of the 2014 World Cup.

I usually keep my imports in the top cell.

import os

import pandas as pd
import pymssql

The code to generate the scatter plot lives in the second cell.

with pymssql.connect(server="127.0.0.1",
                     user="SA",
                     password=os.environ['MSSQL_CLI_PASSWORD'],
                     database="WorldCup2014") as conn:
    df = pd.read_sql("SELECT quarter, semi FROM Predictions", conn)
    df.plot.scatter(x='quarter', y='semi', c='DarkBlue')

Using Airflow to Automate SQL Server Tasks

Apache Airflow is a popular tool for automating routine data pipeline tasks. I go into its features in my Building a Data Pipeline with Airflow blog post as well as my Customising Airflow: Beyond Boilerplate Settings blog post. The following will a minimal Airflow installation up.

$ pip install apache-airflow
$ airflow initdb
$ airflow upgradedb

I'll launch the web server and then navigate to http://127.0.0.1:8080/ in a web browser.

$ airflow webserver

With the Airflow Web UI visible I'll click on "Admin" in the navigation followed by "Connections". I'll then edit the existing "mssql_default" connection. I'll change the Login to "SA", set its password and change the schema to "WorldCup2014".

With that in place I'll launch a Python REPL.

$ python

The following will create a DAG with a one-off task to count the number of predictions in the World Cup 2014 database.

import airflow
from airflow.operators.mssql_operator import MsSqlOperator


default_args = {
    'owner':            'airflow',
    'depends_on_past':  False,
    'start_date':       airflow.utils.dates.days_ago(0),
}

dag = airflow.DAG('world_cup',
                  default_args=default_args,
                  description='Count the predictions')

task = MsSqlOperator(dag=dag,
                     mssql_conn_id='mssql_default',
                     task_id='one-off',
                     sql="""SELECT COUNT(*) FROM Predictions;""")

task.run()

The following will print out the three teams deemed most likely to win the tournament.

from airflow.hooks.mssql_hook import MsSqlHook


hook = MsSqlHook()
res = hook.get_records("""SELECT TOP 3 country, win
                          FROM Predictions
                          ORDER BY win DESC;""")

for record in res:
    print record

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.