Systems Monitoring: top vs Htop vs Glances

When developing a piece of software or monitoring a running system both telemetry and context are important. After one understands what is normal behaviour in a historical context the two most pressing questions are often (1) what's changed? and (2) what's acting abnormally?

In this post, I'm going to look at three popular tools often used for ad-hoc monitoring as well as look at a simplistic solution for monitoring distributed systems.

top

In virtually any modern UNIX-like system you can type top and see a variety of system performance metrics updating every few seconds.

$ top -b -n2 -d5

top - 09:43:05 up  1:08,  0 users,  load average: 0.52, 0.58, 0.59
Tasks:   4 total,   1 running,   3 sleeping,   0 stopped,   0 zombie
%Cpu0  :  4.1 us, 22.2 sy,  0.0 ni, 72.3 id,  0.0 wa,  1.4 hi,  0.0 si,  0.0 st
%Cpu1  :  4.3 us,  7.1 sy,  0.0 ni, 87.7 id,  0.0 wa,  0.9 hi,  0.0 si,  0.0 st
%Cpu2  :  4.4 us,  9.0 sy,  0.0 ni, 85.3 id,  0.0 wa,  1.2 hi,  0.0 si,  0.0 st
%Cpu3  :  3.6 us,  6.7 sy,  0.0 ni, 88.6 id,  0.0 wa,  1.0 hi,  0.0 si,  0.0 st
KiB Mem:  33431016 total,  9521052 used, 23909964 free,    34032 buffers
KiB Swap: 62455548 total,    27064 used, 62428484 free.   188576 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0    8304    132    104 S   0.0  0.0   0:00.14 /init ro
    3 root      20   0    8308     96     56 S   0.0  0.0   0:00.00 /init ro
    4 mark      20   0   17856   5308   5192 S   0.0  0.0   0:00.35 -bash
  228 mark      20   0   14452   1668   1172 R   0.0  0.0   0:00.01 top -b -n2 -d5

The binary running is almost like a version of top written by James Warner of Comcast. This version of top is entirely new and was built as a replacement to a previous version written by developers from a variety of organisations including Lockheed Martin and Heidelberg University.

The top.c source code itself is reasonably simplistic and as of this writing was around ~4,900 lines of C code. Top is still in active development to this day and its source code can be seen with the rest of the procps repository on GitLab. Other utilities found in this repo include kill, ps, sysctl, uptime and watch.

The default layout feels timeless to me but over the decades I've been working with UNIX systems I've developed muscle memory for typing zc1M every time I bring up top on a new machine.

Top uses a monochrome display mode by default so z will toggle into a colour-mapping mode. The number 1 will display separate CPU states and does a good job at highlighting single CPU core-bound loads. I like to view processes sorted by their pressure on memory capacity by typing M. In total there are 49 metrics top can view and sort on.

Commands are truncated by default and typing c will give more extended information on their paths and arguments. My only complaint with this is that it's the end of the commands and arguments that are truncated; it would be more useful to just keep the beginning and end of each command and argument string in order to differentiate between processes.

The changes to top's configuration will only last as long as the session. To avoid this, type uppercase W and it'll save the current configuration to ~/.toprc by default. My only annoyance with this file is that it contains byte values above 0x7F and isn't easy to edit outside of top.

$ hexdump -C ~/.toprc | head

00000000  74 6f 70 27 73 20 43 6f  6e 66 69 67 20 46 69 6c  |top's Config Fil|
00000010  65 20 28 4c 69 6e 75 78  20 70 72 6f 63 65 73 73  |e (Linux process|
00000020  65 73 20 77 69 74 68 20  77 69 6e 64 6f 77 73 29  |es with windows)|
00000030  0a 49 64 3a 69 2c 20 4d  6f 64 65 5f 61 6c 74 73  |.Id:i, Mode_alts|
00000040  63 72 3d 30 2c 20 4d 6f  64 65 5f 69 72 69 78 70  |cr=0, Mode_irixp|
00000050  73 3d 31 2c 20 44 65 6c  61 79 5f 74 69 6d 65 3d  |s=1, Delay_time=|
00000060  33 2e 30 2c 20 43 75 72  77 69 6e 3d 30 0a 44 65  |3.0, Curwin=0.De|
00000070  66 09 66 69 65 6c 64 73  63 75 72 3d a5 a8 b3 b4  |f.fieldscur=....|
00000080  bb bd c0 c4 b7 ba b9 c5  26 27 29 2a 2b 2c 2d 2e  |........&')*+,-.|
00000090  2f 30 31 32 35 36 38 3c  3e 3f 41 42 43 46 47 48  |/012568<>?ABCFGH|

Htop

In 2004, Hisham Muhammad began work on creating a distinctly different systems telemetry monitor. Htop put a focus on telemetry display organisation. There are bar charts for key CPU and memory metrics, processes can toggle between a flat list and a hierarchy via the F5 shortcut, field sorted can be done via mouse clicks and there are seven different colour modes are supported.

The software does a good job of keeping you within the application. If you want to inspect the files a process is using you can select the process and simply type l, if you want to run the process through strace simply type s while running htop as a privileged user.

Below will install and run htop on Ubuntu 16.04.2 LTS.

$ sudo apt install htop
$ htop

 1  [                                         0.0%]   Tasks: 37, 145 thr; 1 running
 2  [                                         0.0%]   Load average: 0.03 0.05 0.07
 3  [                                         0.0%]   Uptime: 01:31:42
 4  [                                         0.0%]
 Mem[||||||||||||||||||||||||||||||||  1.03G/3.84G]
 Swp[                                     0K/4.00G]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
    1 root       20   0 37556  5668  4004 S  0.0  0.1  0:03.03 /sbin/init noprompt
27884 clickhous  20   0 3716M  359M 49184 S  0.7  9.1  0:24.93 ├─ /usr/bin/clickhouse-server --config=/etc/cli
29668 clickhous  20   0 3716M  359M 49184 S  0.0  9.1  0:00.10 │  ├─ /usr/bin/clickhouse-server --config=/etc/
29667 clickhous  20   0 3716M  359M 49184 S  0.0  9.1  0:01.02 │  ├─ /usr/bin/clickhouse-server --config=/etc/
29666 clickhous  20   0 3716M  359M 49184 S  0.0  9.1  0:00.08 │  ├─ /usr/bin/clickhouse-server --config=/etc/
29665 clickhous  20   0 3716M  359M 49184 S  0.0  9.1  0:00.48 │  ├─ /usr/bin/clickhouse-server --config=/etc/
29409 clickhous  20   0 3716M  359M 49184 S  0.0  9.1  0:03.48 │  ├─ /usr/bin/clickhouse-server --config=/etc/
29408 clickhous  20   0 3716M  359M 49184 S  0.0  9.1  0:02.15 │  ├─ /usr/bin/clickhouse-server --config=/etc/

In terms of configuration, any changes made while using the software will be saved automatically to ~/.config/htop/htoprc by default. This file is text-based but comes with the following warning:

$ head -n2 ~/.config/htop/htoprc

# Beware! This file is rewritten by htop when settings are changed in the interface.
# The parser is also very primitive, and not human-friendly.

The source code is still quite small given the functionality on offer. As of this writing, there's a total of ~12,000 lines of C code with other files making up a further ~3,000 lines of code.

Glances

Glances is a Python-based systems telemetry monitor. The project was started by Nicolas Hennion in 2011. Nicolas' LinkedIn profile states he works in the South of France as a Project Manager in the Satellite Control Centre Department for Thales Alenia Space.

When you launch Glances, in addition to the regular CPU, memory and process lists, you'll see the Cloud virtual machine type as well as network, disk and docker container activity to name just a few items.

$ glances

ubuntu (Ubuntu 16.04 64bit / Linux 4.4.0-62-generic)                                            Uptime: 18:55:00

CPU  [  1.7%]   CPU -     1.7%  nice:     0.0%  ctx_sw:   923   MEM -   53.1%   SWAP -    0.1%   LOAD    4-core
MEM  [ 53.1%]   user:     0.8%  irq:      0.0%  inter:    587   total:  3.84G   total:   4.00G   1 min:    0.20
SWAP [  0.1%]   system:   0.7%  iowait:   0.0%  sw_int:   786   used:   2.04G   used:    3.27M   5 min:    0.14
                idle:    98.4%  steal:    0.0%                  free:   1.80G   free:    3.99G   15 min:   0.10

NETWORK       Rx/s   Tx/s   TASKS 203 (349 thr), 1 run, 202 slp, 0 oth sorted automatically by CPU consumption
ens33         152b    3Kb
lo            59Kb   59Kb   CPU%   MEM%  VIRT  RES      PID USER          TIME+ THR  NI S  R/s W/s  Command
                            2.6    4.5   524M  178M   16470 mark          35:48 1     0 S    0 0    /home/mark/.
DISK I/O       R/s    W/s   2.3    0.6   372M  24.5M  14672 mark           0:01 1     0 R    0 0    /home/mark/.
fd0              0      0   1.0    23.7  5.42G 931M   21151 root          13:00 71    0 S    ? ?    java -Xmx1G
loop0            0      0   0.7    9.8   3.71G 385M   27884 clickhous      5:29 46    0 S    ? ?    /usr/bin/cli
loop1            0      0   0.3    2.8   3.53G 109M   12883 zookeeper      1:36 20    0 S    ? ?    /usr/bin/jav
loop2            0      0   0.3    0.2   31.4M 6.80M    333 root           0:53 1     0 S    ? ?    /lib/systemd
loop3            0      0   0.3    0.1   13.8M 2.68M   4353 mark           1:07 1     0 S    0 0    watch ifconf
loop4            0      0   0.0    0.3   186M  9.86M   1447 root           0:35 2     0 S    ? ?    /usr/bin/vmt
loop5            0      0   0.0    0.2   75.2M 8.11M   1470 root           0:00 1     0 S    ? ?    /usr/bin/VGA
loop6            0      0   0.0    0.2   90.6M 6.59M   4381 root           0:00 1     0 S    ? ?    sshd: mark [
loop7            0      0   0.0    0.1   269M  5.75M    595 root           0:13 3     0 S    ? ?    /usr/lib/acc
sda              0    78K   0.0    0.1   36.7M 5.37M      1 root           0:37 1     0 S    ? ?    /sbin/init n
sda1             0    78K   0.0    0.1   64.0M 5.31M   4246 root           0:00 1     0 S    ? ?    /usr/sbin/ss
sda2             0      0   0.0    0.1   44.3M 5.05M   3402 mark           0:00 1     0 S    0 0    /lib/systemd
sda5             0      0   0.0    0.1   21.8M 5.04M   4403 mark          27:23 1     0 S    0 0    -bash
sr0              0      0   0.0    0.1   21.8M 4.93M  21493 mark           0:10 1     0 S    0 0    /bin/bash
sr1              0      0   0.0    0.1   21.7M 4.62M  16114 mark           0:03 1     0 S    0 0    /bin/bash
                            0.0    0.1   21.7M 4.47M  21119 mark           0:00 1     0 S    0 0    /bin/bash
FILE SYS      Used  Total   0.0    0.1   90.6M 4.14M   4402 mark           0:08 1     0 S    ? ?    0
/ (sda1)     2.48G  15.6G   0.0    0.1   250M  3.97M    588 syslog         0:28 4     0 S    ? ?    /usr/sbin/rs
                            0.0    0.1   21.8M 3.87M   3407 mark           0:04 1     0 S    0 0    -bash
SENSORS                     0.0    0.1   51.5M 3.76M  21144 root           0:00 1     0 S    ? ?    sudo nohup /
Physical id          100C   0.0    0.1   41.9M 3.64M    597 messagebu      0:00 1     0 S    ? ?    /usr/bin/dbu
Core 0               100C   0.0    0.1   43.2M 3.45M    396 root           0:01 1     0 S    ? ?    /lib/systemd
Core 1               100C   0.0    0.1   64.3M 3.21M   3377 root           0:00 1     0 S    ? ?    /bin/login -
Core 2               100C   0.0    0.1   28.0M 2.91M    592 root           0:00 1     0 S    ? ?    /lib/systemd
Core 3               100C   0.0    0.1   26.7M 2.86M  16113 mark           0:06 1     0 S    ? ?    SCREEN
                            0.0    0.1   15.7M 2.81M    774 root           0:00 1     0 S    ? ?    /sbin/dhclie

Glances is written with ~10K lines of Python, ~25K lines of JavaScript and relies on the psutil package for its telemetry collection. There are a huge variety of plugins including support for monitoring GPUs, Kafka, RAID setups, folder monitoring and WiFi to name a few.

In addition to the ncurses-based interface, Glances can also run as a web application. When you run glances via cmd.exe on Windows 10 it'll launch a Bottle-based Web Application on TCP port 61209. When you load up http://127.0.0.1:61209/ in a web browser you'll be greeted with an AngularJS-based Application that mimics the ncurses interface.

There is an API exposed as well if you want to consume it with other tools.

$ curl http://127.0.0.1:61209/api/3/all \
    | python -mjson.tool \
    | head -n50

{
    "alert": [],
    "amps": [],
    "batpercent": [],
    "cloud": {},
    "core": {
        "log": 4,
        "phys": 4
    },
    "cpu": {
        "cpucore": 4,
        "ctx_switches": 182358,
        "idle": 82.9,
        "interrupts": 113134,
        "soft_interrupts": 0,
        "syscalls": 215848,
        "system": 12.5,
        "time_since_update": 8.532670974731445,
        "total": 9.8,
        "user": 3.1
    },
    "diskio": [
        {
            "disk_name": "PhysicalDrive6",
            "key": "disk_name",
            "read_bytes": 0,
            "read_count": 0,
            "time_since_update": 8.492774963378906,
            "write_bytes": 0,
            "write_count": 0
        },
        {
            "disk_name": "PhysicalDrive2",
            "key": "disk_name",
            "read_bytes": 0,
            "read_count": 0,
            "time_since_update": 8.492774963378906,
            "write_bytes": 0,
            "write_count": 0
        },
...

The default configuration file is somewhat lengthy but is friendly enough for human editing.

Glances also supports exporting telemetry to over 16 different targets including statsd, Kafka, RabbitMQ, JSON, SVG, Elasticsearch, CSV as well as to bespoke RESTful APIs.

Feeding Glances into Kafka

Below I'll walk through exporting telemetry to a CSV file and then feeding that into Kafka. My thinking behind this is that local disk is usually more reliable than network connections and if the network connection were to fail the local file could be backfilled into Kafka again.

The following was run on a fresh installation of Ubuntu 16.04.2 LTS.

$ sudo apt update
$ sudo apt install \
    kafkacat \
    python-pip \
    python-virtualenv \
    screen \
    zookeeperd

I'll install Kafka manually using the binary package distributed by one of Apache's mirrors.

$ sudo mkdir -p /opt/kafka
$ wget -c -O kafka.tgz \
    http://www-eu.apache.org/dist/kafka/1.1.1/kafka_2.11-1.1.1.tgz
$ sudo tar xzvf kafka.tgz \
    --directory=/opt/kafka \
    --strip 1

I'll then create a log file for Kafka which will be owned by my UNIX account.

$ sudo touch /var/log/kafka.log
$ sudo chown mark /var/log/kafka.log

Much of Kafka's distributed functionality is supported by ZooKeeper. The following command will launch the service.

$ sudo /etc/init.d/zookeeper start

With ZooKeeper up, I'll launch Kafka's server process.

$ sudo nohup /opt/kafka/bin/kafka-server-start.sh \
             /opt/kafka/config/server.properties \
             > /var/log/kafka.log 2>&1 &

I'll then create a Python virtual environment and install Glances as well as CSVKit so I can analyse the CSV output from Glances.

$ virtualenv ~/.monitoring
$ source ~/.monitoring/bin/activate
$ pip install \
    csvkit \
    glances

Below I'll launch a screen session and start Glances. It will both display the ncurses interface as well as write 215 pieces of telemetry to ~/glances.csv.

$ screen
$ glances --export csv \
          --export-csv-file ~/glances.csv

Once that's running type CTRL-A and then CTRL-D to return to your regular shell.

As you can see, there is a large amount of telemetry being collected.

$ csvstat --type ~/glances.csv | tail

206. mem_available: Number
207. mem_used: Number
208. mem_cached: Number
209. mem_percent: Number
210. mem_free: Number
211. mem_inactive: Number
212. mem_active: Number
213. mem_shared: Number
214. mem_total: Number
215. mem_buffers: Number

Kafkacat is a non-JVM Kafka producer and consumer that's written in C. When statically linked it's less than 150 KB in size. Below I'll use it to feed the contents of ~/glances.csv into a Kafka topic called "glances_log" and use Snappy compression on the contents.

$ screen
$ tail -F ~/glances.csv \
    | kafkacat -b localhost:9092 \
               -t glances_log \
               -z snappy

Again, once that's running type CTRL-A and then CTRL-D to return to your regular shell.

Any of the above commands running in screen sessions could be easily added to Supervisord. It would do a good job of restarting processes if they were to fail for any reason.

With all the above running I'll take a look at the first three columns of data for the first 100 records.

$ /opt/kafka/bin/kafka-console-consumer.sh \
        --topic glances_log \
        --from-beginning \
        --zookeeper localhost:2181 \
    | head -n100 \
    | csvstat --columns 1-3 \
              --no-header-row

Below are the statistics for the timestamp column, number of CPU cores and 1-minute load averages collected in the first 100 records.

1. "a"

      Type of data:          DateTime
      Contains null values:  False
      Unique values:         100
      Smallest value:        2018-10-07 05:53:49
      Largest value:         2018-10-07 05:58:55
      Most common values:    2018-10-07 05:53:49 (1x)
                             2018-10-07 05:53:52 (1x)
                             2018-10-07 05:53:55 (1x)
                             2018-10-07 05:53:58 (1x)
                             2018-10-07 05:54:01 (1x)

2. "b"

      Type of data:          Number
      Contains null values:  False
      Unique values:         1
      Smallest value:        4
      Largest value:         4
      Sum:                   400
      Mean:                  4
      Median:                4
      StDev:                 0
      Most common values:    4 (100x)

3. "c"

      Type of data:          Number
      Contains null values:  False
      Unique values:         18
      Smallest value:        0.02
      Largest value:         0.22
      Sum:                   6.57
      Mean:                  0.066
      Median:                0.05
      StDev:                 0.045
      Most common values:    0.04 (15x)
                             0.02 (14x)
                             0.03 (13x)
                             0.06 (9x)
                             0.05 (9x)

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.