Home | Benchmarks | Categories | Atom Feed

Posted on Tue 24 August 2021 under File and Data Management

Track changes in Excel, Word, PowerPoint, PDFs, Images & Videos with Git

Git is an open-source, distributed version control system. It was designed by Linus Torvalds in the Spring of 2005 with the goal of improving the workflow around the development of the Linux Kernel. Within the space of a month, Linus, along with Junio Hamano and a handful of others, had written the initial C code and used Git to merge multiple branches together. By June of 2006, Git had been used for the Linux Kernel's version 2.6.12 release.

Git can describe changes in files well when the contents can be rendered as plain text. Unfortunately, the underlying diff tool being used will simply state that binary files "differ".

$ git diff test.mp4
diff --git a/test.mp4 b/test.mp4
index 9a9efb7..d9607c6 100644
Binary files a/test.mp4 and b/test.mp4 differ

Microsoft Office began using XML files within PKZip containers in 2007. This made the task of building open-source readers for Word, Excel and PowerPoint documents much easier compared to Microsoft's previous formats. Between Pandas and Pandoc you can dump the contents of any given Office document into plain text. These plain text dumps can then be compared against one another and this process can be wrapped up inside the git diff command.

Likewise, Phil Harvey's ExifTool tool can be used to dump metadata around images and videos. While these aren't visual comparisons, they can still be a useful starting place for understanding when change has taken place.

In this post, I'll describe ways of using alternative deserialisation tools so that you can get more informative difference information between binary files using Git.

Installing Tooling

Using Homebrew, I'll install ExifTool, Git, Pandas and Python 3 on my MacBook Pro.

$ brew install \
    exiftool \
    git \
    pandoc \
    virtualenv

I'll set up a virtual environment and install some Python packages. These will be used for extracting document contents.

$ virtualenv ~/.officecomp
$ source ~/.officecomp/bin/activate
$ python3 -m pip install \
    openpyxl \
    Pandas \
    python-pptx \
    textract

Comparing Office Documents

I'll initialise a new Git repository, configure it and then populate it with blank Word, Excel and PowerPoint documents.

$ mkdir -p ~/repo
$ cd ~/repo
$ git init

The following will map various file extensions to alternative diff configurations.

$ vi .gitattributes
*.docx diff=word
*.pptx diff=powerpoint
*.xlsx diff=excel
*.pdf diff=pdf
*.jpg diff=images_videos
*.png diff=images_videos
*.gif diff=images_videos
*.mp4 diff=images_videos

Inside of the repository's configuration, I'll set up BASH commands that will either extract a file's contents as plain text or dump its metadata. I will also set up a wdiff alias which will provide a colourful and concise output for the diff command.

$ vi .git/config
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true

[alias]
    wdiff = diff --word-diff=color --unified=1

[diff "word"]
    textconv=pandoc --to=markdown
    binary=true
    prompt=false

[diff "excel"]
    textconv=python3 /Users/mark/Downloads/test/xlsx-dump.py
    binary=true

[diff "powerpoint"]
    textconv=python /Users/mark/Downloads/test/pptx-dump.py
    binary=true

[diff "pdf"]
    textconv=python3 /Users/mark/Downloads/test/pdf-dump.py
    binary=true

[diff "images_videos"]
    textconv=exiftool
    binary=true

Below you can see the three Office documents. Their file extensions hide that they are actually ZIP files containing XML data.

$ ls -l *.{docx,pptx,xlsx}
-rw-r--r--@ 1 mark  staff   8145 Aug 24 10:34 Book1.xlsx
-rw-r--r--@ 1 mark  staff  11588 Aug 24 10:34 Doc1.docx
-rw-r--r--@ 1 mark  staff  32326 Aug 24 10:35 Presentation1.pptx

I'll add them to the repository and commit them so that we can compare future changes to these files.

$ git add \
    Book1.xlsx \
    Doc1.docx \
    Presentation1.pptx
$ git commit -am'Adding blank documents'

Comparing Word Documents

I've added a heading to the Word document and saved it. Below is the output when running wdiff.

$ git wdiff Doc1.docx
diff --git a/Doc1.docx b/Doc1.docx
index 90ff36f..5b96082 100644
--- a/Doc1.docx
+++ b/Doc1.docx
@@ -1 +1 @@
# Hello

I can also view changes to this file in reverse chronological order.

$ git log -p --word-diff=color Doc1.docx
commit c824375dac2f78cc8bcfacdb6766e00219efe0aa
Author: Mark Litwintschik
Date:   Tue Aug 24 11:14:53 2021 +0300

    Adding initial screenshot

diff --git a/Doc1.docx b/Doc1.docx
index 5b96082..d05790f 100644
--- a/Doc1.docx
+++ b/Doc1.docx
@@ -1 +1 @@
# HelloHello2

commit cfc43b74eee780bcb18ec3dae37d910daa86a428
Author: Mark Litwintschik
Date:   Tue Aug 24 11:01:42 2021 +0300

    Saving inital PDF

diff --git a/Doc1.docx b/Doc1.docx
index 90ff36f..5b96082 100644
--- a/Doc1.docx
+++ b/Doc1.docx
@@ -1 +1 @@
# Hello

commit 78584bb9b7464a02e4e3784418e13465c612f21d
Author: Mark Litwintschik
Date:   Tue Aug 24 10:36:05 2021 +0300

    Adding blank documents

diff --git a/Doc1.docx b/Doc1.docx
new file mode 100644
index 0000000..90ff36f
--- /dev/null
+++ b/Doc1.docx
@@ -0,0 +1 @@

Comparing PowerPoint Documents

I'll build a small PowerPoint to plain text dump utility using the python-pptx library that was installed earlier.

$ vi pptx-dump.py
import sys

from pptx import Presentation


for slide in Presentation(sys.argv[1]).slides:
    for shape in slide.shapes:
        if not shape.has_text_frame:
            continue

        for paragraph in shape.text_frame.paragraphs:
            for run in paragraph.runs:
                print(run.text)

I'll then add a heading to Presentation1.pptx and view the difference using Git.

$ git wdiff Presentation1.pptx
diff --git a/Presentation1.pptx b/Presentation1.pptx
index 9df8333..ff5383c 100644
--- a/Presentation1.pptx
+++ b/Presentation1.pptx
@@ -0,0 +1 @@
Hello

Comparing Excel Documents

I'll build a small Excel to plain text dump utility using Pandas and OpenPyXL. This will iterate through each worksheet and dumps its contents, including headers, in CSV format.

$ vi xlsx-dump.py
from io import StringIO
import sys

import pandas as pd


for sheet_name in pd.ExcelFile(sys.argv[1]).sheet_names:
    output = StringIO()
    print('Sheet: %s' % sheet_name)
    pd.read_excel(sys.argv[1], sheet_name=sheet_name)\
      .to_csv(output,
              header=True,
              index=False)
    print(output.getvalue())

I'll then change some values on both Sheet1 and Sheet2 within Book1.xlsx and view the difference.

$ git wdiff Book1.xlsx
diff --git a/Book1.xlsx b/Book1.xlsx
index 1678ee3..58278dc 100644
--- a/Book1.xlsx
+++ b/Book1.xlsx
@@ -1,6 +1,6 @@
 Sheet: Sheet1
-Hello
+Hello!

 Sheet: Sheet2
-Hello Again
+Hello Again!

Comparing PDF Documents

I'll print Doc1.docx to Doc1.pdf and add it to the repository.

$ git add Doc1.pdf
$ git commit -am"Saving inital PDF"

Then, using textract, I'll create a small PDF to text dump utility.

$ vi pdf-dump.py
import sys

import textract


print(textract.process(sys.argv[1]).decode('utf-8'))

I'll then change the heading in the Word document and print it out to Doc1.pdf again. Below is what the difference looks like.

$ git wdiff Doc1.pdf
diff --git a/Doc1.pdf b/Doc1.pdf
index 964baee..35f2f4b 100644
--- a/Doc1.pdf
+++ b/Doc1.pdf
@@ -1,2 +1,2 @@
HelloHello2

Keep in mind there are red and green background colours behind the text that differs. These won't render on this website but will inside the Terminal.

Comparing Images

The temptation to use an image-to-ASCII renderer here was strong but for the sake of simplicity, I'll only compare the image metadata.

I took a screenshot, named it test.png and added it to the repository.

$ git add test.png
$ git commit -am'Adding initial screenshot'

I then took another screenshot and replaced test.png with its contents. Below is what the difference looks like.

$ git wdiff test.png
diff --git a/test.png b/test.png
index af264c8..d82cfad 100644
--- a/test.png
+++ b/test.png
@@ -1,9 +1,9 @@
ExifTool Version Number         : 12.30
File Name                       : yDRG2g_test.pngtest.png
File Size                       : 151684 KiB
File Modification Date/Time     : 2021:08:24 11:16:10+03:002021:01:20 21:28:57+02:00
File Access Date/Time           : 2021:08:24 11:16:10+03:00
File Inode Change Date/Time     : 2021:08:24 11:16:10+03:0011:15:54+03:00
File Permissions                : -rw--------rw-r--r--
File Type                       : PNG
@@ -11,4 +11,4 @@ File Type Extension             : png
MIME Type                       : image/png
Image Width                     : 10201738
Image Height                    : 8842056
Bit Depth                       : 8
@@ -54,4 +54,4 @@ Resolution Unit                 : inches
User Comment                    : Screenshot
Exif Image Width                : 10201738
Exif Image Height               : 8842056
Pixels Per Unit X               : 5669
@@ -61,3 +61,3 @@ XMP Toolkit                     : XMP Core 6.0.0
Apple Data Offsets              : (Binary data 28 bytes, use -b option to extract)
Image Size                      : 1020x8841738x2056
Megapixels                      : 0.9023.6

Again, there are red and green background colours behind the text that differs. These won't render on this website but will inside the Terminal.

Comparing Videos

I copied a video off of my phone, named it test.mp4 and added it to the repository.

$ git add test.mp4
$ git commit -am'Adding initial video'

I then replaced that video with another video off my phone. Below is the difference.

diff --git a/test.mp4 b/test.mp4
index 9a9efb7..d9607c6 100644
--- a/test.mp4
+++ b/test.mp4
@@ -1,9 +1,9 @@
ExifTool Version Number         : 12.30
File Name                       : saDQQs_test.mp4test.mp4
File Size                       : 6063 MiB
File Modification Date/Time     : 2021:08:24 11:23:30+03:002019:12:14 15:12:25+02:00
File Access Date/Time           : 2021:08:24 11:23:30+03:00
File Inode Change Date/Time     : 2021:08:24 11:23:30+03:0011:23:08+03:00
File Permissions                : -rw--------rw-r--r--
File Type                       : MP4
@@ -14,9 +14,9 @@ Minor Version                   : 0.0.0
Compatible Brands               : isom, mp42
Media Data Size                 : 6320243665774570
Media Data Offset               : 32
Movie Header Version            : 0
Create Date                     : 2019:12:14 13:03:1213:04:10
Modify Date                     : 2019:12:14 13:03:1213:04:10
Time Scale                      : 1000
Duration                        : 10.4310.90 s
Preferred Rate                  : 1
@@ -34,6 +34,6 @@ Android Version                 : 9
Track Header Version            : 0
Track Create Date               : 2019:12:14 13:03:1213:04:10
Track Modify Date               : 2019:12:14 13:03:1213:04:10
Track ID                        : 1
Track Duration                  : 10.4310.90 s
Track Layer                     : 0
@@ -52,9 +52,9 @@ Pixel Aspect Ratio              : 65536:65536
Color Representation            : nclx 1 1 1
Video Frame Rate                : 30.02429.992
Matrix Structure                : 1 0 0 0 1 0 0 0 1
Media Header Version            : 0
Media Create Date               : 2019:12:14 13:03:1213:04:10
Media Modify Date               : 2019:12:14 13:03:1213:04:10
Media Time Scale                : 48000
Media Duration                  : 10.4010.86 s
Handler Type                    : Audio Track
@@ -68,3 +68,3 @@ Image Size                      : 3840x2160
Megapixels                      : 8.3
Avg Bitrate                     : 48.548.3 Mbps
GPS Latitude                    : 0 deg 0' 0.00" N
Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.