Home | Benchmarks | Categories | Atom Feed

Posted on Mon 25 November 2024 under Artificial Intelligence

Language Translation with Python

I've worked on client projects where we try to analyse second-hand markets or real estate feeds across countries that don't use English as their official language. In these cases, having a rough translation can be good enough for any automated analysis.

I recently came across a minimalist, Python-based API server called LibreTranslate that abstracts Argos Translate's language models and provides an API to both detect and translate natural text.

I've been able to translate 20-25 short Estonian sentences per minute into English with LibreTranslate. A single server process maxes out two of my CPU cores. I'm in the process of finding out how to maximize the amount of translations I can put through a single system. As I uncover these performance increases, I'll update this post.

My Workstation

I'm using a 6 GHz Intel Core i9-14900K CPU. It has 8 performance cores and 16 efficiency cores with a total of 32 threads and 32 MB of L2 cache. It has a liquid cooler attached and is housed in a spacious, full-sized, Cooler Master HAF 700 computer case. I've come across videos on YouTube where people have managed to overclock the i9-14900KF to 9.1 GHz.

The system has 96 GB of DDR5 RAM clocked at 6,000 MT/s and a 5th-generation, Crucial T700 4 TB NVMe M.2 SSD which can read at speeds up to 12,400 MB/s. There is a heatsink on the SSD to help keep its temperature down. This is my system's C drive.

The system is powered by a 1,200-watt, fully modular, Corsair Power Supply and is sat on an ASRock Z790 Pro RS Motherboard.

I'm running Ubuntu 22 LTS via Microsoft's Ubuntu for Windows on Windows 11 Pro. In case you're wondering why I don't run a Linux-based desktop as my primary work environment, I'm still using an Nvidia GTX 1080 GPU which has better driver support on Windows and I use ArcGIS Pro from time to time which only supports Windows natively.

Installing Prerequisites

I'll use Python and jq in this post.

$ sudo apt update
$ sudo apt install \
    jq \
    python3-pip \
    python3-virtualenv

I'll set up a Python Virtual Environment and install LibreTranslate.

$ python3 -m venv ~/.lt
$ source ~/.lt/bin/activate
$ pip install \
    libretranslate \
    requests

Launching LibreTranslate

To start the translation server API, run the following with the languages you're interested in working with. There are 92 different models to choose from as of this writing.

$ libretranslate --load-only en,et

I haven't found a way to add more models piecemeal after this step. In fact, when I tried the server would quietly launch while ignoring any new languages I specified.

$ libretranslate --load-only en,et,ru

To get around this, I had to run the update models endpoint and it ended up downloading every model supported.

$ libretranslate --update-models

LibreTranslate's API

The following will translate a sentence in Russian into English. This endpoint demands you specify the source and target languages.

$ curl -sX POST \
       http://127.0.0.1:5000/translate \
       -d "q=С 28 ноября в экзамен по теории вождения будут включены обновленные вопросы&source=ru&target=en" \
    | jq -S .
{
  "translatedText": "From November 28, updated questions will be included in the driving theory exam"
}

If you want to find the language(s) for a given piece of text, there is a detection endpoint.

$ curl -sX POST \
       http://127.0.0.1:5000/detect \
       -d "q=Vasakpoolne hoone on minu oma" \
    | jq -S .
[
  {
    "confidence": 100,
    "language": "et"
  }
]

Alternative Translations

For some language pairs, you can get additional alternative translations back.

$ curl -sX POST \
       http://127.0.0.1:5000/translate \
       -d "q=das Gebäude links gehört mir&source=de&target=en&alternatives=3" \
    | jq -S .
{
  "alternatives": [
    "the building to the left is mine",
    "the building to the left belongs to me"
  ],
  "translatedText": "the building on the left is mine"
}

For the Estonian phrase for "the building on the left is mine" the https://libretranslate.com/ homepage returned the following.

{
    "alternatives": [
        "I own the building on the left.",
        "The building on my left is mine.",
        "The building on the left is my building."
    ],
    "detectedLanguage": {
        "confidence": 100,
        "language": "et"
    },
    "translatedText": "The building on the left is mine."
}

I haven't yet managed to get my local installation to return any alternatives for Estonian yet but if and when I do, I'll update this post.

Models

By default, the models are kept in the following folder.

$ du -hs ~/.local/share/argos-translate/packages/*
165M  ../translate-en_et-1_9
165M  ../translate-et_en-1_9

The model weights, configuration and a README explaining the origins of the model and how it was put together can be found in these folders.

$ tree ~/.local/share/argos-translate/packages/translate-et_en-1_9/
~/.local/share/argos-translate/packages/translate-et_en-1_9/
├── README.md
├── metadata.json
├── model
│   ├── config.json
│   ├── model.bin
│   └── shared_vocabulary.json
├── sentencepiece.model
└── stanza
    ├── et
    │   └── tokenize
    │       └── edt.pt
    └── resources.json
$ head ~/.local/share/argos-translate/packages/translate-et_en-1_9/README.md
# Estonian - English version 1.9

Data compiled by [Opus](https://opus.nlpl.eu/).

Dictionary data from Wiktionary using [Wiktextract](https://github.com/tatuylonen/wiktextract).

Includes pretrained models from [Stanza](https://github.com/stanfordnlp/stanza/).

author = {Aleksey Kutashov}
$ jq -S .  ~/.local/share/argos-translate/packages/translate-et_en-1_9/metadata.json
{
  "argos_version": "1.9.0",
  "from_code": "et",
  "from_name": "Estonian",
  "package_version": "1.9",
  "to_code": "en",
  "to_name": "English"
}

Using Python

The following are two helper functions. One to detect the language of a given piece of text and another to translate a given piece of text into English.

$ python3
import json

import requests


def detect(text):
    resp = requests.post('http://127.0.0.1:5000/detect', data={'q': text})
    return json.loads(resp.content)


def translate(text, from_lang):
    resp = requests.post('http://127.0.0.1:5000/translate',
                         data={'q': text,
                               'source': from_lang,
                               'target': 'en'})
    return json.loads(resp.content)

This is a language detection example.

detect('das Gebäude links gehört mir')
[{'confidence': 100.0, 'language': 'de'}]

This will convert the above into English.

translate('das Gebäude links gehört mir', 'de')
{'translatedText': 'the building on the left is mine'}
Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.