Home | Benchmarks | Categories | Atom Feed

Posted on Sat 14 August 2021 under Full-Text Search

MeiliSearch: A Minimalist Full-Text Search Engine

MeiliSearch is a fast, feature-rich full-text search engine. Its built on top of the LMDB key-value store and lives as a 35 MB binary when installed on Ubuntu or MacOS. It comes with a built-in client, server and WebUI. Features such as stemming, stop words, synonyms, ranking, filters and faceting all work out of the box, use sensible defaults and can be easily customised.

The MeiliSearch project began in 2018 with much of the early code written by Clément Renault. Clément later when on to found Meili, a Paris-based firm that provides services around this offering. They currently employ 18 staff according to LinkedIn.

MeiliSearch's Source Code

The first thing about the MeiliSearch codebase that caught my eye was that, excluding unit tests, it's made up of 7,600 lines of Rust. Though competing offerings do often have more features, Elasticsearch is made up of almost 2 million lines of Java, Apache Solr which uses Apache Lucene comes in at around 1.3M lines of Java, Groonga is made up of 600K+ lines of C, Manticore Search is made up of 150K lines of C++, Sphinx is made up of 100K lines of C++, Typesense is made up of 50K lines of C++ including headers, Tantivy, which is also written in Rust, has 40K lines of code. Bleve, which is just a text indexing library for GoLang, that doesn't include any client or server interfaces, is made up of 83K+ lines of code. That being said, there is a standalone search engine that can sit on top of it called SRCHX that comes in at a few hundred lines of GoLang.

One of the reasons behind this seemingly small codebase is that areas of concern have been broken up into separate repositories. The indexer milli sits at 17K lines of Rust, the tokenizer is 1,200-lines long and the WebUI dashboard is a 3.5K-line React app.

The developers have gone out of their way to not reinvent the wheel with their use of 3rd-party libraries. These include heed which they use to wrap LMDB, Tokio is used for networking, Actix is their Web Framework, futures is used for streaming functionality and parking_lot handles locking and synchronisation.

Both Sled and RocksDB were candidates considered for the embedded database backend before the team settled on LMDB. They cited LMDB as having the best combination of performance and stability for this use case.

The choice of using Rust, with its rich ecosystem of libraries, non-verbose syntax and ability to produce performant binaries looks to have paid off well. Rust started off as a personal project of Mozilla staffer Graydon Hoare back in 2006. In 2020, a StackOverflow survey found Rust to be the most loved programming language among its respondents.

MeiliSearch Up & Running

I ran the following on a 2020 MacBook Pro with a 1.4 GHz Quad-Core Intel Core i5, 8 GB of RAM and an external SSD connected via Thunderbolt. I'll be using Homebrew as my package manager. The following will install version 0.20.0 of MeiliSearch as well as Python, jq and curl which are used throughout this post.

$ brew update
$ brew install \
    curl \
    jq \
    meilisearch \
    virtualenv

If you're running this on Ubuntu 20.04 LTS, the following will install the above.

$ echo "deb [trusted=yes] https://apt.fury.io/meilisearch/ /" | \
    sudo tee /etc/apt/sources.list.d/fury.list
$ sudo apt update
$ sudo apt install \
    curl \
    jq \
    meilisearch-http \
    python3-pip \
    python3-virtualenv

I'll launch MeiliSearch within a screen so it will remain running in the background. I'll be launching this from a working folder on an external SSD to avoid any excess wear on my laptop's primary drive.

$ screen
$ MEILI_NO_ANALYTICS=1 \
  MEILI_HTTP_PAYLOAD_SIZE_LIMIT=1073741824 \
    meilisearch \
        --db-path ./meilifiles \
        --http-addr '127.0.0.1:7700'

Type CTRL-A and then CTRL-D to detach the screen.

Importing Data

Meilisearch has indices that contain documents. Each index has its own settings. Documents are made up of fields that have a name and a value that can be either a string, integer, float, boolean, array, dictionary or NULL. Dates and time have no native representation. I've seen timestamps converted into integers to get around this limitation. So Midnight on March 18th, 1995 would be expressed as the UNIX timestamp 795484800.

Wikipedia produces a dump of their site's contents every few days. I'll pull down one of the dump's 239 MB, bzip2-compressed segments.

$ wget -c https://dumps.wikimedia.org/enwiki/20210801/enwiki-20210801-pages-articles1.xml-p1p41242.bz2

The archive contains 879 MB of XML. I'll build a Python script that will extract part of its contents and convert them into a JSON format that MeiliSearch is friendly with.

The following will set up a virtual environment where I'll install lxml, an XML library for Python.

$ virtualenv ~/.meilisearch
$ source ~/.meilisearch/bin/activate
$ python3 -m pip install lxml

The following is the conversion script I've put together.

$ vi convert.py
import bz2
import io
import json
import sys

from lxml.etree import iterparse as xml_parse


def get_parser(bz2_file):
    prefix = '{http://www.mediawiki.org/xml/export-0.10/}'

    for event, element in xml_parse(bz2_file, events=('end',)):
        if element.tag.endswith('page'):
            if element.find(prefix + 'ns').text == '0':
                id_tag    = element.find(prefix + 'id')
                title_tag = element.find(prefix + 'title')
                text_tag  = element.find(prefix + 'revision')\
                                   .find(prefix + 'text')
                yield id_tag.text, title_tag.text, text_tag.text

            element.clear()


parser = get_parser(
            bz2.BZ2File(
                io.BytesIO(
                    sys.stdin.buffer.read()), 'r'))
print(json.dumps([{'id':    id_,
                   'title': title,
                   'text':  text}
                  for id_, title, text in parser],
                 ensure_ascii=False))

The following took 102 seconds to complete. It converted 27,199 documents into JSON at a rate of ~8.6 MB/s and 267 documents/s. The resulting JSON file is 842 MB when decompressed.

$ cat enwiki-20210801-pages-articles1.xml-p1p41242.bz2 \
    | python3 convert.py \
    > enwiki-20210801-pages-articles1.json-p1p41242

Below is what the first dictionary, within the single list in the above JSON file, looks like.

$ jq 'nth(0)' enwiki-20210801-pages-articles1.json-p1p41242
{
  "id": "10",
  "title": "AccessibleComputing",
  "text": "#REDIRECT [[Computer accessibility]]\n\n{{rcat shell|\n{{R from move}}\n{{R from CamelCase}}\n{{R unprintworthy}}\n}}"
}

The following post took 3 seconds to complete. Indexing will happen in the background so this is purely the delivery time of the raw JSON to MeiliSearch.

$ curl \
      -X POST 'http://127.0.0.1:7700/indexes/articles/documents' \
      --data @enwiki-20210801-pages-articles1.json-p1p41242 \
      | jq

This is MeiliSearch's reply with the update identifier.

{
  "updateId": 0
}

Searching and Ranking

Opening http://127.0.0.1:7700/ in a web browser will display a WebUI with a live search box. Results will display as you type out and refine your query.

Querying via curl on the CLI is also supported. Below is an example where I've searched for "programming languages" and asked for 10 results, each containing the title field, to be returned. I've run the results through jq so that they are more concisely formatted.

$ curl 'http://127.0.0.1:7700/indexes/articles/search' \
        --data '{"q": "programming languages",
                 "attributesToRetrieve": ["title"],
                 "limit": 10}' \
    | jq '.hits | map(.title)'
[
  "ProgrammingLanguages",
  "Timeline of programming languages",
  "List of object-oriented programming languages",
  "Programming language/Timeline",
  "Fourth-generation programming language",
  "Logic programming",
  "Lynx (programming language)",
  "Procedural programming",
  "Class (computer programming)",
  "Structured programming"
]

There are six default ranking rules with MeiliSearch.

  1. typo prioritises documents matching your query terms with the fewest number of typos.
  2. words prioritises documents that contain all of your query terms ahead of documents only matching some of them.
  3. proximity prioritises documents where your search query terms are in closest proximity to one another.
  4. attribute prioritises which field your search query matched in. If you have a title, description and author attribute ranking order then matches in the title field will have a greater weight than those in the description or author fields.
  5. words position prioritises documents where your search terms appear closest to the beginning of the field.
  6. exactness prioritises documents that match your query the closest.

These can be further extended, removed and/or rearranged by posting to the ranking-rules endpoint. This setting is index-specific. Below is an example where hypothetical release_date and rank fields are taken into account when deciding how relevant any matching document is during any given query.

$ curl \
  -X POST 'http://127.0.0.1:7700/indexes/articles/settings/ranking-rules' \
  --data '[
      "typo",
      "words",
      "proximity",
      "attribute",
      "wordsPosition",
      "exactness",
      "asc(release_date)",
      "desc(rank)"
  ]' | jq

The following is a response from the server with the update identifier.

{"updateId": 0}

Dumping MeiliSearch Indices

The following will instruct MeiliSearch to begin producing a dump of its contents.

$ curl -X POST 'http://127.0.0.1:7700/dumps' | jq
{
  "uid": "20210813-193103820",
  "status": "in_progress"
}

This process can be monitored via a status call.

$ curl 'http://127.0.0.1:7700/dumps/20210813-193103820/status' | jq
{
  "uid": "20210813-193103820",
  "status": "done"
}

Once completed a .dump file will appear within the dumps folder of MeiliSearch's working folder.

$ ls -lh dumps/20210813-193103820.dump
-rwxrwxrwx  1 mark  staff   291M Aug 13 22:32 dumps/20210813-193103820.dump

The dump file is GZIP-compressed. It contains a header section with its metadata before line-delimited JSON is used to serialise documents.

$ gunzip -c dumps/20210813-193103820.dump \
    | head -c1500 \
    | hexdump -C
00000000  2e 2f 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |./..............|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00 00 00 30 30 34 30  37 35 35 00 30 30 30 30  |....0040755.0000|
00000070  37 36 35 00 30 30 30 30  30 32 34 00 30 30 30 30  |765.0000024.0000|
00000080  30 30 30 30 30 30 30 00  31 34 31 30 35 35 34 34  |0000000.14105544|
00000090  31 36 37 00 30 30 30 37  34 30 34 00 35 00 00 00  |167.0007404.5...|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000100  00 75 73 74 61 72 20 20  00 00 00 00 00 00 00 00  |.ustar  ........|
00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000140  00 00 00 00 00 00 00 00  00 30 30 30 30 30 30 30  |.........0000000|
00000150  00 30 30 30 30 30 30 30  00 00 00 00 00 00 00 00  |.0000000........|
00000160  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000200  6d 65 74 61 64 61 74 61  2e 6a 73 6f 6e 00 00 00  |metadata.json...|
00000210  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000260  00 00 00 00 30 31 30 30  36 34 34 00 30 30 30 30  |....0100644.0000|
00000270  37 36 35 00 30 30 30 30  30 32 34 00 30 30 30 30  |765.0000024.0000|
00000280  30 30 30 30 33 30 30 00  31 34 31 30 35 35 34 34  |0000300.14105544|
00000290  31 36 37 00 30 30 31 31  37 31 30 00 30 00 00 00  |167.0011710.0...|
000002a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000300  00 75 73 74 61 72 20 20  00 00 00 00 00 00 00 00  |.ustar  ........|
00000310  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000340  00 00 00 00 00 00 00 00  00 30 30 30 30 30 30 30  |.........0000000|
00000350  00 30 30 30 30 30 30 30  00 00 00 00 00 00 00 00  |.0000000........|
00000360  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400  7b 22 69 6e 64 65 78 65  73 22 3a 5b 7b 22 6e 61  |{"indexes":[{"na|
00000410  6d 65 22 3a 22 61 72 74  69 63 6c 65 73 22 2c 22  |me":"articles","|
00000420  75 69 64 22 3a 22 61 72  74 69 63 6c 65 73 22 2c  |uid":"articles",|
00000430  22 63 72 65 61 74 65 64  41 74 22 3a 22 32 30 32  |"createdAt":"202|
00000440  31 2d 30 38 2d 31 33 54  31 32 3a 32 33 3a 32 32  |1-08-13T12:23:22|
00000450  2e 33 33 30 35 39 36 5a  22 2c 22 75 70 64 61 74  |.330596Z","updat|
00000460  65 64 41 74 22 3a 22 32  30 32 31 2d 30 38 2d 31  |edAt":"2021-08-1|
00000470  33 54 31 32 3a 32 33 3a  32 32 2e 33 33 33 32 34  |3T12:23:22.33324|
00000480  38 5a 22 2c 22 70 72 69  6d 61 72 79 4b 65 79 22  |8Z","primaryKey"|
00000490  3a 22 69 64 22 7d 5d 2c  22 64 62 56 65 72 73 69  |:"id"}],"dbVersi|
000004a0  6f 6e 22 3a 22 30 2e 32  30 2e 30 22 2c 22 64 75  |on":"0.20.0","du|
000004b0  6d 70 56 65 72 73 69 6f  6e 22 3a 22 56 31 22 7d  |mpVersion":"V1"}|
000004c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000005d0  00 00 00 00 00 00 00 00  00 00 00 00              |............|
000005dc

The following is the first JSON-based piece of metadata inside of the dump file.

$ gunzip -c dumps/20210813-193103820.dump \
    | strings -50 \
    | head -n1 \
    | jq
{
  "indexes": [
    {
      "name": "articles",
      "uid": "articles",
      "createdAt": "2021-08-13T12:23:22.330596Z",
      "updatedAt": "2021-08-13T12:23:22.333248Z",
      "primaryKey": "id"
    }
  ],
  "dbVersion": "0.20.0",
  "dumpVersion": "V1"
}

The following is the first document within the dump.

$ gunzip -c dumps/20210813-193103820.dump \
    | strings -50 \
    | grep '^{"id' \
    | head -n1 \
    | jq
{
  "id": "10",
  "title": "AccessibleComputing",
  "text": "#REDIRECT [[Computer accessibility]]\n\n{{rcat shell|\n{{R from move}}\n{{R from CamelCase}}\n{{R unprintworthy}}\n}}"
}

MeiliSearch dumps are compatible between different versions of the software whereas snapshots, which are more performant to produce, are only compatible with the same version of the software they're produced with.

The following will import a dump into MeiliSearch.

$ MEILI_NO_ANALYTICS=1 \
    meilisearch --import-dump dumps/20210813-193103820.dump

Limitations of MeiliSearch

As of this writing, the largest dataset MeiliSearch is being officially tested against has 120 million documents. The software could probably support more but this is the largest instance of this software I could find any mention of.

MeiliSearch's database size is something to keep an eye on as by default, it's limited to 100 GB. This can be changed by passing overriding parameters at launch.

There is also a hard-coded limit of 200 indices and only the first 1,000 words of any attribute will be indexed.

The following endpoint will report on the database size and give statistics for each index hosted by this given instance of MeiliSearch.

$ curl http://127.0.0.1:7700/stats | jq
{
  "databaseSize": 4465295366,
  "lastUpdate": "2021-08-13T19:51:25.342231Z",
  "indexes": {
    "articles": {
      "numberOfDocuments": 27199,
      "isIndexing": false,
      "fieldsDistribution": {
        "id": 27199,
        "text": 27199,
        "title": 27199
      }
    }
  }
}
Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.