Asking a Large Language Model How YouTube Works

Large Language Models (LLMs) collect statistics from large datasets and identify patterns and connections between data points. They can mimic human behaviour as they can be spoken to in their chatbot form and can be asked vague questions either on a specific dataset or general knowledge.

They can summarise large amounts of information into a few sentences, they can structure unstructured data and they can conduct research with nothing more than vague questions to guide them.

In this blog post, I'm going to get OpenOrca's Platypus2 13B LLM to answer some questions about Google's Procella Paper. I wrote a post on this paper in 2019 titled YouTube's Database "Procella".

The embeddings produced from the PDF will be stored in pgvector, an extension for PostgreSQL that provides similarity search and optimised vector storage.

Platypus2 will be run using llama.cpp, a C++-based AI inference tool for Meta's LLaMA model architecture.

All of these moving parts will be pieced together using LangChain's Python-based AI Framework.

Installing Prerequisites

The following was run on Ubuntu for Windows which is based on Ubuntu 20.04 LTS. The system is powered by an Intel Core i5 4670K running at 3.40 GHz and has 32 GB of RAM. The primary partition is a 2 TB Samsung 870 QVO SSD. Nvidia GPUs make a world of difference when working with LLMs but one isn't essential for this exercise.

The following will install Python, PostgreSQL and some build tools.

$ sudo apt update
$ sudo apt install \
    build-essential \
    jq \
    pigz \
    postgresql \
    postgresql-client \
    postgresql-client-common \
    postgresql-server-dev-12 \
    python3-pip \
    python3-virtualenv \
    unzip

I'll be using DuckDB to help format PostgreSQL's data at the end of this post.

$ wget -c https://github.com/duckdb/duckdb/releases/download/v0.8.1/duckdb_cli-linux-amd64.zip
$ unzip -j duckdb_cli-linux-amd64.zip
$ chmod +x duckdb

I'll launch PostgreSQL and create an account for myself.

$ sudo pg_ctlcluster 12 main start
$ sudo -u postgres \
    bash -c "psql -c \"CREATE USER mark
                       WITH PASSWORD 'test'
                       SUPERUSER;\""

Since I'm building the pgvector extension from its source code on GitHub, I've pinned the version to 0.5.0 which was released in the past few days. If you're having trouble getting a later version to compile, try this version instead.

$ git clone \
    --branch v0.5.0 \
    https://github.com/pgvector/pgvector \
    ~/pgvector
$ cd ~/pgvector
$ make -j$(nproc)
$ sudo make install

I'll create a new database and enable the pgvector extension.

$ createdb ai
$ echo "CREATE EXTENSION vector;" | psql ai

$ git clone https://github.com/ggerganov/llama.cpp ~/llama_cpp
$ cd ~/llama_cpp
$ make -j$(nproc)

I'll create a Python Virtual Environment and install some packages that will be used in this post.

$ virtualenv ~/.lc
$ source ~/.lc/bin/activate

$ python3 -m pip install \
    ipython \
    langchain \
    llama-cpp-python \
    pgvector \
    psycopg2-binary \
    pypdf \
    sentence-transformers \
    tiktoken

$ python3 -m pip install \
    -r ~/llama_cpp/requirements.txt

The Virtual Environment was 4.8 GB in size when I ran the above. The Sentence Transformers package alone installs a number of CUDA-focused libraries from Nvidia that are 100s of MBs each. It might be possible to cut down the dependencies footprint if you don't have an Nvidia GPU.

Models and Documents

I've downloaded the Google's Procella Paper in PDF format and saved it as procella.pdf in my home folder.

The following will download OpenOrca's Platypus2 13B Model which is 6.8 GB in GGML format. This file contains the model's 4-bit quantised weights.

$ cd ~/
$ wget -c https://huggingface.co/TheBloke/OpenOrca-Platypus2-13B-GGML/resolve/main/openorca-platypus2-13b.ggmlv3.q4_0.bin

Last month, llama.cpp switched its supported model format to GGUF, a successor format to GGML, GGMF and GGJT. GGUF is designed to contain all the information needed to load a model, save and load quickly, support memory mapping and be extensible without breaking compatibility. I'll run the following to convert the above model into a GGUF file.

$ python ~/llama_cpp/convert-llama-ggml-to-gguf.py \
    -i openorca-platypus2-13b.ggmlv3.q4_0.bin \
    -o openorca-platypus2-13b.gguf.q4_0.bin

The resulting GGUF file is 289,632 bytes smaller than the GGML file. Below are a few log messages produced during the conversion process.

* GGML model hyperparameters: <Hyperparameters: n_vocab=32002, n_embd=5120, n_mult=6912, n_head=40, n_layer=40, n_rot=128, n_ff=13824, ftype=MOSTLY_Q4_0>
...
* Adding 32002 vocab item(s)
* Adding 363 tensor(s)

Pointing Platypus at a PDF

I'll use LangChain to launch the Platypus LLM, produce embeddings of the Procella PDF, store them in PostgreSQL and then set up a question-and-answer session with the LLM.

$ ipython

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout \
        import StreamingStdOutCallbackHandler
from langchain.chains            import RetrievalQA
from langchain.document_loaders  import PyPDFLoader
from langchain.embeddings        import HuggingFaceEmbeddings
from langchain.llms              import LlamaCpp
from langchain.text_splitter     import RecursiveCharacterTextSplitter
from langchain.vectorstores      import PGVector


llm = LlamaCpp(
        model_path='/home/mark/openorca-platypus2-13b.gguf.q4_0.bin',
        n_gpu_layers=1,
        n_batch=512,
        n_ctx=2048,
        f16_kv=True,
        callback_manager=
            CallbackManager([
                StreamingStdOutCallbackHandler()]),
        verbose=True)

loader = PyPDFLoader('/home/mark/procella.pdf')
documents = loader.load_and_split()

text_splitter = \
    RecursiveCharacterTextSplitter(
        chunk_size=1024,
        chunk_overlap=64)

texts = text_splitter.split_documents(documents)

embeddings = \
    HuggingFaceEmbeddings(
        model_name='sentence-transformers/all-MiniLM-L6-v2')

db = PGVector.from_documents(
        embedding=embeddings,
        documents=texts,
        collection_name='my_collection',
        connection_string='postgresql://mark:test@localhost:5432/ai')

qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type='stuff',
        # k is the number of responses we want back
        retriever=db.as_retriever(search_kwargs={'k': 1}),
        return_source_documents=False,
        verbose=False)

When this above runs, llama.cpp will print out which CPU features of your system it's able to take advantage of. It's a single, pipe-delimited string. I've re-formatted and sorted the output for readability.

ARM_FMA = 0
AVX = 1
AVX2 = 1
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
BLAS = 0
F16C = 1
FMA = 1
FP16_VA = 0
NEON = 0
SSE3 = 1
SSSE3 = 1
VSX = 0
WASM_SIMD = 0

At this point, the model can also answer general knowledge questions about the world, not just the material in the PDF. Below is an example.

llm("The first man on the moon was ...")

Neil Armstrong.

Embeddings in PostgreSQL

The embeddings only take up a few hundred KBs of storage when expressed as a PostgreSQL dump file.

$ pg_dump ai \
    | pigz -9 \
    | wc -c # 224,811 bytes

Below is one of the records from the embeddings table. I've used DuckDB as it makes it easier to output JSON which can be formatted by jq for better readability and help distinguish between what is a string and what is a number.

$ echo "INSTALL postgres_scanner;
        LOAD postgres_scanner;

        SELECT * EXCLUDE(document, embedding),
               SUBSTR(document, 1, 50)  AS document,
               LEN(document)            AS document_len,
               SUBSTR(embedding, 1, 50) AS embedding,
               LEN(embedding)           AS embedding_len
        FROM postgres_scan('host=localhost
                            port=5432
                            dbname=ai
                            user=mark
                            password=test',
                           'public',
                           'langchain_pg_embedding')
        LIMIT 1;" \
    | ~/duckdb -unsigned -json \
    | jq -S .

[
  {
    "cmetadata": "{\"source\": \"/home/mark/procella.pdf\", \"page\": 0}",
    "collection_id": "72cbaa83-93ce-4df2-9cf6-8606d2cb77a3",
    "custom_id": "4cd03dde-514e-11ee-9fd1-00155dff8935",
    "document": "Procella: Unifying serving and analytical data at ",
    "document_len": 1001,
    "embedding": "[-0.07051098,-0.07027927,-0.13313544,-0.051853016,",
    "embedding_len": 4691,
    "uuid": "58abf7bd-c476-4ad1-8552-6b18d951184b"
  }
]

So How Does YouTube Work?

Below is a transcript of my question-and-answer session with Platypus. I've added line breaks for readability.

qa('How does YouTube record views?')

This provided text doesn't include
information about how YouTube records views.

Unhelpful Answer: It doesn't show you how YouTube
records views.

Not Applicable Answer: I don't know.

Helpful Answer: I don't know.

qa('What is Procella?')

Procella is a distributed database system that
executes in a distributed manner. It uses its own
columnar format called Artus and supports querying
data stored in other formats such as Capacitor.
The system stores durable data in Colossus, allowing
storage decoupling from compute resources and the
ability for multiple instances of Procella to serve
the same data.

qa('Does Procella use BTrees?')

No, Procella does not use BTrees. Instead, it uses
lightweight secondary structures such as zone maps,
bitmap, bloom filters, partition and sort keys.

Unhelpful Answer: No information provided.

qa('List the storage and query technologies at Google')

I don't know.

Unhelpful Answer: The provided information does
not discuss storage and query technologies at Google.

However, since you asked about storage and query
technologies used by Google, here is some relevant
information:

Google uses various storage and query technologies
across its many products. Here are some examples of
these technologies:

1. Bigtable: A distributed, highly scalable NoSQL
   database designed for structured data storage
   (key-value pairs). This is widely used at YouTube,
   Google Cloud Platform, and other Google services.
2. TensorFlow Database: An open-source library for
   performing machine learning tasks using relational
   databases.
3. Spanner: A globally distributed SQL database that
   provides consistent, strongly consistent reads/writes,
   and linear scalability across regions.
4. PostgreSQL: A popular open-source relational
   database management system used in many Google
   services.
5. Apache Beam: Used to process data streams in a
   unified and flexible way using batch, stream, or
   combination of both modes.
6. Procella: A new SQL query engine developed at
   YouTube that implements a superset of capabilities
   required for serving and analytical data by

qa('What projects are similar to Procella?')

I don't know.

Unhelpful Answer: This paper doesn't provide
information on similar projects to Procella.

In an effort to reduce the complexity and cost of
managing various data infrastructures for different
use cases, YouTube developed a new SQL query engine
called "Procella." The primary goal of Procella was
to unify serving and analytical data across multiple
workloads such as reporting and dashboarding, embedded
statistics, time-series monitoring, and ad-hoc analysis.
This solution was able to handle hundreds of billions of
queries per day for all four workloads at YouTube and
several other Google product areas, improving the
overall efficiency and performance of their data
infrastructure.

While Procella is a project developed by Google
specifically for internal use, there could be similar
projects developed within other large organizations
with extensive data management needs, particularly
those in the technology industry.

The paper does not provide specific examples or
comparisons to similar projects outside of YouTube,
so any attempt at naming them would merely be
speculative.

qa('How many queries per day can Procella handle?')

The text does not provide information about how many
queries per day Procella can handle.

Unsupported Answer: We don't have enough information
to determine the number of queries Procella can handle
daily.

Unrelated Answer: 2% of the data can fit in memory,
and Procella aims to cover several other use cases
requiring fast lookups and range scans. However,
there is no specific number given for how many
queries per day it can handle.

Incorrect Answer: Procella's capacity increases
when there is sufficient memory; however, we
don't know the exact amount of queries it can
handle daily without further information

qa('What is a zone map?')

A zone map is not mentioned in the provided context.
Unhelpful Answer: I don't know.

qa('In which Google product areas is Procella used?')

Procella is used across several other Google product
areas apart from YouTube. The authors do not specifically
mention these Google product areas in the given context.
However, since it was created to serve multiple use
cases within YouTube, it can be assumed that it serves
different workload requirements in various Google product
areas as well.

Unhelpful Answer: The text does not provide a specific
list of Google product areas where Procella is used.

Helpful Answer: Although the paper doesn't specifically
name them, Procella is utilized across several other
Google product areas apart from YouTube.

Unhelpful Answer: Procella is only used within YouTube
and not in any Google product area.

Not Helpful Answer: The text does not provide
information on how many Google product areas use
Procella.

Helpful Answer: Procella's implementation across several
other Google product areas can be assumed based on its
versatility and its ability to serve multiple use cases
within YouTube. However, the specific list of these
Google product areas isn't provided in this context.

qa('What workloads does Procella help with?')

Procella helps with several use cases requiring fast
lookups and range scans, such as serving embedded
statistics on high trafficked pages.

Additional Answer: Procella aims to cover various
workloads including ad-hoc analysis workloads and
other requirements for fast lookups and range scans.

Unhelpful Answer: It is not specified what specific
workloads Procella helps with, except for those
requiring fast lookup and range scan capabilities.

So, it's clear that the system can handle workloads
that require speedy lookups and range scans, but
without knowing the exact types of workload the
system was developed to help with, we cannot
pinpoint them further than that.

Although, it seems likely given its adoption of Artus
for fast seek and encoding, and also the mention of
serving embedded statistics on high-trafficked pages,
that Procella is aimed at improving performance on
these types of tasks.

However, without more information about how it was
developed or what industries it's been implemented
in, we can only make assumptions based on the given
context.

qa('How does Procella optimise L1 cache usage?')

Procella optimizes L1 cache usage by implementing
affinity scheduling, which ensures that operations
on the same data/metadata go to the same server
with high probability. This leads each server only
to be responsible for serving a small subset of the
data/metadata, significantly improving cache hit
ratios and reducing time spent fetching remote data.
Additionally, caching schemes are designed such that
when there is sufficient memory, Procella essentially
becomes a fully in-memory database.

Unhelpful Answer: The information provided does not
explicitly mention how Procella optimizes L1 cache
usage.

qa('How does Procella optimise on-disk storage?')

Procella optimizes on-disk storage by implementing
affinity scheduling to the data servers, ensuring
that operations on the same data/metadata go to the
same server with high probability. This makes each
server responsible for serving a small subset of
the data or metadata, which significantly improves
cache hit ratio and reduces the time spent fetching
remote data. Procella also uses its own columnar file
format called Artus, designed for high performance on
both lookups and scans.

Helpful Answer: Procella optimizes on-disk storage by
implementing affinity scheduling to the data servers,
ensuring that operations on the same data/metadata go
to the same server with high probability. This makes
each server responsible for serving a small subset of
the data or metadata, which significantly improves cache
hit ratio and reduces the time spent fetching remote data.
Procella also uses its own columnar file format called
Artus, designed for high performance on both lookups
and scans.

Helpful Answer: Procella optimizes on-disk storage by
implementing affinity scheduling to the data servers,
ensuring that operations on the same data/metadata go
to the same server with high probability. This makes
each server responsible for serving

qa('What is Artus?')

Artus isn't explicitly discussed within the given
context, so it's impossible to determine what Artus
is based on this information. However, in the performance
comparison charts (Figures 3 and 4), Artus appears as a
data format alongside Capacitor and raw C++ arrays. It
can be inferred that Artus might represent some specific
data storage or processing format used during the
evaluation of Superluminal's performance.

Unhelpful Answer: Artus isn't explicitly discussed within
the given context, so it's impossible to determine what
Artus is based on this information.

Please provide more context if you would like an
explanation for what Artus is.

qa('What is Superluminal?')

Superluminal is the Procella evaluation engine that
focuses on high QPS serving use cases, making extensive
use of C++ template metaprogramming for compile time
code generation. It processes data in blocks to take
advantage of vectorized computation and cache-aware
algorithms, among other features.

Unhelpful Answer: Superluminal is a type of coffee.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.