You wanted deeper? AI is not an oracle, it's a calculator

Glad to hear the AI trilogy was well received. And I am glad it was. But I got the question: "Yeah, great that you talked about ML, LLMs, and transformers in part three. But you never really went deeper into the technical side."

That's true. And I think that's exactly the line I'm constantly balancing on. I try not to get too technical, but I also want to explain enough. Honestly, I sometimes struggle to find that balance.

Let’s do a fourth part, then. That way, we can take a closer look at that latest generation of AI from a technical perspective. But be warned! We’re about to get technical.

As it turned out, this became a four-part series. You are now reading part four. If you’d like, you can find the other parts here:

The technical inside of the black box

First, we need to understand why modern AI talks about "training data" and a "trained model". The first is the input, the second is the output. But what are those things really? Well, let me explain. The input is ideally a massive dataset containing labelled data.

Here's the thing: there are roughly two ways to teach an AI something. The first method is to simply feed it enormous amounts of text. The second method is to feed it enormous amounts of text again, but this time labelled.

And what is labelled data exactly? Well, instead of just giving the AI a sentence like: "My name is John and I have a nice chair", you annotate the different parts of the sentence. John is a name, chair is an object. The rest are signals. For instance, "My name is" is a signal that a name is about to follow. And "nice" is a signal that the following object is being described positively.

From NLP to transformers

The key difference is that a traditional NLP pipeline is relatively rudimentary compared to modern transformer models. This is because a model is trained on enormous amounts of data. And the software that runs the model can compute complex patterns from vast amounts of data. This is also why special chips are used, such as GPUs or dedicated AI chips. These are used because matrix operations in transformer models are extremely compute-intensive.

And because it is trained on so much data, the similarities it finds are often much more accurate than those from smaller datasets. In addition to finding similar sentences, models are also better at identifying entities. Again, this is due to scale.

Even traditional NLP systems from around 2019 made less use of rule-based methods such as regular expressions (regex) and instead used tokenisation as a basis for further processing. After that, there is often a second stage that turns the initial labels into entities. For example, the label "currency symbol" is assigned to € and the label "number" is assigned to 50. This was an early form of labelling, later often supplemented with or replaced by learned models instead of pure rule-based systems.

But in earlier systems, for example, there was an additional step that said: hey, € and 50 are next to each other, together they form a monetary amount. Not just a currency symbol and a number, but a single meaningful concept.

A classic NER system was designed to recognise names, organisations, and locations. But modern labelling and extraction goes much further. Think of times, dates, and amounts. But also objects such as a garden chair or truck, units such as meters and inches, and so on.

Garbage in is garbage out

But how do you actually do that? How can a computer understand any of this? The answer is more boring than you probably expect: absurd amounts of data.

And this is probably where we should talk about the phrase "garbage in is garbage out". That term has existed for decades. It basically means that if you feed nonsense into a computer, you'll get nonsense back out.

So why is so much data needed? Think of it like our universe: a vast space with the occasional star or planet scattered here and there. With so much emptiness, the real question becomes, how do you get a model to learn meaningful relationships?

A model operates in a huge vector space where words, sentences, and concepts exist as points. That space needs to be filled, not just a little, but extremely densely. Only then will related concepts truly end up close to each other, allowing reliable patterns to emerge.

That’s why so much data is required. Not because “more is always better” in a superficial sense, but because without it, you simply end up with gaps in that space. And gaps lead to unpredictable behavior. The model needs to be able to “see” that a cat is closer to a dog than to a fire truck, and that only works if it encounters those relationships often enough in many different contexts.

Most of this data comes from existing texts such as books, websites, forums, and documentation, because they are rich in variation and context. Sometimes synthetic data is also used, for example to strengthen specific patterns or fill in rare cases. But that is supportive, not primary. Ultimately, the quality of the model depends on how well that vector space approximates reality, and that can only be achieved with large amounts of diverse and consistent examples.

AI does not understand language, it understands math

"Why all that data?", you're probably asking yourself now. Well, because of mathematics. AI does not truly understand grammar, English, Dutch, or any language in the way humans do. That's how people like to imagine it, but that's not really what's happening. AI ultimately does one thing: mathematics.

Let's say I type: "I love spaghetti". The first thing AI does is break that sentence into tokens. Those are smaller pieces. For example:

['I', 'love', 'spa', '##ghe', '##tti']

As you can see, the last word becomes multiple tokens. The ## markers tell the model that the token is connected to the previous one.

This was an example using BERT-modal notation, but the concept is the same for every modal. In general, this is called Byte Pair Encoding.

Why does it do this? Simple. Breaking words apart creates more overlap with other words. And that saves data. The token "spa" might appear in completely different words as well. Through tricks like this, the model becomes much smaller and more efficient. Shared pieces no longer need to be stored as entirely unique structures.

From tokens to vectors

And this brings us to vectorization. Every token, such as "spa", receives its own number. A so-called ID, or identifier. And those IDs differ per model. The creators of the model define them.

That's why every model comes with its own token-ID mapping. Every token must be converted into a number. Let's say "spa" receives the number 10.

The vector for the sentence "I love spaghetti" could then become:

[16, 33, 10, 99, 14]

And that vector is only valid for the model you trained or use. Another model could assign entirely different numbers.

The reason you vectorize text is because later on you're going to perform mathematical comparisons. You need to transform text into numbers. For example, imagine these two sentences:

"I love spaghetti"

"Spaghetti is my favorite meal"

From a traditional programming perspective these are two completely different strings. But in vector-space they suddenly share similarities. Spaghetti might contain the token pattern [10, 99, 14]. And that means the two sentences overlap in a way you cannot easily solve with regex or if-statements, but you can solve it with vectors.

The trick with vectors is that you compare them using mathematical distance and weights. Sentences with similar vector components end up closer together than sentences with completely different structures.

And that's exactly why you cannot simply use a hash. A hash changes completely when even one character changes. A vector preserves similarities.

Inference and intent detection

For those still reading: once the user's sentence has been transformed into a vector, the model starts comparing it against vectors inside the trained model.

The model contains thousands or tens of thousands of examples. You're essentially searching for vectors that are mathematically closest to the input.

For programmers, think of it like this. I'll just use PHP as an example:

$user_input = "I love spaghetti";
$user_tokens = $my_model->tokenize($user_input);
$user_vector = $my_model->vectorize($user_tokens);

$my_model->add_intent([
    intent: 'favorite_food',
    examples: [
        'I love spaghetti',
        'Spaghetti is my favorite food',
        'For me spaghetti is the best',
        'If you ask me, definitely spaghetti'
    ]
]);

$scores = $my_model->infer($user_vector);

print_r($scores);

So what's happening here? The model itself is already trained. What you're doing is two things. First, you're feeding example sentences connected to an intent you want to detect. Then you feed the actual user input into the system, as a vector.

Based on the relationship between the user input and your examples, the model returns scores for all possible intents.

$scores = Array(
    'favorite_food' => 0.98,
    'favorite_restaurant' => 0.03,
    'favorite_color' => 0.02,
);

The highest score usually wins. But ultimately that's your decision. The model only returns probabilities. Your own business logic decides what happens next.

But in practice, the top score is probably the intent you want to continue with.

Why AI uses multiple tasks

In practice, you will usually use a pretrained model. That model contains enormous amounts of statistical relationships between words, concepts, and patterns.

You can then use that same model for different tasks. Think of Sentence Similarity, entity recognition (NER), or classification.

And on top of that, you may also want to perform sentiment analysis. Is this message cheerful? Are they angry? Was it sarcasm? And so on.

Those tasks often use the same underlying transformer architecture, but with different classification layers or different ways of using the vector representations.

Because of this, the same model can be used for semantic similarity, but also for recognizing people, locations, amounts, or emotions in text.

Each task uses the model in a different way. For example, sentiment analysis looks at very different statistical patterns than Named Entity Recognition. Just to keep it in PHP syntax:

class Pipeline {
    function __construct($my_model, $user_input) {
        $this->tokens   = $my_model->tokenize($user_input);
        $this->vector   = $my_model->vectorize($this->tokens);
        $this->entities = $my_model->entities($this->tokens);
        $this->mood     = $my_model->mood($this->vector);
        $this->intent   = $my_model->intent($this->vector);
    }
}

Because of this, separate classification layers, or even specialized models, are often used for different tasks. Modern AI systems frequently combine multiple specialized pipelines, even when they are based on the same transformer architecture.

Let me make that more concrete.

A task that tries to determine the tone of text looks at very different patterns inside the model. Think of phrases like: "ugh", "very angry", "this disappoints me". Or the opposite: "wow", "awesome", "fantastic", "so happy". Those are patterns that strongly relate to each other in sentiment analysis.

Those are completely different vector relationships than the ones used by a task that tries to recognize place names.

Heuristics and early exits

And that's why systems often apply heuristic detection before they even touch the heavy AI pipeline. You want to figure out what the user wants as quickly and energy-efficiently as possible.

Did the input contain a numerical entity? Two units? And a token like "to"? Then the user probably wants a conversion like "1 meter to feet". That request can immediately be routed to a conversion tool without running the full NLP pipeline. That's called an early exit.

Is the user input more complicated than that? Then you pass it into the NLP layer and see whether it can classify it. If not, you vectorize the input and send it through the actual model for intent detection.

Those early stages save enormous amounts of energy and cost, both in hardware load and literally in money. Big players like Google and OpenAI obviously know this. But smaller AI systems built on top of pretrained models surprisingly often don't. They simply throw every single user request directly into the model. Wasteful.

AI is not a mystical oracle

And finally, perhaps the most “elusive” part of AI: systems that write answers themselves.

So far, we’ve mainly talked about AI as a tool for understanding human language. The final decisions were still made through controllable logic. But there is also a form of AI that generates text itself, such as ChatGPT. In that case, an answer is not retrieved from a database; the model simply predicts, based on mathematics, which answer is most likely.

And that is where things can go wrong. Such a system can give an answer that sounds convincing, while being factually incorrect. That makes generative AI incredibly impressive as a pleasant conversation partner, but not very suitable for situations where accurate answers are crucial, such as medical applications.

That does not mean AI as a whole is unusable. Quite the opposite. AI is widely used for serious tasks. But in those systems, control over facts, rules, and decisions remain outside the model itself. More directly in human hands. That is the important difference.

So AI is not one "magical technology" that independently makes decisions on its own. It is a collection of techniques, built by people, with different applications and different risks.

Undesirable AI answers are usually caused by:

Lower quality training data
Less than ideal choices in the surrounding software
Or too much freedom to generate answers independently

AI is not a mystical oracle. In the end, it is a large calculator that recognizes patterns. And when people use the right form of AI in the right way, control simply remains in human hands. That is how it provides benefits and serves a useful purpose.

PS: And we haven’t even talked about AI that analyzes images instead of language, for example. Because that too is a completely different form of AI. Perhaps another time!