Modeling the Voynich Manuscript with AI

Not to Translate It, But to Test It

May 18, 2025

I’m not a linguist. I’m not a cryptographer. But I like to think I can build things — and I was curious if the Voynich Manuscript, arguably the world’s most unreadable book, could be modeled like a real language.

Not translated. Not decoded. Just… modeled.

I wasn’t interested in fairy tale interpretations or elaborate cipher trees. I wanted structure. Syntax. Patterns. Something I could throw modern NLP tools at and ask:
Does this behave like language — or just look like it?

So I built a pipeline. Not a fancy one. Just enough to run the manuscript through modern AI methods and see if anything broke.

Spoiler: it flinched. In all the right ways.

What is the Voynich Manuscript?

The Voynich Manuscript is a 15th-century illustrated book written in a language no one has been able to decode. It contains strange plants, astronomical charts, and naked women bathing in connected tubes — all accompanied by a looping, alien script that doesn’t match any known language, cipher, or invented system. It’s been studied by cryptographers, historians, linguists, and AI researchers — and still refuses to give up its secrets.

What I built

The core of the project is a full NLP pipeline — open-source, reproducible, and annotated for anyone who wants to pick it apart.

Here’s what it does:

Strips common suffixes from Voynich words — things like aiin, dy, chy. My assumption was these were phonetic or rhythmic filler, not core roots. That’s a big assumption, and it affected everything downstream.
Clusters root-like words using SBERT (a multilingual sentence embedding model) + KMeans. This gives us "semantic neighborhoods" based on how words behave, not what they mean.
Maps each line of the manuscript to a sequence of cluster IDs — using the transliteration as input.
Infers POS roles like Function, Root, Modifier — based on position, variety, and frequency. (If it shows up at the start of every line and only has 12 variations? Probably a function word.)
Builds a transition matrix — how likely is it for Cluster 8 to follow Cluster 3? The result looked suspiciously like grammar.
Segments by manuscript section — Botanical, Biological, Cosmological, etc. Turns out different parts of the book use different grammatical structures.
Generates a lexicon hypothesis — not meanings, but a data-driven guess at which words do what, and where.

What I found

Cluster 8 behaves like a function word group. Low diversity, high frequency, often starts lines.
Cluster 3 is root-heavy — highly diverse and central to most line structures.
Transition matrices showed consistent flow: Function → Root → Modifier → Object patterns show up a lot.
Grammatical patterns vary by folio type. Botanical pages don’t structure themselves like Biological ones.

This doesn’t mean we’ve cracked anything.
It just means we can stop pretending the text is random.

What This Isn’t

I didn’t train a model to “translate” Voynichese.
I didn’t guess what words mean.
I didn’t invent a cipher model or claim it’s a conlang.

This is not about breaking the code. It’s about modeling the bones underneath it — and seeing whether those bones suggest a brain.

Why Suffixes Matter

One of the biggest variables in all this was the suffix-stripping step.

By removing the most common endings from each word, I made an assumption: that those endings were structurally or phonetically repetitive — not semantic. It definitely helped the clustering. But it might’ve thrown out meaningful information too.

If you fork the repo and rerun the pipeline without stripping suffixes? I’d love to see what changes.

Why I Did This

This started as a way to learn NLP by throwing it at a hard, unsolved problem.
I didn’t expect a result. I expected to flounder.

Instead, I found what I think is evidence that the Voynich Manuscript:

Has structure.
Behaves like a language.
Changes syntax depending on content.

Is that proof of meaning? No. But it’s a solid case for structure — and that’s step one in any linguistic model.

The Code

Everything’s open and annotated here:
github.com/brianmg/voynich-nlp-analysis

If you’re a linguist, a cryptographer, or just someone who likes weird puzzles and reproducible pipelines — have at it.

Feedback is welcome. Corrections are even more welcome.