Notes
2026/05/12

Hypertext and grammar-based compression

What if we took the idea of linking together shared parts of a text to its extreme conclusion and built a hypertext system that links together every text fragment that occurs more than once, even for text fragments smaller than single words? So that e.g. apple pie and apple juice are interlinked because they both contains apple, but so are mismatch and matching because they both contain match.

(This is mostly an intellectual exercise which might be interesting from a linguistic perspective and perhaps has some similarities to the analysis proposed by Hjelmslev in his “Prolegomena to a Theory of Language”. Linking together every shared fragment of text would result in a data structure with so many indirection through pointers that it's most likely relatively useless in practice.)

The starting point is grammar-based compression. The most well known examples of algorithms are Sequitur and Re-Pair, which both compress text as a grammar.

In practice, Re-Pair seems to be faster and thus used more than Sequitur. However, as is pointed out in the paper The Smallest Grammar Problem, Sequitur has a property that Re-Pair lacks and which makes it interesting for a potential hypertext system: In contrast to Sequitur, Re-Pair is a global algorithm, which needs to process the entire input text and cannot start producing the final grammar until the entire input text is known. Sequitur doesn't have this limitation and can build a grammar piece by piece, before the entire input text is known.

Associative grammars

For our hypothetical hypertext system, we want the ability to make edits to texts in any order and in parallel, and still end up with the same grammar as if we had used a different order and processed everything sequentially. It is important to point out that “any order” refers to the order of building up texts from smaller fragment, but it does not mean that the order of the text fragments doesn't matter: Building up the text “abc” by adding “a” to the front of “bc” should be the same as adding “c” to “ab”, but adding “a” to the front of “bc” is very different from adding “a” to the back of “bc”.

In other words, we want our grammar-based compression to be associative (but not commutative), so that building up a hypertext from the texts (a + b) + c is the same as building it up as a + (b + c). Re-Pair is not a good fit, because it requires the full text to be known in advance. Sequitur on the other hand allows us to process a text piece by piece.

Uniqueness of a grammar

There is still a problem though. Lets say we have the text “abab”, which can be represented using a grammar as two times “ab”. If we now add “a” to the end of it, we would expect the text to be represented as two times “ab” + “a”. But what if we start with “baba” and then add “a” to the front? We would expect “baba” to be represented using a grammar as two times “ba”, and, after adding the “a” to the front, the entire text as “a” + two times “ba”. But now we have two separate ways of representing “ababa”!

    +--+--+     +--+--+
a + |ba|ba| -> a|ba|ba|
    +--+--+     +--+--+

+--+--+        +--+--+
|ab|ab| + b -> |ab|ab|a
+--+--+        +--+--+

Lets instead write [...] as a notation for shared fragments:

a + [ba][ba] -> a[ba][ba]

[ab][ab] + b -> [ab][ab]b

This is a contrived example, but there are many practical examples: Lets say we have the three texts “motor”, “motel”, and “hotel”. We would end up with different grammars depending on whether we combine “motor” and “motel” first, or “motel” and “hotel” first:

                           +----+     +----+
                           | +--+     | +--+--+    +--+--+
(motor + motel) + hotel -> |m|ot|or + |m|ot|el| + h|ot|el|
                           | +--+     | +--+--+    +--+--+
                           +----+     +----+

                                      +-----+    +-----+
                            +--+      +--+  |    +--+  |
motor + (motel + hotel) -> m|ot|or + m|ot|el| + h|ot|el|
                            +--+      +--+  |    +--+  |
                                      +-----+    +-----+

Or using the more succinct [...] notation:

motor + motel -> [mot]or + [mot]el

(motor + motel) + hotel -> [m[ot]]or + [m[ot]][el] +  h[ot][el]

motel + hotel -> m[otel] + h[otel]

motor + (motel + hotel) -> m[ot]or + m[[ot]el] + h[[ot]el]

The problem is that we are trying to use a grammar, represented as a textual tree, for something that is a graph, because there are multiple possible minimal grammars for a given text. (We could represent this ambiguity as overlapping [...] and {...}, such as [m{ot]el}, but this doesn't solve the problem of finding an unambiguous data representation using trees.)

How could the uniqueness property be enforced? Here are some options, loosely ranked from least to most promising:

We store all possible minimal grammars at the same time. Easy in principle, but potentially extremely wasteful in practice.
If we ever end up with a text that could have multiple different minmal grammars, we pick one of these options as the canonical normal form, deterministically and independent of how we arrived at the text. This would require us to e.g. “flip” from a[ba][ba] to [ab][ab]a and is potentially expensive because we need to check for ambiguity after a grammar is built.
We could represent parts of the grammar as overlapping “superpositions”, so that “motel” is a superposition of “mot” and “otel”, with neither fragment taking precedence over the other. Instead of being represented as a sequence of grammar rules, texts would then be represented as a sequence of grammar rules with optional offsets so that multiple rules can overlap.

Uniqueness, utility, and unambiguity

We can now summarize the above as three properties. The first two, uniqueness and utility, are a core part of the Sequitur algorithm, whereas the third one, unambiguity, goes beyond Sequitur.

Uniqueness: No sequence of N or more characters appears twice without being represented by the same grammar rule.
Utility: No grammar rule is used only once (except for top-level grammar rules).
Unambiguity: No matter in which order editing operations are applied, every text has an unambiguous grammar that represents it.

The other significant difference between Sequitur and the hypertext system discussed here is that Sequitur operates only on sequential text, whereas a hypertext system is an unordered collection of texts, with each text potentially built up incrementally as a sequence of editing operations. Both of these cases have been implicitly discussed so far, but it makes sense to call out these differences: A grammar-based hypertext system must support editing “abab” to become “ababa” (sequence of editing operations), as well as adding three separate texts “motor”, “motel”, and “hotel” (unordered collection of texts).

Is such a hypertext system practical? Probably not, but it would be interesting to see how far it can be pushed in practice.