Notes
2026/04/12

Hypertext: similarity and ancestors

In last week's note I sketched out a hypertext system built on implicit links, which links together pages at their shared paragraphs, in the style of a subway network with transfer hubs.

One of the central ideas is to treat identical paragraphs as being implicitly linked across documents. Identical paragraphs? Yes. So whitespace matters and a single extra space means paragraphs won't be linked? Correct. Here's why:

Linking together identical content does not by itself lead to a great UX, because changes that we might consider incidental will completely break links. But linking together identical content (which we can identify with a content hash) is the basic building block that allows us to build fuzzy ways of linking together content on top of the existing hypertext system.

Do one thing and do it well

Let's say we have the following page:

# A book (draft 2)

The hero ventures into the unknown.

There are many trials along the way.

The hero returns changed.

Now we replace “The hero” with “Our hero” and “the way” with “their way”:

# A book (draft 3)

Our hero ventures into the unknown.

There are many trials along their way.

Our hero returns changed.

Since the two pages don't share any identical paragraphs, no links will be generated by a hypertext system built on content hashing. We could of course try to make our hypertext system smarter, by considering textual similarity based on inverted indexes, or throwing in the towel and outsourcing it to AI, whatever that might mean in practice. But even if we do that, what kind of textual similarity do we actually want? N-gram similarity (bigrams? trigrams? both?), over entire words? Or characters, so that a single letter typo doesn't affect the similarity too much? What if our language doesn't have the concept of whitespace between words, or is highly agglutinative? Do we want to support synonyms for words? Which ones? What if we want to support similarity for non-textual hypermedia?

Point being, our notion of what we consider similar depends on our content, there is no “one size fits all” solution that would work for hypertext in general. But instead of trying to build all of the different notions into one hypertext system or tool, we can build a general system that is based on content hashing and then let other tools extend it.

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

In our above example, an external program could build an inverted index for the two pages and then create extra pages that link together similar paragraphs by including them. In addition to the two draft pages, it could spit out the following pages:

# Similar: A book (draft 2), A book (draft 3)

The hero ventures into the unknown.

Our hero ventures into the unknown.
# Similar: A book (draft 2), A book (draft 3)

There are many trials along the way.

There are many trials along their way.
# Similar: A book (draft 2), A book (draft 3)

The hero returns changed.

Our hero returns changed.

If we now run our original content-based hypertext tool on the 2 drafts + 3 similarity pages, the three similarity pages act as short transfer lines that take us from one draft to the other. Draft 3, for example, would be displayed as follows:

# A book (draft 3)

Our hero ventures into the unknown.
  +------------------------------------------------------+
  | -> [...] Similar: A book (draft 2), A book (draft 3) |
  +------------------------------------------------------+

There are many trials along their way.
  +------------------------------------------------------+
  | -> [...] Similar: A book (draft 2), A book (draft 3) |
  +------------------------------------------------------+

Our hero returns changed.
  +------------------------------------------------------+
  | -> [...] Similar: A book (draft 2), A book (draft 3) |
  +------------------------------------------------------+

The titles of the similarity pages aren't very informative. Instead, we could change our similarity tool (while keeping it completely independent from our hypertext tool) and have it produce titles of the form “The/Our hero ventures into the unknown.”, which would immediately provide some context:

# A book (draft 3)

Our hero ventures into the unknown.
  +-----------------------------------------------------+
  | -> [...] The/Our hero ventures into the unknown.    |
  +-----------------------------------------------------+

There are many trials along their way.
  +-----------------------------------------------------+
  | -> [...] There are many trials along the/their way. |
  +-----------------------------------------------------+

Our hero returns changed.
  +-----------------------------------------------------+
  | -> [...] The/Our hero returns changed.              |
  +-----------------------------------------------------+

Again, the hypertext tool can't know whether showing the changes right in the title would be a good idea or not, because it depends on the content. It might work for short lines, but will quickly break down for longer paragraphs. Or perhaps not? Perhaps having verbose links there is fine, because there will be few links overall? Either way, the hypertext system provides the foundation, whereas the actual structure is up to whoever is providing the content.

Linking to unique ancestors

While this is a decent foundation, we can do better. We started with two pages that had similar text but weren't linked because the paragraphs weren't identical. Now we have “helper” pages the establish links between similar paragraphs. But the links that are displayed on the page for draft 3 don't point to draft 2 (which is what we wanted), they point to the helper pages, which then point to draft 2.

In a way, the helper pages “braid” the pages for draft 2 and draft 3 together in a very regular fashion, with paragraph 1 of draft 2 being linked to paragraph 1 of draft 3, same for paragraphs 2 and 3. If we viewed these links as links between drafts 2 and 3 (basically skipping the intermediary “transfer hub” established by each helper page), we would see that there is an unbroken 1:1 correspondence between the drafts.

Can we capture the notion of helper pages (that we can skip when displaying links) in our hypertext system? As something that we build into the foundation of the system, because it affects how content-based links are displayed?

We could try to distinguish helper pages from regular pages by looking at the markup and decide, for example, that pages with titles (in our example # ... headings at the beginning of the document) are treated as regular pages, whereas pages that start without a title are treated as helpers. When displaying the links, we walk up from helper pages to their “parents”, defined as the pages that link to or include the helper. But what if a helper page has multiple parents? And do we really want to couple links this tightly to our markup?

Here's a better option: A page is considered to be a helper page when it has a unique parent that links to it directly by including its first paragraph. We then change our similarity tool to generate one page that “braids” the similar paragraphs together in addition to the three helper pages:

# Similar: A book (draft 2), A book (draft 3)

A book (draft 2), A book (draft 3), paragraph 1

A book (draft 2), A book (draft 3), paragraph 2

A book (draft 2), A book (draft 3), paragraph 3
A book (draft 2), A book (draft 3), paragraph 1

The hero ventures into the unknown.

Our hero ventures into the unknown.
A book (draft 2), A book (draft 3), paragraph 2

There are many trials along the way.

There are many trials along their way.
A book (draft 2), A book (draft 3), paragraph 3

The hero returns changed.

Our hero returns changed.

When we now display links on the page of draft 3, the hypertext system can walk up from the helper pages and show the links in terms of the unique ancestor. Now that several paragraphs in the page of draft 3 are linked to the same ancestor page “Similar: A book (draft 2), A book (draft 3)”, we can see that the three paragraphs of draft 3 link to neighboring paragraphs in the similarity document. In other words, the linked paragraphs are consecutive and we can show this by showing the first link (the incoming branch) more prominently than the other two paragraphs:

# A book (draft 3)

Our hero ventures into the unknown.
  +------------------------------------------------------+
  | -> [...] Similar: A book (draft 2), A book (draft 3) |
  +------------------------------------------------------+

There are many trials along their way.
    -> Similar: A book (draft 2), A book (draft 3)

Our hero returns changed.
    -> Similar: A book (draft 2), A book (draft 3)

Bottom line: The combination of implicit content-based links and displaying them based on unique ancestors makes it possible to extend the system with other tools (each with their own notion of when and how content should be linked) without changing the foundation.