Back to Writing

Measuring the platonic representation

The platonic representation hypothesis says capable models converge toward one shared picture of reality. A language-anchored multimodal encoder lets you test a sharp version of it: encode one concept through four senses and measure what they agree on.

2 min read

The platonic representation hypothesis is the claim that sufficiently capable models, trained on different data and in different modalities, converge toward the same internal picture of reality. If it holds, the structure a network learns is less an artefact of its training set and more a property of the world the data came from. It’s a big claim, and most of the evidence for it is correlational: line up two models, measure how similar their representations are, note that the number keeps rising as models get more capable.

A language-anchored multimodal encoder lets you test a sharper, more local version of the same idea. Take one concept - “thunder” - and encode it four ways: the word, an image of a storm, an audio clip, a short video. Each is a different sense arriving at the same shared space. The hypothesis predicts they should land in roughly the same place. The question is what “roughly” means, and where the disagreement lives.

Decompose, then compare

A dense cosine similarity between the four embeddings gives you a single blurry number. Factor each embedding into a sparse decomposition first - a handful of named features from a sparse autoencoder - and the comparison gets legible. Now you can ask not just how much the four senses agree, but which features they share and which each one carries alone.

The intersection is the modality-agnostic concept: the features that fire whether thunder arrives as a word, a picture, or a sound. That shared set is the closest thing the system has to a platonic core - the part of “thunder” that survives the change of sense. The per-modality remainder is what each sense contributes on top: the audio decomposition carries low-frequency and temporal features the word never touches; the image carries visual-storm features the sound can’t express.

The delta is the carve-out

That remainder is not noise to be averaged away. It is the measurement. The features audio adds and text lacks are a direct readout of what hearing thunder encodes that naming it does not. Cross-modal agreement on the feature sets - a Jaccard overlap, concept by concept - turns the platonic hypothesis from a vibe into a number you can compute per concept and compare across them.

And it cuts both ways, which is what makes it honest. High overlap across modalities is evidence for a convergent core: the senses really are arriving at one representation. Low overlap is evidence against it for that concept - a sign the shared space is more bound together by training than unified by reality, with each modality keeping its own private structure under a thin coat of alignment. Either result tells you something. The experiment is worth running precisely because it can come back negative.

The code is on GitHub; the anchor-compare command encodes one concept through each modality and reports the agreement, so the carve-out is something you can read off a single run.