Mechanistic interpretability as generative art
If a network has learned a concept, that concept is a location in its embedding space. Steer a generator toward that location and you don't get a diagram of what the model knows - you get a picture of it.
A neural network that has learned the concept “ocean” has not learned a definition. It has learned a location - a point, or more honestly a region, in a high-dimensional space where the things it associates with oceans cluster together. Interpretability research usually treats that fact as something to diagram: probe the space, label the axes, write the paper. I became more interested in a different question. If the concept is a place, what does it look like to go there?
That question turned into a project. It optimises generative outputs - images, audio, video - toward target coordinates in LanguageBind’s 768-dimensional multimodal embedding space, doing concept algebra across modalities to surface the platonic ideals the network has learned. The honest description is that it sits exactly on the seam between research and aesthetics, and that seam is more interesting than either side of it alone.
One space for everything
What makes LanguageBind useful here is that it embeds image, text, audio, and video into a single shared space, anchored on language. A photo of a beach, the word “beach,” and the sound of waves all land near each other, because the model was trained to align them. The space doesn’t care which door a concept came in through. “Ocean” is a region whether you arrived by image, by word, or by sound. Anchoring on language earns one more thing: any point in the space can be projected back toward the text vocabulary, so even a non-text coordinate has a running readout in words. You can ask what the current image embeds as and get back something like {storm, crackling, electric}, which turns the optimisation from a black box into something you can follow as it runs.
That shared geometry is what lets you do something that shouldn’t intuitively work: algebra across senses. Take the embedding of a sound, subtract the embedding of a word, add the embedding of an image, and you land somewhere new - a coordinate that no single input could have named. The classic king − man + woman ≈ queen move from word vectors, except the operands can be a photograph, a field recording, and a sentence, mixed freely.
Generation as a search for a place
Here’s the move that turns the geometry into pictures. Pick a target coordinate - a concept, or a piece of concept algebra. Then run a generator (a diffusion model for images, the equivalents for audio and video) and optimise its output so that its embedding lands as close as possible to that target. You’re not prompting “draw an ocean.” You’re telling the system: produce something - anything - that this network would file in the same place it files oceans, and let the network be the judge.
The output is not an illustration of the concept. It’s the generator’s best attempt to occupy the coordinate. Sometimes that’s a recognisable beach. More often it’s stranger and more revealing: the features the network most strongly associates with the region, rendered without the constraint of looking like any real photograph. You are seeing the concept the way the model holds it, not the way the world presents it.
Why this is interpretability, not just a filter
It would be easy to dismiss this as a stylised image generator. The reason it’s interpretability is that the output is diagnostic. When you steer toward a concept and the result is surprising - when “trust” renders as something you wouldn’t have predicted, or when an audio target produces an image whose logic only makes sense once you hear the sound - you’ve learned something concrete about how the network organises that region of its space. The art is the readout.
Concept algebra is where this gets sharpest. If summer − heat lands somewhere coherent, the model has factored those two ideas apart cleanly. If it lands in noise, it hasn’t - the concepts are entangled in a way the geometry won’t let you separate. The generated output makes that legible in a way a cosine-similarity table never does. You can see whether the model’s internal world is well-organised, and where it isn’t.
This is the same instinct that drives the better interpretability work in the field - the feature-visualisation lineage, the “what is this neuron looking for” question - pointed at the multimodal case and pushed until the answer is an image.
Why the aesthetics aren’t decoration
I want to defend the art half directly, because the reflex is to treat it as a decorative wrapper on the research. It isn’t decorative. It’s the most faithful available rendering of an object - a learned concept - that has no native visual form.
A concept in embedding space is genuinely 768-dimensional. Any honest depiction of it has to throw most of those dimensions away. A scatter plot throws away all but two and asks you to trust the projection. A generated image throws away fewer, because the generator is trying to satisfy the full target coordinate at once, across every dimension simultaneously. The image is a lossy compression of the concept, but it’s a richer lossy compression than the chart - and it carries information the chart structurally cannot. The aesthetics are doing epistemic work.
There’s also a claim hiding in the word “platonic.” If many different networks, trained on different data, converge on similar internal structure - and there’s a growing body of evidence they do - then steering toward a concept and rendering it isn’t just visualising one model’s quirks. It’s getting at something closer to the shape the concept takes whenever a system this size learns it from the world. That’s a stronger claim than “pretty pictures from a model,” and it’s the one the work is actually making.
The seam is the point
The cleanest framing I have: most interpretability tells you that a model represents a concept and roughly where. This tries to show you what it’s like for the model to represent it. That’s a question with a research answer and an aesthetic answer at the same time, and refusing to separate them is deliberate. The picture is the finding.
The code is on GitHub if you want to steer toward your own coordinates and see what the network thinks lives there.