Multimodal embedding art: concept algebra in a shared representation space

Optimising generative models toward coordinates in a 768-dimension multimodal embedding space: surfacing the platonic ideals a neural net has learned, then doing arithmetic on them across text, image, audio, and video.

Independent research · 2026

Concept algebra across four modalities in a shared embedding space

Context

A text-to-image model can draw you a goldfish. That’s a useful trick and a dull question. The more interesting one sits underneath it: what does goldfish look like to the model itself - not a goldfish, but the direction in representation space that means goldfish, cranked to maximum?

Neural networks learn internal representations of concepts. A multimodal encoder learns them in a space shared across modalities, so the embedding for the word “thunder,” an audio clip of thunder, and an image of a storm all land near each other. That shared space is a map of what the network thinks concepts are. This project is an attempt to render points on that map directly, by treating generation as an optimisation problem in embedding space rather than a prompt.

It sits where mechanistic interpretability meets generative art. The interpretability question (what has the network actually learned, and can we see it) is the same question whether you frame the output as a diagnostic or as an image worth looking at. The aesthetic and the analytic are the same act here: maximally activating an internal representation and looking at what comes out.

honest natural

A hand-built evocation, not a system output. Drag from honest to natural: the honest side holds the representation the model optimises toward; the natural side resolves it under a diffusion prior toward a recognisable image. The gap between them is the subject.

Decision

The framing decision was to optimise toward embedding coordinates rather than condition on a prompt.

A normal generative pipeline takes a prompt, runs it through the model, and returns a sample. The concept lives in the prompt; you never touch the representation. The decision here was to invert that: pick a target point in the embedding space, then optimise a generator’s latent to maximise the similarity between its output’s embedding and that target. The concept becomes a coordinate you can manipulate, not a sentence you have to phrase.

That choice is what makes concept algebra possible. Once a concept is a vector, you can do arithmetic on it: add two concepts, subtract an attribute, interpolate between two points. You can’t cleanly add “fire” and “water” in a prompt; you can add their embeddings. The output isn’t training data retrieved and blended; it’s the model’s own answer to “what lies at this coordinate,” reconstructed from scratch by optimisation.

The encoder is LanguageBind, which binds image, audio, video, and text into one 768-dimension space anchored on language. A target can be assembled from any mix of those modalities, and a generator for one modality can be optimised toward a target defined in another. That cross-modal property is the whole reason to use a shared space rather than a per-modality one. Anchoring on language earns a second thing for free: any point in the space can be projected back toward the text encoder’s vocabulary, so a non-text embedding has a running human-readable readout. Mid-optimisation you can ask “what does the current image embed as,” and get back {storm, crackling, electric}. That single property is what turns the optimisation loop from a black box into something you can watch think. (The original build used ImageBind in a 1024-dimension space; LanguageBind is its direct successor, covers the same modalities, and exposes ViT patch tokens natively, which the loss now leans on.)

The deeper representation decision is that a concept is not really the dense vector; it’s a sparse feature decomposition, and the vector is derived from it. A Matryoshka sparse autoencoder factors each embedding into a handful of named, active features. That matters because arithmetic in dense space is blunt: adding two normalised vectors averages every dimension, and subtraction is a vague nudge. Arithmetic over decompositions is sharp: addition unions the named features, subtraction removes a specific one, interpolation morphs the feature mixture. So Concept carries its decomposition as primary state and composes in feature space whenever both operands have one. “Fire plus water” stops being a blended average and becomes a union of what the model knows about each.

What I built

A system that takes any combination of text, image, audio, and video as input concepts, combines them with feature-space arithmetic, and optimises a generative model’s latent toward the target. One command, showcase, renders a single target across all four modalities at once and emits an interpretation bundle and an evaluation card alongside the artefacts.

Generation runs through modality-specific decoders: the Stable Diffusion 3.5 Medium latent for images, Stable Audio Open for audio, and LTX-Video for video. Optimisation is gradient descent on the generator’s latent, with a default of 2000 steps at a learning rate of 0.1. (The original build paired ImageBind with the SDXL VAE, AudioLDM 2, and Stable Video Diffusion; those still exist as legacy ablation backbones, but the flow-matching SD3.5 latent and the newer audio and video models are sharper and the licences are cleaner.)

The concept algebra is the interface:

# Addition: combine two concepts
fire_water = Concept.from_text("fire", encoder) + Concept.from_text("water", encoder)

# Weighted combination
sunset_ocean = 0.3 * Concept.from_text("sunset", encoder) + 0.7 * Concept.from_text("ocean", encoder)

# Subtraction: remove an attribute
hairless = Concept.from_text("dog", encoder) - 0.3 * Concept.from_text("fur", encoder)

# Spherical interpolation between two points
midpoint = Concept.slerp(concept_a, concept_b, t=0.5)

# Cross-modal: an audio concept plus a text concept, rendered as an image
thunder_purple = Concept.from_audio("thunder.wav", encoder) + 0.3 * Concept.from_text("purple", encoder)

What the optimiser actually matches

The naive loss is cosine similarity on the pooled embedding, and it has a known failure mode: it matches the model’s final answer the way an adversarial image matches a classifier’s top label, which is to say with high-frequency noise that scores well and looks like nothing. Pooled cosine is now a metric I log at every step, not the signal I optimise. The signal is two tighter terms. Patch-token alignment matches LanguageBind’s ViT patch features rather than the pooled summary, which preserves spatial structure: a goldfish-shaped goldfish instead of goldfish texture smeared across the frame. Feature-direction alignment matches the SAE decomposition, so the optimiser is pushed toward the named features the target carries, not just its dense direction.

Holding the output inside the distribution the decoder expects used to be a hand-tuned job: total variation for smoothness, a spectral penalty on high frequencies, a latent-norm term. Those still exist on the legacy path, but the load-bearing version now is a learned one. A variational score-distillation prior pulls the latent toward the natural-image manifold a frozen diffusion model already knows, trained as a small per-concept LoRA so it regularises without collapsing every concept to the same generic mean. The regulariser weights I used to pick by feel are replaced by a prior the model derived.

What comes out

Four renders, four concepts: water, thunder, fire and goldfish, listed here in a different order than they appear below. See if you can match each to a tile before you turn it over.

Dense multicoloured speckle with clusters of orange, fan-shaped forms scattered across it. — Four concepts, each optimised from noise toward its target embedding; nothing retouched. Tap a tile to reveal which - see how many you can name from the noise alone.

Multicoloured speckle with branching orange tongues rising through the centre. — Four concepts, each optimised from noise toward its target embedding; nothing retouched. Tap a tile to reveal which - see how many you can name from the noise alone.

These are not meant to be beautiful, and it’s worth being plain about what they are. Each one is the literal output for a concept: the optimiser starts from random noise and nudges every pixel until the whole image embeds close to the target point. Nobody ever shows it the subject or tells it what one looks like. So where matching the concept doesn’t pull on a region, the pixels stay as the high-frequency speckle they started as. Most of the frame is static because, to the encoder, most of the frame doesn’t matter.

What matters is that the subject still surfaces. Look past the speckle and forms start to organise out of it - edges, repeating shapes, the rough silhouette of whatever the model is reaching for. In places it reaches for the written word rather than the thing, which is its own tell that the encoder is anchored on language. None of it resolves into a clean picture, but the concept is unmistakably there once you spot it; that is the game above, four renders with nothing labelled until you turn them over. That partial, patchy legibility is the honest result, and it forces a choice: leave the representation raw, or resolve it toward something that reads as a real image.

Honest and natural

The most interesting decision was to render two tracks of the same concept side by side. A VAE-style render answers “what is the nearest natural image whose embedding matches this concept” - it leans on the decoder’s training distribution to produce something photo-real and recognisable. That’s the natural track, VSD-prior conditioned. The honest track dials the prior down and lets the optimisation sit closer to the raw representation, so the output is what the model thinks the concept is rather than the closest real photograph of it. Neither is the true one. The contrast between them is the artefact: here is the model’s idea of “thunder,” and here is the nearest thing in the world to it.

What every render ships with

Each run emits an interpretation bundle and an evaluation card, so an artefact is never just an image. The bundle carries the text-anchor readout (what the output embeds as, in words), the named SAE features that fired hardest, attribution maps from attention rollout and integrated gradients that show where in the frame each feature lives, and linear-probe activations for coarse questions like “is this animal” or “is this danger-coded.” The card carries the numbers: cross-modal agreement (do the image, audio, and video for one target re-encode to neighbours), cross-encoder probes that re-score the output with SigLIP2 and CLAP rather than the encoder it was optimised against, and seed-stability across repeated runs. Evaluation is built into the pipeline rather than bolted on after, which is the part of this build I’d most defend.

The whole thing runs on Apple Silicon: PyTorch on the MPS backend, tuned on an M1 Max with 64GB of unified memory. A full four-modality dual-track run holds the LanguageBind encoders, three decoders, and optimisation overhead in roughly 19GB; a single-modality image render fits comfortably on 16GB. The Apple Silicon path is deliberate, and it took real work to make fast: bf16 autocast on the forward pass, torch.compile on the optimisation hot loop, and flash-attention routing through the ViT together buy a 3-5x wall-time improvement over naive fp32. This is exploratory work where the loop is run-look-adjust, and a fast local iteration cycle on hardware I already own beats renting a faster card.

Outcome

The system does concept algebra across four modalities in a single shared embedding space: addition, subtraction, and spherical interpolation over text, image, audio, and video, with cross-modal targets that let an audio concept shape an image or a text concept colour a sound.

What it surfaces is the more interesting result than any single picture: the platonic ideals a network has learned. The outputs are what the model thinks a concept looks like: maximally activated representations reconstructed by optimisation, not training examples retrieved and recombined. As an interpretability artefact, that’s the point. You’re looking at the geometry of a learned representation directly, and the interpretation bundle lets you read it rather than guess at it: which features fired, where they live in the frame, what the output embeds as in words.

The evaluation card turns “did this work” from a judgement made by eye into something with numbers behind it. Cross-modal agreement says whether the four renderings of a target really landed near each other, cross-encoder probes say whether a second model agrees about what’s in the output, and seed-stability says whether the result is a property of the concept or an accident of initialisation. It still makes no benchmark claim and it isn’t a paper; the numbers exist to keep the framing honest, not to top a leaderboard.

I’m treating this as an essay-grade study. Its value is in the framing - generation as optimisation in a shared space, concepts as feature decompositions you can do arithmetic on, the honest and natural tracks shown together - and in the artefacts that framing produces. The natural next surface is the interactive one: an embedding-space view where you move through the concept manifold and watch the output change, which is where the lab takes it.

The code is on GitHub.