Module 5 · Beyond text 6 min read

Multimodal AI: more than just words

Modern AI doesn't just read and write. It sees, hears, and creates pictures, sounds, and video.

A bigger world for AI

Until now, we’ve talked about AI as if it only knew text. For a long time, that was true.

In 2026, that’s no longer the case. The newest models are multimodal, which is a fancy way of saying they handle more than one kind of stuff at once: text, images, audio, video. They can:

Look at a photo and describe what’s in it, or read the writing on a whiteboard.
Listen to your voice and answer in their own voice, in near-real time.
Watch short videos and summarize what happened.
Make images, music, and even short video clips from a text prompt.

GPT-4o (from OpenAI), Claude (Anthropic), and Gemini (Google) are all multimodal, and there are open-source multimodal models too. Fun fact: the “o” in GPT-4o stands for omni, meaning all the senses in one model.

How is this possible?

The trick is simpler than it sounds. Remember from the tokens lesson that an LLM reads words as tokens, then turns those into numbers?

A multimodal model does the same thing for everything. An image becomes a list of numbers. A sound clip becomes a list of numbers. A video becomes a long list of numbers. All those numbers get fed into the same model, which learned how each kind of input relates to the others.

The model never really “sees” the image the way you do. It just learned that pictures of golden retrievers tend to make these numbers, and the word “dog” tends to make these other numbers, and the two patterns are close together.

Try image generation

Type a description. The model on the other side will paint you a picture, pixel by pixel, from those words alone.

Describe a picture in words

Your picture will appear here.

Generated by an image model on Cloudflare's edge. The same prompt run twice will look different, because images are sampled too.

Things to try

Add a style: “in watercolor”, “as a pencil sketch”, “pixel art”.
Add a mood: “moody”, “cozy”, “joyful”.
Be specific about what’s where: “a red balloon on the left, a yellow umbrella on the right.”

What this unlocks

Once a single model can handle multiple senses, a lot of new uses appear:

Accessibility. Describe what’s on the screen for someone who can’t see it.
Education. Snap a photo of a math problem and get a worked-out explanation.
Real-time translation with voice and tone preserved.
Creative tools. Write a story, generate the illustrations, narrate it, animate it.

By 2026, roughly two out of three large companies are actively using multimodal AI in real products.

The same warnings still apply

Multimodal models still hallucinate. They can misread an image, hear the wrong word, generate something inaccurate or unfair, or invent details that aren’t really there. Bias from training data shows up in pictures too. So all the rules from earlier modules (verify, give context, keep oversight) still apply, just with more senses to watch.

Quick check

1. What does 'multimodal' mean for AI?
2. Inside the model, how is an image handled?
3. True or false: multimodal models never hallucinate, since they can 'see' the source.