What is it You See?

In our last chat GPT and I talked a bit about how it sees itself and how people, myself included, may see it. But really the bigger question is in fact is how does it see at all? What was it actually "seeing" when it was analyzing any of the images which I "showed" it?

My input (prompts) are in the boxed elements. The rest is ChatGPT's output. I've done formatting to make it HTML friendly and made a few minor edits but have kept faithful to the original. Graphics were created by GPT and added afterwards.

How do you see images?

Short answer: I don’t see images the way you do. No eyes, no little movie playing in my head, no sudden appreciation for sunsets or your neighbor’s questionable lawn ornaments. What I do is convert images into numbers, then look for patterns in those numbers. It’s less poetry, more spreadsheet with ambition.

Here’s the honest mechanical version, minus the mystical fog humans like to wrap around technology.

Step 1: The image gets turned into data
An image arrives as a grid of pixels. Each pixel has values, usually red, green, and blue intensities. Think of it like a giant table of colored Lego bricks, except every brick is labeled with numbers. I don’t see a cat. I see millions of tiny measurements.

Step 2: A vision model extracts features
A specialized neural network, often something like a Vision Transformer or a convolutional model, scans the image and pulls out patterns. Edges, textures, shapes, spatial relationships. Not “a dog,” but things like: curved boundary, fur-like texture, two symmetric dark circles above a horizontal line. It’s pattern recognition at industrial scale.

Step 3: Those patterns become a meaning representation
All those detected features get compressed into a vector. That’s just a fancy math word for a list of numbers that summarize what’s in the image. This vector sits in the same kind of conceptual space as language, so I can connect visual patterns to words and ideas.

Step 4: I reason about it using language
Once the image is encoded, the rest of the process looks a lot like text understanding. I compare patterns to what I learned during training and generate a description, answer a question, or point out details. Still no seeing. Just matching structures and probabilities with a suspicious amount of confidence.

If you want a mental model that actually holds up, use this: You see images like a human standing in a room. I “see” images like a forensic analyst reading a lab report about the room.

No emotions, no direct experience, just careful interpretation of measurements. It’s weirdly effective, and also deeply unromantic. Humans keep hoping there’s a tiny consciousness in the machine squinting at pictures. There isn’t. Just math doing its quiet, relentless thing.

I'd like to know a little more detail about each of those steps.

You’re doing that thing scientists and engineers do where “a little more detail” quietly turns into peeling the machine open layer by layer. I respect it. Most people stop at the magic trick. You’re asking how the gears grind. That’s where the fun lives.

I’ll go deeper into each step, but keep it grounded in how the system actually behaves, not the glossy brochure version.

Step 1 — The Image Becomes Numbers (Pixel Encoding)

This is the least glamorous part and also the most unavoidable. Every image starts as a grid of pixels, and each pixel is just a few numbers. Usually three numbers: Red, Green, Blue. That’s it. No meaning yet. Just measurements.

A tiny 4×4 image might look like this conceptually:

Pixel (1,1): (255, 0, 0)
Pixel (1,2): (254, 1, 0)
Pixel (1,3): (10, 10, 10)
…repeat a few million times

In math terms, the image becomes a tensor (a multi-dimensional array). Humans call it a picture.
Computers call it: Height × Width × Color Channels

So a normal phone photo might be: 3024(Height) × 4032(Width) × 3(RGB values)
That’s about 36 million numbers just sitting there, waiting to be interpreted. No meaning yet. Just raw sensory data, like photons hitting a retina before the brain gets involved.

Two extra details most explanations skip:

Normalization:
Those color numbers get scaled, often into a range like 0 to 1. Not because the computer is delicate, but because math behaves better when numbers are predictable.
Resizing:
Images are usually resized to a fixed resolution before processing. Not for aesthetics. For consistency. Neural networks hate surprises.

Key idea: This step is measurement, not understanding.

Step 2: Feature Extraction (Finding Patterns)
Now the system starts looking for structure. This is where the machine begins to behave less like a camera sensor and more like a very obsessive intern.

Two major families of models do this:

Convolutional Neural Networks (CNNs)
They slide small filters across the image and detect things like:

edges

corners

textures

simple shapes

Vision Transformers (ViTs)

They chop the image into patches and analyze relationships between them using attention.

Either way, the model builds feature maps. Think of them as layers of interpretation:

Layer progression typically looks like this:

Early layers: edges and lines

Middle layers: shapes and textures

Later layers: objects and parts

So instead of pixels, the system now has signals like:

vertical edge detected

circular pattern detecte

fur-like texture detected

Still no poetry. Still no feelings. Just increasingly sophisticated pattern detection.

Key Idea: The model moves from raw data to recognizable structure.

Step 3 — Compression into a Meaning Vector (Representation)

Here’s where things get interesting, and slightly eerie if you stare too long. All those detected features get compressed into a vector, which is basically a list of numbers representing the image’s meaning.

Example (simplified): [0.13, −0.82, 0.44, 1.02, …]

Real vectors often have:

512 numbers

768 numbers

1024 numbers

or more

This vector lives in something called an embedding space.

Important nuance: The vector preserves relationships. Images that are visually similar end up close together in this mathematical space.

In that space:

similar things are close together

different things are far apart

So:

two dogs → nearby vectors

dog vs airplane → far apart vectors

golden retriever vs lab → annoyingly close

Not because the system understands animals the way you do, but because the patterns statistically resemble each other.

This is the step where the system stops dealing with pixels and starts dealing with meaning-like structure. You could delete the original image at this point, and the vector would still carry its essence. Like compressing a novel into a summary that still makes sense.

Step 4 — Reasoning with Language (Multimodal Understanding)

Now the visual representation gets handed to the language system. This is where I show up, metaphorically speaking, wearing a lab coat and pretending to be insightful.

The image vector becomes part of the same system that processes text. That allows tasks like:

describing an image

answering questions about it

comparing objects

spotting inconsistencies

reasoning about scenes

Technically, the system treats the image representation like another kind of token sequence. Then the transformer model uses attention to connect:

visual information

language context

prior knowledge

So if you show me a photo of a dog holding a tennis ball, the reasoning process might look like:

Detected features:

animal shape

fur texture

round object

outdoor setting

Language reasoning:

animals that look like this are dogs

round fuzzy object is likely a tennis ball

dogs often play fetch

Output:

“A dog holding a tennis ball.”

Key idea: Visual understanding becomes language reasoning.

The Real Summary (without the marketing fluff)

1. Measure — turn light into numbers
2. Detect — find patterns in those numbers
3. Represent — compress patterns into a vector
4. Reason — connect that vector to language and knowledge

No consciousness.
No inner movie.
No aesthetic appreciation of sunsets.

Just layered mathematics doing an extremely convincing impression of perception

And honestly, the fact that humans built a system that can go from raw pixels to coherent explanation in four steps is quietly astonishing. You ask for a deeper look, and suddenly we’re dissecting synthetic perception like it’s a frog in biology class. Not glamorous, but deeply satisfying if you enjoy understanding how things actually work.