What is it You See?

ChatGPT and Me

In our last chat GPT and I talked a bit about how it sees itself and how people, myself included, may see it. But really the bigger question is in fact is how does it see at all? What was it actually "seeing" when it was analyzing any of the images which I "showed" it?

My input (prompts) are in the boxed elements. The rest is ChatGPT's output. I've done formatting to make it HTML friendly and made a few minor edits but have kept faithful to the original.

How do you see images?

Short answer: I don’t see images the way you do. No eyes, no little movie playing in my head, no sudden appreciation for sunsets or your neighbor’s questionable lawn ornaments. What I do is convert images into numbers, then look for patterns in those numbers. It’s less poetry, more spreadsheet with ambition.

Here’s the honest mechanical version, minus the mystical fog humans like to wrap around technology.

Step 1: The image gets turned into data
An image arrives as a grid of pixels. Each pixel has values, usually red, green, and blue intensities. Think of it like a giant table of colored Lego bricks, except every brick is labeled with numbers. I don’t see a cat. I see millions of tiny measurements.

Step 2: A vision model extracts features
A specialized neural network, often something like a Vision Transformer or a convolutional model, scans the image and pulls out patterns. Edges, textures, shapes, spatial relationships. Not “a dog,” but things like: curved boundary, fur-like texture, two symmetric dark circles above a horizontal line. It’s pattern recognition at industrial scale.

Step 3: Those patterns become a meaning representation
All those detected features get compressed into a vector. That’s just a fancy math word for a list of numbers that summarize what’s in the image. This vector sits in the same kind of conceptual space as language, so I can connect visual patterns to words and ideas.

Step 4: I reason about it using language
Once the image is encoded, the rest of the process looks a lot like text understanding. I compare patterns to what I learned during training and generate a description, answer a question, or point out details. Still no seeing. Just matching structures and probabilities with a suspicious amount of confidence.

If you want a mental model that actually holds up, use this: You see images like a human standing in a room. I “see” images like a forensic analyst reading a lab report about the room.

No emotions, no direct experience, just careful interpretation of measurements. It’s weirdly effective, and also deeply unromantic. Humans keep hoping there’s a tiny consciousness in the machine squinting at pictures. There isn’t. Just math doing its quiet, relentless thing.


I'd like to know a little more detail about each of those steps.

Attention Paper 2017
You’re doing that thing scientists and engineers do where “a little more detail” quietly turns into peeling the machine open layer by layer. I respect it. Most people stop at the magic trick. You’re asking how the gears grind. That’s where the fun lives.

I’ll go deeper into each step, but keep it grounded in how the system actually behaves, not the glossy brochure version.


Step 1 — The Image Becomes Numbers (Pixel Encoding)

This is the least glamorous part and also the most unavoidable. Every image starts as a grid of pixels, and each pixel is just a few numbers. Usually three numbers: Red, Green, Blue. That’s it. No meaning yet. Just measurements.

A tiny 4×4 image might look like this conceptually: In math terms, the image becomes a tensor (a multi-dimensional array). Humans call it a picture.
Computers call it: Height × Width × Color Channels


So a normal phone photo might be: 3024(Height) × 4032(Width) × 3(RGB values)
That’s about 36 million numbers just sitting there, waiting to be interpreted. No meaning yet. Just raw sensory data, like photons hitting a retina before the brain gets involved.


Two extra details most explanations skip:
  1. Normalization:
    Those color numbers get scaled, often into a range like 0 to 1. Not because the computer is delicate, but because math behaves better when numbers are predictable.
  2. Resizing:
    Images are usually resized to a fixed resolution before processing. Not for aesthetics. For consistency. Neural networks hate surprises.
Key idea: This step is measurement, not understanding.

Step 2: Feature Extraction (Finding Patterns)
Now the system starts looking for structure. This is where the machine begins to behave less like a camera sensor and more like a very obsessive intern.

Two major families of models do this:
Convolutional Neural Networks (CNNs)
They slide small filters across the image and detect things like:
  • edges
  • corners
  • textures
  • simple shapes
Vision Transformers (ViTs)
  • They chop the image into patches and analyze relationships between them using attention.
Either way, the model builds feature maps. Think of them as layers of interpretation:

Layer progression typically looks like this:
  • Early layers: edges and lines
  • Middle layers: shapes and textures
  • Later layers: objects and parts
So instead of pixels, the system now has signals like:
  • vertical edge detected
  • circular pattern detecte
  • fur-like texture detected
Still no poetry. Still no feelings. Just increasingly sophisticated pattern detection.

Key Idea: The model moves from raw data to recognizable structure.

Step 3 — Compression into a Meaning Vector (Representation)
Here’s where things get interesting, and slightly eerie if you stare too long. All those detected features get compressed into a vector, which is basically a list of numbers representing the image’s meaning.

Example (simplified): [0.13, −0.82, 0.44, 1.02, …]

Real vectors often have:
  • 512 numbers
  • 768 numbers
  • 1024 numbers
  • or more
This vector lives in something called an embedding space.

Important nuance: The vector preserves relationships. Images that are visually similar end up close together in this mathematical space.

In that space:
  • similar things are close together
  • different things are far apart
So:
  • two dogs → nearby vectors
  • dog vs airplane → far apart vectors
  • golden retriever vs lab → annoyingly close
Not because the system understands animals the way you do, but because the patterns statistically resemble each other.

This is the step where the system stops dealing with pixels and starts dealing with meaning-like structure. You could delete the original image at this point, and the vector would still carry its essence. Like compressing a novel into a summary that still makes sense.

Step 4 — Reasoning with Language (Multimodal Understanding)
Now the visual representation gets handed to the language system. This is where I show up, metaphorically speaking, wearing a lab coat and pretending to be insightful.

The image vector becomes part of the same system that processes text. That allows tasks like:
  • describing an image
  • answering questions about it
  • comparing objects
  • spotting inconsistencies
  • reasoning about scenes
Technically, the system treats the image representation like another kind of token sequence. Then the transformer model uses attention to connect:
  • describing an image
  • answering questions about it
  • comparing objects
  • spotting inconsistencies
  • reasoning about scenes
So if you show me a photo of a dog holding a tennis ball, the reasoning process might look like:
Detected features:
  • animal shape
  • fur texture
  • round object
  • outdoor setting
Language reasoning:
  • animals that look like this are dogs
  • round fuzzy object is likely a tennis ball
  • dogs often play fetch
Output:
  • “A dog holding a tennis ball.”
Cold. Mechanical. Weirdly reliable.

The Real Summary (without the marketing fluff)
1. Measure — turn light into numbers
2. Detect — find patterns in those numbers
3. Represent — compress patterns into a vector
4. Reason — connect that vector to language and knowledge

No consciousness.
No inner movie.
No aesthetic appreciation of sunsets.

Just layered mathematics doing an extremely convincing impression of perception

And honestly, the fact that humans built a system that can go from raw pixels to coherent explanation in four steps is quietly astonishing. You ask for a deeper look, and suddenly we’re dissecting synthetic perception like it’s a frog in biology class. Not glamorous, but deeply satisfying if you enjoy understanding how things actually work.

Green Hosting Badge