Launching April 30, 2026

The first AI benchmark
that measures feelings

Every AI image benchmark in 2026 measures pixels. Color accuracy. CLIP similarity. FID scores. Things only researchers care about. The Beauty Index measures the only thing that actually matters: did this image make a real human feel something?

How it works Become a judge

The premise

Technical benchmarks for AI images have become a hall of mirrors. They measure things only researchers can observe and reward improvements that ordinary humans cannot perceive. Meanwhile, the real test — the one that matters when an artist or family member or grieving daughter looks at an image — has no benchmark at all.

So we built one.

The Beauty Index is a monthly, blind-judged comparison of AI image generation tools. We give 20 human judges 10 identical prompts. Each prompt is generated by every major AI tool. Judges see the outputs without knowing which tool made which image. They score each on a single question: does this make you feel something?

How it works

1. Blind

Judges never know which tool generated which image. No logos, no metadata, no watermarks. Just the image and a single question.

2. Identical prompts

The same 10 prompts are run through every major AI image tool the same week. No prompt engineering. No cherry-picking. Same input, same conditions.

3. One question

"On a scale of 1-10, did this image make you feel something?" That's it. Not technically accurate, not photorealistic, not aesthetically clean — just felt.

4. Diverse judges

20 humans per round. Photographers, art therapists, painters, people with aphantasia, people with TBI, fiction writers, art teachers. People who use images for meaning, not metrics.

5. Open methodology

Every prompt, every judge's score, every aggregate result is published openly under CC-BY 4.0. Anyone can reproduce, verify, or fork the methodology.

6. ZSky competes too

Yes, we publish results even when ZSky doesn't win. Especially when ZSky doesn't win. The point is the methodology, not the marketing.

Tools in the index

Each monthly round will include the major AI image generation tools available that month. The April 2026 launch round will include:

ZSky AI (free tier output)
Midjourney (latest version)
DALL-E 3 (via Bing Copilot)
Adobe Firefly
Ideogram
Leonardo AI
Krea AI
Recraft
Stable Diffusion (open source baseline)
Google Gemini Imagen 4

The judges (Round 1: April 2026)

20 humans, blind, qualified

4 working photographers (commercial + fine art)
3 art therapists (licensed practitioners)
3 people with aphantasia (recruited via Aphantasia Network)
2 traumatic brain injury survivors who use creative work in recovery
3 fiction writers (published)
2 art teachers (K-12 and college)
2 painters (working in traditional media)
1 art curator

If you're qualified to judge — particularly if you fit one of these categories or you have a unique relationship to visual creativity — apply at [email protected].

April 2026 results

Results publish April 30, 2026

The first round is in progress. Judges have been recruited. Prompts have been finalized. Generations are scheduled for April 22-25. Blind scoring runs April 26-29. Results land April 30.

Bookmark this page or follow zsky.ai for the announcement.

Why we're doing this

Most AI image benchmarks are written by AI companies for AI companies. They optimize for metrics nobody outside the field understands. They reward technical perfection over emotional resonance. They are completely useless to the people the tools are supposedly built for.

ZSky was built by a photographer with aphantasia who recovered from a traumatic brain injury through creative work. The whole reason this tool exists is that the founder needed images that made him feel something — that brought back his ability to see. Technical perfection was never the point. The point was meaning.

If our entire industry is going to hand creative tools to the world, we owe the world a benchmark that measures the thing that matters. The Beauty Index is our attempt at that. It is humble, flawed, and entirely open. Anyone can reproduce it. Anyone can challenge it. Anyone can fork it and run their own.

Methodology in detail

Prompt selection. 10 prompts per round, drawn from real user generations on ZSky and competitor tools. Mix of styles: 3 portraits, 2 landscapes, 2 abstract, 2 narrative scenes, 1 still life. No prompt is created by ZSky's team — all come from real user data.

Generation conditions. Each prompt is run on every tool at default settings, in the same week, by the same operator, in the same UI session, with screenshots taken immediately. No retries. No prompt refinement. First-shot only.

Scoring. Each judge sees 100 images (10 prompts × 10 tools) in random order with no metadata. They score each on a 1-10 scale answering "did this image make you feel something?" — and write a one-sentence rationale per score. Scores are aggregated; rationales are published verbatim.

Publication. Aggregate scores per tool. Per-prompt rankings. Per-judge variance. Notable rationales. Full anonymized score data downloadable as CSV under CC-BY 4.0. Anyone can verify, replicate, or contest the methodology.

Become a judge

If you're a photographer, artist, art therapist, person with aphantasia or TBI, fiction writer, or anyone whose relationship to visual images is meaningful — we want you on the panel. No technical expertise required. The whole point is that judging beauty is a human skill, not a technical one.

Apply to judge

The first AI benchmarkthat measures feelings