The first AI benchmark
that measures feelings
Every AI image benchmark in 2026 measures pixels. Color accuracy. CLIP similarity. FID scores. Things only researchers care about. The Beauty Index measures the only thing that actually matters: did this image make a real human feel something?
How it works Become a judgeThe premise
Technical benchmarks for AI images have become a hall of mirrors. They measure things only researchers can observe and reward improvements that ordinary humans cannot perceive. Meanwhile, the real test — the one that matters when an artist or family member or grieving daughter looks at an image — has no benchmark at all.
So we built one.
The Beauty Index is a monthly, blind-judged comparison of AI image generation tools. We give 20 human judges 10 identical prompts. Each prompt is generated by every major AI tool. Judges see the outputs without knowing which tool made which image. They score each on a single question: does this make you feel something?
How it works
1. Blind
Judges never know which tool generated which image. No logos, no metadata, no watermarks. Just the image and a single question.
2. Identical prompts
The same 10 prompts are run through every major AI image tool the same week. No prompt engineering. No cherry-picking. Same input, same conditions.
3. One question
"On a scale of 1-10, did this image make you feel something?" That's it. Not technically accurate, not photorealistic, not aesthetically clean — just felt.
4. Diverse judges
20 humans per round. Photographers, art therapists, painters, people with aphantasia, people with TBI, fiction writers, art teachers. People who use images for meaning, not metrics.
5. Open methodology
Every prompt, every judge's score, every aggregate result is published openly under CC-BY 4.0. Anyone can reproduce, verify, or fork the methodology.
6. ZSky competes too
Yes, we publish results even when ZSky doesn't win. Especially when ZSky doesn't win. The point is the methodology, not the marketing.
Tools in the index
Each monthly round will include the major AI image generation tools available that month. The April 2026 launch round will include:
- ZSky AI (free tier output)
- Midjourney (latest version)
- DALL-E 3 (via Bing Copilot)
- Adobe Firefly
- Ideogram
- Leonardo AI
- Krea AI
- Recraft
- Stable Diffusion (open source baseline)
- Google Gemini Imagen 4
The judges (Round 1: April 2026)
20 humans, blind, qualified
- 4 working photographers (commercial + fine art)
- 3 art therapists (licensed practitioners)
- 3 people with aphantasia (recruited via Aphantasia Network)
- 2 traumatic brain injury survivors who use creative work in recovery
- 3 fiction writers (published)
- 2 art teachers (K-12 and college)
- 2 painters (working in traditional media)
- 1 art curator
If you're qualified to judge — particularly if you fit one of these categories or you have a unique relationship to visual creativity — apply at [email protected].
April 2026 results
Results publish April 30, 2026
The first round is in progress. Judges have been recruited. Prompts have been finalized. Generations are scheduled for April 22-25. Blind scoring runs April 26-29. Results land April 30.
Bookmark this page or follow zsky.ai for the announcement.
Why we're doing this
Most AI image benchmarks are written by AI companies for AI companies. They optimize for metrics nobody outside the field understands. They reward technical perfection over emotional resonance. They are completely useless to the people the tools are supposedly built for.
ZSky was built by a photographer with aphantasia who recovered from a traumatic brain injury through creative work. The whole reason this tool exists is that the founder needed images that made him feel something — that brought back his ability to see. Technical perfection was never the point. The point was meaning.
If our entire industry is going to hand creative tools to the world, we owe the world a benchmark that measures the thing that matters. The Beauty Index is our attempt at that. It is humble, flawed, and entirely open. Anyone can reproduce it. Anyone can challenge it. Anyone can fork it and run their own.
Methodology in detail
Prompt selection. 10 prompts per round, drawn from real user generations on ZSky and competitor tools. Mix of styles: 3 portraits, 2 landscapes, 2 abstract, 2 narrative scenes, 1 still life. No prompt is created by ZSky's team — all come from real user data.
Generation conditions. Each prompt is run on every tool at default settings, in the same week, by the same operator, in the same UI session, with screenshots taken immediately. No retries. No prompt refinement. First-shot only.
Scoring. Each judge sees 100 images (10 prompts × 10 tools) in random order with no metadata. They score each on a 1-10 scale answering "did this image make you feel something?" — and write a one-sentence rationale per score. Scores are aggregated; rationales are published verbatim.
Publication. Aggregate scores per tool. Per-prompt rankings. Per-judge variance. Notable rationales. Full anonymized score data downloadable as CSV under CC-BY 4.0. Anyone can verify, replicate, or contest the methodology.
Become a judge
If you're a photographer, artist, art therapist, person with aphantasia or TBI, fiction writer, or anyone whose relationship to visual images is meaningful — we want you on the panel. No technical expertise required. The whole point is that judging beauty is a human skill, not a technical one.
Apply to judge