Image Feedback Research

1. The Problem

When an LLM "evaluates" a generated image with a numeric score (e.g., "aesthetics: 7.2, originality: 8.1"), it is essentially hallucinating. A language model has no trained aesthetic perception pipeline — it produces plausible-sounding numbers based on text description and prompt, but these numbers do not systematically correlate with human perception.

Core Problem

If the feedback is fiction, the entire system is optimizing fiction. This is a fundamental issue for gen-emerge: the 8 feedback channels that caused convergence were built on unreliable signals.

The engineering and scientific community has developed a sophisticated apparatus for solving this problem over the last 10 years. Below is a systematic review of all major approaches.

2. Taxonomy of Image Evaluation Approaches

All existing approaches fall into 7 paradigms, each with distinct strengths and limitations:

Seven Evaluation Paradigms

2.1. Neural Aesthetic Scorers

Train CNN/ViT to predict the distribution of human ratings on annotated photography datasets.

NIMA (Google, 2017)

ImageNet-pretrained CNN + 10-class output (ratings 1–10). Trained on AVA: 255K photos, ~200 judges each. Predicts distribution of ratings via Earth Mover's Distance loss — distinguishes "everyone gave 5" from "half gave 2, half gave 8." SRCC ~0.61.

LAION Aesthetic V2 (2022)

Radically simpler: nn.Linear(768, 1) on CLIP ViT-L/14 embeddings. Trained on ~5K image-rating pairs. Used to filter LAION-5B → LAION-Aesthetics (120M images) for Stable Diffusion v1 training. A linear probe on rich features.

Limitations for Gen-Emerge

All models trained on photographs, not generative art. AVA = photo competitions: landscapes, portraits, macro. Aesthetic score ≠ art quality — a technically perfect sunset gets 8/10, a radically innovative abstraction gets 3/10. Useful only as a technical quality floor.

2.2. CLIP-Based Text-Image Alignment

Measure how well a generated image matches the text prompt via cosine similarity of CLIP embeddings.

CLIP Score = cos(CLIP_image(image), CLIP_text(prompt))

Text-Image Alignment Score

CLIP Distance ≠ Perceptual Distance

DreamSim (Fu et al., NeurIPS 2023): two images with CLIP cosine similarity 0.95 can look radically different to humans (different palette, mood), while two with 0.80 can look nearly identical (same scene at different times). Using CLIP distance as a diversity metric is an error.

2.3. Human Preference Reward Models

Train a model to predict which of two images a human will prefer, based on pairwise comparisons. Direct analogue of reward models from RLHF in NLP.

Model	Training Data	Architecture	Accuracy
ImageReward	137K expert pairwise comparisons	BLIP backbone, (prompt, image) → scalar	65.14%
HPS v2	HPD v2 (large & diverse)	CLIP-based preference model	~65%
VisionReward	48K images, structured	64 binary questions → 5 dimensions → weighted sum	>65%, multi-dim

Comparison baseline: CLIP Score achieves only 54.82%, NIMA aesthetic score 57.35%. ImageReward's ReFL-tuned Stable Diffusion wins against vanilla SD in 58.4% of human evaluations.

P(A ≻ B) = σ(β_A − β_B)

Bradley-Terry Model (1952) — Mathematical Foundation of All Reward Models

Why Pairwise is Better than Absolute

Thurstone (1927): relative judgment is cognitively simpler and more stable than absolute scoring. Pairwise eliminates scale bias — different raters use different parts of the scale, but relative judgment is consistent. ICC for absolute aesthetic ratings: ~0.40 in crowdsourcing vs ~0.94 in lab. Bradley-Terry produces transitive ranking from noisy individual judgments.

2.4. VLM-as-a-Judge

Show an image to a multimodal model (GPT-4V, Claude, Gemini) and ask it to evaluate. This is exactly what gen-emerge does now — and exactly what's problematic.

79%Pair comparison agreement

~70%Scoring agreement

42%Batch ranking agreement

Persistent biases: position bias (first shown rates higher), verbosity bias (more detailed descriptions rate higher), self-preference bias (model prefers its own outputs), and hallucination of non-existent image elements.

Critical Finding

An LLM evaluating a text description of an image (without vision) achieves Pearson similarity 0.435 in scoring — better than some MLLMs at judging. Most of the "evaluation" is evaluation of the description, not the image.

2.5. Learned Perceptual Metrics

Models trained to predict how similar two images look to humans. Not quality assessment — difference assessment.

LPIPS (Zhang et al., 2018)

Distance in VGG/AlexNet feature space, calibrated on human perceptual judgments. Sensitive to color, texture, edges. Not sensitive to semantic content or layout.

DISTS (2020)

Improvement over LPIPS — separately models structure and texture components.

DreamSim (Fu et al., NeurIPS 2023 Spotlight)

Breakthrough. Trained on 20K triplets from diffusion models. Concatenation of CLIP + OpenCLIP + DINO embeddings + LoRA fine-tuning on human judgments. 96.16% agreement with humans. Focuses on foreground objects and semantic content while being sensitive to color and layout — the middle of the spectrum that existing metrics miss.

Key Recommendation

DreamSim is the best available metric for diversity measurement in gen-emerge. Instead of CLIP distance (poorly correlated with perceived difference), DreamSim distance = perceptually calibrated "how different are they." Direct basis for diversity gate (B10) and fingerprint comparison.

2.6. Distribution-Level Metrics

Evaluate not a single image, but the quality and diversity of an entire set.

Metric	What It Measures	Limitations
FID	Distance from generated to real distribution (Inception features)	Assumes Gaussian; biased estimator; Inception features outdated
CMMD (CVPR 2024)	Same via CLIP embeddings + MMD	No Gaussian assumption; unbiased; sample-efficient
sFID	FID with spatial features	Better for textures; still Inception-based
Self-Similarity	Diversity within generated set	Direct diagnostic for convergence

FID/CMMD don't apply to gen-emerge (no reference dataset of "ideal generative art"). But self-similarity within a series — mean pairwise DreamSim distance — is a simple and interpretable convergence diagnostic.

2.7. Decomposed Evaluation

Break "do you like it?" into specific measurable dimensions. Each dimension evaluated separately, then aggregated.

Alignment

Text-image coherence

Aesthetics

Composition, color, style

Detail

Artifacts, clarity

Safety

NSFW, harmful

Bias

Stereotypes, repr.

VisionReward decomposes human preferences across these 5 dimensions using 64 binary (yes/no) judgment questions. Accuracy grows monotonically with the number of questions — fine-grained decomposition beats scalar judgment.

Gen-Emerge Specific Dimensions

Constraint adherence: were the given constraints met? (binary per constraint)
Technical quality: are there artifacts? (binary)
Novelty: differs from previous works? (DreamSim distance, continuous)
Coherence: is there a visual narrative? (VLM judgment)
Style ambiguity: can the style be easily classified? (softmax entropy, continuous)

Each dimension can be validated separately. If VLM is bad at aesthetic scoring but good at binary "are there artifacts?" — decomposed approach isolates weak dimensions from strong ones.

3. The Subjectivity Problem

3.1. How Objective Is Aesthetic Judgment?

Lab, experts

ICC 0.94

Lab, non-experts

ICC 0.68

Crowdsourcing

ICC 0.40

ICC ~0.40 means: less than half the variance in ratings is explained by differences between images — the rest is rater differences. This is not noise — it's genuine disagreement. Different people like different things.

Experts vs laypeople: Non-experts orient on semantic features (what is depicted), experts on formal features (how: composition, color, rhythm). "Objective" aesthetic score is an oxymoron.

4. Comprehensive Comparison

Approach	Measures	Grounding	Cost	Accuracy	Gen-Emerge Fit
NIMA	Photo aesthetics	AVA (255K)	~0	SRCC 0.61	Low (photo bias)
LAION Aesthetic	Generic appeal	CLIP + 5K	~0	Not validated	Low (circular bias)
CLIP Score	Text-image align	400M pairs	~0	Moderate	Medium — prompt check
ImageReward	Human preference	137K expert	~0	65%	High — best single-score
VisionReward	Multi-dim pref.	48K structured	~0	>65%	Very high — decomposed
DreamSim	Perceptual sim.	20K triplets	~0	96%	Very high — diversity
VLM Pairwise	Comparative	VLM training	Low	79%	High — natural
VLM Absolute	Score assign	VLM training	Low	42–70%	Low — current bad approach
Human Pairwise	Gold standard	Direct	High	Ground truth	Ideal but bottleneck
Structured Checklist	Binary per-dim	VLM + design	Medium	High per Q	Very high

5. Recommended Feedback Architecture

5.1. Principle: Multi-Signal, Not Single-Score

No single method provides a reliable single score. The solution is an ensemble of heterogeneous signals, each with understood limitations.

5.2. Three Feedback Channels

Channel A: Automated Metrics (zero-cost, every cycle)

Technical quality floor: NIMA score > threshold. Prompt adherence: CLIP Score > threshold. Diversity gate: DreamSim distance to nearest neighbor > threshold. Style ambiguity: WikiArt classifier softmax entropy. All four are gating (pass/fail), not scoring.

Channel B: Structured VLM Evaluation (low-cost, every cycle)

Show image to VLM. Do NOT ask for numeric score. Instead: binary checklist (10–15 yes/no questions per dimension), descriptive summary (2–3 sentences), and pairwise comparison with current best in series. Binary questions minimize hallucination.

Channel C: Human Feedback (high-cost, async, periodic)

Not every cycle — every N cycles or on request. Pairwise comparison (not rating): 3–5 pairs, "which is better?" Confidence-weighted favorites: "interesting" (free) vs "breakthrough" (max 1 per series). Compass update. Feeds into Bradley-Terry model for calibration.

5.3. Aggregation: Constraint Satisfaction, Not Average

Instead of "average score = X" →

✓ 4/4 automated gates passed
✓ 11/15 binary checks passed
✓ Won pairwise vs current best
✗ Human marked as breakthrough: no

A profile, not a number

For the QD-archive: the profile determines cell coordinates (where in descriptor space), gates determine minimum viability, pairwise determines replace-if-better.

5.4. Calibration Over Time

Metrics drift. A NIMA threshold calibrated on the first 10 images may be inadequate at image 500. Each human pairwise judgment is a calibration point. If humans consistently disagree with automated metrics — adjust thresholds. Periodic calibration rounds: human ranks 10 random archive images, compared against automated ranking.

5.5. Minimal Implementation Pipeline

[Image Generated] │ ▼ [Gate 1: NIMA > 3.5?] ──── if no ──→ reject + log │ yes ▼ [Gate 2: CLIP Score > 0.25?] ── if no ──→ reject + log │ yes ▼ [Gate 3: DreamSim(img, nearest) > 0.15?] ── if no ──→ reject as duplicate │ yes ▼ [VLM Binary Checklist: 10 yes/no questions] │ ▼ [VLM Pairwise: vs current best in series] │ ▼ [Store: image + profile + VLM description] │ ▼ [Every 10 cycles: Human pairwise calibration]

Cost per cycle: ~4 model inferences (NIMA + CLIP + DreamSim + VLM call). Zero "made-up numbers."

6. Anti-Patterns to Avoid

Avoid

LLM Numeric Scoring

LLM writes "aesthetic quality: 7.2/10" — fiction. No trained mapping between visual perception and numbers. Non-deterministic, biased to means, self-preference bias.

Avoid

Single Metric Rules All

Even ImageReward optimizes toward average human preference — the median taste attractor. One metric never captures everything.

Avoid

Optimization on Proxy

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Reward hacking is inevitable. Metrics for gating, not optimization.

Avoid

Assuming Objectivity

ICC = 0.40 in crowdsourcing doesn't mean "bad workers" — it means aesthetics IS subjective. Model one specific human (creator), not "average taste."

7. Ready-to-Use Tools

Tool	Type	How to Use
NIMA (MobileNet)	Aesthetic scorer	`pip install`, CPU/GPU inference — github
LAION Aesthetic V2	Aesthetic scorer	`nn.Linear(768,1)` on CLIP embed — github
ImageReward	Preference scorer	`pip install image-reward`, (prompt, image) → score — github
VisionReward	Multi-dim scorer	Checklist queries, HuggingFace model — github
DreamSim	Perceptual distance	`pip install dreamsim`, (img1, img2) → distance — github
CLIP	Text-image alignment	`pip install open-clip-torch` — github

8. Open Questions

Fine-tuning reward model for one human

ImageReward trained on average preference. Can it be fine-tuned on a specific creator's preferences? HBT (Hierarchical Bradley-Terry) allows modeling individual rater preferences on top of population-level model. ~100-200 pairwise judgments → personalized reward model.

Temporal drift in preferences

Taste changes. Creator who loved minimal style initially may switch to dense composition months later. Online updating of preference model with exponential decay of old judgments — FadeMem for preferences.

Adversarial evaluation

Instead of "how good?" → "what's wrong?". VLM as critic: "find three weaknesses." Negative feedback may be more reliable than positive scoring — easier and more reliable to describe a problem than to quantify quality.

Self-evaluation via reconstruction

If a model can describe an image as text, and the text generates another image → measure consistency. Low reconstruction consistency = ambiguous or low-quality image. Internal metric, no external model needed.

QDHF — personalized diversity axes

Instead of hand-picked axes (palette, subject, composition) — learned axes reflecting what this specific human considers "different." DreamSim as stepping stone, personalized DreamSim as goal.

9. Conclusions

Main Conclusion

An LLM assigning numeric scores to images is a hallucination engine, not an evaluation engine. 50 years of image quality assessment research has created a powerful apparatus — and it should be used.

Replace LLM scoring with multi-signal gating

Not "rate 1 to 10" but "pass 4 gates + answer 10 binary questions + win pairwise."

DreamSim is the key diversity metric

CLIP distance uncalibrated with human perception. DreamSim calibrated (96% agreement). Use for diversity gate and fingerprint comparison.

Pairwise comparison always beats absolute scoring

For VLM: 79% vs 42%. For humans: Thurstone 1927. Move entire feedback architecture to comparative mode.

Structured binary checklist reduces hallucination

VLM reliable for binary classification, unreliable for numeric scoring. 15 binary questions > 1 scalar score.

Never optimize proxy

Goodhart's Law + reward hacking. Metrics for gating and diagnosis. Primary objective: coverage (QD), not score.

1. Проблема

Когда LLM «оценивает» сгенерированное изображение числовой оценкой (напр., «эстетика: 7.2, оригинальность: 8.1»), это по сути галлюцинация. У языковой модели нет обученного пайплайна эстетического восприятия — она производит правдоподобно звучащие числа на основе текстового описания и промпта, но эти числа не коррелируют систематически с человеческим восприятием.

Суть проблемы

Если обратная связь — фикция, вся система оптимизирует фикцию. Для gen-emerge это фундаментальная проблема: 8 каналов обратной связи, вызвавших конвергенцию, были построены на ненадёжных сигналах.

Инженерное и научное сообщество разработало развитый аппарат решения этой проблемы за последние 10 лет. Ниже — систематический обзор всех основных подходов.

2. Таксономия подходов к оценке изображений

Все существующие подходы разделяются на 7 парадигм, каждая со своими сильными сторонами и ограничениями.

2.1. Нейронные эстетические скореры

Обучение CNN/ViT предсказывать распределение человеческих оценок на аннотированных фото-датасетах. NIMA (Google, 2017): ImageNet-CNN + 10-классовый выход, обучен на AVA (255K фото, ~200 оценщиков на фото), SRCC ~0.61. LAION Aesthetic V2 (2022): линейный зонд nn.Linear(768, 1) на CLIP ViT-L/14 эмбеддингах, обучен на ~5K парах.

Ограничения для Gen-Emerge

Все модели обучены на фотографиях, не на генеративном искусстве. AVA = фотоконкурсы. Эстетическая оценка ≠ качество искусства. Полезно только как технический порог качества.

2.2. CLIP-выравнивание текста и изображения

Измерение соответствия сгенерированного изображения текстовому промпту через косинусное сходство CLIP-эмбеддингов.

CLIP-расстояние ≠ перцептуальное расстояние

DreamSim (Fu et al., NeurIPS 2023): два изображения с CLIP cosine similarity 0.95 могут выглядеть радикально по-разному для людей, а два с 0.80 — почти идентично. Использование CLIP-расстояния как метрики разнообразия — ошибка.

2.3. Reward-модели на человеческих предпочтениях

Обучение модели предсказывать, какое из двух изображений предпочтёт человек, на основе попарных сравнений. ImageReward: 137K экспертных попарных сравнений, BLIP-архитектура, точность 65.14%. HPS v2: CLIP-based preference model, ~65%. VisionReward: 64 бинарных вопроса → 5 измерений → взвешенная сумма, >65%.

P(A ≻ B) = σ(β_A − β_B)

Модель Брэдли-Терри (1952) — математический фундамент всех reward-моделей

Почему попарное лучше абсолютного

Thurstone (1927): относительное суждение когнитивно проще и стабильнее абсолютного. Попарное устраняет масштабное смещение. ICC для абсолютных эстетических оценок: ~0.40 в краудсорсинге vs ~0.94 в лаборатории. Bradley-Terry производит транзитивное ранжирование из зашумлённых индивидуальных суждений.

2.4. VLM-as-a-Judge

Показать изображение мультимодальной модели (GPT-4V, Claude, Gemini) и попросить оценить. Именно это делает gen-emerge сейчас — и именно это проблематично.

79%Попарное согласие

~70%Согласие в скоринге

42%Согласие в ранжировании батча

Критическая находка

LLM, оценивающая текстовое описание изображения (без зрения), достигает Pearson similarity 0.435 в скоринге — лучше некоторых MLLM. Большая часть «оценки» — оценка описания, не изображения.

2.5. Обученные перцептуальные метрики

Модели, обученные предсказывать насколько похожи два изображения для человека. Не оценка качества — оценка различия.

LPIPS (Zhang et al., 2018): расстояние в пространстве признаков VGG/AlexNet, калиброванное на человеческих суждениях. DISTS (2020): улучшение — отдельно моделирует структурные и текстурные компоненты. DreamSim (Fu et al., NeurIPS 2023 Spotlight): прорыв. Обучена на 20K триплетах из диффузионных моделей. Конкатенация CLIP + OpenCLIP + DINO + LoRA fine-tuning. 96.16% согласие с людьми.

Ключевая рекомендация

DreamSim — лучшая доступная метрика для измерения разнообразия в gen-emerge. Вместо CLIP-расстояния (слабо коррелирует с воспринимаемой разницей), DreamSim-расстояние = перцептуально откалиброванное «насколько они разные». Прямая основа для шлюза разнообразия (B10) и сравнения отпечатков.

2.6. Метрики уровня распределения

Оценка не одного изображения, а качества и разнообразия всего набора. FID, CMMD (CVPR 2024), sFID, Self-Similarity. FID/CMMD не применимы к gen-emerge (нет референсного датасета). Но self-similarity внутри серии — средняя попарная DreamSim-дистанция — простая и интерпретируемая диагностика конвергенции.

2.7. Декомпозированная оценка

Разбить «нравится ли тебе?» на конкретные измеримые измерения: Alignment, Aesthetics, Detail, Safety, Bias. VisionReward декомпозирует по 5 измерениям с помощью 64 бинарных вопросов. Точность монотонно растёт с числом вопросов.

Специфичные измерения для Gen-Emerge

Соответствие ограничениям: были ли выполнены заданные ограничения? (бинарное на каждое)
Техническое качество: есть ли артефакты? (бинарное)
Новизна: отличается от предыдущих работ? (DreamSim-расстояние, непрерывное)
Когерентность: есть ли визуальный нарратив? (суждение VLM)
Стилевая амбигуативность: легко ли классифицировать стиль? (энтропия softmax, непрерывная)

3. Проблема субъективности

3.1. Насколько объективно эстетическое суждение?

Лаборатория, эксперты

ICC 0.94

Лаборатория, не-эксперты

ICC 0.68

Краудсорсинг

ICC 0.40

ICC ~0.40 означает: менее половины дисперсии оценок объясняется различиями между изображениями — остальное это различия оценщиков. Это не шум — это подлинное несогласие.

4–5. Рекомендуемая архитектура обратной связи

Принцип: мультисигнал, а не единая оценка

Ни один метод не даёт надёжной единой оценки. Решение — ансамбль разнородных сигналов с понятными ограничениями.

Три канала обратной связи

Канал A: Автоматические метрики (бесплатно, каждый цикл)

Технический порог качества: NIMA > threshold. Соответствие промпту: CLIP Score > threshold. Шлюз разнообразия: DreamSim-расстояние до ближайшего соседа > threshold. Всё четыре — gating (прошёл/не прошёл), не скоринг.

Канал B: Структурированная VLM-оценка (дёшево, каждый цикл)

Показать изображение VLM. НЕ просить числовую оценку. Вместо этого: бинарный чеклист (10–15 да/нет вопросов), описательное резюме (2–3 предложения) и попарное сравнение с текущим лучшим в серии.

Канал C: Человеческий фидбэк (дорого, асинхронно, периодически)

Не каждый цикл — каждые N циклов или по запросу. Попарное сравнение (не рейтинг): 3–5 пар, «какое лучше?» Взвешенные по уверенности фавориты. Обновление компаса. Калибрация модели Bradley-Terry.

Агрегация: удовлетворение ограничений, не среднее

Вместо «средняя оценка = X» →

✓ 4/4 автоматических шлюза пройдены
✓ 11/15 бинарных проверок пройдены
✓ Выиграло попарное vs текущего лучшего
✗ Человек отметил как прорыв: нет

Профиль, а не число

6. Антипаттерны

Избегать

LLM-скоринг числами

LLM пишет «эстетика: 7.2/10» — фикция. Нет обученного маппинга между зрительным восприятием и числами.

Избегать

Одна метрика правит всем

Даже ImageReward оптимизирует под средний вкус. Одна метрика никогда не охватит всё.

Избегать

Оптимизация прокси

Закон Гудхарта: «Когда мера становится целью, она перестаёт быть хорошей мерой». Reward hacking неизбежен.

Избегать

Допущение объективности

ICC = 0.40 не означает «плохие работники» — это означает, что эстетика IS субъективна. Моделируй одного конкретного человека (создателя), не «средний вкус».

7–8. Инструменты и открытые вопросы

Готовые к использованию: NIMA (MobileNet), LAION Aesthetic V2, ImageReward, VisionReward, DreamSim, CLIP — все с pip install и GitHub-репозиториями.

Открытые вопросы: файн-тюнинг reward-модели под одного человека (HBT, ~100–200 попарных суждений → персонализированная модель); темпоральный дрифт предпочтений (FadeMem для вкусов); адверсариальная оценка («найди три слабости» вместо «насколько хорошо?»); самооценка через реконструкцию; QDHF — персонализированные оси разнообразия.

9. Выводы

Главный вывод

LLM, присваивающая числовые оценки изображениям — движок галлюцинаций, а не движок оценки. 50 лет исследований в области оценки качества изображений создали мощный аппарат — и его следует использовать.

Заменить LLM-скоринг на мультисигнальный гейтинг

Не «оцени от 1 до 10», а «пройди 4 шлюза + ответь на 10 бинарных вопросов + выиграй попарное».

DreamSim — ключевая метрика разнообразия

CLIP-расстояние не откалибровано с человеческим восприятием. DreamSim откалибрована (96% согласие).

Попарное сравнение всегда побеждает абсолютный скоринг

Для VLM: 79% vs 42%. Для людей: Thurstone 1927. Перевести всю архитектуру обратной связи в сравнительный режим.

Структурированный бинарный чеклист снижает галлюцинации

VLM надёжна для бинарной классификации, ненадёжна для числового скоринга. 15 бинарных вопросов > 1 скалярной оценки.

Никогда не оптимизируй прокси

Закон Гудхарта + reward hacking. Метрики для гейтинга и диагностики. Основная цель: покрытие (QD), а не оценка.