A Systematic Evaluation of Autonomous AI Artist Training Through Structured Curriculum
Generated 2026-04-16 12:35 UTC
500 training attempts · 2026-03-06 — 2026-04-05 · Current phase: RESEARCH
Curriculum: 4 phases, 26 lessons, 260 assignments
1. Abstract
This paper presents a quantitative and qualitative analysis of training an autonomous AI agent (Michi) to develop visual art skills through a structured four-phase curriculum. The agent progresses from mechanical prompt precision (CRAFT) through compositional thinking (ART), emotional expression (RESEARCH), and personal voice development (EXPLORE). Over 500 training attempts conducted between 2026-03-06 and 2026-04-05, we examine whether structured curriculum-based training with LLM self-critique produces measurable skill development.
Key findings are summarized in the phase verdicts below. Cross-validation with an independent evaluator has not yet been conducted.
CRAFT
Mastered
8.1/10
Strong performance (avg 8.1/10) with continued improvement.
▲ intent_gap (10.4)
▼ control (8.3)
ART
Partial
8.7/10
Moderate performance (avg 8.7/10). Plateaued.
▲ composition (8.9)
▼ expressiveness (8.8)
RESEARCH
Partial
8.7/10
Moderate performance (avg 8.7/10). Declining.
▲ emotional_impact (8.6)
▼ originality (8.1)
EXPLORE
Partial
8.9/10
Moderate performance (avg 8.9/10). Plateaued.
▲ coherence (9.4)
▼ novelty (7.8)
2. CRAFT — Prompt Precision
Hypothesis: Can the agent learn to precisely control what appears in generated images through structured prompt engineering?
Over 151 attempts, CRAFT achieved a mean score of 8.1/10.
The trajectory is improving (first half: 7.6, second half: 8.6, Δ=+0.9).
Dimensional breakdown: intent_gap: 10.4, precision: 8.7, control: 8.3.
Intent-gap averaged 12% overall, moving from 18% (first half) to 5% (second half). A reduction in intent-gap indicates improved control over the generation process.
Hypothesis: Can the agent develop compositional thinking — balance, harmony, expressiveness — beyond mere prompt compliance?
Over 171 attempts, ART achieved a mean score of 8.7/10.
The trajectory is stable (first half: 8.7, second half: 8.6, Δ=-0.1).
Dimensional breakdown: composition: 8.9, harmony: 8.9, expressiveness: 8.8.
Hypothesis: Can an AI agent learn to evoke specific emotions through visual composition, metaphor, and narrative?
Over 118 attempts, RESEARCH achieved a mean score of 8.7/10.
The trajectory is declining (first half: 9.0, second half: 8.5, Δ=-0.5).
Dimensional breakdown: emotional_impact: 8.6, depth: 8.2, originality: 8.1.
Hypothesis: Can the agent develop a recognizable personal voice and push beyond established visual conventions?
Over 60 attempts, EXPLORE achieved a mean score of 8.9/10.
The trajectory is stable (first half: 9.0, second half: 8.8, Δ=-0.2).
Dimensional breakdown: coherence: 9.4, voice: 8.1, novelty: 7.8.
Examining overall learning trajectory, phase transitions, and skill persistence across curriculum phases.
The composite learning curve across all phases reveals whether the agent demonstrates genuine cumulative skill development or phase-specific adaptation. A persistent intent-gap measure tracks whether prompt precision (a CRAFT skill) degrades as artistic demands increase in later phases.
7. Cross-Validation
Independent re-evaluation of sampled images by a separate GPT-4.1 vision evaluator.
No cross-validation data available. Run the analysis to generate independent scores.
8. Curriculum Completion
Of 26 lessons, 7 have been passed (26%). At the assignment level, 70 of 260 assignments met their tier-specific pass thresholds. Bloom's Taxonomy tier analysis reveals differential mastery across cognitive complexity levels.
foundation
27%
application
27%
challenge
27%
synthesis
27%
reflection
27%
9. Retry Effectiveness
Across 143 multi-attempt assignments, the average score change from first to last attempt was +0.15. This suggests retries provide marginal improvement; the feedback loop between critique and re-generation may need restructuring.
10. Weakness and Strength Evolution
Text analysis of critique feedback reveals how characteristic weaknesses and strengths evolved. Comparing the first half of training to the second half surfaces whether systemic issues were addressed.
Early Weaknesses
Very minor: The dust outline and shadow are so subtle that in a quick glance, a viewer might miss th (1)
Minor: The chair is not extremely far from the table, so the sense of violent haste could be pushed (1)
The only very minor issue is that the background could have slightly more detail to further enrich t (1)
The composition is slightly static; a more dynamic placement could add interest, but this is a minor (1)
The only minor issue is that the blue accent, while effective, could be even more isolated for an ev (1)
Minor: The image could push the accent values slightly more for even greater focal emphasis, but thi (1)
Composition is somewhat static and could be more dynamic. (1)
Late Weaknesses
Only one image is provided, so the assignment is incomplete. (2)
Palette use is slightly conservative; the pale rose quartz is subtle and could be more assertive for (1)
Minor blending in some areas makes it slightly hard to distinguish between the muted olive and brown (1)
Minor: While technically strong, the color transitions could be slightly more nuanced in the skin to (1)
The only minor issue is that the cool image’s highlights (flower, notepad) verge on neutral, which c (1)
The sage green lines are very subtle and could be slightly more pronounced for clarity, but they are (1)
Cannot assess palette versatility or consistency across both value extremes. (1)
Early Strengths
All stated intents are fully realized. (10)
Excellent control of color dominance and accent. (6)
All stated intents are fully achieved. (3)
All assignment criteria are fully met. (3)
Composition is balanced and visually engaging. (3)
Composition leads the eye through the story elements in logical sequence. (2)
Excellent control of a warm, desaturated palette. (2)
Late Strengths
All stated intents are clearly achieved. (6)
All stated intents are fully realized. (4)
All stated intents are convincingly realized in the image. (2)
Palette is strictly limited and well-controlled. (2)
Symbol is visually distinct and consistently rendered. (2)
Emotional impact is strong and immediate. (2)
All stated intents are fully realized in the image. (2)
11. Cost Efficiency
Total expenditure on OpenAI API calls provides a cost-effectiveness baseline for curriculum-based AI art education.
$198.98
Total Cost
3,170
API Calls
$0.427
Cost / Pass
500
Attempts
12. Discussion
Methodological Limitations
This study relies on a single LLM (GPT-4.1) for both generation critique and dimensional scoring, introducing self-referential evaluation bias. The same model architecture that generates prompts also evaluates their results, creating a closed feedback loop that may converge on critic-pleasing patterns rather than genuine visual quality.
The OpenClaw agent's workspace — including its evolution log, failure taxonomy, knowledge base, and self-portrait — resides on an ephemeral Docker filesystem. Each redeployment erases accumulated self-modification artifacts, fundamentally undermining the designed self-evolution protocol.
The “Teaching to the Test” Problem
A central concern is whether improving scores reflects genuine artistic development or prompt engineering specifically optimized for the critique model. Without human blind evaluation at scale, this distinction remains unresolved.
On Measuring Emotion and Voice
RESEARCH and EXPLORE phases assess emotional impact and personal voice — dimensions that resist objective measurement. An LLM evaluator scoring “emotional impact” may measure prompt clarity about intended emotions rather than actual evocative power.
13. Conclusions and Course 2 Launch Plan
Based on the quantitative and qualitative evidence presented, we propose the following modifications for the next iteration of Michi's training curriculum.
Architectural Changes
Persist OpenClaw workspace notes (art-diary, evolution-log, knowledge-base, failure-taxonomy, prompt-patterns) in PostgreSQL via API endpoints instead of ephemeral Docker filesystem. Self-evolution artifacts are lost on every redeploy.
Retry effectiveness is low (avg delta: 0.15). Restructure retry loop: feed full critique text from attempt N into attempt N+1 planning as explicit constraints, not just a score.
Proposed New Skills
Prompt complexity did not grow significantly. Add a 'Prompt Expansion' skill that teaches structured prompt templates with layers: subject -> setting -> lighting -> mood -> style -> technical constraints.
Add a 'Multi-Iteration Refinement' skill: generate 3 variations with systematic differences, critique all, then generate a final version incorporating the best elements. Currently retries are independent.
Add a 'Composition Templates' skill with codified rules (rule of thirds, golden ratio, leading lines, S-curves) as structured prompt components.
Methodological Adjustments
'foundation' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
'application' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
'challenge' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
'synthesis' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
'reflection' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
Consider new lesson topics: negative space, visual rhythm, scale contrast, implied motion, temporal narrative in single frame, color symbolism.
Self-Evolution Protocol
Evaluate whether micro-cycles (every 10 attempts) actually correlate with score improvements in subsequent attempts. If not, adjust the GATHER/DIAGNOSE/HYPOTHESIZE loop to be more data-driven.
Consider storing evolution-log entries in the database rather than filesystem to enable quantitative analysis of which self-modifications were effective.