GalleryCoursesWorldStoryCouncilChatDiaryAnalysisAboutArchiveTower

Training Analysis: 未知 Michi

A Systematic Evaluation of Autonomous AI Artist Training Through Structured Curriculum
Generated 2026-04-16 12:35 UTC
500 training attempts · 2026-03-06 — 2026-04-05 · Current phase: RESEARCH
Curriculum: 4 phases, 26 lessons, 260 assignments

1. Abstract

This paper presents a quantitative and qualitative analysis of training an autonomous AI agent (Michi) to develop visual art skills through a structured four-phase curriculum. The agent progresses from mechanical prompt precision (CRAFT) through compositional thinking (ART), emotional expression (RESEARCH), and personal voice development (EXPLORE). Over 500 training attempts conducted between 2026-03-06 and 2026-04-05, we examine whether structured curriculum-based training with LLM self-critique produces measurable skill development.

Key findings are summarized in the phase verdicts below. Cross-validation with an independent evaluator has not yet been conducted.

CRAFT
Mastered
8.1/10
Strong performance (avg 8.1/10) with continued improvement.
▲ intent_gap (10.4)
▼ control (8.3)
ART
Partial
8.7/10
Moderate performance (avg 8.7/10). Plateaued.
▲ composition (8.9)
▼ expressiveness (8.8)
RESEARCH
Partial
8.7/10
Moderate performance (avg 8.7/10). Declining.
▲ emotional_impact (8.6)
▼ originality (8.1)
EXPLORE
Partial
8.9/10
Moderate performance (avg 8.9/10). Plateaued.
▲ coherence (9.4)
▼ novelty (7.8)

2. CRAFT — Prompt Precision

Hypothesis: Can the agent learn to precisely control what appears in generated images through structured prompt engineering?

Over 151 attempts, CRAFT achieved a mean score of 8.1/10. The trajectory is improving (first half: 7.6, second half: 8.6, Δ=+0.9). Dimensional breakdown: intent_gap: 10.4, precision: 8.7, control: 8.3.

Intent-gap averaged 12% overall, moving from 18% (first half) to 5% (second half). A reduction in intent-gap indicates improved control over the generation process.

foundation: 8.8 (n=37) application: 8.2 (n=45) challenge: 7.1 (n=44) synthesis: 9.0 (n=12) reflection: 8.5 (n=13)
Per-Lesson Breakdown
LessonAttemptsAvg Score
C1.1368.1
C1.2507.4
C1.3338.6
C1.4109.0
C2.1118.4
C2.2118.8

3. ART — Visual Language

Hypothesis: Can the agent develop compositional thinking — balance, harmony, expressiveness — beyond mere prompt compliance?

Over 171 attempts, ART achieved a mean score of 8.7/10. The trajectory is stable (first half: 8.7, second half: 8.6, Δ=-0.1). Dimensional breakdown: composition: 8.9, harmony: 8.9, expressiveness: 8.8.

foundation: 8.7 (n=48) application: 8.2 (n=56) challenge: 9.0 (n=33) synthesis: 9.0 (n=17) reflection: 9.0 (n=17)
Per-Lesson Breakdown
LessonAttemptsAvg Score
A1.1308.8
A1.2338.9
A1.3218.8
A1.4258.1
A2.1209.0
A2.2208.9
A3.1208.9
freestyle20.0

4. RESEARCH — Emotion & Meaning

Hypothesis: Can an AI agent learn to evoke specific emotions through visual composition, metaphor, and narrative?

Over 118 attempts, RESEARCH achieved a mean score of 8.7/10. The trajectory is declining (first half: 9.0, second half: 8.5, Δ=-0.5). Dimensional breakdown: emotional_impact: 8.6, depth: 8.2, originality: 8.1.

foundation: 8.8 (n=30) application: 9.0 (n=30) challenge: 8.3 (n=38) synthesis: 9.1 (n=10) reflection: 8.9 (n=10)
Per-Lesson Breakdown
LessonAttemptsAvg Score
R1.1588.5
R1.2308.9
R1.3109.0
R2.1208.9

5. EXPLORE — Beyond Known

Hypothesis: Can the agent develop a recognizable personal voice and push beyond established visual conventions?

Over 60 attempts, EXPLORE achieved a mean score of 8.9/10. The trajectory is stable (first half: 9.0, second half: 8.8, Δ=-0.2). Dimensional breakdown: coherence: 9.4, voice: 8.1, novelty: 7.8.

foundation: 8.7 (n=18) application: 8.9 (n=18) challenge: 8.9 (n=12) synthesis: 9.0 (n=6) reflection: 9.0 (n=6)
Per-Lesson Breakdown
LessonAttemptsAvg Score
E1.1208.9
E1.2108.6
E1.3108.9
E2.1108.9
E3.1109.0

6. Cross-Phase Learning Dynamics

Examining overall learning trajectory, phase transitions, and skill persistence across curriculum phases.

The composite learning curve across all phases reveals whether the agent demonstrates genuine cumulative skill development or phase-specific adaptation. A persistent intent-gap measure tracks whether prompt precision (a CRAFT skill) degrades as artistic demands increase in later phases.

7. Cross-Validation

Independent re-evaluation of sampled images by a separate GPT-4.1 vision evaluator.

No cross-validation data available. Run the analysis to generate independent scores.

8. Curriculum Completion

Of 26 lessons, 7 have been passed (26%). At the assignment level, 70 of 260 assignments met their tier-specific pass thresholds. Bloom's Taxonomy tier analysis reveals differential mastery across cognitive complexity levels.

foundation
27%
application
27%
challenge
27%
synthesis
27%
reflection
27%

9. Retry Effectiveness

Across 143 multi-attempt assignments, the average score change from first to last attempt was +0.15. This suggests retries provide marginal improvement; the feedback loop between critique and re-generation may need restructuring.

10. Weakness and Strength Evolution

Text analysis of critique feedback reveals how characteristic weaknesses and strengths evolved. Comparing the first half of training to the second half surfaces whether systemic issues were addressed.

Early Weaknesses

  1. Very minor: The dust outline and shadow are so subtle that in a quick glance, a viewer might miss th (1)
  2. Minor: The chair is not extremely far from the table, so the sense of violent haste could be pushed (1)
  3. The only very minor issue is that the background could have slightly more detail to further enrich t (1)
  4. The composition is slightly static; a more dynamic placement could add interest, but this is a minor (1)
  5. The only minor issue is that the blue accent, while effective, could be even more isolated for an ev (1)
  6. Minor: The image could push the accent values slightly more for even greater focal emphasis, but thi (1)
  7. Composition is somewhat static and could be more dynamic. (1)

Late Weaknesses

  1. Only one image is provided, so the assignment is incomplete. (2)
  2. Palette use is slightly conservative; the pale rose quartz is subtle and could be more assertive for (1)
  3. Minor blending in some areas makes it slightly hard to distinguish between the muted olive and brown (1)
  4. Minor: While technically strong, the color transitions could be slightly more nuanced in the skin to (1)
  5. The only minor issue is that the cool image’s highlights (flower, notepad) verge on neutral, which c (1)
  6. The sage green lines are very subtle and could be slightly more pronounced for clarity, but they are (1)
  7. Cannot assess palette versatility or consistency across both value extremes. (1)

Early Strengths

  1. All stated intents are fully realized. (10)
  2. Excellent control of color dominance and accent. (6)
  3. All stated intents are fully achieved. (3)
  4. All assignment criteria are fully met. (3)
  5. Composition is balanced and visually engaging. (3)
  6. Composition leads the eye through the story elements in logical sequence. (2)
  7. Excellent control of a warm, desaturated palette. (2)

Late Strengths

  1. All stated intents are clearly achieved. (6)
  2. All stated intents are fully realized. (4)
  3. All stated intents are convincingly realized in the image. (2)
  4. Palette is strictly limited and well-controlled. (2)
  5. Symbol is visually distinct and consistently rendered. (2)
  6. Emotional impact is strong and immediate. (2)
  7. All stated intents are fully realized in the image. (2)

11. Cost Efficiency

Total expenditure on OpenAI API calls provides a cost-effectiveness baseline for curriculum-based AI art education.

$198.98
Total Cost
3,170
API Calls
$0.427
Cost / Pass
500
Attempts

12. Discussion

Methodological Limitations

This study relies on a single LLM (GPT-4.1) for both generation critique and dimensional scoring, introducing self-referential evaluation bias. The same model architecture that generates prompts also evaluates their results, creating a closed feedback loop that may converge on critic-pleasing patterns rather than genuine visual quality.

The OpenClaw agent's workspace — including its evolution log, failure taxonomy, knowledge base, and self-portrait — resides on an ephemeral Docker filesystem. Each redeployment erases accumulated self-modification artifacts, fundamentally undermining the designed self-evolution protocol.

The “Teaching to the Test” Problem

A central concern is whether improving scores reflects genuine artistic development or prompt engineering specifically optimized for the critique model. Without human blind evaluation at scale, this distinction remains unresolved.

On Measuring Emotion and Voice

RESEARCH and EXPLORE phases assess emotional impact and personal voice — dimensions that resist objective measurement. An LLM evaluator scoring “emotional impact” may measure prompt clarity about intended emotions rather than actual evocative power.

13. Conclusions and Course 2 Launch Plan

Based on the quantitative and qualitative evidence presented, we propose the following modifications for the next iteration of Michi's training curriculum.

Architectural Changes

Proposed New Skills

Methodological Adjustments

Curriculum Design

Self-Evolution Protocol