Training Analysis: 未知 Michi

A Systematic Evaluation of Autonomous AI Artist Training Through Structured Curriculum

Generated 2026-06-05 23:04 UTC
500 training attempts · 2026-03-06 — 2026-04-05 · Current phase: RESEARCH
Curriculum: 4 phases, 26 lessons, 260 assignments

1. Abstract

This paper presents a quantitative and qualitative analysis of training an autonomous AI agent (Michi) to develop visual art skills through a structured four-phase curriculum. The agent progresses from mechanical prompt precision (CRAFT) through compositional thinking (ART), emotional expression (RESEARCH), and personal voice development (EXPLORE). Over 500 training attempts conducted between 2026-03-06 and 2026-04-05, we examine whether structured curriculum-based training with LLM self-critique produces measurable skill development.

Key findings are summarized in the phase verdicts below. Cross-validation with an independent evaluator has not yet been conducted.

CRAFT

Mastered

8.1/10

Strong performance (avg 8.1/10) with continued improvement.

▲ intent_gap (10.4)

▼ control (8.3)

ART

Partial

8.7/10

Moderate performance (avg 8.7/10). Plateaued.

▲ composition (8.9)

▼ expressiveness (8.8)

RESEARCH

Partial

8.7/10

Moderate performance (avg 8.7/10). Declining.

▲ emotional_impact (8.6)

▼ originality (8.1)

EXPLORE

Partial

8.9/10

Moderate performance (avg 8.9/10). Plateaued.

▲ coherence (9.4)

▼ novelty (7.8)

2. CRAFT — Prompt Precision

Hypothesis: Can the agent learn to precisely control what appears in generated images through structured prompt engineering?

Over 151 attempts, CRAFT achieved a mean score of 8.1/10. The trajectory is improving (first half: 7.6, second half: 8.6, Δ=+0.9). Dimensional breakdown: intent_gap: 10.4, precision: 8.7, control: 8.3.

Intent-gap averaged 12% overall, moving from 18% (first half) to 5% (second half). A reduction in intent-gap indicates improved control over the generation process.

foundation: 8.8 (n=37) application: 8.2 (n=45) challenge: 7.1 (n=44) synthesis: 9.0 (n=12) reflection: 8.5 (n=13)

Per-Lesson Breakdown

Lesson	Attempts	Avg Score
C1.1	36	8.1
C1.2	50	7.4
C1.3	33	8.6
C1.4	10	9.0
C2.1	11	8.4
C2.2	11	8.8

3. ART — Visual Language

Hypothesis: Can the agent develop compositional thinking — balance, harmony, expressiveness — beyond mere prompt compliance?

Over 171 attempts, ART achieved a mean score of 8.7/10. The trajectory is stable (first half: 8.7, second half: 8.6, Δ=-0.1). Dimensional breakdown: composition: 8.9, harmony: 8.9, expressiveness: 8.8.

foundation: 8.7 (n=48) application: 8.2 (n=56) challenge: 9.0 (n=33) synthesis: 9.0 (n=17) reflection: 9.0 (n=17)

Per-Lesson Breakdown

Lesson	Attempts	Avg Score
A1.1	30	8.8
A1.2	33	8.9
A1.3	21	8.8
A1.4	25	8.1
A2.1	20	9.0
A2.2	20	8.9
A3.1	20	8.9
freestyle	2	0.0

4. RESEARCH — Emotion & Meaning

Hypothesis: Can an AI agent learn to evoke specific emotions through visual composition, metaphor, and narrative?

Over 118 attempts, RESEARCH achieved a mean score of 8.7/10. The trajectory is declining (first half: 9.0, second half: 8.5, Δ=-0.5). Dimensional breakdown: emotional_impact: 8.6, depth: 8.2, originality: 8.1.

foundation: 8.8 (n=30) application: 9.0 (n=30) challenge: 8.3 (n=38) synthesis: 9.1 (n=10) reflection: 8.9 (n=10)

Per-Lesson Breakdown

Lesson	Attempts	Avg Score
R1.1	58	8.5
R1.2	30	8.9
R1.3	10	9.0
R2.1	20	8.9

5. EXPLORE — Beyond Known

Hypothesis: Can the agent develop a recognizable personal voice and push beyond established visual conventions?

Over 60 attempts, EXPLORE achieved a mean score of 8.9/10. The trajectory is stable (first half: 9.0, second half: 8.8, Δ=-0.2). Dimensional breakdown: coherence: 9.4, voice: 8.1, novelty: 7.8.

foundation: 8.7 (n=18) application: 8.9 (n=18) challenge: 8.9 (n=12) synthesis: 9.0 (n=6) reflection: 9.0 (n=6)

Per-Lesson Breakdown

Lesson	Attempts	Avg Score
E1.1	20	8.9
E1.2	10	8.6
E1.3	10	8.9
E2.1	10	8.9
E3.1	10	9.0

6. Cross-Phase Learning Dynamics

Examining overall learning trajectory, phase transitions, and skill persistence across curriculum phases.

The composite learning curve across all phases reveals whether the agent demonstrates genuine cumulative skill development or phase-specific adaptation. A persistent intent-gap measure tracks whether prompt precision (a CRAFT skill) degrades as artistic demands increase in later phases.

7. Cross-Validation

Independent re-evaluation of sampled images by a separate GPT-4.1 vision evaluator.

No cross-validation data available. Run the analysis to generate independent scores.

8. Curriculum Completion

Of 26 lessons, 7 have been passed (26%). At the assignment level, 70 of 260 assignments met their tier-specific pass thresholds. Bloom's Taxonomy tier analysis reveals differential mastery across cognitive complexity levels.

foundation

27%

application

27%

challenge

27%

synthesis

27%

reflection

27%

9. Retry Effectiveness

Across 143 multi-attempt assignments, the average score change from first to last attempt was +0.15. This suggests retries provide marginal improvement; the feedback loop between critique and re-generation may need restructuring.

10. Weakness and Strength Evolution

Text analysis of critique feedback reveals how characteristic weaknesses and strengths evolved. Comparing the first half of training to the second half surfaces whether systemic issues were addressed.

Early Weaknesses

Very minor: The dust outline and shadow are so subtle that in a quick glance, a viewer might miss th (1)
Minor: The chair is not extremely far from the table, so the sense of violent haste could be pushed (1)
The only very minor issue is that the background could have slightly more detail to further enrich t (1)
The composition is slightly static; a more dynamic placement could add interest, but this is a minor (1)
The only minor issue is that the blue accent, while effective, could be even more isolated for an ev (1)
Minor: The image could push the accent values slightly more for even greater focal emphasis, but thi (1)
Composition is somewhat static and could be more dynamic. (1)

Late Weaknesses

Only one image is provided, so the assignment is incomplete. (2)
Palette use is slightly conservative; the pale rose quartz is subtle and could be more assertive for (1)
Minor blending in some areas makes it slightly hard to distinguish between the muted olive and brown (1)
Minor: While technically strong, the color transitions could be slightly more nuanced in the skin to (1)
The only minor issue is that the cool image’s highlights (flower, notepad) verge on neutral, which c (1)
The sage green lines are very subtle and could be slightly more pronounced for clarity, but they are (1)
Cannot assess palette versatility or consistency across both value extremes. (1)

Early Strengths

All stated intents are fully realized. (10)
Excellent control of color dominance and accent. (6)
All stated intents are fully achieved. (3)
All assignment criteria are fully met. (3)
Composition is balanced and visually engaging. (3)
Composition leads the eye through the story elements in logical sequence. (2)
Excellent control of a warm, desaturated palette. (2)

Late Strengths

All stated intents are clearly achieved. (6)
All stated intents are fully realized. (4)
All stated intents are convincingly realized in the image. (2)
Palette is strictly limited and well-controlled. (2)
Symbol is visually distinct and consistently rendered. (2)
Emotional impact is strong and immediate. (2)
All stated intents are fully realized in the image. (2)

11. Cost Efficiency

Total expenditure on OpenAI API calls provides a cost-effectiveness baseline for curriculum-based AI art education.

$199.84

Total Cost

3,189

API Calls

$0.429

Cost / Pass

500

Attempts

12. Discussion

Methodological Limitations

This study relies on a single LLM (GPT-4.1) for both generation critique and dimensional scoring, introducing self-referential evaluation bias. The same model architecture that generates prompts also evaluates their results, creating a closed feedback loop that may converge on critic-pleasing patterns rather than genuine visual quality.

The OpenClaw agent's workspace — including its evolution log, failure taxonomy, knowledge base, and self-portrait — resides on an ephemeral Docker filesystem. Each redeployment erases accumulated self-modification artifacts, fundamentally undermining the designed self-evolution protocol.

The “Teaching to the Test” Problem

A central concern is whether improving scores reflects genuine artistic development or prompt engineering specifically optimized for the critique model. Without human blind evaluation at scale, this distinction remains unresolved.

On Measuring Emotion and Voice

RESEARCH and EXPLORE phases assess emotional impact and personal voice — dimensions that resist objective measurement. An LLM evaluator scoring “emotional impact” may measure prompt clarity about intended emotions rather than actual evocative power.

13. Conclusions and Course 2 Launch Plan

Based on the quantitative and qualitative evidence presented, we propose the following modifications for the next iteration of Michi's training curriculum.

Architectural Changes

Persist OpenClaw workspace notes (art-diary, evolution-log, knowledge-base, failure-taxonomy, prompt-patterns) in PostgreSQL via API endpoints instead of ephemeral Docker filesystem. Self-evolution artifacts are lost on every redeploy.
Retry effectiveness is low (avg delta: 0.15). Restructure retry loop: feed full critique text from attempt N into attempt N+1 planning as explicit constraints, not just a score.

Proposed New Skills

Prompt complexity did not grow significantly. Add a 'Prompt Expansion' skill that teaches structured prompt templates with layers: subject -> setting -> lighting -> mood -> style -> technical constraints.
Add a 'Multi-Iteration Refinement' skill: generate 3 variations with systematic differences, critique all, then generate a final version incorporating the best elements. Currently retries are independent.
Add a 'Composition Templates' skill with codified rules (rule of thirds, golden ratio, leading lines, S-curves) as structured prompt components.

Methodological Adjustments

'foundation' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
'application' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
'challenge' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
'synthesis' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.
'reflection' tier has only 27% pass rate. Assignments may be too difficult or thresholds too high for this tier.

Curriculum Design

Add integration assignments requiring all 4 skill layers simultaneously: precise prompting (CRAFT) + strong composition (ART) + emotional impact (RESEARCH) + stylistic distinctiveness (EXPLORE).
Consider new lesson topics: negative space, visual rhythm, scale contrast, implied motion, temporal narrative in single frame, color symbolism.

Self-Evolution Protocol

Evaluate whether micro-cycles (every 10 attempts) actually correlate with score improvements in subsequent attempts. If not, adjust the GATHER/DIAGNOSE/HYPOTHESIZE loop to be more data-driven.
Consider storing evolution-log entries in the database rather than filesystem to enable quantitative analysis of which self-modifications were effective.