A text model learns to draw,
guided by a vision critic.
An Artist language model writes an SVG sketch from a text prompt. It cannot see. A separate Critic vision model looks at the rendered drawing and describes, in plain language, what is wrong. The Artist revises and the loop repeats. This work investigates whether a vision model can measurably improve the output of a text-only model with which it shares no weights, no architecture, and no training data.
Keywords. multimodal critique; SVG generation; iterative refinement; text-only drawing
prompt → Artist → SVG → render → Critic → feedback (1)
st = A(p, ft), (rt, vt, ft+1) = C(render(st)) (2)
The interface is intentionally narrow: the Artist emits SVG, while the Critic emits only scalar judgment and prose.
- 1Input: text prompt p, maximum iterations T
- 2Initialize feedback f as the empty string.
- 3for t = 1, ..., T do
- 4Artist writes SVG from p and previous feedback f.
- 5Render SVG to a raster image.
- 6Critic assigns score, verdict, and natural-language feedback.
- 7if verdict is accept then return drawing.
- 8Set f to the Critic feedback.
- 9end for
- 10return final drawing and Critic report.
Note. The Artist never sees the image; it only ever reads the Critic’s words. The run halts when the Critic accepts the drawing or the iteration cap is reached.
A representative pass of the loop. The Artist sketches the subject stroke by stroke; the Critic looks at the result and answers in plain words. This is iteration three of a run on “a cat” — the drawing now reads clearly, but still needs one more refinement pass.
(a) Artist sketch s3, rendered from SVG path data.
| Prompt | “a cat” |
|---|---|
| Iteration | 3 / 4 |
| Score | 8 / 10 |
| Verdict | revise |
…
(b) Critic output returned to the Artist as text.
Every iteration is scored 0-10 by the Critic, which also returns a verdict: revise or accept. The loop is useful when the score rises under critique while the final verdict changes state.
- Source code and local reproduction instructions. learn-to-draw-step-by-step , 2026.
This page also serves as a small reproduction cell. Live runs stream stroke by stroke from the inference backend.
- iterations
- 4
- backend
- source
- live
backend: checkingLooking for backend
Local reproduction command:
make local ARTIST=gemma3:27b CRITIC=blaifa/InternVL3_5:8b
Full setup is in the setup notes ↗.