The model is terrible when generating human. How many japanese girls were this model trained by?
oh dear...
Wrong resolution, wrong CFG, wrong steps, one line prompt without prompt enhancement, what did you expect?
a human with two arms? sorry, sdxl could do this YEARS AGO. Well, mostly. But it had the 2023 excuse.
well, flux.2 is as bad at that. Having checked a few tests and reviews, Ernie seems to be REALLY bad at anatomy. At least flux2 only freaks out with more then one person and kinda "complicated" poses. And you can't blame that on lazy prompts. Besides when i read "prompt enhancer" (that comes with it) i auto translate it "censorship" Hence LTX2.
Quoting 20 pages of benchmarks doesn't help when a model fails at the level of counting arms and hands.
Edit: having tested it a bit myself (turbo version) , it may not be QUITE as bad as Flux.2 Klein9b (not to mention 4b) at counting arms and hands but it's not much between them. And that doesn't change with the 6gb crutch of a prompt enhancer. It also seems to have a strong tendency to make everybody chinese, no matter what is prompted. It is fast. i give it that. But better than Z-image? I really don't think so.
And while i complain a lot about flux.2's abysmally bad anatomy, it DOES have very powerful multi-reference editing capabilities. and does quite high resolution in good image quality. Which puts it despite the issues in a different league. While Ernie is pure text to image. And on that level of only t2i and speed, z-image with a few loras (like zit_sda_v1 for diversity) feels just much better. Ernie, like Flux.2, for anything slightly more challenging (than a lady just standing there) , you have to run multiple images to get one that is ok. Z-image (or Qwen btw) very rarely messes things up like that.
ok, why not going into it: prompt. Geometrically slightly challenging but nothing crazy: three young, turkish men in elegant but dirty, slim fit plain, blue dress shirts run at speed slightly behind each other, along a dirt path through a dark, lush forest.
Interesting little look into the over 6gb "prompt enhancer" that comes with Ernie at the end.
enhancer does this: A highly cinematic, high-definition photograph. The scene features three young, slender Turkish men wearing simple dark blue shirts. The shirts have visible stains and signs of wear, giving them a rustic yet elegant appearance. The three men are running forward along a winding dirt path, keeping a slight distance from one another to create a dynamic formation. The background is a dense, dark forest with tall, naturally scattered trees and thick foliage that casts mottled shadows. Light filters through the gaps in the canopy onto the ground, creating an interplay of light and shadow. The composition uses a wide aspect ratio of 1440x1024, shot from a side-rear angle, perfectly capturing the subjects' running postures as they blend with the environment. The overall color palette features cool, stark natural tones, emphasizing an atmosphere of outdoor adventure or wilderness survival. (gemini translated from chinese). Note how it just changed "dress shirts" into " simple shirts", see last paragraph (tl;dr it is the enhancers fault) .
results Ernie (one is good but failed the prompt quite a bit, one is IDENTICAL but chineseyfied subjects, the third (english translated ext.prompt) , shows the first bad mess-up: weird directions, weird arm, ONLY anatomy mess up i can see in all these attempts, and i'ts from ernie. These are not the worst of a lot attempts, these are 3 out of 4.
4 (not shown here) . was again VERY similar. Ernie turbo seem even more static than Z-image turbo. It seems to create practically identical copies from the same prompt (random seed)
1
2 (remember identical prompt, different seed)
3. note the guys don't really look turkish like the prompt says, It's kind of a hybrid with asian. But possible there are ethnicities matching that.
z-image (with the extended prompt, i changed shirts back to dress shirts); did another one (not shown) , it had four guys in but was otherwise good too (and while similar, it wasn't identical) :
z-image with the basic prompt:
flux.2 klein9b extended prompt :
Ernie has imho by FAR the worst results (prompt adherence, anatomy), with and without prompt enhancing. All else left at comfy ernie template defaults.
first gemini didin't translate but nanobanana'd the prompt. so why not, here it is:
had a little word with gemini about the dress shirt thing https://gemini.google.com/share/14670956d722 :
summary: ...."Where the change happened:
When your original basic prompt (which included "dress shirt") was expanded into that detailed Chinese paragraph, the translator or AI generalized "dress shirt" into the standard word for "shirt" (่กฌ่กซ) and added the adjective "simple" (็ฎ็บฆ็)."
so the prompt enhancer also seems not very good maintaining precision while enhancing. changing "blue dress shirts" into "simple blue shirts". That's on top of a model that seems to treat a prompt more like a careful suggestion.
You know that the prompt enhancer is a llm and you can change the system prompt to let it make changes and add real detail... Did so and it is like a different model. Granted, the resolution is lacking in comparison to z-image but there are no finetunes out yet, z-image and flux2 klein needed those before being usable or more useful. the flux2 klein core is a complete mess on anatomical steroids and the only issue Ernie had out of the box for me was a faded limb here or there, which can also be fixed with proper prompting or the prompt enhancer with custom sys prompt...
Give it a bit of time and it will be a great addition to the variety of modern models.
you may as well use a external solution, not wasting 6gb for yet another llm (which hs to be different from the one used for clip?!) . But anyway, the arms and legs issue and the tendency to make every person Chinese is not the prompt, it's the model itself. And it's baaad at arms and legs.
(i still use chroma and wan a lot and i don't buy that (umt) t5xxl is noticeably worse than all these (different) clip llm). Chroma and wan (remix) can handle quite complex interacions between people and my over complicated prompts and very rarely mess anatomy up. Less then Ernie and Flux.2 definitely.
