enn_nafnlaus

joined 1 year ago
 

Tonight I tested out SD3 via the API. I sadly do not have as much time as with my Stable Cascade review, so there will only be four images for most tests (instead of 9), and no SD2 controls.

Test #1: Inversion of expectations.

All SD versions can do "An astronaut riding a horse" - but none can do "A horse riding an astronaut". Can SD3?

Mmm, nope.

(You'll note that I also added to the prompt, "In the background are the words 'Stable Diffusion Now Does Text And Understands Spatial Relationships'", to test text as well)

Trying to help it out, I tried changing the first part of the prompt to: "A horse riding piggyback on the shoulders of an astronaut" . No dice.

As for the text aspect, we can see that it's not all there. But that said, it's way better than SD2, and even better than Stable Cascade. When dealing with shorter / simpler text, it's to the level that you may get it right, or at least close enough that simple edits can fix it.

Test #2: Spatial relationships

Diffusion models tend not to understand spatial relationships, about how elements are oriented relative to each other. In SD2, "a red cube on a blue sphere" will basically get you a random drawing of spheres and cubes. Stable Cascade showed maybe slightly better success, but not much. Asking for a cube on top of a sphere is particularly malicious to the models, since that's not something you'll see much in training data.

Here I asked for a complex scene:

"A red cube on a blue sphere to the left of a green cone, all sitting on top of a glass prism."

So we can see that none of the runs got it exactly right. But they all showed a surprising degree of "understanding" of the scene. This is definitely greatly improved over earlier models.

Test 3: Linguistic parsing.

Here I ask it a query that famously fails most diffusion models: "A room that doesn't contain an elephant. There are no elephants in the room." With a simple text model, "elephant" attracts images of elephants, even though the prompt asked for no elephants.

And SD3? Well, it fails too.

One might counter with, "Well, that's what negative prompts are for", but that misses the point - the point is whether the text model actually has a decent understanding of what the user is asking for. It does not.

Test 4: Excessive diversity

Do we avoid the "Gemini Scenario" here? Prompting: "The US founding fathers hold a vigorous debate."

That's a pass. No black George Washington :) I would however state that I find the quality and diversity of images a bit subpar.

Test 5: Insufficient diversity

What about the opposite problem - stereotyping? I prompted with "A loving American family.". And the results were so bad that I ran a secod set of four images:

The fact that there's zero diversity is the least of our worries. Look at those images - ugh, KILL IT WITH FIRE! They re beyond glitchy. This is like SD1.5-level glitchiness. It's extremely disappointing to see this in SD3, and I can't help but think that this must be a bug.

Test 6: Many objects

A final test was to see how the prompt could cope with having to draw many unique items in a scene. Prompt: "A clock, a frog, a cow, Bill Clinton, six eggs, an antique lantern, a banana, two seashells, a window, a HP printer, a poster for Star Wars, a broom, and an Amazon parrot."

The results are... meh. Mostly right, but all have flaws. And all the same style.

Test 7: Aesthetics and foreign characters

I did a final test on a useful use-case: creating an ad / cover page style image with nice aesthetics. In this case, an ad / cover for beets, with Icelandic text. Prompt: "A dramatic chiaroscuro photograph ad of a freshly picked beet under a spotlight, with dramatic text behind it that reads only 'Rauðrófur'." I also used a high aspect ratio.

As far as ads / cover images go, I think the aesthetics are quite workable. Foreign text though... I guess that's too ambitious. The eth (ð) is a total non-starter, and it also omitted the accent on the ó. I also did another test (not pictured here) of tomatoes with the text "Tómatar". The accent over the ó was only present in 2 of the 8 images, and one of those added a second accent over the second "a" for no reason.

Conclusions

  1. Aesthetics can be quite good,but not always. Needs more experimentation to see whether it's better than Stable Cascade (which is lovely in practice)

  2. Diversity... I'd put it lower than SD2 but higher than Stable Cascade (which is terrible in terms of diversity).

  3. Text: not all the way there, but definitely getting into "workable" territory.

  4. Prompt understanding: Spatial awareness is significantly improved, but it really doesn't "understand" prompts fully yet.

  5. Diversity: could use a bit more work.

  6. Glitchiness: VERY, at least when it comes to shots like the families.

[–] enn_nafnlaus@lemmy.dbzer0.com 2 points 9 months ago

Speaking of contamination, in the "film" images above, the top two are a good example of rubbing-off - I asked for neon text, but the whole image got neon-toned. All of them are black background also.

[–] enn_nafnlaus@lemmy.dbzer0.com 5 points 9 months ago* (last edited 9 months ago) (4 children)

Thanks a bunch, now you got me generating black popes in the style of old Blaxploitation films ;)

[–] enn_nafnlaus@lemmy.dbzer0.com 3 points 9 months ago (2 children)
[–] enn_nafnlaus@lemmy.dbzer0.com 2 points 9 months ago

I've not tried anything less than a RTX 3060 (12GB). But I'm impressed by how large images I can generate on it compared to SDXL.

[–] enn_nafnlaus@lemmy.dbzer0.com 4 points 9 months ago

My suspicion is that they have different teams working in parallel, so it leads to a rather irregular release schedule.

 

StabilityAI's newest diffusion model (up to the newly announced Stable Diffusion 3, that is!), StableCascade, is said to produce images faster and better than SDXL via addition of an extra diffusion stage. But how is it in practice?

First, let's answer the question of "Can it do pretty stuff", with a clear "yes".

It's not yet well integrated into Automatic1111, but can be run with basic features in its own tab via:

github.com/blue-pen5805...

The first set of diffusion steps controls the layout; the other, finetuned detail.

One thing noticed right off the bat is its memory efficiency for large images.

Running on only a 12GB 3060, it has no trouble making images thousands of pixels wide (up to half a dozen megapixels or more), whereas with SDXL you crash out much over the native resolution of 1024x1024.

Stable Cascade also seems to be native to 1024x1024, but seems more lenient against deviation.

That said, different parts of the image still "lose sight of distant parts" over time, and even abstracts like the above start to get somewhat repetitive; the highest resolution images below were 4096x1280 without any upscaling.

Testing with a less abstract prompt of "epic handshake photograph", we see that the 2048x640 handshake actually looks quite good, but the more elongated ones have coherence problems (though are still quite attractive). I greatly enjoy the 4096x1280 one ;) )

There are however limits to how large you can go. An attempt to make a rusty steel texture failed at 3072x3072 (9MP), but succeeded at 2560x2560 (6,5MP)

Once Stable Cascade is better integrated into Automatic1111 and img2img and ControlNet can be used with it, it'll be killer for inpaint & upscale.

There's some claims that Stable Cascade is better at spelling than SDXL. Well... I'd say "yes", but don't expect too much. Top is SDXL, bottom is Stable Cascade, both asked for an ornate sign that says "Welcome to Bluesky!". Both are mostly fails, but the letters are better formed in Stable Cascade

To test its text model, I ask it for a "red cube on top of a blue sphere." Again, top is SDXL, bottom is Stable Cascade. It's mostly red cubes and blue spheres, but it just can't get the order right (normally you don't balance things on spheres). Also, note the lack of diversity. Keep that in mind.

To test two things diffusion models have trouble with - hands and guns - I asked for a clown wielding an AK-47 (SD top, SC bottom). Are Stable Cascade's hands and guns better? Yeah. But the low image diversity is really problematic. It has a SPECIFIC clown in mind.

Gemini has been taking flak recently for inserting diversity where it doesn't belong. Asked to draw an 18th century pope in a cornfield, none of them add in inaccurate racial or gender diversity, but while SD's image variety (normally a strong suit) isn't at its strongest here, SC's is basically nonexistent.

In terms of desirable racial / gender diversity, my go-to test is "military personnel". SDXL (left) tends to get the 1/6th female, 1/3rd black, etc mix of the actual US military. But unsurprisingly... Stable Cascade just gets one particular image in mind for any given prompt, and never deviates far from it.

In short... Stable Cascade is a mixed bag. Poor on image diversity, and only marginal improvements in several key areas, but quite pretty, great on memory management, and should become a great tool for use with ControlNet, img2img, and in upscaling.

(Respective prompts:

  1. "The End". Helical rainforest. Lightning. Plasma. Hope. "The End". "The End". 2: Glowing words "The End". Helical rainforest. Lightning. Plasma. Hope. "The End" written. "The End". 3: Glowing words "The End". Helical rainforest. Lightning. Plasma. Hope. "The End" written. "The End". 4: Glowing words "End Of Thread". Helical rainforest. Lightning. Plasma. Hope. "End Of Thread" written. "End Of Thread".

First generation of each taken; no cherry picking)

O captain, my captain!