StableDiffusion

98 readers
1 users here now

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and...

founded 1 year ago
MODERATORS
201
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/b-monster666 on 2024-10-04 19:27:29+00:00.


Dude just keeps posting "Early Access" checkpoints for millions of credits in donations

202
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/lostinspaz on 2024-10-04 16:04:02+00:00.


A while ago, I did some blackbox analysis of CLIP (L,G) to learn more about them.

Now I'm starting to do similar things with T5 (specifically, t5xxl-enconly)

One odd thing I have discovered so far: It uses SentencePiece as its tokenizer, and from a human perspective, it can be stupid/wasteful.

Not as bad as the CLIP-L used in SD(xl), but still...

It is case sensitive. Which in some limited contexts I could see as a benefit, but its stupid for the following specific examples:

It has a fixed number of unique token IDs. around 32,000.

Of those, 9000 of them are tied to explicit Uppercase use.

Some of them make sense. But then there are things like this:

"Title" and "title" have their own unique token IDs

"Cushion" and "cushion" have their own unique token IDs.

????

I havent done a comprehensive analysis, but I would guess somewhere between 200 and 900 would be like this. The waste makes me sad.

Why does this matter?

Because any time a word doesnt have its own unique token id, it then has to be represented by multiple tokens. Multiple tokens, means multiple encodings (note: CLIP coalesces multiple tokens into a single text embedding. T5 does NOT!) , which means more work, which means calculations and generations take longer.

PS: my ongoing tools will be updated at

203
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/tintwotin on 2024-10-04 17:24:45+00:00.

204
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/jonesaid on 2024-10-04 13:45:13+00:00.


This looks like an interesting approach to using LLMs to help generate prompt specific workflows for ComfyUI.

205
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/rerri on 2024-10-04 13:16:04+00:00.


Blog post:

Samples:

Paper:

  • 30B parameter model capable of generating videos and images
  • Video resolution 768x768 or similar amount of pixels in different aspect ratio
  • Video length 16sec
  • Blog mentions "potential future release" whatever that means
206
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/EntertainerOk9595 on 2024-10-04 08:23:33+00:00.

207
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/camenduru on 2024-10-03 23:49:17+00:00.

208
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Robos_Basilisk on 2024-10-04 08:30:29+00:00.

209
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/rawker86 on 2024-10-04 04:28:07+00:00.

210
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/isnaiter on 2024-10-03 20:14:51+00:00.


Going straight to the point, I fixed the Prodigy main issue. With my fix, you can train the Unet and TEs for as long as you want without frying the TEs and undertraining the Unet. To use it, just get the code I submitted in a PR on Prodigy’s GitHub. I don’t know if they’ll accept it, so you’ll probably have to manually replace it in the venv.

Edit: it's also possible to put a different LR in each network

About the loss modifier, I made it based on my limited knowledge of diffusion training and machine learning. It’s not perfect, it’s not the holy grail, but my trainings always turn out better when I use it.

Feel free to suggest ways to improve it.

For convenience, I replaced OneTrainer's min snr gamma function with my own, so all I need to do is activate msg and my function will take over.

I’m not going to post any examples here, but if anyone’s curious, I uploaded a training I did of my ugly face in the training results channel on the OT discord.

Edit:

To use the prodigy fix, get the prodigy.py here:

and put it in this folder:

C:\your-trainer-folder\OneTrainer\venv\Lib\site-packages\prodigyopt\

That's it, all the settings in OT stay the same, unless you want to set different LRs for each network, because that's possible now.

To use my custom loss modifier, get the ModelSetupDiffusionLossMixin.py here:

and put it in this folder:

C:\your-trainer-folder\OneTrainer\modules\modelSetup\mixin

Then in the OT's UI, select MIN_SNR_GAMMA in the Loss Weight Function on training tab, and insert any positive value other than 0.

The value itself doesn't matter, it's just to get OT to trigger the conditionals to use the min snr gamma function, which now has my function in place.

There was a typo in the function name in the loss modifier file, I fixed it now, it was missing an underline in the name.

211
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Anibaaal on 2024-10-04 03:12:57+00:00.

212
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Striking-Long-2960 on 2024-10-03 23:48:21+00:00.


cinematic, beautiful, in the street of a city, a red car is moving towards the camera

cinematic, beautiful, in the street of a city, a red car is moving towards the camera

cinematic, beautiful, in a park, in the background a samoyedan dog is moving towards the camera

After some initial bad results, I decided to give Cogvideoxfun Pose a second opportunity, this time using some basic 3D renders as Control... And oooooh boy, this is impressive. The basic workflow is in the ComfyUI-CogVideoXWrapper folder, and you can also find it here:

These are tests done with Cogvideoxfun-2B at low resolutions and with a low number of steps, just to show how powerful this technique is.

cinematic, beautiful, in a park, a samoyedan dog is moving towards the camera

NOTE: Prompts are very important; poor word order can lead to unexpected results. For example

cinematic, beautiful, a beautiful red car in a city at morning

213
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/EldrichArchive on 2024-10-03 21:21:33+00:00.

214
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/anekii on 2024-10-03 15:21:33+00:00.

215
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/_Vikthor on 2024-10-03 20:28:19+00:00.

216
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/jenza1 on 2024-10-03 15:34:33+00:00.

217
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/bipolaridiot_ on 2024-10-03 14:57:01+00:00.

218
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/LewdGarlic on 2024-10-03 14:46:21+00:00.

219
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/tevlon on 2024-10-03 13:41:38+00:00.

220
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Total_Kangaroo_7140 on 2024-10-03 13:27:21+00:00.

221
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Current_Wind_2667 on 2024-10-03 08:11:27+00:00.


222
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/LevelKnown4922 on 2024-10-02 22:18:11+00:00.

223
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Sea-Resort730 on 2024-10-03 05:07:45+00:00.

224
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/SteffanWestcott on 2024-10-02 23:34:13+00:00.


Powder workflow for ComfyUI is available here

I found a way of using masked conditioning for part of the image inference to combine high and low Flux guidance into a single pass text-to-image workflow. I use this to remove the waxy look of skin textures for photorealistic portraits in Flux Dev, where the overall image needs to use high Flux guidance for good prompt adherence or character Lora likeness.

Please give Powder a try!

Instructions are in the workflow, I've copied them below:

Powder is a single-pass text-to-image workflow for Flux.1 [dev] based checkpoints. It is designed for photorealistic portraits that require high Flux guidance (3.5 or above) for the overall image. It aims to improve skin contrast and detail, avoiding the shiny, waxy, smoothed look.

High Flux guidance is required for good prompt adherence, image composition, colour saturation and close likeness with character Loras. Lower Flux guidance (1.7 to 2.2) improves skin contrast and detail but loses the mentioned benefits of high guidance for the overall image. Powder uses masked conditioning with varied Flux guidance according to a 3 phase schedule. It also uses masked noise injection to add skin blemishes. It can be run completely automatically, though there is a recommended optional step to manually edit the skin mask. Powder can be used with any Loras and controlnets that work with a standard KSampler, but it does not work with Flux.1 [schnell].

Powder uses an Ultralytics detector for skin image segments. Install the detector model using ComfyUI Manager > Model Manager and search for skin_yolov8n-seg_800.pt

Image inference uses a KSampler as usual, but the scheduled steps are split into 3 phases:

  • Phase 1: Each KSampler step uses a single (high) Flux guidance value for the whole image.
  • Phase 2: Latent noise is injected into the masked region. Then, inference proceeds like in Phase 1, except for a different (lower) Flux guidance value used for the masked region.
  • Phase 3: Similar to Phase 2, but using different settings for the injected noise and Flux guidance value applied to the masked region.

At the end of Phase 1, the workflow pauses. Right-click on the image in "Edit skin mask" and select "Open in MaskEditor". The image will be fuzzy because it is not fully resolved, but its composition should be apparent. A rough mask will have been automatically generated. The mask should cover skin only; ensure hair, eyes, lips, teeth, nails and jewellery are not masked. Make any corrections to the mask and click "Save to node". Queue another generation, and the workflow will complete the remaining phases.

To make a new image, click "New Fixed Random" in the "Seed - All phases" node before queueing another generation.

Tips:

  • "Schedule steps" is the total number of steps used for all phases. This should be at least 40; I recommend 50.
  • "Phase 1 steps proportion" ranges from 0 to 1 and controls the number of steps in Phase 1. Higher numbers ensure the image composition more closely matches a hypothetical image generated purely using the Flux guidance value for Phase 1, but at the cost of fewer steps in Phases 2 and 3 to impact the masked region. 0.24 seems to work well; for 50 schedule steps this gives 0.24 * 50 = 12 steps for Phase 1.
  • "Flux guidance - Phase 1" should be at least 3.5 for good prompt adherence, well-formed composition of all objects in the image, aesthetic colour saturation and good likeness when using character Loras.
  • You may need to experiment with "Flux guidance (masked) - Phases 2/3" settings to work well with your choice of checkpoint and style Lora, if any.
  • Latent noise is added to the masked region at the start of Phases 2 and 3. The noise strengths can be adjusted in the "Inject noise - Phase 2/3" nodes to vary the level of skin blemishes added.
  • To skip mask editing and use the automatically generated mask each time, click on "block" in the "Edit skin mask" node to select "never".
  • Consider excluding fingers or fingertips from the mask, particularly small ones. Images of fingers and small objects at lower Flux guidance are often posed incorrectly or crumble into a chaotic mess.
  • Feel free to change the sampler and scheduler. I find deis / ddim_uniform works well, as it converges sufficiently for Phase 1.
  • After completing all phases to generate a final image, you may fine-tune the mask by pasting the final image into the "Preview Bridge - Phase 1" node. To do this, right-click on "Preview Image - Powder" (right of this node group) and select "Copy (Clipspace)". Then right-click on "Preview Bridge - Phase 1" and select "Paste (Clipspace)". Queue a generation for a mask to be automatically generated and edit the mask as before. Then, queue another generation to restart the process from Phase 2.
  • Images should be larger than 1 megapixels in area for good results. I often use 1.6 megapixels.
  • Consider using a finetuned checkpoint. I find Acorn is Spinning gives good realistic results.
  • Use Powder as a first step in a larger workflow. Powder is not designed to generate final completed images.
  • Not every image can be improved satisfactorily. Sometimes a base image will be so saturated or lacking detail that it cannot be salvaged. Just reroll and try again!
225
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Total-Resort-3120 on 2024-10-02 20:31:46+00:00.

view more: ‹ prev next ›