StableDiffusion

98 readers
1 users here now

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and...

founded 1 year ago
MODERATORS
376
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/lazyspock on 2024-09-21 23:28:34+00:00.


I've been using (and struggling with) the original Flux Dev FP16 since launch. It works in my 3060 12Gb, but it takes 2 to 3 minutes to generate an image without any LoRa, makes the computer unusable for anything else during generation, and it's even worse with bigger LoRas (when it needs to reload the model for each generation). But it was the price to pay for being able to use Flux Dev, right?

Wrong. After this marvelous post from u/Lory1998 (thanks again!), I've decided to test Flux-Dev-Q5_1.gguf as he suggested and, man, what a difference! Now I can generate images considerably faster even with two LoRas, as the model fits entirely in my VRAM. There's no reload of the model ever, as long as I don't change the checkpoint, and even the LoRas are loaded in an instant. Also, I can use my computer for other non-intensive things like YouTube, Reddit, ect, while generating the image without Windows almost choking and without making the generation slower. And the best part is that there are no discernible quality differences in the generated images,

So, if you're also a 12Gb VRAM person, try it. It's worth it.

377
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/ComprehensiveHand515 on 2024-09-21 22:51:32+00:00.


We’re launching ComfyAI.run, a cloud platform that allows you to run ComfyUI quickly from anywhere without the need to set up your own GPU machines. This is our first Alpha release.

Serverless Example Workflows: SD, SD with ControlNet, Flux

Key Features:

  • Access from anywhere: Just click the link to launch ComfyUI and start creating immediately.
  • No setup required: Get started right away without worrying about technical installations.
  • Free cloud GPUs: No need to manage your own local or cloud-based GPU
  • Sharable link to the cloud: Create a link for easy collaboration or sharing.

Alpha Version Limitations:

  • Supports only a limited number of SD15, SDXL, and Flux checkpoints.
  • Supports a limited number of Custom Nodes.
  • Free machine pools are shared. If many users are running jobs simultaneously, you may experience a wait time in the queue.

Goal:

We would like to enable anyone to participate in the image generation workflow with easy-to-access and shareable infrastructure.

Feedback

Feedback and suggestions are always welcome! I’m sharing to gather your input. Since it’s still early, feel free to share any feature requests you may have.

378
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/renderartist on 2024-09-21 18:30:21+00:00.

379
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/VirusCharacter on 2024-09-21 20:58:17+00:00.


These are the only scheduler/sampler combinations worth the time with Flux-dev-fp8. I'm sure the other checkpoints will get similar results, but that is up to someone else to spend their time on 😎

I have removed the samplers/scheduler combinations so they don't take up valueable space in the table.

🟢=Good 🟡= Almost good 🔴= Really bad!

Here I have compared all sampler/scheduler combinations by speed for flux-dev-fp8 and it's apparent that scheduler doesn't change much, but sampler do. The fastest ones are DPM++ 2M and Euler and the slowest one is HeunPP2

Percentual speed differences between sampler/scheduler combinations

From the following analysis it's clear that the scheduler Beta consistently delivers the best images of the samplers. The runner-up will be the Normal scheduler!

  • SGM Uniform: This sampler consistently produced clear, well-lit images with balanced sharpness. However, the overall mood and cinematic quality were often lacking compared to other samplers. It’s great for crispness and technical accuracy but doesn't add much dramatic flair.
  • Simple: The Simple sampler performed adequately but didn't excel in either sharpness or atmosphere. The images had good balance, but the results were often less vibrant or dynamic. It’s a solid, consistent performer without any extremes in quality or mood.
  • Normal: The Normal sampler frequently produced vibrant, sharp images with good lighting and atmosphere. It was one of the stronger performers, especially in creating dynamic lighting, particularly in portraits and scenes involving cars. It’s a solid choice for a balance of mood and clarity.
  • DDIM: DDIM was strong in atmospheric and cinematic results, but it often came at the cost of sharpness. The mood it created, especially in scenes with fog or dramatic lighting, was a strong point. However, if you prioritize sharpness and fine detail, DDIM occasionally fell short.
  • Beta: Beta consistently delivered the best overall results. The lighting was dynamic, the mood was cinematic, and the details remained sharp. Whether it was the portrait, the orange, the fisherman, or the SUV scenes, Beta created images that were both technically strong and atmospherically rich. It’s clearly the top performer across the board.

When it comes to which sampler is the best it's not as easy. Mostly because it's in the eye of the beholder. I believe this should be guidance enough to know what to try. If not you can go through the tiled images yourself and be the judge 😉

PS. I don't get reddit... I uploaded all the tiled images and it looked like it worked, but when posting, they are gone. Sorry 🤔😥

380
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/HonorableFoe on 2024-09-21 20:56:13+00:00.

Original Title: My comfyui Cog video workflow with adtailer using the fun_5b model, with some examples of outputs. You need to really dive in with some prompting, describing clothing and objects being held helps a lot too. Comfy workflow in the comments.

381
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/jenza1 on 2024-09-21 14:35:23+00:00.

382
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/fpgaminer on 2024-09-21 18:37:01+00:00.


This is an update and follow-up to my previous post (). To recap, JoyCaption is being built from the ground up as a free, open, and uncensored captioning VLM model for the community to use in training Diffusion models.

  • Free and Open: It will be released for free, open weights, no restrictions, and just like bigASP, will come with training scripts and lots of juicy details on how it gets built.
  • Uncensored: Equal coverage of SFW and NSFW concepts. No "cylindrical shaped object with a white substance coming out on it" here.
  • Diversity: All are welcome here. Do you like digital art? Photoreal? Anime? Furry? JoyCaption is for everyone. Pains are being taken to ensure broad coverage of image styles, content, ethnicity, gender, orientation, etc.
  • Minimal filtering: JoyCaption is trained on large swathes of images so that it can understand almost all aspects of our world. almost. Illegal content will never be tolerated in JoyCaption's training.

The Demo

WARNING ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ This is a preview release, a demo, alpha, highly unstable, not ready for production use, not indicative of the final product, may irradiate your cat, etc.

JoyCaption is still under development, but I like to release early and often to garner feedback, suggestions, and involvement from the community. So, here you go!

What's New

Wow, it's almost been two months since the Pre-Alpha! The comments and feedback from the community have been invaluable, and I've spent the time since then working to improve JoyCaption and bring it closer to my vision for version one.

  • First and foremost, based on feedback, I expanded the dataset in various directions to hopefully improve: anime/video game character recognition, classic art, movie names, artist names, watermark detection, male nsfw understanding, and more.
  • Second, and perhaps most importantly, you can now control the length of captions JoyCaption generates! You'll find in the demo above that you can ask for a number of words (20 to 260 words), a rough length (very short to very long), or "Any" which gives JoyCaption free reign.
  • Third, you can now control whether JoyCaption writes in the same style as the Pre-Alpha release, which is very formal and clincal, or a new "informal" style, which will use such vulgar and non-Victorian words as "dong" and "chick".
  • Fourth, there are new "Caption Types" to choose from. "Descriptive" is just like the pre-alpha, purely natural language captions. "Training Prompt" will write random mixtures of natural language, sentence fragments, and booru tags, to try and mimic how users typically write Stable Diffusion prompts. It's highly experimental and unstable; use with caution. "rng-tags" writes only booru tags. It doesn't work very well; I don't recommend it. (NOTE: "Caption Tone" only affects "Descriptive" captions.)

The Details

It has been a grueling month. I spent the majority of the time manually writing 2,000 Training Prompt captions from scratch to try and get that mode working. Unfortunately, I failed miserably. JoyCaption Pre-Alpha was turning out to be quite difficult to fine-tune for the new modes, so I decided to start back at the beginning and massively rework its base training data to hopefully make it more flexible and general. "rng-tags" mode was added to help it learn booru tags better. Half of the existing captions were re-worded into "informal" style to help the model learn new vocabulary. 200k brand new captions were added with varying lengths to help it learn how to write more tersely. And I added a LORA on the LLM module to help it adapt.

The upshot of all that work is the new Caption Length and Caption Tone controls, which I hope will make JoyCaption more useful. The downside is that none of that really helped Training Prompt mode function better. The issue is that, in that mode, it will often go haywire and spiral into a repeating loop. So while it kinda works, it's too unstable to be useful in practice. 2k captions is also quite small and so Training Prompt mode has picked up on some idiosyncrasies in the training data.

That said, I'm quite happy with the new length conditioning controls on Descriptive captions. They help a lot with reducing the verbosity of the captions. And for training Stable Diffusion models, you can randomly sample from the different caption lengths to help ensure that the model doesn't overfit to a particular caption length.

Caveats

As stated, Training Prompt mode is still not working very well, so use with caution. rng-tags mode is mostly just there to help expand the model's understanding, I wouldn't recommend actually using it.

Informal style is ... interesting. For training Stable Diffusion models, I think it'll be helpful because it greatly expands the vocabulary used in the captions. But I'm not terribly happy with the particular style it writes in. It very much sounds like a boomer trying to be hip. Also, the informal style was made by having a strong LLM rephrase half of the existing captions in the dataset; they were not built directly from the images they are associated with. That means that the informal style captions tend to be slightly less accurate than the formal style captions.

And the usual caveats from before. I think the dataset expansion did improve some things slightly like movie, art, and character recognition. OCR is still meh, especially on difficult to read stuff like artist signatures. And artist recognition is ... quite bad at the moment. I'm going to have to pour more classical art into the model to improve that. It should be better at calling out male NSFW details (erect/flaccid, circumcised/uncircumcised), but accuracy needs more improvement there.

Feedback

Please let me know what you think of the new features, if the model is performing better for you, or if it's performing worse. Feedback, like before, is always welcome and crucial to me improving JoyCaption for everyone to use.

383
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/stockimgai on 2024-09-21 16:43:51+00:00.

384
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/randomvariable56 on 2024-09-21 14:53:54+00:00.

385
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/diogodiogogod on 2024-09-21 14:04:04+00:00.


My Civitai article:

So, Flux is great with prompt adherence, right? Right…

but writing directions can be tricky for the model. How would Flux interpret “A full body man with a watch on his right wrist?”. It will most probably output a man, in front view, with the watch on his LEFT wrist, but positioned on the RIGHT side of the image. That’s not what we asked for.

"Full body shot of a man with a watch on his right wrist" 0 out of 2 here

Sometimes Flux gets it right, but often it doesn’t. And that’s mostly because of how we write our prompts.

A warning first: This is in no way perfect. Based on my experimentation, It helps, but it won’t be 100%.

Describing body parts using the character’s perspective (like “his left”) leads to confusion. Instead, it’s better to use the image’s perspective. For example, say “on the left side” instead of “his left.” Adding “side” helps the model a lot. You can also reference specific areas of the image like “on the left bottom corner”, “on the top-left corner”, “on the center”, “on the bottom”, of the image. Etc.

"Full body shot of a man with a watch on his wrist on the left side" 0.5 out of 2, getting there

NEVER use “his right X body part” ever. “On the left” is already way better than “on his left”, but still generates a lot of wrong perspectives. More recently I have been experimenting with taking “him/her” completely from the prompt and I think it is even better.

"Full body shot of a man with a watch on the wrist on the left side" 1 out of 2, better.

Another example would be:

"A warrior man from behind, climbing stepping up a stone. The leg on the left side is extended down, the leg on the right is bent at the knee. He is wearing a magical glowing green bracelet on the hand on the left side. The hand on the right side is holding the sword vertically upward. The background is the entrance of a magical dark cave, with multiple glowing red neon lights on the top-right side corner inside the cave resembling eyes."

Definitely not all is correct. But it's more consistent.

For side views, when both body parts are on the same side, you can use foreground and background to clarify:

A photo of man in side view wearing an orange tank top and green shorts. He is touching a brick wall arching, leaning forward to the left side. His hand on the background is up touching the wall on the left side. His hand in the foreground is hanging down on the left side.

This is way more inconsistent. It's a hit-and-miss most of the time.

Using these strategies, Flux performs better for inference. But what about training with auto captions like Joy Caption?

A trend have been going on about the model not needing them, but I still don’t buy it. For simple objects or faces, trigger words might be enough, but for complex poses or anatomy, captions still seem important. I haven't tested enough, though, so I could be wrong.

With the help of ChatGPT I created a script that updates all text files in a folder to the format I mentioned. It’s not perfect, but you can tweak it or ask ChatGPT for more body part examples (I also just recently added "to" instead of only "on").

https://github.com/diodiogod/Search-Replace-Body-Pos

A simpler and fast option would be to just add “side” after “right/left”. But it would still be ambiguous. For example, “her left side arm” might mean her side, not the image’s side. So you need to include all prepositions like “on the left leg” > “on the leg on the left side”. “On his left X” > “on his X on the left side” etc.

But another big problem is that Joy Caption and all the other auto captioners are very inconsistent. They often get left and right wrong, probably because of the perspective problem I mentioned. So it’s kind of essential to manual check…. That’s why I add after each substitution, so I can easily find and check them manually. You can then search and replace that string with Taggui, Notepad++ or another tool.

But manually switching left and right can be tedious. So, I built another tool to make it easier: a floating box to do text swap fast. I organize my window so I can manually check each text file, spot substitutions, and easily swap “left side” and “right side.”

https://github.com/diodiogod/Floating-FAST-Text-Swapper

What I did was using the preview panel, I would organize my window just like this:

Manually click on every txt, I could easily spot on the preview panel any txt that had a substitution by looking fro the <###---------####>. Check is it were correct. If not, I could drag the txt and easily swap “left side” <> “right side”.

This process isn’t perfect, and you’ll still need to do some manual edits.

But anyway, that’s it. Hope this can help anyone with their captions, or just with their prompt writing.

386
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/fab1an on 2024-09-21 13:35:29+00:00.

387
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/cgpixel23 on 2024-09-21 10:16:05+00:00.

388
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/3dmindscaper2000 on 2024-09-21 06:27:05+00:00.

389
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/stbl_reel on 2024-09-21 06:10:02+00:00.

390
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Angrypenguinpng on 2024-09-20 19:07:26+00:00.

391
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/ol_barney on 2024-09-20 16:09:32+00:00.

392
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Glass-Caterpillar-70 on 2024-09-20 14:09:10+00:00.

393
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/mardy_grass on 2024-09-20 18:12:46+00:00.

394
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/tintwotin on 2024-09-20 16:45:25+00:00.

395
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/rolux on 2024-09-20 16:24:18+00:00.

396
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/cocktail_peanut on 2024-09-20 15:52:30+00:00.

397
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/jjjnnnxxx on 2024-09-20 12:18:58+00:00.

398
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/zazaoo19 on 2024-09-20 04:01:52+00:00.

399
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/theninjacongafas on 2024-09-20 11:38:06+00:00.

400
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/mrfofr on 2024-09-20 10:14:56+00:00.

view more: ‹ prev next ›