StableDiffusion

98 readers
1 users here now

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and...

founded 1 year ago
MODERATORS
251
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/OkSpot3819 on 2024-10-01 09:22:59+00:00.


  • Interesting find of the week: Kat, an engineer who built a tool to visualize time-based media with gestures.
  • Flux updates:
    • Outpainting: ControlNet Outpainting using FLUX.1 Dev in ComfyUI demonstrated, with workflows provided for implementation.
    • Fine-tuning: Flux fine-tuning can now be performed with 10GB of VRAM, making it more accessible to users with mid-range GPUs.
    • Quantized model: Flux-Dev-Q5_1.gguf quantized model significantly improves performance on GPUs with 12GB VRAM, such as the NVIDIA RTX 3060.
    • New Controlnet models: New depth, upscaler, and surface normals models released for image enhancement in Flux.
    • CLIP and Long-CLIP models: Fine-tuned versions of CLIP-L and Long-CLIP models now fully integrated with the HuggingFace Diffusers pipeline.
  • James Cameron joins Stability.AI: Renowned filmmaker James Cameron has joined Stability AI's Board of Directors, bringing his expertise in merging cutting-edge technology with storytelling to the AI company.
  • Put This On Your Radar:
    • MIMO: Controllable character video synthesis model for creating realistic character videos with controllable attributes.
    • Google's Zero-Shot Voice Cloning: New technique that can clone voices using just a few seconds of audio sample.
    • Leonardo AI's Image Upscaling Tool: New high-definition image enlargement feature rivaling existing tools like Magnific.
    • PortraitGen: AI portrait video editing tool enabling multi-modal portrait editing, including text-based and image-based effects.
    • FaceFusion 3.0.0: Advanced face swapping and editing tool with new features like "Pixel Boost" and face editor.
    • CogVideoX-I2V Workflow Update: Improved image-to-video generation in ComfyUI with better output quality and efficiency.
    • Ctrl-X: New tool for image generation with structure and appearance control, without requiring additional training or guidance.
    • Invoke AI 5.0: Major update to open-source image generation tool with new features like Control Canvas and Flux model support.
    • JoyCaption: Free and open uncensored vision-language model (Alpha One Release) for training diffusion models.
    • ComfyUI-Roboflow: Custom node for image analysis in ComfyUI, integrating Roboflow's capabilities.
    • Tiled Diffusion with ControlNet Upscaling: Workflow for generating high-resolution images with fine control over details in ComfyUI.
    • 2VEdit: Video editing tool that transforms entire videos by editing just the first frame.
    • Flux LoRA showcase: New FLUX LoRA models including Simple Vector Flux, How2Draw, Coloring Book, Amateur Photography v5, Retro Comic Book, and RealFlux 1.0b.

📰 Full newsletter with relevant links, context, and visuals available in the original document.

🔔 If you're having a hard time keeping up in this domain - consider subscribing. We send out our newsletter every Sunday.

252
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Opening-Ad5541 on 2024-10-01 07:51:06+00:00.

253
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/formalsystem on 2024-10-01 03:43:29+00:00.

254
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/EldrichArchive on 2024-10-01 01:54:45+00:00.

255
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/MikirahMuse on 2024-09-30 18:33:51+00:00.

256
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/urgettingtallpip on 2024-09-30 22:33:03+00:00.


has anybody heard of new flux controlnets being trained/coming out soon? the current ones released by Xlabs and instantX feel mediocre at best.

257
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/jenza1 on 2024-09-30 15:07:47+00:00.

258
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Angrypenguinpng on 2024-09-30 21:58:03+00:00.

259
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Striking-Long-2960 on 2024-09-30 19:51:34+00:00.


New versions of CogVideoX-Fun 5B and 2B have been released. Including a new model that I believe it's thought for animating humans.

  • Retrain the i2v model and add noise to increase the motion amplitude of the video. Upload the control model training code and control model. [ 2024.09.29 ]

5B

2B

The custom node for comfyUI Cogvdeoxwrapper has an initial support for these new models.

260
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/idunno63 on 2024-09-30 15:47:13+00:00.

261
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/theroom_ai on 2024-09-30 12:28:23+00:00.

262
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/woadwarrior on 2024-09-30 13:39:26+00:00.

263
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/gpahul on 2024-09-30 12:39:16+00:00.

264
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Halodri88 on 2024-09-30 11:23:12+00:00.

265
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/hackerzcity on 2024-09-30 04:39:03+00:00.


This model has been trained on lots of artificially damaged images—things like noise, blurriness, or compression. And it learns from those bad images and can turn your blurry pictures into clearer ones.

266
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/an303042 on 2024-09-30 08:31:46+00:00.

267
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Jonfreakr on 2024-09-30 08:17:15+00:00.

268
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/ninjasaid13 on 2024-09-30 05:21:42+00:00.


Paper: (pdf link is broken for some reason)

Project Page:

Code:

Model: (Apache License for all models) and the vision tokenizer

Disclaimer: I am not the author.

Overview

While next-token prediction is considered a promising path towards AGI, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this work, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences.

Examples

They introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. They introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction! By tokenizing images, text, and videos into a discrete space, they train a single transformer from scratch on a mixture of multimodal sequences.

Emu3 excels in both generation and perception

Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures.

! By tokenizing images, text, and videos into a discrete space, they train a single transformer from scratch on a mixture of multimodal sequences.

Emu3 excels in both generation and perception

Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures.

Video Generation

Emu3 is capable of generating videos. Unlike Sora which employs a video diffusion model to generate the video from noise, Emu3 simply generates a video causally by predicting the next token in a video sequence.

Video Prediction

With a video in context, Emu3 can naturally extend the video and predict what will happen next. The model can simulate some aspects of the environment, people and animals in the physical world.

Vision-Language Understanding

Emu3 demonstrates strong perception capabilities to understand the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.

269
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Pretend_Potential on 2024-09-30 02:52:22+00:00.


Just going to post the link to the news article rather than quote the entire article.

270
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/stupidxthrowaway on 2024-09-30 01:05:21+00:00.

271
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/myf4pacc0unt on 2024-09-29 21:06:37+00:00.

272
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/nootropicMan on 2024-09-29 21:47:40+00:00.

273
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Devajyoti1231 on 2024-09-29 08:47:42+00:00.

274
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/os75 on 2024-09-29 02:53:10+00:00.


Hey guys, So I've been working on this thing I'm calling lorakit. It's just a little toolkit I threw together for training SDXL LoRA models. It is heavily based on DreamBooth from AutoTrain but with similar configuration style as ai-toolkit. Nothing fancy, but it's been pretty handy for quick experiments and prototyping. Thought some of you might wanna check it out:

275
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/smusamashah on 2024-09-29 13:06:03+00:00.

view more: ‹ prev next ›