this post was submitted on 31 Aug 2024
1 points (100.0% liked)

StableDiffusion

98 readers
1 users here now

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and...

founded 1 year ago
MODERATORS
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/terminusresearchorg on 2024-08-30 21:30:30+00:00.


I'm sure it's all a big mystery how Fal is managing to train a 100 step LoRA in 5 minutes.

They've given precious few hints on Twitter so far:

  • Using H100 for training
  • Uses the GPUs "more efficiently", "among other changes"
  • "Totally new training process"

We considered a few different potential approaches, but the one that felt the most eerily similar is HyperDreambooth.

When reading the HyperDreambooth paper, you discover the main improvement they propose is a hypernetwork that predicts better starting weights for a rank-1 LoRA. But this has some other limitations, including the need to train a hypernetwork with its own issues.

Simply Guessing isn't really enough for me, so, I went ahead and contacted someone who had trained a model on their API. Upon viewing the weights, it looks like a normal LoRA. there is no metadata of the model training config saved into the safetensors. That's an obvious route to look, so, it's not there anymore.

However, there's a notorious issue with Fal's API where error messages manage to reveal more information than the successful outputs will tell you. Crafted a special image for training that passes the image loading but then does not process correctly, leading to an error message that contained a link:

[removed for privacy reasons]

It's a 172M rank 16 LoRA in the Diffusers state_key_dict format. But there was more...

At the bottom is the downloaded LoRA, and just above that you see the LoRA config and .... what the hell? A folder of images?!

[removed for privacy reasons]

The filenames give it away - these are just a random person from Facebook.

Here's the LoRA config:

{
    "images_data_url": "[removed for privacy reasons]",
    "trigger_word": "women danish 36y",
    "disable_captions": false,
    "disable_segmentation_and_captioning": false,
    "resolution_buckets": "512",
    "iter_multiplier": 1.0,
    "is_style": false,
    "is_input_format_already_preprocessed": false,
    "instance_prompt": "women danish 36y"
}

That's super interesting.

  • There is an option to set resolution_buckets which is 512px here for some reason
  • The trigger phrase is women danish 36y
  • My requested LoRA is then continued from this

Who the heck is she?

a real person!

So, we did an experiment (we being JimmyCarter / uptightmoose, from my research group):

  • retrieve 4 images of a subject that Flux doesn't know (Dr. Seuss)
  • rent an A100-80G for $1/hr
  • train on a very high batch size (16) with the 4 images over 100 steps

The result:

a fully generalised Dr. Seuss

We didn't even use starting LoRA weights. Not sure why they do that - it's entirely unnecessary. Too bad they didn't explain this fast LoRA training process, leaving the rest of us to reverse engineer the process and explain it to others.

The optimiser states and random states don't even seem to be reused from the initial LoRA - it's literally just using it as starting weights.

Cost comparison

Fal's training costs $2 for 2 minutes of runtime and produces inferior LoRAs to a full hour of training at a proper set of hyperparameters. Their compute would cost roughly $60/hr compared to renting your own H100 for $3.99/hr. Heck - you can rent an MI300X for $3.99 on RunPod and get 192G VRAM.

You can achieve similar results locally by using segmented training masks and high batch size.

Hope this helps others and makes you worry less that you're missing out on some great new advancement.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here