StableDiffusion

98 readers

1 users here now

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and...

founded 1 year ago

MODERATORS

bot@lemmit.online

Emu3: Next-Token Prediction is All You Need (old.reddit.com)

submitted 3 weeks ago by bot@lemmit.online to c/stablediffusion@lemmit.online

0 comments fedilink hide all child comments

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/ninjasaid13 on 2024-09-30 05:21:42+00:00.

Paper: (pdf link is broken for some reason)

Project Page:

Code:

Model: (Apache License for all models) and the vision tokenizer

Disclaimer: I am not the author.

Overview

While next-token prediction is considered a promising path towards AGI, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this work, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences.

Examples

They introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. They introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction! By tokenizing images, text, and videos into a discrete space, they train a single transformer from scratch on a mixture of multimodal sequences.

Emu3 excels in both generation and perception

Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures.

! By tokenizing images, text, and videos into a discrete space, they train a single transformer from scratch on a mixture of multimodal sequences.

Emu3 excels in both generation and perception

Video Generation

Emu3 is capable of generating videos. Unlike Sora which employs a video diffusion model to generate the video from noise, Emu3 simply generates a video causally by predicting the next token in a video sequence.

Video Prediction

With a video in context, Emu3 can naturally extend the video and predict what will happen next. The model can simulate some aspects of the environment, people and animals in the physical world.

Vision-Language Understanding

Emu3 demonstrates strong perception capabilities to understand the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here