-
Notifications
You must be signed in to change notification settings - Fork 5.9k
TAESD-encoded latents are too dark #4676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Oops, It's probably this (@sayakpaul 😅) |
yeah I was side-eyeing that suspiciously |
I am not sure what's meant by that comment. Maybe the encode-decode roundtrip is best done by referring to @madebyollin's original notebook here: |
Is there a round-trip in that notebook? I don't see any encoding. |
@sayakpaul I think either @keturn's sample code needs to use a special preprocessor for Right now the encoder is getting an image in [-1, 1], encoding the values that are in [0, 1] (above 50% brightness), and clamping the rest of the values (everything below 50% brightness) to black - which is why the decoder decodes a darkened image. (My fault for not following up on the review comment - sorry!) @keturn Since a few people have now been (understandably 😅) tripped up by the way TAESD bakes in the SD-VAE input / output scale-shift transforms, I've added an example "Encoding / Decoding" notebook to hopefully clear things up. I also added (hopefully) clearer language to the README. |
It seems that it's avaliable for Add But it seems that the workflow would be a little bit complicated if it needs switching between Test codeimport diffusers, torch
from PIL.Image import Image, open as image_open
device = torch.device("cuda:0")
with torch.inference_mode():
taesd = diffusers.AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16).to(device=device)
vaesd = diffusers.AutoencoderKL.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="vae", variant="fp16", torch_dtype=torch.float16).to(device=device)
from diffusers.utils.testing_utils import load_image
image = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/versatile_diffusion/benz.jpg"
)
from diffusers.image_processor import VaeImageProcessor
vae_processor = VaeImageProcessor(do_normalize=False)
image_tensor: torch.FloatTensor = vae_processor.preprocess(image).to(dtype=torch.float16, device=device)
print(f"image tensor range: {image_tensor.min()} < {image_tensor.mean()} < {image_tensor.max()})")
with torch.inference_mode():
taesd_latents = taesd.encode(image_tensor).latents
print(f"taesd-encoded latent range: {taesd_latents.min()} < {taesd_latents.mean()} (σ={taesd_latents.std()}) < {taesd_latents.max()})")
vaesd_latents = vaesd.encode(image_tensor).latent_dist.sample()
print(f"vaesd-encoded latent range: {vaesd_latents.min()} < {vaesd_latents.mean()} (σ={vaesd_latents.std()}) < {vaesd_latents.max()})")
with torch.inference_mode():
redecoded_tensor = taesd.decode(taesd_latents).sample
redecoded_image = vae_processor.postprocess(redecoded_tensor, do_denormalize=[True])
display(image, redecoded_image[0]) Outputs |
Yes. Given that TAESD was explicitly designed as a drop-in replacement for the Stable Diffusion VAE, and the diffusers library implements them both, it would be very much appreciated if the library offered an interface that's consistent to both. |
Even though it was developed as a drop-in replacement, I think its main usefulness lies in speedy decoding. For now, almost all (if not all) encoders we have in the library expect the value range to be in [-1, 1]. So, we won't likely be changing it. But happy to review any PRs in this regard. |
* Add [-1, 1] -> [0, 1] rescaling to EncoderTiny (this fixes huggingface#4676) * Move [0, 1] -> [-1, 1] rescaling from AutoencoderTiny.decode to DecoderTiny (i.e. immediately after the final conv, as earlier as possible) * Fix missing [0, 255] -> [0, 1] rescaling in AutoencoderTiny.forward * Update AutoencoderTinyIntegrationTests to protect against scaling issues. The new test constructs a simple image, round-trips it through AutoencoderTiny, and confirms the decoded result is approximately equal to the source image. This test will fail if new AutoencoderTiny scaling issues are introduced. * Context: Raw TAESD weights expect images in [0, 1], but diffusers' convention represents images with zero-centered values in [-1, 1], so AutoencoderTiny needs to scale / unscale images at the start of encoding and at the end of decoding in order to work with diffusers.
* Add [-1, 1] -> [0, 1] rescaling to EncoderTiny (this fixes huggingface#4676) * Move [0, 1] -> [-1, 1] rescaling from AutoencoderTiny.decode to DecoderTiny (i.e. immediately after the final conv, as earlier as possible) * Fix missing [0, 255] -> [0, 1] rescaling in AutoencoderTiny.forward * Update AutoencoderTinyIntegrationTests to protect against scaling issues. The new test constructs a simple image, round-trips it through AutoencoderTiny, and confirms the decoded result is approximately equal to the source image. This test will fail if new AutoencoderTiny scaling issues are introduced. * Context: Raw TAESD weights expect images in [0, 1], but diffusers' convention represents images with zero-centered values in [-1, 1], so AutoencoderTiny needs to scale / unscale images at the start of encoding and at the end of decoding in order to work with diffusers.
* Add [-1, 1] -> [0, 1] rescaling to EncoderTiny (this fixes huggingface#4676) * Move [0, 1] -> [-1, 1] rescaling from AutoencoderTiny.decode to DecoderTiny (i.e. immediately after the final conv, as early as possible) * Fix missing [0, 255] -> [0, 1] rescaling in AutoencoderTiny.forward * Update AutoencoderTinyIntegrationTests to protect against scaling issues. The new test constructs a simple image, round-trips it through AutoencoderTiny, and confirms the decoded result is approximately equal to the source image. This test will fail if new AutoencoderTiny scaling issues are introduced. * Context: Raw TAESD weights expect images in [0, 1], but diffusers' convention represents images with zero-centered values in [-1, 1], so AutoencoderTiny needs to scale / unscale images at the start of encoding and at the end of decoding in order to work with diffusers.
* Add [-1, 1] -> [0, 1] rescaling to EncoderTiny (this fixes huggingface#4676) * Move [0, 1] -> [-1, 1] rescaling from AutoencoderTiny.decode to DecoderTiny (i.e. immediately after the final conv, as early as possible) * Fix missing [0, 255] -> [0, 1] rescaling in AutoencoderTiny.forward * Update AutoencoderTinyIntegrationTests to protect against scaling issues. The new test constructs a simple image, round-trips it through AutoencoderTiny, and confirms the decoded result is approximately equal to the source image. This test will fail if new AutoencoderTiny scaling issues are introduced. * Context: Raw TAESD weights expect images in [0, 1], but diffusers' convention represents images with zero-centered values in [-1, 1], so AutoencoderTiny needs to scale / unscale images at the start of encoding and at the end of decoding in order to work with diffusers.
* Add [-1, 1] -> [0, 1] rescaling to EncoderTiny (this fixes huggingface#4676) * Move [0, 1] -> [-1, 1] rescaling from AutoencoderTiny.decode to DecoderTiny (i.e. immediately after the final conv, as early as possible) * Fix missing [0, 255] -> [0, 1] rescaling in AutoencoderTiny.forward * Update AutoencoderTinyIntegrationTests to protect against scaling issues. The new test constructs a simple image, round-trips it through AutoencoderTiny, and confirms the decoded result is approximately equal to the source image. This test checks behavior with and without tiling enabled. This test will fail if new AutoencoderTiny scaling issues are introduced. * Context: Raw TAESD weights expect images in [0, 1], but diffusers' convention represents images with zero-centered values in [-1, 1], so AutoencoderTiny needs to scale / unscale images at the start of encoding and at the end of decoding in order to work with diffusers.
@sayakpaul I agree, [-1, 1] is the correct value convention for @keturn I've tested the PR on your sample code, and I think it should now work without modifications (though the printed latent ranges are still different, because your sample code isn't manually applying the |
Thank you, and for the note about the scaling factor as well -- I was wondering about that discrepancy. I have some follow-up questions about that, but such goes beyond the scope of this AutoencoderTiny issue; I'll find somewhere else to post that. |
|
Describe the bug
AutoencodeTiny (TAESD) decoder seems to work fine. encoding on the other hand is producing poor results, and an encode-decode round-trip turns out poorly:
input:

output:

Reproduction
see https://gist.github.com/keturn/b0a10a3b388e1e49cdf38567b76eb30c
System Info
diffusers
version: 0.20.0Who can help?
No response
The text was updated successfully, but these errors were encountered: