-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cosmos #10660
base: main
Are you sure you want to change the base?
Cosmos #10660
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
To match our sigmas to original exactly, without any rounding errors, I had to use
Also, we only match the sigmas if we set our |
This PR doesn't seem to include the guardrail model: https://huggingface.co/nvidia/Cosmos-1.0-Guardrail |
@asfiyab-nvidia I didn't think to add the guardrail models because they essentially work as preprocessors/postprocessors outside the core diffusion-related aspects. Can definitely do a follow-up adding support for it. Additionally, the prompt upsampler isn't added for the similar reasons. The upsampling can be run via any language model (independent of diffusers), but I'll update the docs to point to Pixtral-12B as used in original codebase as an example. This PR contains only the parts relevant for running the diffusion sampling and generating videos. |
@a-r-r-o-w Not including the guardrail model violates the License in https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Text2World. cc @pjannaty for comment on this |
@asfiyab-nvidia Thanks for the notice! I didn't check the license until now. In that case, I'll implement the guardrails tomorrow. |
@asfiyab-nvidia @pjannaty The CosmosGuardrail has been integrated as well. The relevant class to review is |
Thanks for patiently reviewing this! If everything looks good to merge, please let us know. We plan to do a diffusers release over the weekend or on Monday. It would be great to ship the Cosmos integration as well for this release cycle. In order to proceed with that, we'll have to host diffusers-format weights for the following repositories:
To host the weights, none of the existing files will be modified apart from README.md (which we can use to showcase how to run inference with diffusers). The diffusers-format folder structure would look something like: I've opened an example PR for the 7B Text-to-World weights here: https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Text2World/discussions/9 Once I have the go from your end that these changes are good, I can open up PRs to all the other repositories |
@a-r-r-o-w I'm running into the below issue during pipeline load FYI. Is this expected?
|
Another note re the attention definition here. Enabling GQA breaks ONNX export due to https://github.com/pytorch/pytorch/blob/main/torch/onnx/symbolic_opset14.py#L152. Can this be addressed? |
@asfiyab-nvidia I'm testing a non- I'm not sure why you get the error about import torch
from diffusers import CosmosPipeline
from diffusers.utils import export_to_video
model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
pipe = CosmosPipeline.from_pretrained(model_id, revision="refs/pr/9", torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
output = pipe(prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=30) output.mp4I'll try to dig in more soon to see if it errors out with a different environment. |
@asfiyab-nvidia Could you review again and let us know if everything looks good and if we can move forward with hosting the diffusers folder-format weights in each repo? Thanks! |
I opened a PR for the 7B Video-to-World model as well for easier testing: https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-7B-Video2World/discussions/2 import torch
from diffusers import CosmosVideoToWorldPipeline
from diffusers.utils import export_to_video, load_video
model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World"
pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16, revision="refs/pr/2")
pipe.to("cuda")
prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
video = load_video(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
)[:21] # This example uses only the first 21 frames
video = pipe(video=video, prompt=prompt).frames[0]
export_to_video(video, "output.mp4", fps=30) |
@a-r-r-o-w Thanks for the update! I'm still seeing the issue mentioned in #10660 (comment). I see that |
Could you try after |
Thanks. Installing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please rename this pipeline_cosmos.py
to pipeline_cosmos_text2world.py
and accordingly, to be consistent with Cosmos implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have considered doing a bit more refactoring by consolidating all reusable code into a common file. However, I think it might be best to adhere to the Diffusers basic design philosophy explicit is better than implicit and simple is better than complex
so I'll pass on this idea for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can rename the file. It will require updating the remote checkpoint PRs, so I'll do it when the PR is close to merge and we've gotten the thumbsup from your team after verification 🤗
@a-r-r-o-w I'm also noticing the guardrail is only applied on the input text prompt, but not on the generated video. |
@asfiyab-nvidia Could you check if the video guardrail is erroring out with an exception on your end? If a guardrail raises an exception due to an error, it is simply ignored (this behaviour is consistent with the official implementation). On my end, the generated videos have the robot's face blurred in most generations |
@a-r-r-o-w I have been testing pipelines with guardrails, and so far everything seems to be working smoothly, however, I am looking to add some more test cases to ensure comprehensive coverage @asfiyab-nvidia regarding the |
I resolved this problem by installing |
The cosmos is within us. We are made of star-stuff. We are a way for the universe to know itself.
Transformer
test attention
test ff
test timesteps
test patch embed
test positional embed
test transformer block
test transformer
test transformer video
VAE
test vae attention
test vae
Text-to-World:
Video-to-World (image-conditioning):
Video-to-World (video-conditioning):
Note that the model repos are not yet compatible with Diffusers-loading. I'll open PRs for weights once nvidia team gives the thumbs up.
Inference code (old)