A Diffusion Transformer model for 3D video-like data was introduced in Cosmos World Foundation Model Platform for Physical AI by NVIDIA.
The model can be loaded with the following code snippet.
from diffusers import CosmosTransformer3DModel
transformer = CosmosTransformer3DModel.from_pretrained("nvidia/Cosmos-1.0-Diffusion-7B-Text2World", subfolder="transformer", torch_dtype=torch.bfloat16)
[[autodoc]] CosmosTransformer3DModel
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput