-
Notifications
You must be signed in to change notification settings - Fork 5.9k
conditional Unet1D #3044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @lucala! Thanks for being willing to do this :) Could you help us understand the impact of this feature e.g., some works that could benefit deeply from this? |
Hi, I would like this feature as well, even I would also like to help out to put it into the library, my use case is speech synthesis, it would be helpful to synthesise audio representations conditioned on text input. |
We have an AudioDiffusionPipeline that leverages the |
Treating the mel spectrogram as images might not be the best inductive bias that we would want in audio synthesis there is no temporal relation in the frequency domain i.e. on the y-axis, so just for experimentation purposes, it would be nice to try it with a 1D model. |
Yes, AudioDiffusionPipeline works on 2D spectrograms, which can also work but in my case I'm trying to experiment on raw 1D audio signals. This approach has been shown to work in Dance Diffusion but it only exists for the unconditional case. |
Cool discussion, sorry to not have replied earlier! I definitely agree that working directly in the audio domain is better than log-mel space (e.g. NaturalSpeech2 > DiffGAN-TTS) I think if there's a new pipeline/model that we want to add to diffusers that leverages a 1-d conditional UNet model it'd make sense to add the class. I'm not sure how useful a standalone 1-d conditional UNet model would be though? |
I feel it will be useful for research purposes, having stronger conditioning by cross-attention would definitely perform better than concatenating channels initially (because that is what I am doing right now). |
One potential application of audio is utilizing the UNet1D conditional model to forecast pitches based on midi notes and phones. This approach can prove beneficial in developing pitch prediction models. |
I'll leave this to @sayakpaul to comment on whether adding individual components for research purposes is part of the current design philosophy in Diffusers. From what I gather, the library is typically developed the other way round, where new research is done in standalone repositories and then merged into Diffusers when released. |
I think for now, we will just keep the UNet1D model as is and later we can revisit any modification if there's any model/pipeline that would benefit from it. This is in line with what we have been doing for some of our recent stuff such as ControlNet, T2I adapter (being worked on here: #2555). |
I don't think there is a need for the conditional UNet1D |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
If anyone has a conditional 1D implementation, please share it here! |
I built one on top of lucidrains' Karras ("elucidated") diffusion implementation, it's in a bit of a "research code" stage right now, though. |
Is your feature request related to a problem? Please describe.
I would like to work on text conditioned diffusion of 1d signals. There exists a variety of building blocks for conditional Unet2D but not for Unet1D.
Describe the solution you'd like
Support for conditional Unet1D (unet_1d_condition.py and respective conditional building blocks)
Describe alternatives you've considered
Other repo's, but they do not offer the amazing ecosystem Huggingface has.
The text was updated successfully, but these errors were encountered: