Skip to content

Add dedicated transcription interface for audio-to-text models #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
keithrbennett opened this issue Apr 3, 2025 · 3 comments
Open

Comments

@keithrbennett
Copy link
Contributor

Add dedicated transcription interface for audio-to-text models

Current Behavior

The README currently shows audio transcription support through the chat interface:

# Analyze audio recordings
chat.ask 'Describe this meeting', with: { audio: 'meeting.wav' }

However, this doesn't work. The library includes specific transcription models (gpt-4o-transcribe, gpt-4o-mini-transcribe) but attempting to use them results in errors. These models are distinct from audio conversation models (gpt-4o-audio-preview) and text-to-speech models (gpt-4o-mini-tts).

  • Using chat interface fails because transcription models aren't chat models:

    chat = RubyLLM.chat(model: 'gpt-4o-transcribe')
    chat.ask('Transcribe this', with: { audio: 'audio.mp3' })
    # Error: This is not a chat model and thus not supported in the v1/chat/completions endpoint
  • No dedicated transcription method exists:

    RubyLLM.transcribe('audio.mp3', model: 'gpt-4o-transcribe')
    # Error: undefined method 'transcribe' for module RubyLLM

Desired Behavior

Add a dedicated transcription interface consistent with other RubyLLM operations:

# Simple usage
transcription = RubyLLM.transcribe('audio.mp3', model: 'gpt-4o-transcribe')
puts transcription.text

# With options
transcription = RubyLLM.transcribe('audio.mp3',
  model: 'gpt-4o-transcribe',
  language: 'en',  # Optional language hint
  prompt: 'This is a technical discussion'  # Optional context
)

This would:

  1. Provide a consistent interface for audio transcription
  2. Support different transcription models
  3. Match the pattern of other RubyLLM operations (chat, paint, embed)
  4. Allow for future expansion to other providers' transcription models

Current Workaround

Until this feature is implemented, users need to use the OpenAI client directly for transcription:

def transcribe_audio
  client = OpenAI::Client.new
  File.open('audio.mp3', 'rb') do |file|
    transcription = client.audio.transcribe(
      parameters: {
        model: 'gpt-4o-transcribe',
        file: file
      }
    )
    transcription['text']
  end
end

Documentation

The README needs to be updated to remove the misleading example of audio support through the chat interface. Instead, it should document the new dedicated transcription interface, making it clear that audio processing is a separate operation from chat, similar to how image generation (paint) and embeddings are handled.

@crmne
Copy link
Owner

crmne commented Apr 3, 2025

The example in the README isn't misleading – we absolutely do support audio in chat today. Our test suite has working examples with gpt-4o-audio-preview models.

What we don't have is a dedicated transcription-only interface for models like gpt-4o-transcribe, which is a fair point.

Go ahead and open a PR! This would be a nice addition to the API that fits our pattern of simple, top-level methods.

@keithrbennett
Copy link
Contributor Author

@crmne Sorry, you're absolutely right, and it's right there in the chat guide. I have edited the reported issue to remove any reference to the lack of audio support. I'll take a look at putting together a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@keithrbennett @crmne and others