Add dedicated transcription interface for audio-to-text models #92

keithrbennett · 2025-04-03T15:52:34Z

Add dedicated transcription interface for audio-to-text models

Current Behavior

The README currently shows audio transcription support through the chat interface:

# Analyze audio recordings
chat.ask 'Describe this meeting', with: { audio: 'meeting.wav' }

However, this doesn't work. The library includes specific transcription models (gpt-4o-transcribe, gpt-4o-mini-transcribe) but attempting to use them results in errors. These models are distinct from audio conversation models (gpt-4o-audio-preview) and text-to-speech models (gpt-4o-mini-tts).

Using chat interface fails because transcription models aren't chat models:

chat = RubyLLM.chat(model: 'gpt-4o-transcribe')
chat.ask('Transcribe this', with: { audio: 'audio.mp3' })
# Error: This is not a chat model and thus not supported in the v1/chat/completions endpoint

No dedicated transcription method exists:

RubyLLM.transcribe('audio.mp3', model: 'gpt-4o-transcribe')
# Error: undefined method 'transcribe' for module RubyLLM

Desired Behavior

Add a dedicated transcription interface consistent with other RubyLLM operations:

# Simple usage
transcription = RubyLLM.transcribe('audio.mp3', model: 'gpt-4o-transcribe')
puts transcription.text

# With options
transcription = RubyLLM.transcribe('audio.mp3',
  model: 'gpt-4o-transcribe',
  language: 'en',  # Optional language hint
  prompt: 'This is a technical discussion'  # Optional context
)

This would:

Provide a consistent interface for audio transcription
Support different transcription models
Match the pattern of other RubyLLM operations (chat, paint, embed)
Allow for future expansion to other providers' transcription models

Current Workaround

Until this feature is implemented, users need to use the OpenAI client directly for transcription:

def transcribe_audio
  client = OpenAI::Client.new
  File.open('audio.mp3', 'rb') do |file|
    transcription = client.audio.transcribe(
      parameters: {
        model: 'gpt-4o-transcribe',
        file: file
      }
    )
    transcription['text']
  end
end

Documentation

The README needs to be updated to remove the misleading example of audio support through the chat interface. Instead, it should document the new dedicated transcription interface, making it clear that audio processing is a separate operation from chat, similar to how image generation (paint) and embeddings are handled.

The text was updated successfully, but these errors were encountered:

crmne · 2025-04-03T16:04:16Z

The example in the README isn't misleading – we absolutely do support audio in chat today. Our test suite has working examples with gpt-4o-audio-preview models.

What we don't have is a dedicated transcription-only interface for models like gpt-4o-transcribe, which is a fair point.

Go ahead and open a PR! This would be a nice addition to the API that fits our pattern of simple, top-level methods.

keithrbennett · 2025-04-04T08:53:06Z

@crmne Sorry, you're absolutely right, and it's right there in the chat guide. I have edited the reported issue to remove any reference to the lack of audio support. I'll take a look at putting together a PR.

keithrbennett mentioned this issue Apr 4, 2025

Add RubyLLM.transcribe method. #97

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dedicated transcription interface for audio-to-text models #92

Add dedicated transcription interface for audio-to-text models #92

keithrbennett commented Apr 3, 2025

crmne commented Apr 3, 2025

keithrbennett commented Apr 4, 2025

Add dedicated transcription interface for audio-to-text models #92

Add dedicated transcription interface for audio-to-text models #92

Comments

keithrbennett commented Apr 3, 2025

Add dedicated transcription interface for audio-to-text models

Current Behavior

Desired Behavior

Current Workaround

Documentation

crmne commented Apr 3, 2025

keithrbennett commented Apr 4, 2025