Using custom wispier models in pipeline builder

278d8f915952e8d66637 · October 8, 2024, 9:35am

Does anyone know if it possible to use a custom wispier model in the Audio to Text pipeline builder board?

helenq · October 8, 2024, 1:09pm

Hi this currently isn’t supported

9aebee43292b35f5ac9a · January 17, 2025, 1:45pm

2.5 Adjacent terminology related questions

Context:

I was watching this video (https://vimeo.com/1001669623) (AIP for Developers in 10 minutes - Songshare.ai by Jeg)
His “workflow” was as follows:

User uploads song.wav file
Pipeline transformation 1 - converts media set to table rows
Pipeline transformation 2 - transcribes audio into text (with whisper?)
Does a bunch of Dev Console OSDK stuff
Presses “Generate Lyrics”
Notices output isn’t perfect
At 4:40 - 4:50 states “The transcription that’s coming natively out of whisper may not be exactly right so I may want to use my own model”

My end goal is to build an app where:

The user talks into the microphone (input.wav)
(input.wav) gets transcribed into text
Text gets shown back on screen

Questions:

In the video, would we say that @Jeg is using just a “standard whisper model” (from openAI)?
Is using https://deepgram.com/ an example of the desired “custom whisper model” concept that (@278d8f915952e8d66637) is talking about?
- If yes, I wanted to implement deepgram as a “custom whisper model”. Would I implement this via a pipeline compute module where:
  - Input - (input.wav)
  - Output - (dataset that updates an object property)

jeg · January 17, 2025, 2:19pm

Hi @9aebee43292b35f5ac9a

Pretty much, yes!
A pipeline compute module would work here or you could write a Python Transform to take in the Media and call the model.

For your end outcome however, it looks like you want to have the end outcome be the frontend interactively returning the transcription. I would propose you try creating a Python or a TypeScript function (dependent on which language you feel most comfortable in) via Code Repositories. You can then call that function from your frontend via the OSDK - if you base64 encoded your audio, you could then call the DeepGram API and then return the transcription to the frontend.