Using custom wispier models in pipeline builder

Does anyone know if it possible to use a custom wispier model in the Audio to Text pipeline builder board?

Hi this currently isn’t supported

2.5 Adjacent terminology related questions

Context: :jigsaw:

I was watching this video (https://vimeo.com/1001669623) (AIP for Developers in 10 minutes - Songshare.ai by Jeg)
His “workflow” was as follows:

  1. User uploads song.wav file
  2. Pipeline transformation 1 - converts media set to table rows
  3. Pipeline transformation 2 - transcribes audio into text (with whisper?)
  4. Does a bunch of Dev Console OSDK stuff
  5. Presses “Generate Lyrics”
  6. Notices output isn’t perfect
  7. At 4:40 - 4:50 states “The transcription that’s coming natively out of whisper may not be exactly right so I may want to use my own model”

My end goal is to build an app where:

  1. The user talks into the microphone (input.wav)
  2. (input.wav) gets transcribed into text
  3. Text gets shown back on screen

Questions: :question:

  1. In the video, would we say that @Jeg is using just a “standard whisper model” (from openAI)?
  2. Is using https://deepgram.com/ an example of the desired “custom whisper model” concept that (@278d8f915952e8d66637) is talking about?
    • If yes, I wanted to implement deepgram as a “custom whisper model”. Would I implement this via a pipeline compute module where:
      • Input - (input.wav)
      • Output - (dataset that updates an object property)

Hi @9aebee43292b35f5ac9a

  1. Pretty much, yes!

  2. A pipeline compute module would work here or you could write a Python Transform to take in the Media and call the model.

For your end outcome however, it looks like you want to have the end outcome be the frontend interactively returning the transcription. I would propose you try creating a Python or a TypeScript function (dependent on which language you feel most comfortable in) via Code Repositories. You can then call that function from your frontend via the OSDK - if you base64 encoded your audio, you could then call the DeepGram API and then return the transcription to the frontend.