[Galene] More work on speech-to-text

Galène videoconferencing server discussion list archives
 help / color / mirror / Atom feed

* [Galene] More work on speech-to-text
@ 2024-11-08 14:54 Juliusz Chroboczek
  0 siblings, 0 replies; only message in thread
From: Juliusz Chroboczek @ 2024-11-08 14:54 UTC (permalink / raw)
  To: galene

Hi,

I've just finished doing some more work on speech-to-text support.
Galene-stt can now run in three modes:

  - dump a transcript to standard output; this is the default, and is
    useful if you're trying to follow a meeting that's not in a language
    you understand well;

  - dump a transcript to the chat; this is requested with the option
    "-chat", and I don't think it's very useful;

  - generate proper captions; this is requested with the option
    "-caption", and is pretty useful in general.

You first need to create a group with a user "speech-to-text" with the
permission to publish captions.  Here's what I did in order to create the
group <https://galene.org:8443/group/public/stt/>:

    galenectl create-group -group public/stt
    galenectl create-user -group public/stt -user speech-to-text -permissions caption
    galenectl set-password -group public/stt -user speech-to-text -type wildcard
    galenectl create-user -group public/stt -wildcard
    galenectl set-password -group public/stt -wildcard -type wildcard

Now run the galene-stt client on the fastest machine you have access to:

    ./galene-stt -model models/ggml-tiny-q5_1.bin -caption https://galene.org:8443/group/public/stt

Type `./galene-stt -help` for other options.  Whisper.cpp has a lot of
other options which I haven't exported in galene-stt, please let me know
if there are any that you'd find useful.

  https://github.com/ggerganov/whisper.cpp/blob/master/examples/main/main.cpp#L125

The problem is, of course, that whisper.cpp (the speech-to-text library
I'm using) is too slow to produce real-time output; on my
(eight-years-old) laptop, I'm able to run it in real-time using the "tiny"
model, which does not produce useful output in practice.  I'm
experimenting with running it on the GPU, but with little success so far.

The obvious solution would be to use the cloud instance of Whisper instead
of running the inference locally, but that raises serious privacy issues.
I won't be implementing it myself, but if you're not concerned about
privacy, please feel free to fork the galene-stt tool and announce your
fork on the list.

Next steps:

  - more work on audio segmentation;
  - GPU support.

-- Juliusz

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2024-11-08 14:54 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-08 14:54 [Galene] More work on speech-to-text Juliusz Chroboczek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox