More or less accidentally stumbled upon Whisper.cpp . It is a speech-to-text converter with a built-in translator. In Open Source. From OpenAI (they also made ChatGPT and DallE). I wanted to try it.

How to assemble:

git clone
cd whisper.cpp
cd models
bash ./ large
cd ..
# проверяем, что все нормально собралось и скачалось
./main -m models/ggml-large.bin -f samples/jfk.wav

How to prepare the data:

cd samples
brew install ffmpeg yt-dlp
# качаем мой ролик
yt-dlp -x --audio-format wav -o test-u8mWIpv2zaA.wav -- u8mWIpv2zaA
# конвертируем в нужный формат
ffmpeg -i test-u8mWIpv2zaA.wav -ar 16000 -ac 1 -c:a pcm_s16le test-u8mWIpv2zaA_16bit.wav
# запускаем
cd ..
./main -f samples/test-u8mWIpv2zaA_16bit.wav -l ru -m models/ggml-large.bin

We look at the results selectively:

[00:00:37.000 --> 00:00:50.000]   Потому что даже если посмотреть какие-нибудь конференции DevOps, панельные дискуссии о том, что такое DevOps, чем он отличается от чего-то другого,
[00:06:28.000 --> 00:06:35.000]   Условно говоря, когда мы выбираем два маршрута, маршрут tramway и маршрут троллейбус,
[00:11:48.000 --> 00:11:55.000]   Какие фичи нужно сделать на нашем сайте, чтобы увеличить продажи на 20% - никто не знает.
[00:12:34.000 --> 00:12:53.000]   Можно купить за 25 тысяч рублей исходники сайта какого-то и мобильного приложения, ну, может быть, с мобильным, под 75, по две платформы, на сайтах, которые занимаются продажей готовых продуктов.
[00:13:20.000 --> 00:13:28.000]   И когда мы смотрим эти цифры, непонятно, почему в Uber 4,5 тысяч программистов.
[00:15:44.000 --> 00:15:48.000]   Так называемые professional services.
[00:16:41.000 --> 00:16:50.000]   Да, ну это может быть аптека, которая заказывает доработку 1С,
[00:25:29.000 --> 00:25:37.000]   Это исправление багов, это поддержка обновленной экосистемы, новой версии Android и iOS.
[00:31:52.000 --> 00:31:59.000]   Это основная задача программы.
[00:31:59.000 --> 00:32:08.000]   [неразборчиво]
[00:32:08.000 --> 00:32:10.000]   Это правильно.

What can I say?

  1. Recognizes Russian text well with intersperses of English.
  2. Recognizes brand terms well (Uber, Android, iOS, DevOps, 1C, …)
  3. Recognizes numbers well
  4. When he does not understand what he is writing `[inaudible]’. According to this video, these are questions from the audience – a person would understand them, but it’s not under the microphone, it’s quite acceptable.
  5. Sometimes there are mistakes (tramway), but really few
  6. The speed is higher than the audio speed. 21 minutes for a 63-minute video in the slowest version (large model), i.e. about 3x on my hardware (M1 Pro 16Gb).

Aegisub can be used if you want to double-check some piece of text and correct it there. --output-str --maxlen 47 arguments to save the result whisper.cpp in the srt format.

There is a Mac Whisper program that allows you to dictate and immediately recognize the text. It’s more for dating, but it’s also interesting (if you don’t want to collect it yourself and all that).

There is also an open source Buzz program . It is being put:

brew install --cask buzz

The program (buzz) crashes for me now after downloading any model, but I think it will be fixed sooner or later.

Whisper has a built-in English translator. You only need to add the -tr option for translation. Only English is supported, but this is what is needed in most cases. Plus, this means that the rest of the languages are already a matter of improvements, and not a fundamental problem.

Let’s compare the beginning in both languages:

[00:00:00.000 --> 00:00:13.000]   И сейчас пойдет более-менее сложный материал, соответственно, как встанете, говорите, будем проверивать.
[00:00:13.000 --> 00:00:27.000]   А, собственно, зачем это все? Вот я сейчас начну говорить про вещи, о которых редко говорят и редко задумываются.
[00:13:58.000 --> 00:14:10.000]   Там, по-моему, порядка 700 человек лет они тратят в год на инфраструктуру, на ее оптимизацию, разработку.
[00:14:10.000 --> 00:14:16.000]   А остальное - это как раз какие-то новые гипотезы, которые они проверяют.
[00:14:16.000 --> 00:14:26.000]   Поэтому простой вопрос, сколько стоит продукт, он имеет бесконечную стоимость.

[00:00:00.000 --> 00:00:09.000]   And now we'll have more or less complex material.
[00:00:09.000 --> 00:00:13.000]   So, as you get up, we'll talk and improvise.
[00:00:13.000 --> 00:00:18.000]   Actually, why is all this?
[00:00:18.000 --> 00:00:27.000]   I'm going to start talking about things that are rarely talked about and rarely thought about.
[00:13:58.000 --> 00:14:10.000]   I think they spend about 700 people a year on infrastructure, on its optimization, development, and development.
[00:14:10.000 --> 00:14:16.000]   And the rest is just some new hypotheses that they check.
[00:14:16.000 --> 00:14:26.000]   Therefore, the simple question of how much a product costs is of infinite value.

Initially, the text was recognized a little incorrectly, it should be like this:

[00:00:00.000 --> 00:00:13.000]   И сейчас пойдет более-менее сложный материал, соответственно, как устанете, говорите, будем перерывы делать.

If you enable the -pc option, the text will be displayed in different colors (green - confident recognition, red - uncertain). Errors are usually in red letters.

Nevertheless, the translation is quite adequate. Of course, this needs to be subtracted, but there is already a feeling that it is easier to subtract than to translate from scratch yourself. And it’s all local. And with the source code.

Overall, I am delighted. Most likely, they will be actively used.

On the downside, there doesn’t seem to be any source data on which the models were trained. Actually, this means partial openness, not complete. Nevertheless, progress has already been made.

Additional materials:

  1. – good introduction
  2. – the original Whisper in Python and GPU from openair
  3. – C++ version without GPU
  4. – original page