I can talk a lot quicker than I can write. I always thought about using dictation to increase productivity in writing code, but so far it hasn’t really been possible. Code is highly structured text which is really hard to dictate. Try dictating this:

swaymsg -t get_outputs | jq '.[] | {name, make, rect}'

With large language models this has changed, since they take inputs mostly in the form of textual descriptions, not unstructured text. This is much quicker to say than to type out.

This idea prompted me to hack together a tool for dictation under Linux, since I couldn’t find any good existing options. This has been made somewhat superfluous by GPT-4o, but I thought it was worth sharing nonetheless, even if it’s just to show how amazingly simple it is to put together things like this, these days.

Implementation

The architecture is very straightforward. There’s a flask app that I run at startup, which accepts a /state command telling you whether it’s currently recording or not. It also accepts /start and /end commands to start and end recordings. The /end command returns a json with a result field, containing the text which was said while recording.

There is a wrapper around that which starts recording if not already running, and stops recording otherwise. If it just stopped recording, it puts the transcribed text into my clipboard.

The text is recorded using pyaudio and transcribed using AI magic. Essentially:

import whisper
model = whisper.load_model('base.en')

# Do some plumbing to record audio in a separate thread
# ...

model = whisper.load_model('base.en')

result = model.transcribe(audio_recording_thread.get_audio())
return {'status': 'ok', 'result': result['text']}

The tricky bits were actually unrelated to AI. One thing was capturing the audio in a separate thread, so that it wouldn’t block the flask app’s worker thread. This was easily done using the threading module, no doubt because some brain-bending work done by PyAudio to evade the GIL:

import pyaudio
import threading

class AudioRecordingThread(threading.Thread):
    RATE = 16000
    CHUNKSIZE = int(RATE * 0.5)
    MAX_SECONDS = 120

    def __init__(self):
        super().__init__()
        self._stop_event = threading.Event()

    def run(self):
        def record_audio_chunk():
            data = stream.read(AudioRecordingThread.CHUNKSIZE)
            return np.frombuffer(data, dtype=np.float32)

        chunks = []
        p = pyaudio.PyAudio()
        stream = p.open(
            format=pyaudio.paFloat32,
            channels=1,
            rate=AudioRecordingThread.RATE,
            input=True,
            frames_per_buffer=AudioRecordingThread.CHUNKSIZE,
        )
        logging.info("Stared recording")
        for _ in range(int(AudioRecordingThread.MAX_SECONDS * AudioRecordingThread.RATE / AudioRecordingThread.CHUNKSIZE)):
            if self._stop_event.is_set():
                break
            chunks.append(record_audio_chunk())
        logging.info("Finished recording")

        stream.stop_stream()
        stream.close()

        self.audio = np.concatenate(chunks)

    def get_audio(self):
        return self.audio[int(AudioRecordingThread.RATE * 0.25):]

    def stop(self):
        self._stop_event.set()

A second was to play a sound when recording is started and stopped to give confirmation that the system is working. I did this using PyGame because it’s the most reliable way I know of to play sound on Linux. It pulls in lots of unneeded dependencies, but oh well.

def play_mp3(file_name, block=False):
    resource_path = os.path.abspath(
        os.path.join(
            os.path.dirname(__file__),
            '../data/sfx/',
            file_name,
        )
    )
    pygame.mixer.music.load(resource_path)
    pygame.mixer.music.play()

    if block:
        while pygame.mixer.music.get_busy():
            pygame.time.Clock().tick(10)

Repo link here.

Usefullness

It’s so fucking good. It is very useful for talking to chat GPT because I don’t need to structure my thoughts very clearly and the language model will usually deal with it just fine. With LLMs, often I get time savings if I can get the language model to find me a bit of documentation 10 seconds quicker than I could myself. If I was interacting with it using text, the time spent typing out my query would negate the benefit. This dictation tool takes the friction out of the process, which means that I get to have a lot of small time savings, which I wouldn’t otherwise have. Also asking the computer questions by voice doesn’t break flow as much as looking for information on the internet would.

I am not sure about the use in other writing. I need to think more carefully about what I want to write, which makes it feel less fluid. Also, the ease of producing lots of text gives an incentive to be less concise, which is a bad thing. But I think with practice I might see productivity gains here as well.

Of course, I cannot use this tool in an open plan office, so for now it must be confined to the realm of personal projects.