Python / Groq Whisper API / faster-whisper (local fallback)

undertone

Voice typing for Linux, published on PyPI

Overview

A desktop voice typing tool for Linux. Hold a hotkey, speak, release -- your words are typed at the cursor. Built because existing voice typing tools were either Mac-only, required a browser, or had unacceptable latency. Published on PyPI as a pip-installable package.

Architecture

Audio capture uses sounddevice with a callback-buffered InputStream at 16kHz mono. Recording starts on hotkey press (Right Ctrl or F8 via pynput) and stops on release, outputting WAV via BytesIO.

Transcription follows a Groq-first-with-local-fallback strategy: tries Groq API (whisper-large-v3-turbo model) with retry and exponential backoff on 429/5xx errors. Falls back to local faster-whisper (distil-large-v3, CPU int8, VAD filter enabled) on any failure.

Intent classification is pure keyword matching -- no LLM in the loop. Checks for action keywords (open, find, search, grep, copy) returning 'act' or 'ask'. Action planning maps transcripts to 6 tools: find_file (ripgrep), search_repo, open_file, open_url, open_app, copy_text.

Text injection uses xclip for X11 and wl-copy for Wayland. App-aware cleanup applies context-specific transformations. Custom dictionaries and snippets are stored in ~/.config/undertone/.

TTS system has 5 backends (3 local: Qwen3-TTS, Supertonic v2 ONNX, KittenTTS Mini; 2 hosted: Inworld, ElevenLabs) with a built-in bakeoff function for side-by-side comparison.

Design Decisions

Groq API as primary over local-only transcription

Groq's Whisper endpoint returns in ~200ms for typical utterances, while local faster-whisper takes 1-3s on CPU. For voice typing, that latency difference is the difference between feeling instant and feeling slow. The local fallback ensures it works offline or when Groq is down.

Keyword-based intent classification over LLM routing

Adding an LLM call for intent classification would add 500ms+ latency to every utterance. Keyword matching is instant and covers the action vocabulary (open, find, search, copy) with near-perfect accuracy for this use case.

Half-duplex (push-to-talk) over continuous listening

Continuous listening requires always-on VAD and uses significant CPU/battery. Push-to-talk is explicit, private (no accidental recording), and simpler to implement correctly. Full-duplex is planned for v2.

Tech Stack

PythonGroq Whisper APIfaster-whisper (local fallback)sounddevicepynput (hotkeys)xclip / wl-copyPyPI (published)