Midpath | Speech To Text Bridge

Simple and elegant voice typing on your laptop using your phone.

About STT Bridge

Laptop microphones are mediocre and prone to CPU fan noise. Phone keyboards have had excellent speech-to-text built in for years. The STT Bridge exploits that gap: it serves a tiny web page from your laptop, you open it on your phone, and every keystroke from your phone's voice keyboard is replayed on your laptop via xdotool. No cloud service, no app install on your phone, no special driver - just a browser tab.

Get started with a one liner on Linux


# Download pre-built binary
curl -L https://midpath.in/cdn/sttbridge/stt -o stt && chmod +x stt
./stt # listens on :8080 by default

Here's a demo :

Demo: voice dictation on the phone appearing live in a desktop application.

Why not X?

Tool	Why / Why not?
STT Bridge	Single binary on laptop, quick and good.
KDE Connect	Daemon on laptop, app on phone. Keyboard event transmission / configuration issues.
Remote Mouse / similar	Invasive install on laptop. Ads on phone app.
whisper.cpp / similar	Complex setup! Almost always run into NVIDIA issues! No way to port to team mates easily.

KDE Connect is great but requires installing an app on the phone and a daemon on the laptop, and it is tightly coupled to the KDE ecosystem. Remote Mouse (and similar tools) require a proprietary app on the phone. whisper.cpp runs an ASR model locally on the laptop , which if you can get it running is quiet powerful; but iff you can get it running.

STT Bridge has none of these trade-offs: the phone's built-in voice keyboard already runs a fast, high-quality on-device model; we just pipe its output to the laptop over HTTP.

How it works

The whole system is a single Go binary (~200 lines) that does two things: serve a web UI and translate incoming text into X11 keystrokes.

Nice things in STT Bridge

Type into anything

Because keystrokes are injected at the X11 level through xdotool, whatever application has focus receives them - exactly as if you had typed on a physical keyboard. This means you can dictate into:

Websites - Notion, WhatsApp Web, Gmail, GitHub comments
Chat tools - Slack, Discord, Mattermost (browser or Electron)
Terminal emulators - bash, zsh, fish; works perfectly with shell history and readline
Editors - VIM, Neovim, VS Code, Emacs (insert mode for VIM, obviously)
Office suites - LibreOffice Writer, Google Docs in the browser, Microsoft Word via Wine
Games - anything that reads keyboard input from the X server

Focus the window, switch to your phone, dictate. That's it.

Because the keystrokes land exactly like real keyboard input, you can freely mix voice and keyboard typing in the same document. Type a sentence on your laptop, continue with a voice paragraph on your phone, then jump back to the keyboard to fix a word - all without switching context in the application. This makes STT Bridge genuinely powerful for long-form writing, coding comments, and chat: use whichever input mode is faster for the next thing you want to say.

The phone does a LOT of stuff!

Modern on-device STT keyboards (Google Gboard, Samsung Keyboard, SwiftKey, etc.) do a lot of heavy lifting that you would otherwise need to implement yourself in a model-based pipeline:

Punctuation. Say "full stop" or simply pause and the keyboard inserts .. Say "comma", "question mark", "exclamation mark" and they appear as the correct character.
Capitalisation. The first word after a sentence-ending punctuation is automatically capitalised. Proper nouns and "I" are handled too.
Auto-correction. Minor mispronunciations are silently corrected against a language model baked into the keyboard.
Newlines. Say "new line" or "new paragraph" and the keyboard emits the appropriate whitespace.
Numbers and symbols. "Twenty three" becomes 23; many keyboards handle currency, percentages, and common symbols by voice.

All of this happens before the text even reaches the server - you get clean, formatted output without writing a single line of post-processing code.

The bridge does diffs, leading to consistent typing

Mobile STT keyboards don't emit individual keystrokes; they replace the entire textarea value on each recognition event. The server keeps the previous text it received. On each POST /type it computes a longest-common-prefix diff:

Find where the old and new strings first diverge.
Send xdotool key BackSpace once per character that was deleted.
Send xdotool type for the characters that were added.

This keeps keystrokes minimal and avoids retyping the entire transcript on every update. A mutex serialises concurrent requests so xdotool calls never interleave.

Keep your keys secure! Use over VPN on public nets

By default the server listens on your LAN. If you want to dictate from your phone while your laptop is on a different network - say, your laptop is connect to a public WIFI and you are using mobile data on your phone you can still get it to work: install a VPN like Tailscale on both devices. Tailscale assigns each device a stable private IP (e.g. 100.x.y.z) that works regardless of which physical network either device is on. Point your phone's browser at http://<tailscale-ip>:8080 and everything works exactly the same way, with end-to-end WireGuard encryption for free.

No port forwarding, no dynamic DNS, no VPN configuration files - just install Tailscale and go.

Running it

Single binary

Install xdotool, via their installation instructions. After that download the stt binary and use it.


# Download pre-built binary
curl -L https://midpath.in/cdn/sttbridge/stt -o stt && chmod +x stt
./stt # listens on :8080 by default

Open http://<your-laptop-ip>:8080 on your phone, and start typing away!

Docker

You can put the binary into a docker container along with xdotool to avoid installing it on your system entirely. We can also remap ports via docker. We have a

docker image save

output saved and you can just use that via:

curl -L https://midpath.in/cdn/sttbridge/stt.tar.gz | docker load

Then to allow Docker to send keystrokes to your X session and run:

xhost +local:docker # To allow X session keystrokes

docker run --rm \
  --ports "39215:8080" \ 		# Remap ports in case 8080 is occupied
  --restart always \     		# Auto start on reboot
  -e DISPLAY=$DISPLAY \  		# Pass in the display
  -v /tmp/.X11-unix:/tmp/.X11-unix \    # For X11 communication
  midpath_stt_bridge:latest

Limitations / Future

Linux/X11 only. xdotool does not work on Wayland (yet) or macOS. Maybe add support for other systems based on demand?
STT quality depends entirely on the phone's keyboard - Google Gboard works very well so far and these are expected to improve over time.
The diff is prefix-only. Mid-word autocorrections that change an earlier part of the string will cause extra backspaces followed by a retype of the tail.

Support this work

STT Bridge is MIT-licensed and free to use, fork, and modify. If it saves you time or you just want to encourage more tools like this, consider supporting us by paying for this tool.

Every contribution helps us spend more time on open, dependency-light developer tools.

Table of Contents

Speech To Text Bridge