voicemode-windows
Verified Safeby cescroca1976
Overview
An MCP server enabling real-time voice interaction (Speech-to-Text and Text-to-Speech) for AI agents, integrating local services like Whisper and Kokoro, with configurable cloud fallback (OpenAI).
Installation
python -m voice_modeEnvironment Variables
- OPENAI_API_KEY
- VOICEMODE_WHISPER_MODEL
- VOICEMODE_WHISPER_PORT
- VOICEMODE_KOKORO_PORT
- VOICEMODE_TTS_BASE_URLS
- VOICEMODE_STT_BASE_URLS
- VOICEMODE_AUDIO_FEEDBACK
- VOICEMODE_DISABLE_SILENCE_DETECTION
- VOICEMODE_VAD_AGGRESSIVENESS
- VOICEMODE_DEFAULT_LISTEN_DURATION
- VOICEMODE_SERVICE_AUTO_ENABLE
- VOICEMODE_SKIP_TTS
- VOICEMODE_PRONOUNCE
- VOICEMODE_PRONUNCIATION_ENABLED
Security Notes
The server leverages extensive `subprocess` calls for installing and managing external tools (git, cmake, package managers, whisper.cpp, kokoro-fastapi), which broadens the attack surface. Its security heavily depends on the integrity of these external projects and the user's system configuration. The `whisper-server` is configured to bind to `0.0.0.0` by default, which can expose it externally if not properly firewalled. Sensitive `OPENAI_API_KEY`s are handled well via environment variables and masked in logs/outputs. No direct `eval` or malicious patterns were found, and `shlex.split` is used for parsing rules. The testing suite includes measures to prevent dangerous commands, indicating developer awareness of subprocess risks.
Similar Servers
voicemode
Provides robust voice interaction capabilities for Model Context Protocol (MCP) agents, enabling real-time speech-to-text (STT) and text-to-speech (TTS) functionalities, with support for local and cloud-based services. It also includes tools for audio playback (DJ), service management, and diagnostics.
mcp-discord
Enables AI assistants to interact with the Discord platform by providing a set of Discord-related functionalities via the Model Context Protocol (MCP).
stt-mcp-server-linux
Local speech-to-text server for Linux, designed to integrate with Claude Code via the MCP protocol or run in standalone mode to inject transcribed text into a Tmux session.
glm-asr
An all-in-one service for high-accuracy speech recognition (ASR) across multiple languages, featuring Web UI, REST API, SSE streaming, and MCP server integration.