Chuyển tới nội dung chính

Voice & TTS

Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.

Text-to-Speech

Convert text to speech with five providers:

ProviderQualityCostAPI Key
Edge TTS (default)GoodFreeNone needed
ElevenLabsExcellentPaidELEVENLABS_API_KEY
OpenAI TTSGoodPaidVOICE_TOOLS_OPENAI_KEY
MiniMax TTSExcellentPaidMINIMAX_API_KEY
NeuTTSGoodFreeNone needed

Platform Delivery

PlatformDeliveryFormat
TelegramVoice bubble (plays inline)Opus .ogg
DiscordVoice bubble (Opus/OGG), falls back to file attachmentOpus/MP3
WhatsAppAudio file attachmentMP3
CLISaved to ~/.hermes/audio_cache/MP3

Configuration

# In ~/.hermes/config.yaml
tts:
provider: "edge" # "edge" | "elevenlabs" | "openai" | "minimax" | "neutts"
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
elevenlabs:
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
model_id: "eleven_multilingual_v2"
openai:
model: "gpt-4o-mini-tts"
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
base_url: "https://api.openai.com/v1" # Override for OpenAI-compatible TTS endpoints
minimax:
model: "speech-2.8-hd" # speech-2.8-hd (default), speech-2.8-turbo
voice_id: "English_Graceful_Lady" # See https://platform.minimax.io/faq/system-voice-id
speed: 1 # 0.5 - 2.0
vol: 1 # 0 - 10
pitch: 0 # -12 - 12
neutts:
ref_audio: ''
ref_text: ''
model: neuphonic/neutts-air-q4-gguf
device: cpu

Telegram Voice Bubbles & ffmpeg

Telegram voice bubbles require Opus/OGG audio format:

  • OpenAI and ElevenLabs produce Opus natively — no extra setup
  • Edge TTS (default) outputs MP3 and needs ffmpeg to convert:
  • MiniMax TTS outputs MP3 and needs ffmpeg to convert for Telegram voice bubbles
  • NeuTTS outputs WAV and also needs ffmpeg to convert for Telegram voice bubbles
# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Fedora
sudo dnf install ffmpeg

Without ffmpeg, Edge TTS, MiniMax TTS, and NeuTTS audio are sent as regular audio files (playable, but shown as a rectangular player instead of a voice bubble).

mẹo

If you want voice bubbles without installing ffmpeg, switch to the OpenAI or ElevenLabs provider.

Voice Message Transcription (STT)

Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.

ProviderQualityCostAPI Key
Local Whisper (default)GoodFreeNone needed
Groq Whisper APIGood–BestFree tierGROQ_API_KEY
OpenAI Whisper APIGood–BestPaidVOICE_TOOLS_OPENAI_KEY or OPENAI_API_KEY
Zero Config

Local transcription works out of the box when faster-whisper is installed. If that's unavailable, Hermes can also use a local whisper CLI from common install locations (like /opt/homebrew/bin) or a custom command via HERMES_LOCAL_STT_COMMAND.

Configuration

# In ~/.hermes/config.yaml
stt:
provider: "local" # "local" | "groq" | "openai" | "mistral"
local:
model: "base" # tiny, base, small, medium, large-v3
openai:
model: "whisper-1" # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
mistral:
model: "voxtral-mini-latest" # voxtral-mini-latest, voxtral-mini-2602

Provider Details

Local (faster-whisper) — Runs Whisper locally via faster-whisper. Uses CPU by default, GPU if available. Model sizes:

ModelSizeSpeedQuality
tiny~75 MBFastestBasic
base~150 MBFastGood (default)
small~500 MBMediumBetter
medium~1.5 GBSlowerGreat
large-v3~3 GBSlowestBest

Groq API — Requires GROQ_API_KEY. Good cloud fallback when you want a free hosted STT option.

OpenAI API — Accepts VOICE_TOOLS_OPENAI_KEY first and falls back to OPENAI_API_KEY. Supports whisper-1, gpt-4o-mini-transcribe, and gpt-4o-transcribe.

Mistral API (Voxtral Transcribe) — Requires MISTRAL_API_KEY. Uses Mistral's Voxtral Transcribe models. Supports 13 languages, speaker diarization, and word-level timestamps. Install with pip install hermes-agent[mistral].

Custom local CLI fallback — Set HERMES_LOCAL_STT_COMMAND if you want Hermes to call a local transcription command directly. The command template supports {input_path}, {output_dir}, {language}, and {model} placeholders.

Fallback Behavior

If your configured provider isn't available, Hermes automatically falls back:

  • Local faster-whisper unavailable → Tries a local whisper CLI or HERMES_LOCAL_STT_COMMAND before cloud providers
  • Groq key not set → Falls back to local transcription, then OpenAI
  • OpenAI key not set → Falls back to local transcription, then Groq
  • Mistral key/SDK not set → Skipped in auto-detect; falls through to next available provider
  • Nothing available → Voice messages pass through with an accurate note to the user