AI Outbound Agent State
The AIOutboundAgentState
extends the regular AI Agent state to automate outbound interactions—e.g., phone calls, chat messages, or messaging-app conversations—directly from a workflow. In addition to the usual LLM configuration, tools, and outcomes, the state lets you specify:
Outbound channel details (phone, Zalo, WhatsApp, Telegram, …) via
outboundConfig
Realtime voice features (STT/TTS/VAD) via
voiceConfig
AIOutboundAgentState
Parameter
Description
Type
Required
agentName
The name of the agent.
string
yes
aiModel
The name of AI Language Model. Default value is 'gpt-4o'.
string
no
The configuration for the language model.
object
no
systemMessage
The system message used for constructing LLM prompt. Defaults to "You are a helpful AI Assistant."
string
yes
userMessage
The user message.
string
yes
maxToolExecutions
The maximum number of tool executions. Default is 10.
integer
no
The memory of the agent. If not specify, the workflow process instance scope is used.
object
no
JSON schema for agent data output. See AgentDataOutput.
object
yes
Define list of tools. Each tool is described by the ToolForAI schema.
array
no
array
yes
Filter to apply to the state data.
string
no
Channel-specific outbound settings.
object
yes
Voice features (STT, TTS, VAD) for realtime calls.
object
no
LLMConfig
ChatMemory
AgentDataOutput
OnAgentOutcome
ToolForAI
OutboundConfig
The OutboundConfig
defines the channel-specific outbound settings.
greeting
string
The static greeting message to be used by the agent. This message is used when the agent is first initialized.
no
greetingInstructions
string
The instructions for the LLM to use when generating the greeting message. This configuration takes precedence over the static greeting message.
no
object
The outbound target configuration. This configuration is used to define the target for the outbound agent.
yes
OutboundTarget
The OutboundTarget
defines the target for the outbound agent.
targetType
string
The type of the outbound target. This can be voice
, zalo
, whatsapp
, etc. Default is voice
.
yes
targetAddress
string
The address of the outbound target. This can be an email address, phone number, Zalo ID, etc. The format depends on the target type.
yes
targetName
string
The name of the outbound target. This is used for identification purposes.
yes
VoiceConfig
VoiceConfig
is the single block that tells the workflow how to listen, think, and speak during a telephone or voice-chat session.
Because speech is a round-trip of audio → text → LLM → text → audio, VoiceConfig
is split into four conceptual sub-modules, each matching one step in that loop:
VAD – “Is anyone talking right now?”
STT – “What did they just say?”
LLM – “How should the agent respond?”
TTS – “Say it out loud—in a human voice.”
The pipeline looks like this:
We can mix-and-match providers for every step; each has its own latency, cost, language coverage, and feature set.
Why each component matters
Voice-Activity Detection (VAD)
Purpose: Detects the precise start and end of human speech in the inbound audio stream. Why it’s critical: If VAD fires too late you waste the caller’s first syllables; if it fires too early you feed silence or background noise into STT and spend tokens on “uh …”. Good VAD also enables barge-in (interrupting TTS mid-sentence) and double-talk detection.
Typical knobs inside the vad
block
provider name (
silero
is the default implementation)energy / probability thresholds
timeouts for “no-speech” and “end-of-speech”
Speech-to-Text (STT)
Purpose: Transforms raw audio chunks into partial and final transcripts. Why it’s critical: Whatever the LLM “hears” comes from STT; recognition accuracy drives the entire conversational quality. Latency drives perceived responsiveness.
Key configuration areas
Provider & model – e.g. OpenAI Whisper large-v3, Deepgram Nova-2, Google STT tel-alpha
Language/locale – supply a BCP-47 code like
vi-VN
oren-US
so the model loads the right phoneme setStreaming vs batch – most providers stream; some cheaper models require a full clip upload
Vocabulary bias / hints – business terms, proper names, SKU codes
Post-processing – capitalization, profanity masking, punctuation injection
Security – API key, private endpoint, or on-prem GPU deployment
Large-Language Model (LLM)
Purpose: Understands user intent, decides on tool calls, chooses the next action, and produces a textual reply (or a JSON payload if your state’s output
schema demands structured data).
Inside a voice agent the LLM sits in the tightest latency loop after STT, so choosing how the LLM delivers its tokens changes the entire user experience:
Realtime LLM
A realtime model ingests raw audio, reasons over it, and streams synthetic speech back without any external STT or TTS step.
What changes in the pipeline
No separate STT/TTS blocks. The model hears tone, hesitations, laughter—cues that are normally lost in a transcript.
Built-in turn detection. Most providers decide when you’ve finished speaking; We recommends relying on that internal detector. If you want to fall back to default turn-detector you must still bolt on an STT plugin so the detector can read interim transcripts.
Non-realtime LLM (Classic)
The classic voice-AI stack separates concerns:
Why it’s still popular:
Deterministic text flow. Every turn yields clean, timestamped transcripts.
Great for analytics, compliance, post-call RAG pipelines.
Fine-grained control. You choose best-of-breed STT, specialised LLM tooling, and premium or budget TTS per use-case.
Extra integration work and ~1-2 s additional latency.
Script fidelity. A TTS engine will read a legal disclaimer exactly as written.
Voices may sound less expressive unless you invest in neural styles.
Text-to-Speech (TTS)
Purpose: Transforms the LLM’s textual reply into audio the caller hears. Why it’s critical: Humans judge “bot-ness” mainly by voice quality and timing. A 220 ms chunk-synthesis delay feels natural; 800 ms feels robotic.
TTS options worth documenting
Voice/character – Rachel, en-US-Wavenet-D, Alloy-en-v2
Style & prosody controls – speaking rate, pitch, emotion, stability, pronunciation lexicons
Streaming support – mandatory for realtime pipelines; optional for batch
Silence trimming & filler – some providers auto-trim leading breaths; some insert breathing/fillers you may want to disable
Bandwidth – telephony lines are 8 kHz mono; web or app can handle 22 kHz stereo
VoiceConfig Properties
object
The speech-to-text configuration (Optional).
no
object
The text-to-speech configuration (Optional).
no
object
The voice activity detection configuration (Optional).
no
allowInterruptions
boolean
Whether to allow interruptions during the voice interaction. Default is false.
no
STT
The STT
defines the configuration for the Speech To Text (STT) to be used by the AI Agent.
provider
string
The name of the STT provider. Allowed values: 'deepgram', 'openai', 'google', 'elevenlabs', 'fal', 'groq'. This determines which STT backend will be used.
yes
model
string
The model to use for speech recognition. This is provider-specific.
no
language
string
The language code for recognition. This is provider-specific. Example: 'en-US', 'vi-VN', etc.
no
apiKey
string
The API key or credentials for the STT service. This is required for most providers to authenticate requests.
no
baseUrl
string
The base URL for the STT service. This is used for custom endpoints or self-hosted deployments. Optional for most cloud providers.
providerOptions
object
Provider-specific configuration options for STT. Use this to supply additional settings required by your provider.
Supported options by STT provider:
deepgram:
openai:
google:
elevenlabs: None
fal: None
groq: None
TTS
The TTS
defines the configuration for Text To Speech (TTS) to be used by the AI Agent
provider
string
The name of the TTS provider. Allowed values: 'openai', 'deepgram', 'google', 'elevenlabs', 'groq'. This determines which TTS backend will be used.
no
model
string
The model to use for TTS. This is provider-specific.
no
voice
string
The voice to use for TTS. This is provider-specific and may refer to a named voice (e.g., 'en-US-Wavenet-D' for Google, 'Rachel' for ElevenLabs, etc.).
no
language
string
The language code for TTS. This is provider-specific and may affect pronunciation and available voices.
no
apiKey
string
The API key or credentials for the TTS service. This is required for most providers to authenticate requests.
no
baseUrl
string
The base URL for the TTS service. This is used for custom endpoints or self-hosted deployments. Optional for most cloud providers.
no
providerOptions
object
Provider-specific configuration options for TTS. Use this to supply additional settings required by your provider.
no
Supported options by TTS provider:
openai:
deepgram:
google:
elevenlabs:
groq: None
VAD
The VAD
defines the configuration for the Voice Activity Detection (VAD) to be used by the AI Agent
provider
string
The name of the provider. Default is 'silero'.
no
Example:
This document provides a detailed view of the AIOutboundAgentState
state and its related objects, including comprehensive schema definitions, required fields, and descriptions for each attribute within the AIOutboundAgentState
and associated schemas. This specification ensures clarity and completeness for integrating Outboudn AI agents within serverless workflows.
Last updated