AI Outbound Agent State

The AIOutboundAgentState extends the regular AI Agent state to automate outbound interactions—e.g., phone calls, chat messages, or messaging-app conversations—directly from a workflow. In addition to the usual LLM configuration, tools, and outcomes, the state lets you specify:

  • Outbound channel details (phone, Zalo, WhatsApp, Telegram, …) via outboundConfig

  • Realtime voice features (STT/TTS/VAD) via voiceConfig

AIOutboundAgentState

Parameter

Description

Type

Required

agentName

The name of the agent.

string

yes

aiModel

The name of AI Language Model. Default value is 'gpt-4o'.

string

no

The configuration for the language model.

object

no

systemMessage

The system message used for constructing LLM prompt. Defaults to "You are a helpful AI Assistant."

string

yes

userMessage

The user message.

string

yes

maxToolExecutions

The maximum number of tool executions. Default is 10.

integer

no

The memory of the agent. If not specify, the workflow process instance scope is used.

object

no

JSON schema for agent data output. See AgentDataOutput.

object

yes

Define list of tools. Each tool is described by the ToolForAI schema.

array

no

Define list of agent outcomes. Each outcome is described by the OnAgentOutcome schema.

array

yes

Filter to apply to the state data.

string

no

Channel-specific outbound settings.

object

yes

Voice features (STT, TTS, VAD) for realtime calls.

object

no

LLMConfig

The same as LLMConfig from AIAgent State

ChatMemory

The same as ChatMemory from AIAgent State

AgentDataOutput

The same as AgentDataOutput from AIAgent State

OnAgentOutcome

The same as OnAgentOutcome from AIAgent State

ToolForAI

The same as ToolForAI from AIAgent State

OutboundConfig

The OutboundConfig defines the channel-specific outbound settings.

Property
Type
Description
Required

greeting

string

The static greeting message to be used by the agent. This message is used when the agent is first initialized.

no

greetingInstructions

string

The instructions for the LLM to use when generating the greeting message. This configuration takes precedence over the static greeting message.

no

object

The outbound target configuration. This configuration is used to define the target for the outbound agent.

yes

OutboundTarget

The OutboundTarget defines the target for the outbound agent.

Property
Type
Description
Required

targetType

string

The type of the outbound target. This can be voice, zalo, whatsapp, etc. Default is voice.

yes

targetAddress

string

The address of the outbound target. This can be an email address, phone number, Zalo ID, etc. The format depends on the target type.

yes

targetName

string

The name of the outbound target. This is used for identification purposes.

yes

VoiceConfig

VoiceConfig is the single block that tells the workflow how to listen, think, and speak during a telephone or voice-chat session. Because speech is a round-trip of audio → text → LLM → text → audio, VoiceConfig is split into four conceptual sub-modules, each matching one step in that loop:

  1. VAD – “Is anyone talking right now?”

  2. STT – “What did they just say?”

  3. LLM – “How should the agent respond?”

  4. TTS – “Say it out loud—in a human voice.”

The pipeline looks like this:

We can mix-and-match providers for every step; each has its own latency, cost, language coverage, and feature set.

Why each component matters

Voice-Activity Detection (VAD)

Purpose: Detects the precise start and end of human speech in the inbound audio stream. Why it’s critical: If VAD fires too late you waste the caller’s first syllables; if it fires too early you feed silence or background noise into STT and spend tokens on “uh …”. Good VAD also enables barge-in (interrupting TTS mid-sentence) and double-talk detection.

Typical knobs inside the vad block

  • provider name (silero is the default implementation)

  • energy / probability thresholds

  • timeouts for “no-speech” and “end-of-speech”

Speech-to-Text (STT)

Purpose: Transforms raw audio chunks into partial and final transcripts. Why it’s critical: Whatever the LLM “hears” comes from STT; recognition accuracy drives the entire conversational quality. Latency drives perceived responsiveness.

Key configuration areas

  • Provider & model – e.g. OpenAI Whisper large-v3, Deepgram Nova-2, Google STT tel-alpha

  • Language/locale – supply a BCP-47 code like vi-VN or en-US so the model loads the right phoneme set

  • Streaming vs batch – most providers stream; some cheaper models require a full clip upload

  • Vocabulary bias / hints – business terms, proper names, SKU codes

  • Post-processing – capitalization, profanity masking, punctuation injection

  • Security – API key, private endpoint, or on-prem GPU deployment

Large-Language Model (LLM)

Purpose: Understands user intent, decides on tool calls, chooses the next action, and produces a textual reply (or a JSON payload if your state’s output schema demands structured data).

Inside a voice agent the LLM sits in the tightest latency loop after STT, so choosing how the LLM delivers its tokens changes the entire user experience:

  • Realtime LLM

    • A realtime model ingests raw audio, reasons over it, and streams synthetic speech back without any external STT or TTS step.

    • What changes in the pipeline

      • No separate STT/TTS blocks. The model hears tone, hesitations, laughter—cues that are normally lost in a transcript.

      • Built-in turn detection. Most providers decide when you’ve finished speaking; We recommends relying on that internal detector. If you want to fall back to default turn-detector you must still bolt on an STT plugin so the detector can read interim transcripts.

      • No hard-scripted speech. You can cue the model with instructions, but you cannot guarantee it will read a line verbatim. For legally approved disclaimers, attach a conventional TTS plugin and use greetingInstructions for that segment.

  • Non-realtime LLM (Classic)

The classic voice-AI stack separates concerns:

Why it’s still popular:

Advantage
Implication

Deterministic text flow. Every turn yields clean, timestamped transcripts.

Great for analytics, compliance, post-call RAG pipelines.

Fine-grained control. You choose best-of-breed STT, specialised LLM tooling, and premium or budget TTS per use-case.

Extra integration work and ~1-2 s additional latency.

Script fidelity. A TTS engine will read a legal disclaimer exactly as written.

Voices may sound less expressive unless you invest in neural styles.

Text-to-Speech (TTS)

Purpose: Transforms the LLM’s textual reply into audio the caller hears. Why it’s critical: Humans judge “bot-ness” mainly by voice quality and timing. A 220 ms chunk-synthesis delay feels natural; 800 ms feels robotic.

TTS options worth documenting

  • Voice/character – Rachel, en-US-Wavenet-D, Alloy-en-v2

  • Style & prosody controls – speaking rate, pitch, emotion, stability, pronunciation lexicons

  • Streaming support – mandatory for realtime pipelines; optional for batch

  • Silence trimming & filler – some providers auto-trim leading breaths; some insert breathing/fillers you may want to disable

  • Bandwidth – telephony lines are 8 kHz mono; web or app can handle 22 kHz stereo

VoiceConfig Properties

Property
Type
Description
Required

object

The speech-to-text configuration (Optional).

no

object

The text-to-speech configuration (Optional).

no

object

The voice activity detection configuration (Optional).

no

allowInterruptions

boolean

Whether to allow interruptions during the voice interaction. Default is false.

no

STT

The STT defines the configuration for the Speech To Text (STT) to be used by the AI Agent.

Property
Type
Description
Required

provider

string

The name of the STT provider. Allowed values: 'deepgram', 'openai', 'google', 'elevenlabs', 'fal', 'groq'. This determines which STT backend will be used.

yes

model

string

The model to use for speech recognition. This is provider-specific.

no

language

string

The language code for recognition. This is provider-specific. Example: 'en-US', 'vi-VN', etc.

no

apiKey

string

The API key or credentials for the STT service. This is required for most providers to authenticate requests.

no

baseUrl

string

The base URL for the STT service. This is used for custom endpoints or self-hosted deployments. Optional for most cloud providers.

providerOptions

object

Provider-specific configuration options for STT. Use this to supply additional settings required by your provider.

Supported options by STT provider:

  • deepgram:

Example:

  • openai:

  • google:

  • elevenlabs: None

  • fal: None

  • groq: None

Example:

  • azure: None

TTS

The TTS defines the configuration for Text To Speech (TTS) to be used by the AI Agent

Property
Type
Description
Required

provider

string

The name of the TTS provider. Allowed values: 'openai', 'deepgram', 'google', 'elevenlabs', 'groq'. This determines which TTS backend will be used.

no

model

string

The model to use for TTS. This is provider-specific.

no

voice

string

The voice to use for TTS. This is provider-specific and may refer to a named voice (e.g., 'en-US-Wavenet-D' for Google, 'Rachel' for ElevenLabs, etc.).

no

language

string

The language code for TTS. This is provider-specific and may affect pronunciation and available voices.

no

apiKey

string

The API key or credentials for the TTS service. This is required for most providers to authenticate requests.

no

baseUrl

string

The base URL for the TTS service. This is used for custom endpoints or self-hosted deployments. Optional for most cloud providers.

no

providerOptions

object

Provider-specific configuration options for TTS. Use this to supply additional settings required by your provider.

no

Supported options by TTS provider:

  • openai:

  • deepgram:

  • google:

Example:

  • elevenlabs:

  • groq: None

  • azure:

Example:

VAD

The VAD defines the configuration for the Voice Activity Detection (VAD) to be used by the AI Agent

Property
Type
Description
Required

provider

string

The name of the provider. Default is 'silero'.

no

Example:

YAML

This document provides a detailed view of the AIOutboundAgentState state and its related objects, including comprehensive schema definitions, required fields, and descriptions for each attribute within the AIOutboundAgentState and associated schemas. This specification ensures clarity and completeness for integrating Outboudn AI agents within serverless workflows.

Last updated