Introduction

Users have grown accustomed to talking to their devices. Siri, Google Assistant, and Alexa have normalised voice interaction. Now users expect similar experiences in the apps they use—ordering food by voice, checking account balances through conversation, getting support without typing.

Building conversational AI is more accessible than ever, but there’s still nuance in getting it right. This guide covers how to implement voice interfaces and dialogue management that actually work.

Understanding the Conversational AI Stack

A conversational interface has several components:

┌─────────────────────────────────────────────────────────────┐
│                        User Input                           │
│                    (Voice or Text)                          │
└────────────────────────────┬────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│               Speech-to-Text (if voice)                     │
│          (Whisper, Google STT, Apple Speech)                │
└────────────────────────────┬────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│              Natural Language Understanding                  │
│         (Intent Recognition, Entity Extraction)             │

└────────────────────────────┬────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                   Dialogue Manager                          │
│        (Context, State, Business Logic)                     │
└────────────────────────────┬────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│                 Response Generation                         │
│            (Templates, LLM, Dynamic Content)                │
└────────────────────────────┬────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│              Text-to-Speech (if voice)                      │
│        (System TTS, Cloud TTS, Voice Cloning)               │
└─────────────────────────────────────────────────────────────┘

Let’s implement each component.

Natural Lan

Natural Language Understanding Infographic guage Understanding

NLU converts user text into structured data: what they want (intent) and the details (entities).

Using Rasa NLU

Rasa is an open-source option you can self-host:

# data/nlu.yml
version: "3.1"
nlu:
  - intent: check_balance
    examples: |
      - what's my balance
      - how much do I have
      - show me my account balance
      - what's in my account
      - check my balance

  - intent: transfer_money
    examples: |
      - transfer [50](amount) dollars to [John](recipient)
      - send [100](amount) to [Sarah](recipient)
      - pay [John Smith](recipient) [25](amount)
      - move [200](amount) to my savings

  - intent: find_transactions
    examples: |
      - show my recent transactions
      - what did I spend at [Woolworths](merchant) last week
      - show purchases from [yesterday](date)
      - transactions over [100](amount) dollars
# Backend: NLU service
from rasa.nlu.model import Interpreter
import json

class NLUService:
    def __init__(self, model_path: str):
        self.interpreter = Interpreter.load(model_path)

    async def parse(self, text: str) -> dict:
        result = self.interpreter.parse(text)
        return {
            "intent": result["intent"]["name"],
            "confidence": result["intent"]["confidence"],
            "entities": [
                {
                    "entity": e["entity"],
                    "value": e["value"],
                    "start": e["start"],
                    "end": e["end"]
                }
                for e in result["entities"]
            ]
        }

# Usage
nlu = NLUService("./models/nlu")
result = await nlu.parse("transfer 50 dollars to John")
# {
#   "intent": "transfer_money",
#   "confidence": 0.95,
#   "entities": [
#     {"entity": "amount", "value": "50", "start": 9, "end": 11},
#     {"entity": "recipient", "value": "John", "start": 23, "end": 27}
#   ]
# }

Using LLMs for Intent Recognition

Modern LLMs can handle intent recognition without training:

import OpenAI from 'openai';

const openai = new OpenAI();

interface ParsedIntent {
  intent: string;
  entities: Record<string, string>;
  confidence: number;
}

async function parseWithLLM(userMessage: string): Promise<ParsedIntent> {
  const systemPrompt = `You are an intent classifier for a banking app.

Available intents:
- check_balance: User wants to see their account balance
- transfer_money: User wants to send money (entities: amount, recipient)
- find_transactions: User wants to see transaction history (entities: merchant, date, amount)
- get_help: User needs assistance
- unknown: Intent doesn't match any category

Respond with JSON only:
{
  "intent": "intent_name",
  "entities": {"entity_name": "value"},
  "confidence": 0.0-1.0
}`;

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: userMessage },
    ],
    response_format: { type: 'json_object' },
    temperature: 0,
  });

  return JSON.parse(response.choices[0].message.content!);
}

Building the

Building the Dialogue Manager Infographic Dialogue Manager

The dialogue manager maintains conversation context and decides what happens next.

State Machine Approach

For simple flows, a state machine works well:

interface ConversationState {
  currentStep: string;
  context: Record<string, any>;
  history: Array<{ role: 'user' | 'assistant'; content: string }>;
}

interface DialogueFlow {
  steps: Record<string, DialogueStep>;
  initialStep: string;
}

interface DialogueStep {
  prompt?: string;
  requiredEntities?: string[];
  validate?: (value: any) => boolean;
  onComplete: string | ((context: Record<string, any>) => Promise<string>);
}

const transferMoneyFlow: DialogueFlow = {
  initialStep: 'get_recipient',
  steps: {
    get_recipient: {
      prompt: "Who would you like to send money to?",
      requiredEntities: ['recipient'],
      onComplete: 'get_amount',
    },
    get_amount: {
      prompt: "How much would you like to transfer?",
      requiredEntities: ['amount'],
      validate: (amount) => parseFloat(amount) > 0,
      onComplete: 'confirm',
    },
    confirm: {
      prompt: (ctx) => `Transfer $${ctx.amount} to ${ctx.recipient}. Is that correct?`,
      requiredEntities: ['confirmation'],
      onComplete: async (ctx) => {
        if (ctx.confirmation === 'yes') {
          await executeTransfer(ctx.recipient, ctx.amount);
          return 'success';
        }
        return 'cancelled';
      },
    },
    success: {
      prompt: (ctx) => `Done! $${ctx.amount} has been sent to ${ctx.recipient}.`,
      onComplete: 'end',
    },
    cancelled: {
      prompt: "Transfer cancelled. Is there anything else I can help with?",
      onComplete: 'end',
    },
  },
};

class DialogueManager {
  private state: ConversationState;
  private flows: Record<string, DialogueFlow>;

  constructor() {
    this.state = {
      currentStep: 'idle',
      context: {},
      history: [],
    };
    this.flows = {};
  }

  registerFlow(intent: string, flow: DialogueFlow) {
    this.flows[intent] = flow;
  }

  async processInput(
    text: string,
    parsedIntent: ParsedIntent
  ): Promise<string> {
    // Add to history
    this.state.history.push({ role: 'user', content: text });

    // If idle, start new flow based on intent
    if (this.state.currentStep === 'idle') {
      const flow = this.flows[parsedIntent.intent];
      if (flow) {
        this.state.currentStep = flow.initialStep;
        // Merge any entities from initial message
        Object.assign(this.state.context, parsedIntent.entities);
      } else {
        return "I'm not sure how to help with that.";
      }
    } else {
      // In active flow, extract entities
      Object.assign(this.state.context, parsedIntent.entities);
    }

    return this.advanceFlow();
  }

  private async advanceFlow(): Promise<string> {
    const flow = this.getCurrentFlow();
    if (!flow) return "Something went wrong. Let's start over.";

    const step = flow.steps[this.state.currentStep];
    if (!step) return "Something went wrong.";

    // Check if we have all required entities
    const missingEntities = (step.requiredEntities || []).filter(
      (e) => !this.state.context[e]
    );

    if (missingEntities.length > 0) {
      // Need more info
      const prompt = typeof step.prompt === 'function'
        ? await step.prompt(this.state.context)
        : step.prompt;
      return prompt || `I need the ${missingEntities[0]}`;
    }

    // Validate if needed
    if (step.validate && !step.validate(this.state.context)) {
      return "That doesn't seem right. Could you try again?";
    }

    // Move to next step
    const nextStep = typeof step.onComplete === 'function'
      ? await step.onComplete(this.state.context)
      : step.onComplete;

    this.state.currentStep = nextStep;

    if (nextStep === 'end') {
      this.reset();
      return "Is there anything else I can help you with?";
    }

    // Get next prompt
    return this.advanceFlow();
  }

  private getCurrentFlow(): DialogueFlow | null {
    for (const flow of Object.values(this.flows)) {
      if (flow.steps[this.state.currentStep]) {
        return flow;
      }
    }
    return null;
  }

  private reset() {
    this.state = {
      currentStep: 'idle',
      context: {},
      history: [],
    };
  }
}

LLM-Powered Dialogue

For more natural conversations, use an LLM as the dialogue manager:

interface ConversationContext {
  userId: string;
  history: Array<{ role: 'user' | 'assistant'; content: string }>;
  userData: {
    accountBalance?: number;
    recentTransactions?: Transaction[];
    contacts?: Contact[];
  };
}

class LLMDialogueManager {
  private context: ConversationContext;

  constructor(userId: string) {
    this.context = {
      userId,
      history: [],
      userData: {},
    };
  }

  async processMessage(userMessage: string): Promise<string> {
    // Add user message to history
    this.context.history.push({ role: 'user', content: userMessage });

    // Fetch relevant user data
    await this.updateUserData(userMessage);

    // Build system prompt with context
    const systemPrompt = this.buildSystemPrompt();

    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [
        { role: 'system', content: systemPrompt },
        ...this.context.history.slice(-10), // Last 10 messages for context
        { role: 'user', content: userMessage },
      ],
      tools: this.getAvailableTools(),
      temperature: 0.7,
    });

    const assistantMessage = response.choices[0].message;

    // Handle tool calls
    if (assistantMessage.tool_calls) {
      const toolResults = await this.executeTools(assistantMessage.tool_calls);
      return this.generateFinalResponse(toolResults);
    }

    const reply = assistantMessage.content || "I'm not sure how to help with that.";
    this.context.history.push({ role: 'assistant', content: reply });

    return reply;
  }

  private buildSystemPrompt(): string {
    return `You are a helpful banking assistant for an Australian mobile app.

Current user data:
- Account balance: $${this.context.userData.accountBalance ?? 'unknown'}
- Recent transactions: ${JSON.stringify(this.context.userData.recentTransactions?.slice(0, 5) ?? [])}

Guidelines:
- Be concise and helpful
- For sensitive actions (transfers, payments), always confirm before executing
- Use Australian English spelling
- Format currency as AUD
- If you need information to complete a task, ask for it naturally
- Never make up account information`;
  }

  private getAvailableTools() {
    return [
      {
        type: 'function' as const,
        function: {
          name: 'get_account_balance',
          description: 'Get the current account balance',
          parameters: { type: 'object', properties: {} },
        },
      },
      {
        type: 'function' as const,
        function: {
          name: 'transfer_money',
          description: 'Transfer money to another account',
          parameters: {
            type: 'object',
            properties: {
              recipient: { type: 'string', description: 'Name or account of recipient' },
              amount: { type: 'number', description: 'Amount in AUD' },
            },
            required: ['recipient', 'amount'],
          },
        },
      },
      {
        type: 'function' as const,
        function: {
          name: 'get_transactions',
          description: 'Get recent transactions',
          parameters: {
            type: 'object',
            properties: {
              days: { type: 'number', description: 'Number of days to look back' },
              merchant: { type: 'string', description: 'Filter by merchant name' },
            },
          },
        },
      },
    ];
  }

  private async executeTools(toolCalls: any[]): Promise<Record<string, any>> {
    const results: Record<string, any> = {};

    for (const call of toolCalls) {
      const args = JSON.parse(call.function.arguments);

      switch (call.function.name) {
        case 'get_account_balance':
          results[call.id] = await this.getBalance();
          break;
        case 'transfer_money':
          results[call.id] = await this.initiateTransfer(args.recipient, args.amount);
          break;
        case 'get_transactions':
          results[call.id] = await this.getTransactions(args.days, args.merchant);
          break;
      }
    }

    return results;
  }

  // ... implement actual banking operations
}

Mobile Integration

iOS Voice Interface

import Speech
import AVFoundation

class VoiceAssistant: ObservableObject {
    @Published var isListening = false
    @Published var transcript = ""
    @Published var response = ""
    @Published var isSpeaking = false

    private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-AU"))
    private var recognitionTask: SFSpeechRecognitionTask?
    private let audioEngine = AVAudioEngine()
    private let synthesizer = AVSpeechSynthesizer()

    func startConversation() async {
        guard await requestPermissions() else { return }

        await MainActor.run {
            isListening = true
            transcript = ""
        }

        do {
            try await startListening()
        } catch {
            await MainActor.run {
                isListening = false
            }
        }
    }

    private func startListening() async throws {
        let request = SFSpeechAudioBufferRecognitionRequest()
        request.shouldReportPartialResults = true

        let inputNode = audioEngine.inputNode
        let recordingFormat = inputNode.outputFormat(forBus: 0)

        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { buffer, _ in
            request.append(buffer)
        }

        audioEngine.prepare()
        try audioEngine.start()

        recognitionTask = speechRecognizer?.recognitionTask(with: request) { [weak self] result, error in
            if let result = result {
                Task { @MainActor in
                    self?.transcript = result.bestTranscription.formattedString
                }

                if result.isFinal {
                    self?.stopListening()
                    Task {
                        await self?.processTranscript(result.bestTranscription.formattedString)
                    }
                }
            }
        }
    }

    private func stopListening() {
        audioEngine.stop()
        audioEngine.inputNode.removeTap(onBus: 0)
        recognitionTask?.cancel()

        Task { @MainActor in
            isListening = false
        }
    }

    private func processTranscript(_ text: String) async {
        // Send to backend dialogue manager
        do {
            let response = try await sendToDialogueManager(text)

            await MainActor.run {
                self.response = response
            }

            // Speak the response
            await speak(response)

        } catch {
            await MainActor.run {
                self.response = "Sorry, I couldn't process that. Please try again."
            }
        }
    }

    private func sendToDialogueManager(_ text: String) async throws -> String {
        var request = URLRequest(url: URL(string: "\(Config.apiURL)/assistant/chat")!)
        request.httpMethod = "POST"
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")
        request.httpBody = try JSONEncoder().encode(["message": text])

        let (data, _) = try await URLSession.shared.data(for: request)
        let result = try JSONDecoder().decode(AssistantResponse.self, from: data)

        return result.message
    }

    private func speak(_ text: String) async {
        await MainActor.run {
            isSpeaking = true
        }

        let utterance = AVSpeechUtterance(string: text)
        utterance.voice = AVSpeechSynthesisVoice(language: "en-AU")
        utterance.rate = AVSpeechUtteranceDefaultSpeechRate

        await withCheckedContinuation { continuation in
            let delegate = SpeechDelegate {
                continuation.resume()
            }
            synthesizer.delegate = delegate
            synthesizer.speak(utterance)
        }

        await MainActor.run {
            isSpeaking = false
        }
    }
}

Android Voice Interface

class VoiceAssistant(private val context: Context) : ViewModel() {

    private val _isListening = MutableStateFlow(false)
    val isListening: StateFlow<Boolean> = _isListening

    private val _transcript = MutableStateFlow("")
    val transcript: StateFlow&lt;String&gt; = _transcript

    private val _response = MutableStateFlow("")
    val response: StateFlow&lt;String&gt; = _response

    private var speechRecognizer: SpeechRecognizer? = null
    private var textToSpeech: TextToSpeech? = null

    init {
        textToSpeech = TextToSpeech(context) { status ->
            if (status == TextToSpeech.SUCCESS) {
                textToSpeech?.language = Locale("en", "AU")
            }
        }
    }

    fun startConversation() {
        if (!SpeechRecognizer.isRecognitionAvailable(context)) {
            _response.value = "Speech recognition not available"
            return
        }

        speechRecognizer = SpeechRecognizer.createSpeechRecognizer(context)
        speechRecognizer?.setRecognitionListener(createListener())

        val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply {
            putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
            putExtra(RecognizerIntent.EXTRA_LANGUAGE, "en-AU")
            putExtra(RecognizerIntent.EXTRA_PARTIAL_RESULTS, true)
        }

        speechRecognizer?.startListening(intent)
        _isListening.value = true
    }

    private fun createListener() = object : RecognitionListener {
        override fun onResults(results: Bundle?) {
            _isListening.value = false

            val matches = results?.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION)
            val text = matches?.firstOrNull() ?: return

            _transcript.value = text

            viewModelScope.launch {
                processTranscript(text)
            }
        }

        override fun onPartialResults(partialResults: Bundle?) {
            val matches = partialResults?.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION)
            matches?.firstOrNull()?.let { _transcript.value = it }
        }

        // ... other required overrides
        override fun onReadyForSpeech(params: Bundle?) {}
        override fun onBeginningOfSpeech() {}
        override fun onRmsChanged(rmsdB: Float) {}
        override fun onBufferReceived(buffer: ByteArray?) {}
        override fun onEndOfSpeech() { _isListening.value = false }
        override fun onError(error: Int) { _isListening.value = false }
        override fun onEvent(eventType: Int, params: Bundle?) {}
    }

    private suspend fun processTranscript(text: String) {
        try {
            val response = sendToDialogueManager(text)
            _response.value = response
            speak(response)
        } catch (e: Exception) {
            _response.value = "Sorry, something went wrong."
        }
    }

    private suspend fun sendToDialogueManager(text: String): String {
        val client = HttpClient(CIO) {
            install(ContentNegotiation) { json() }
        }

        val response: AssistantResponse = client.post("${Config.API_URL}/assistant/chat") {
            contentType(ContentType.Application.Json)
            setBody(mapOf("message" to text))
        }.body()

        return response.message
    }

    private fun speak(text: String) {
        textToSpeech?.speak(text, TextToSpeech.QUEUE_FLUSH, null, "response")
    }

    override fun onCleared() {
        speechRecognizer?.destroy()
        textToSpeech?.shutdown()
    }
}

Handling Errors Gracefully

Conversational AI fails often. Handle it well:

async function processWithFallback(userMessage: string): Promise<string> {
  try {
    // Try primary NLU/dialogue
    return await dialogueManager.processMessage(userMessage);
  } catch (error) {
    // Log for debugging
    console.error('Dialogue error:', error);

    // Provide helpful fallback
    return getFallbackResponse(userMessage);
  }
}

function getFallbackResponse(userMessage: string): string {
  // Check for common patterns even without NLU
  const lowerMessage = userMessage.toLowerCase();

  if (lowerMessage.includes('balance')) {
    return "I'm having trouble right now. You can check your balance in the Accounts tab.";
  }

  if (lowerMessage.includes('transfer') || lowerMessage.includes('send')) {
    return "I can't process that right now. To transfer money, tap the Transfer button in the app.";
  }

  if (lowerMessage.includes('help')) {
    return "I'm having trouble understanding. You can reach our support team at 1300 XXX XXX or through the Help section in the app.";
  }

  return "I didn't quite catch that. Could you try saying it differently, or use the app menu for help?";
}

Conclusion

Conversational AI in mobile apps requires balancing sophistication with reliability. The key components are:

  1. NLU for understanding user intent and extracting entities
  2. Dialogue management for maintaining context and guiding conversations
  3. Graceful fallbacks when things don’t work

Start simple: a state machine for defined flows handles most use cases. Add LLM-powered dialogue for more natural interactions once you have the basics working. Always provide clear fallbacks when the AI fails—and it will fail.

The best conversational interfaces feel helpful without trying to be too smart. Match your implementation complexity to your actual user needs. Users appreciate a voice interface that handles three things well over one that attempts everything poorly.