Audio Multimodality: Expanding AI Interaction with Spring AI and OpenAI
This blog post is co-authored by our great contributor Thomas Vitale.
OpenAI provides specialized models for speech-to-text
and text-to-speech
conversion, recognized for their performance and cost-efficiency. Spring AI integrates these capabilities via Voice-to-Text and Text-to-Speech (TTS).
The new Audio Generation feature (gpt-4o-audio-preview
) goes further, enabling mixed input and output modalities. Audio inputs can contain richer data than text alone. Audio can convey nuanced information like tone and inflection, and together with the audio outputs it enables asynchronous speech-to-speech
interactions.
Additionally, this new multimodality opens up possibilities for innovative applications, such as structured data extraction. Developers can extract structured information not just from simple text, but also from images and audio, building complex, structured objects…