Voice to Vision: Building a Real-Time Speech-to-Visual Pipeline in Daydream Scope

00:00
00:00

Voice to Vision: Building a Real-Time Speech-to-Visual Pipeline in Daydream Scope

Daydream Scope

Explore new worlds with Daydream Scope

Check out the latest model drops and powerful integrations.

Download Now

What if the audience could speak a world into existence?

That's the question driving The Mirror's Echo, an interactive AI projection installation I'm developing that transforms spoken language into living visual landscapes. A viewer steps up to a microphone, says "crystalline forest under a blood moon," and within seconds the projection shifts — trees crystallize, the sky bleeds red, light scatters through impossible geometries.

This article describes how I built the real-time voice-to-visual pipeline powering this work, using Daydream Scope's StreamDiffusionV2 pipeline and a custom audio-transcription preprocessor plugin.

The Architecture

The system chains together several components in a real-time loop:

Microphone → Whisper AI → spaCy NLP → StreamDiffusionV2 → Projection

  1. Audio capture continuously listens through the system microphone
  2. OpenAI Whisper (tiny model, running locally) transcribes speech to text every 3 seconds
  3. spaCy NLP extracts nouns from the transcription — filtering out filler words, leaving only the visual essence of what was said
  4. StreamDiffusionV2 receives these nouns as prompts, generating imagery in real-time
  5. The output feeds to a projector via Spout/NDI for large-scale projection mapping

The key insight: people don't speak in prompts. They say "oh wow, that's like a, um, stained glass butterfly or something." The NLP layer distills that into stained glass butterfly — exactly what the diffusion model needs.

Why Scope?

I explored several approaches before landing on Scope. I've worked extensively with TouchDesigner and StreamDiffusion's TD plugin, but Scope's preprocessor architecture solved a fundamental problem: how do you inject prompts from an external source into a running diffusion pipeline?

Scope's preprocessor system lets you intercept the pipeline at the frame level. My audio-transcription plugin sits between the input and StreamDiffusionV2, passing video frames through untouched while injecting voice-derived prompts into the generation parameters. The pipeline doesn't know or care that its prompts are coming from a microphone — it just receives text and generates.

The input_mode: "text" override was critical. StreamDiffusionV2 normally expects video input for img2img generation. By forcing text-only mode, the model generates purely from the prompt, creating imagery that responds to speech rather than transforming a camera feed.

The Dual-Prompt System

The installation needs two modes:

Voice Mode (Green): When someone is actively speaking and nouns are detected, their words drive the visuals. "Ocean waves crashing" produces ocean imagery. "Cathedral ceiling" shifts to architecture. The transition between prompts uses Scope's cache reset for hard cuts — each new noun phrase gets a fresh generation.

Text Box Fallback (Yellow): When no one is speaking (10 seconds of silence), the system falls back to whatever prompt is set in Scope's UI. This serves as an ambient visual state — a default aesthetic that plays between interactions. Gallery staff can change this by typing in the prompt box without touching code.

A prompt monitor overlay (a small tkinter window) shows the current state in real-time: which mode is active, what nouns were extracted, the microphone amplitude, and whether transcription is happening. This is essential for debugging during installation and for gallery staff to understand what the system is doing.

Running on Limited Hardware

My development machine has an 8GB GPU — far from the 5090s in Ryan's VACE demos. Making this work required aggressive optimization:

  • LightVAE (75% pruned) trades some quality for dramatically faster generation
  • 144×144 resolution is the sweet spot for 8GB VRAM — small, but when projected at scale, the low resolution becomes an aesthetic feature rather than a limitation
  • Denoising steps [47, 23] (two steps) minimizes compute per frame
  • 3-second audio processing interval balances responsiveness with GPU headroom

The result is approximately 3 fps of AI-generated imagery driven by voice. Not silky smooth, but for a projection installation where the visual shifts are the spectacle, it works. The dreamy, slightly stuttered quality actually reinforces the feeling that you're watching something being imagined in real-time.

Lessons from the Build

Nouns are everything. Early versions sent the full transcription to StreamDiffusion. The results were incoherent — diffusion models don't know what to do with "um, so like, maybe a." SpaCy's noun extraction was the breakthrough. It turns rambling speech into clean, generative prompts.

Queue architecture matters. Scope's parameter queue can flood when a preprocessor sends updates too frequently. The solution was a bypass that merges prompt parameters directly, skipping the queue entirely. Without this, voice prompts would get dropped in favor of the UI prompt that the frontend sends every frame.

The fallback needs to be graceful. Hard-cutting from voice-driven imagery to a static prompt looks jarring. The cache reset smooths transitions, and the 10-second timeout gives speakers natural breathing room without immediately snapping to the fallback.

Monitor everything. You cannot debug a real-time audio-visual pipeline by reading logs after the fact. The prompt monitor overlay was an afterthought that became essential. Seeing "VOICE: crystalline forest" flash green while the projection shifts gives you immediate confirmation that the whole chain is working.

What's Next

The Mirror's Echo is being developed for exhibition at the Columbus Museum of Art's Wonderball 2026, alongside baroque-themed projection pieces. The voice pipeline will be the centerpiece — an interactive station where guests speak and watch their words become visual worlds.

I'm exploring several extensions:

  • Emotion detection from vocal tone to influence color palette and visual intensity
  • Multi-speaker blending where overlapping voices create composite imagery
  • VACE integration for structural control — imagine speaking "a cathedral" while a depth map from a real architectural column guides the generation
  • Higher resolution on beefier exhibition hardware, potentially leveraging Scope's cloud GPU support

Try It Yourself

The audio-transcription preprocessor is built as a Scope plugin. The core requirements:

  • Daydream Scope (free, open source)
  • A microphone
  • Python packages: openai-whisper, spacy, sounddevice
  • The en_core_web_sm spaCy model for noun extraction
  • An NVIDIA GPU (8GB minimum with the optimizations described above)

The plugin architecture means you can drop this into any Scope pipeline — not just StreamDiffusionV2. As new real-time models land in Scope (LongLive, VACE, MemFlow), the voice input layer stays the same.

Speaking to machines and having them dream back at you — it's the most natural interface I've ever built. No keyboard. No touchscreen. Just your voice and an AI that listens.

Krista Faist is a VR/AI/moving image artist represented by Chaos Contemporary Craft gallery, a 2024 Fuse Factory Artist-in-Residence, and founding board member of Mural ReMix. Her work explores perception, wonder, and technological mediation through interactive installations and projection mapping. She splits her time between Columbus, Ohio and Sarasota, Florida.

Find her on Daydream: @Eicos73