Beyond Quality: Why Control Wins

TLDR

The convergence of real-time video generation and world models represents a shift in how we'll interact with generative AI systems. The most successful applications will craft intuitive and expressive authorship experiences around increasingly commoditized model capabilities.

The broad adoption of high-ceiling tools like TouchDesigner and ComfyUI illustrates this principle; they leverage a node-based interface and a robust plugin ecosystem to unlock incredible customizability and control.

This article explores the future of UX for real-time interactive video and world models, and concludes that use-case-specific controllability is the foundation for success at the application layer.

The Technical Convergence

We're witnessing a fascinating collapse of modality boundaries in real-time AI video and world models. Technical challenges from causal generation to frame compression are converging rapidly, and we're seeing patterns emerge.

Video-to-video transformation is a subset of noise-to-video generation. When you can generate video from pure noise, transforming existing video becomes a special case where you're working with structured starting conditions rather than random initial conditions.
Similarly, real-time world models are converging with real-time video models. The distinction between "arrow keys triggering a new render" and "prompt updates generating new frames" is dissolving. Both represent the same fundamental operation: translating control signals into visual outputs that maintain as much temporal and physical coherence as possible.

As this convergence accelerates, we will see a new wave of world models that are:

Controllable: Accepting and responding to real-time inputs across all modalities
High Quality: Producing visually stunning outputs with minimal temporal artifacts
Physically Deterministic: Understanding and respecting the constraints of physical space

The Authorship Experience as Competitive Moat

Modern AI models and workflows are incredibly feature-rich, but a workflow is only effective if a user can adequately control it to achieve a use-case-specific goal.

As models improve and modalities collapse, the best authorship experience at the app layer will win.

The best authorship experience is a function of controllability and quality; quality will be commoditized at the model layer, and controllability may eventually be commoditized at the infra layer (though it is likely to remain fragmented). While controllability may be standardized at the app layer, it will never be fully commoditized; there are simply too many market opportunities, and too many possibilities.

Everyone is an author now.

There used to be a clear distinction between creators who produce and audiences who consume, but real-time controllable AI enables a new model: every interaction becomes an opportunity for transformation. Instead of watching a video, playing a game, or viewing content, users fork it, remix it, and make it their own in real-time. This shift manifests across both single-player and multiplayer contexts.

In single-player modes, users iterate privately, experimenting and refining until they achieve their vision, with the option to share or keep their creations private.
In multiplayer environments, the creative process itself becomes social—users vote on style changes in live broadcasts, collaboratively build worlds, or compete to create the most compelling variations of shared templates

This is an incredible change to how we think about authorship. But if everyone is now a creator — and a workflow has hundreds of implicit and explicit parameters that affect output — how do you expose the right controls for a user to achieve their goals?

Breaking down authorship

Every use case demands a slightly different authorship experience, even when built on the same underlying workflow. Crafting a great authorship experience starts with understanding who is doing the creating, and why they're doing it.

Here are a few examples from domains where real-time AI and world models are being deployed today:

Human Authors Creating for Other Humans

Content Creation: Streamers want to increase audience engagement with VFX, and need to easily multitask
Game Development: Developers need precise, reproducible controls
Art Installations: Artists need expressive, gestural inputs
Character Stylization: Visual designers need to maintain a high level of consistency and fidelity to reference styles
Embodied Avatars: Users need detail control over the physical attributes of their avatar

Human Authors Guiding AI Systems

Even autonomous systems need human-designed control interfaces. Examples include:

Automated Content: Creators need to set effective guardrails for AI agents to create content that adheres to certain goals and styles
Generative Open-World Games: Game designers need to set rules for consistent procedural generation while maintaining the fun of generative exploration
Synthetic World Data: Engineers need to tune prediction parameters and validate outputs

The Controllability Stack for Applications

While all aspects of quality will eventually become commoditized as models improve, controllability will remain a complex, multi-dimensional challenge at all layers of the stack.

For application developers, this presents opportunities to create powerful and differentiated user experiences.

Let's dig deeper into three aspects of controllability that are most relevant to UX: Control Surface, Action Latency, and Workflow Composability

Control Surface

Control surface refers to how users supply information to a workflow to control its behavior.

Frontier models ultimately ingest data; that data can be supplied by a user in many ways. The choices you make when designing your control surface defines your application's expressiveness.

There are many ways you can allow users to control the underlying workflows, including:

Text and number entry
Gestures
Images
Gamepads and physical controls
Audio / Voice

Within these modalities, there is a nearly unlimited design space.

Action Latency

Action latency refers to the time between a user action and a visible response in the output.

Action latency determines whether your application feels like a powerful tool or a tech demo. This requires intentional architecture throughout your entire pipeline: ingest, pre-processing, inference, and transport.

Gaming is a great example of the importance of action latency. These are a few benchmarks for the relationship between perception and latency:

Less than 50ms: Feels near instantaneous
50-100ms: Acceptable in many games,
100-150ms: Might be acceptable in slower paced games but not in action/competitive settings
More than 150ms: Usually considered laggy and not good for real-time interactivity

Workflow Composability

Different user groups, even within the same use case, have different needs. Professional content creators might want node-based editors with explicit control over every parameter, whereas casual users might want intelligent defaults with optional refinement.

Moreover, the ancillary requirements of each use cases—such as content moderation, foreground/background segmentation, easy recording—often determine the category winner.

Because small changes to the sequencing and configuration of your workflow can significantly impact your ability to meet the needs of a certain user group, it's crucial to think through how precisely a workflow will be configured.

The Next Frontier of Control

The evolution of controllability won't stop at traditional input devices and UX patterns.

Forward-thinking application developers are experimenting with generative UIs and control patterns such as:

Learned Preferences: Applications that observe user choices and automatically adjust control mappings to match individual working styles.
Collaborative Control: Multiple users simultaneously directing different aspects of generation, with the application mediating conflicts and maintaining coherence.
Predictive Assistance: Applications that anticipate user intent and pre-generate likely next steps, making complex workflows feel effortless.

At the hardware and firmware layers, we're starting to see developments that will transform human-computer interaction:

Brain-Computer Interfaces: EEG and EMG inputs that translate thought and muscle signals directly into computer manipulation.
Wearable Interfaces: As AR/VR hardware matures, gestural computing and spatial inputs will become standard.
Enhanced I/O: Novel communication protocols for agents are emerging, enabling richer collaborative generation between AI systems.

Controllability as a Differentiator

Quality improvements in base models and workflows will continue, but over time they'll become table stakes. The applications that win will be those that build the most expressive, responsive, and flexible control systems around these workflows - and tailor them to serve a specific use case.

AI has changed many fundamentals of product development, but creating a great user experience remains the same: deeply understand a users' intent and craft a set of controls that let them achieve it.

The applications that recognize this early and architect their stack accordingly will define the interaction paradigms that become industry standards.

Beyond Quality: Why Control Wins

Beyond Quality: Why Control Wins

Explore new worlds with Daydream Scope

TLDR

The Technical Convergence

The Authorship Experience as Competitive Moat

Everyone is an author now.

Breaking down authorship

Human Authors Creating for Other Humans

Human Authors Guiding AI Systems

The Controllability Stack for Applications

Control Surface

Action Latency

Workflow Composability

The Next Frontier of Control

Controllability as a Differentiator

More like this

[question] How does depth mapping interact with LongLive?

How NPC Texturing Works, and Where AI Fits In

Picasso-style Quixote Sketch