SoundAct: Learning Spatial Sound Awareness for Egocentric Robot Manipulation with Stereo Audio

Abstract

Humans naturally use auditory and visual cues to interact with objects beyond sight. Yet most robot manipulation frameworks rely solely on vision, limiting their ability to handle audio-driven tasks such as ring-off and out-of-view events. We propose SoundAct, a sound-aware egocentric manipulation framework that integrates stereo microphones with a wrist-mounted camera for beyond-sight spatial audio reasoning. We encode directional cues from stereo audio as magnitude spectrograms and fuse them with visual features through an attention mechanism, enabling the policy to adapt its reliance on auditory and visual inputs. A spatial audio augmentation method further ensures robustness under audio distractors. On a beyond-sight ring-off task, SoundAct effectively manipulates sound-source objects that begin entirely outside the camera's field of view.

The Key Idea

Vision stops at the edge of the frame.
Sound doesn't.

A wrist camera sees only a narrow FOV in front of it. When the target lies outside that FOV, vision offers no direction to search. Stereo audio fills the gap: inter-channel differences implicitly encode where a sound comes from, guiding the arm toward an object long before it appears on screen.

SoundAct system and task setup — **Overview of SoundAct.** (a) System setup: a stereo microphone array and a wrist-mounted egocentric camera on a Franka arm. (b) Beyond-sight ring-off task — the robot localizes a ringing alarm clock outside its field of view using stereo audio, then turns it off once the object becomes visible.

Stereo sound, not contact

Prior work treats audio as a contact or context signal. SoundAct uses stereo audio as a directional cue for action beyond sight.

Egocentric & real-world

A real-world egocentric manipulation policy that fuses stereo spatial audio with vision via behavior cloning, where precise action matters.

Robust to distractors

A spatial audio augmentation independently scales per-channel noise, teaching the policy to lock onto the target sound amid competing noises.

Method

One policy, three senses.

A multi-modal diffusion policy fuses vision, stereo audio, and proprioception. Self-attention lets it weigh audio when the target is occluded and vision once it comes into view.

SoundAct model architecture — **SoundAct architecture.** A CLIP-pretrained ViT encodes egocentric frames; stereo audio is converted to magnitude spectrograms via STFT and encoded by a ResNet-18. Multi-modal self-attention fuses the two streams with end-effector pose, conditioning a diffusion policy that denoises 16-step relative action trajectories [Δxyz, Δrot, Δgrip] over 50 denoising steps.

Vision encoding

A CLIP-pretrained ViT extracts [CLS] tokens per frame, concatenated into visual latent features.

Stereo audio encoding

Waveforms become magnitude spectrograms via STFT, encoded by a ResNet-18 and projected by an MLP to align with the visual features — preserving inter-channel level cues while dropping unstable phase.

Attention fusion → diffusion

A Transformer with self-attention dynamically weighs audio vs. vision by target visibility; the fused state conditions a 1D-UNet diffusion policy.

Spatial audio augmentation

Identical noise on both channels never perturbs directional cues. So a distractor clip n sampled from ESC-50 (following ManiWAV) is mixed with independent per-channel scaling, simulating audio distractors with varying left–right balance:

x̃_L = x_L + s_L·n, x̃_R = x_R + s_R·n,
s ~ U(0,1)

By exposing the policy to diverse per-channel distractor configurations, it learns to distinguish the target source from competitors and avoids overfitting to absolute loudness.

Experiments

Task: Beyond-sight ring-off.

Trained on 60 teleoperated episodes with the target placed in a 20 cm × 20 cm region on either beyond-sight side, then evaluated over 10 in-distribution trials and 5 trials per audio-distractor type (music, conversation, typing, another beep, another clock alarm).

**Beyond-sight ring-off evaluation.** Vision-only fails to localize the out-of-view source and audio-only fails to turn it off in time; without spatial audio augmentation the policy locks onto a distractor. SoundAct localizes via stereo audio and successfully turns the target off. A third-person camera is shown for visualization only.

Success rates on the beyond-sight ring-off task
Method	Success / Trials
In-Distribution · (a) Modalities
Vision-Only	5 / 10
Audio-Only	7 / 10
Mono Audio (Only Left)	8 / 10
Stereo Audio (Ours)	10 / 10
In-Distribution · (b) Audio representation
Log-Mel Spectrogram	5 / 10
STFT Spectrogram 4-Ch (Mag&Phase)	2 / 10
STFT Spectrogram 2-Ch Mag (Ours)	10 / 10
Audio distractors · (c) Audio generalization
No Audio Augmentation	0 / 5
Background Audio Augmentation	0 / 5
Spatial Audio Augmentation (Ours)	4 / 5

★ marks SoundAct configurations. Stereo audio guides the beyond-sight search while vision enables a precise final turn-off — the combination solves every in-distribution trial, and spatial audio augmentation is what survives unseen distractors.

Modalities

Vision-only misses out-of-view cues and mono audio lacks direction; audio-only localizes but lacks precision. Fusing stereo audio with vision performs best.

Representation

Log-mel loses spatial cues and 4-channel mag&phase is over-sensitive to reflections. The 2-channel magnitude spectrogram gives the best localization.

Generalization

Spatial augmentation brings robustness to distractor loudness and context shifts — though similar periodic beeps to the target remain challenging.

Limitations & future work. The current evaluation is confined to a task-specific environment. Future work targets in-the-wild settings and more complex visual scenes for robust object interaction.

Supplementary Video

Vision stops at the edge of the frame.Sound doesn't.