Humans naturally use auditory and visual cues to interact with objects beyond sight. Yet most robot manipulation frameworks rely solely on vision, limiting their ability to handle audio-driven tasks such as ring-off and out-of-view events. We propose SoundAct, a sound-aware egocentric manipulation framework that integrates stereo microphones with a wrist-mounted camera for beyond-sight spatial audio reasoning. We encode directional cues from stereo audio as magnitude spectrograms and fuse them with visual features through an attention mechanism, enabling the policy to adapt its reliance on auditory and visual inputs. A spatial audio augmentation method further ensures robustness under audio distractors. On a beyond-sight ring-off task, SoundAct effectively manipulates sound-source objects that begin entirely outside the camera's field of view.
Vision stops at the edge of the frame.
Sound doesn't.
A wrist camera sees only a narrow FOV in front of it. When the target lies outside that FOV, vision offers no direction to search. Stereo audio fills the gap: inter-channel differences implicitly encode where a sound comes from, guiding the arm toward an object long before it appears on screen.
Stereo sound, not contact
Prior work treats audio as a contact or context signal. SoundAct uses stereo audio as a directional cue for action beyond sight.
Egocentric & real-world
A real-world egocentric manipulation policy that fuses stereo spatial audio with vision via behavior cloning, where precise action matters.
Robust to distractors
A spatial audio augmentation independently scales per-channel noise, teaching the policy to lock onto the target sound amid competing noises.
One policy, three senses.
A multi-modal diffusion policy fuses vision, stereo audio, and proprioception. Self-attention lets it weigh audio when the target is occluded and vision once it comes into view.
Vision encoding
A CLIP-pretrained ViT extracts [CLS] tokens per frame, concatenated into visual latent features.
Stereo audio encoding
Waveforms become magnitude spectrograms via STFT, encoded by a ResNet-18 and projected by an MLP to align with the visual features — preserving inter-channel level cues while dropping unstable phase.
Attention fusion → diffusion
A Transformer with self-attention dynamically weighs audio vs. vision by target visibility; the fused state conditions a 1D-UNet diffusion policy.
Spatial audio augmentation
Identical noise on both channels never perturbs directional cues. So a distractor clip n sampled from ESC-50 (following ManiWAV) is mixed with independent per-channel scaling, simulating audio distractors with varying left–right balance:
s ~ U(0,1)
By exposing the policy to diverse per-channel distractor configurations, it learns to distinguish the target source from competitors and avoids overfitting to absolute loudness.
Task: Beyond-sight ring-off.
Trained on 60 teleoperated episodes with the target placed in a 20 cm × 20 cm region on either beyond-sight side, then evaluated over 10 in-distribution trials and 5 trials per audio-distractor type (music, conversation, typing, another beep, another clock alarm).
| Method | Success / Trials |
|---|---|
| In-Distribution · (a) Modalities | |
| Vision-Only | 5 / 10 |
| Audio-Only | 7 / 10 |
| Mono Audio (Only Left) | 8 / 10 |
| Stereo Audio (Ours) | 10 / 10 |
| In-Distribution · (b) Audio representation | |
| Log-Mel Spectrogram | 5 / 10 |
| STFT Spectrogram 4-Ch (Mag&Phase) | 2 / 10 |
| STFT Spectrogram 2-Ch Mag (Ours) | 10 / 10 |
| Audio distractors · (c) Audio generalization | |
| No Audio Augmentation | 0 / 5 |
| Background Audio Augmentation | 0 / 5 |
| Spatial Audio Augmentation (Ours) | 4 / 5 |
Modalities
Vision-only misses out-of-view cues and mono audio lacks direction; audio-only localizes but lacks precision. Fusing stereo audio with vision performs best.
Representation
Log-mel loses spatial cues and 4-channel mag&phase is over-sensitive to reflections. The 2-channel magnitude spectrogram gives the best localization.
Generalization
Spatial augmentation brings robustness to distractor loudness and context shifts — though similar periodic beeps to the target remain challenging.
Limitations & future work. The current evaluation is confined to a task-specific environment. Future work targets in-the-wild settings and more complex visual scenes for robust object interaction.