혁신적 음성 대화의 진전: ‘SALMONN-omni’의 전격 해부

/AI Chasm Catalyst

개요: 풀-듀플렉스 음성 LLM의 새로운 장

2025년 5월 17일, SALMONN‑omni라는 논문이 arXiv에 공개되었습니다 . 이 모델은 코덱 없이(full‑codec‑free) 단일 LLM만으로 풀‑듀플렉스(full‑duplex) 음성 대화를 실현한 최초의 시스템입니다. 다시 말해, 듣고 말하는 기능을 하나의 모델이 동시에 처리할 수 있는 구조죠.

문제 인식: 기존 시스템의 한계

기존 풀‑듀플렉스 음성 대화 시스템은 종종 음성 활성 감지기, 끼어들기(interrupter), 대화 상태 예측기, 혹은 여러 LLM 등 여러 모듈을 조합한 모듈식 구조를 사용해왔습니다. 하지만 이 방식은 모듈 간 오류 누적, 문맥‑의존적 끼어들기 상황 처리 부족, 반향(에코) 제거 어려움 등으로 여러 한계를 드러냈습니다 .

SALMONN-omni의 핵심 메커니즘

1. 스트리밍 음성 인코더 + LLM + 합성기 연동 구조

각기 다른 기능을 수행하는 모듈이 아니라, 스트리밍 음성 인코더, LLM 백본, 스트리밍 음성 합성기가 연속된 임베딩 공간에서 상호작용합니다. 이를 통해 듣는 것과 말하는 것을 중단 없이 이어갈 수 있습니다 .

2. ‘생각하기(thinking)’ 전략 도입

모델이 언제 말하고 언제 들어야 하는지를 특수 토큰을 통해 스스로 판단하도록 학습시켰습니다. ‘생각’하는 과정이라고 명명된 이 전략은, 대화 상태 전환을 마치 일반 텍스트 토큰 생성처럼 다룹니다 .

3. 주기적 동기화 메커니즘

시간의 흐름을 모델 내부에 인식시켜, 들어오는 소리 및 말해주는 내용을 고정 시간 블록 단위로 정렬하여 처리하게 합니다. 이를 통해 자연스럽고 끊김 없는 대화 흐름을 확보했습니다 .

4. 강화학습(RL) 적용

여기에는 RL, 특히 Direct Preference Optimization (DPO) 등이 도입되어, **대화 역학(dynamics)**를 더욱 효과적으로 학습할 수 있도록 했습니다 .

성능: 벤치마크에서 주는 인상적인 결과

기존 오픈소스 풀‑듀플렉스 모델 대비 최소 30% 이상 성능 향상, 일부 지표에서는 35.9% 상대적 정확도 개선도 기록했습니다 .

턴 테이킹(turn-taking)—대화에서 언제 끼어들고 멈출지 결정하는 능력—에서도 다른 모델보다 뛰어난 예측 성공률을 보였습니다 .

반향 제거, 끼어들기 처리, 백채널(backchanneling) 등 자연스러운 대화 상황에서도 높은 내성 및 유연성을 보여주었습니다 .

하프‑듀플렉스(turn‑based) 시스템과도 경쟁할 수준, 오히려 동일 목적에 더 많은 학습 데이터가 필요한 시스템보다도 효율적임이 드러났습니다 .

전문가 견해: 학계의 시선

ICLR 2025 등에서도 인용되며, 기존 SALMONN 시리즈 연장선상에서 풀‑듀플렉스, codec‑free 전략을 최초로 구체화한 작품으로 평가받습니다 .

이와 유사한 MinMo 모델은 대규모 음성 데이터 기반(full‑duplex 지원) 구조를 구축했지만, 대규모 학습량과 별도의 LLM 인스턴스 병행 필요 등의 한계가 존재했습니다 .

SALMONN‑omni는 이러한 맥락에서, 효율성과 구조 통합 측면에서 한 단계 진화한 모델로 간주됩니다.

실제 사례와 응용 예시

예시 대화: “오늘 날씨 어떠니?” → “비가 오고 있어요. 우산 챙기시는 게 좋겠네요.”와 같은 실시간 응답 대화가 자연스럽게 수행됩니다 .

TriviaQA, AlpacaEval 같은 속도감 있는 데이터셋에서도 턴 테이킹 성공률이 유지되어, 빠른 질문‑응답 환경에서도 유연한 대처가 가능합니다 .

출처 : https://arxiv.org/abs/2505.17060?fbclid=IwQ0xDSwMJJMtleHRuA2FlbQIxMQABHkIIusgy9YnLUNNO9gwMzO4XIOgztBoowlbMqwx8BCoN3P98ML0EKmkqzNBv_aem__cqGgFd17n4VRm4y4AAlJw

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

https://arxiv.org

메이커스저널 이길환 편집장 happytalkman@weai.kr

이길환 편집장의 기사 더보기

전체 메뉴

혁신적 음성 대화의 진전: ‘SALMONN-omni’의 전격 해부