Papers
arxiv:2102.01192

Generative Spoken Language Modeling from Raw Audio

Published on Feb 1, 2021
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Generative Spoken Language Modeling learns acoustic and linguistic characteristics from raw audio without labels and evaluates learned representations using proposed metrics.

We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2102.01192
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2102.01192 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2102.01192 in a Space README.md to link it from this page.

Collections including this paper 1