szuweifu commited on
Commit
9bfa26c
·
verified ·
1 Parent(s): efab889

Add SEMamba safetensors checkpoint and config

Browse files
Files changed (3) hide show
  1. README.md +6 -159
  2. config.json +38 -2
  3. model.safetensors +3 -0
README.md CHANGED
@@ -1,163 +1,10 @@
1
  ---
2
- license: other
3
- track_downloads: true
4
- pipeline_tag: audio-to-audio
5
- library_name: mamba-ssm
6
  tags:
7
- - universal speech enhancement
8
- - multiple input sampling rates
9
- - language-agnostic
10
  ---
11
- # **<span style="color:#76b900;">🤫 RE-USE: Multilingual Universal Speech Enhancement</span>**
12
- # Model Overview
13
 
14
-
15
- ## Description
16
- In universal speech enhancement, the goal is to restore the **quality** of diverse degraded speech while preserving **fidelity**, ensuring that all other factors remain unchanged, e.g., linguistic content, speaker identity, emotion, accent, and other paralinguistic attributes. Inspired by the **distortion–perception trade-off theory**, our proposed single model achieves a good balance between these two objectives and has the following desirable properties:
17
-
18
- - Robustness to **diverse degradations**, including additive noise, reverberation, clipping, bandwidth limitation, codec artifacts, packet loss and low-quality mics .
19
- - Support for **multiple input sampling rates**, including 8, 16, 22.05, 24, 32, 44.1, and 48 kHz.
20
- - Strong **language-agnostic** capability, enabling effective performance across different languages.
21
-
22
- This model is for research and development only.
23
-
24
- ## Usage
25
- Directly try our [**Gradio Interactive Demo**](https://huggingface.co/spaces/nvidia/RE-USE) by uploading your noisy audio/video !!
26
-
27
- ## Environment Setup
28
- 1. (For **Mamba** setup)Pre-built Docker environments can be downloaded [here](https://github.com/RoyChao19477/SEMamba?tab=readme-ov-file#-docker-support) to simplify **Mamba** setup.
29
-
30
- 2. If you need bandwidth extension:
31
-
32
- ```bash
33
- pip install resampy
34
- ```
35
- 3. Download and navigate to the HuggingFace repository:
36
- ```
37
- huggingface-cli download nvidia/RE-USE --local-dir ./REUSE --local-dir-use-symlinks False
38
- cd ./REUSE
39
- ```
40
-
41
- ## Inference
42
- Follow the simple steps below to generate enhanced speech using our model:
43
- 1. Place your noisy speech files in the folder `noisy_audio/`
44
- 2. Run the following command:
45
- ```bash
46
- sh inference.sh
47
- ```
48
- 3. The enhanced speech files will be saved in `enhanced_audio/`.
49
-
50
- That's all !
51
-
52
- **Note:**
53
-
54
- a. You can enable bandwidth extension by setting the target bandwidth using the `BWE argument` in the script.
55
-
56
- ---
57
-
58
- If your noisy speech files are **long and may cause GPU out-of-memory (OOM)** errors, please use the following procedure instead:
59
- 1. Place your long noisy speech files in the folder `long_noisy_audio/`
60
- 2. Run the following command:
61
- ```bash
62
- sh inference_chunk_wise.sh
63
- ```
64
- 3. The enhanced speech files will be saved in `Long_enhanced_audio/`.
65
-
66
- **Note:**
67
-
68
- a. You can enable bandwidth extension by setting the target bandwidth using the `BWE argument` in the script.
69
-
70
- b. You can also configure the `chunk_size_in_seconds` and `hop_length_portion` directly in the script.
71
-
72
- ---
73
-
74
- ## License/Terms of Use
75
- This model is released under the [NVIDIA One-Way Noncommercial License (NSCLv1)](https://github.com/NVlabs/HMAR/blob/main/LICENSE).
76
-
77
- ## Deployment Geography
78
- Global.
79
-
80
- ## Use Case
81
- Researchers and general users can use this model to enhance the quality of their speech data.
82
-
83
- ## Release Date
84
- Hugging Face 2026/03/18
85
-
86
- ## References
87
- [1] [Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement](https://arxiv.org/abs/2603.02641), 2025.
88
- (Note: The released model checkpoint differs from the one reported in the paper. It incorporates additional degradation types (e.g., microphone response and more codecs) and is fine-tuned on a smaller, high-quality clean subset.)
89
-
90
- ## Model Architecture
91
- **Architecture Type:** Convolutional encoder, Convolutional decoder, and Mamba for time–frequency modeling <br>
92
- **Network Architecture:** Bi-directional Mamba with 30 layers <br>
93
- **Number of model parameters:** 9.6M <br>
94
-
95
- ## Input
96
- Input Type(s): Audio <br>
97
- Input Format(s): .wav files <br>
98
- Input Parameters: One-Dimensional (1D) <br>
99
- Other Properties Related to Input: 8000 Hz - 48000 Hz Mono-channel Audio <br>
100
-
101
- ## Output
102
- Output Type(s): Audio <br>
103
- Output Format: .wav files <br>
104
- Output Parameters: One-Dimensional (1D) <br>
105
- Other Properties Related to Output: 8000 Hz - 48000 Hz Mono-channel Audio <br>
106
-
107
- Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
108
-
109
- ## Software Integration
110
- **Runtime Engine(s):**
111
- * Not Applicable (N/A)
112
-
113
- **Supported Hardware Microarchitecture Compatibility:**
114
- * NVIDIA Ampere (A100)
115
-
116
- **Preferred Operating System(s):**
117
- * Linux
118
-
119
- The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
120
-
121
- ## Model Version(s)
122
- Current version: 30USEMamba_peak+GAN_tel_mic_1134k
123
-
124
- ## Training Datasets
125
- **Data Modality:**
126
- Audio
127
-
128
- **Audio Training Data Size:**
129
- Less than 10,000 Hours
130
-
131
- * [LibriVox data from DNS5 challenge (EN)](https://github.com/microsoft/DNS-Challenge/tree/master) (~350 hours of speech data)
132
- * [LibriTTS (EN)](https://openslr.org/60/) (~200 hours of speech data)
133
- * [VCTK (EN)](https://datashare.ed.ac.uk/handle/10283/3443) (~80 hours of speech data)
134
- * [WSJ (EN)](https://catalog.ldc.upenn.edu/LDC93S6A) (~85 hours of speech data)
135
- * [EARS (EN)](https://sp-uhh.github.io/ears_dataset/) (~100 hours of speech data)
136
- * [Multilingual Librispeech (De, En, Es, Fr)](https://www.openslr.org/94/) (~450 hours of speech data)
137
- * [CommonVoice 19.0 (De, En, Es, Fr, zh-CN)](https://huggingface.co/datasets/fsicoli/common_voice_19_0) (~1300 hours of speech data)
138
- * [Audioset+FreeSound noise in DNS5 challenge](https://github.com/microsoft/DNS-Challenge/tree/master) (~180 hours of noise data)
139
- * [WHAM! Noise](http://wham.whisper.ai/) (~80 hours of noise data)
140
- * [FSD50K (human voice filtered)](https://huggingface.co/datasets/Fhrozen/FSD50k) (~100 hours of non-speech data)
141
- * [(Part of) Free Music Archive (medium)](https://github.com/mdeff/fma) (~200 hours of non-speech data)
142
- * [Simulated RIRs from DNS5 challenge](https://github.com/microsoft/DNS-Challenge/tree/master) (~60k samples of room impulse response)
143
- * [MicIRP](https://micirp.blogspot.com/p/about-micirp.html) (~70 samples of microphone impulse response)
144
-
145
- ## Inference
146
- **Acceleration Engine:** None <br>
147
- **Test Hardware:** NVIDIA A100
148
-
149
- ## Ethical Considerations
150
- NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
151
- Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
152
-
153
- ## Citation
154
- Please consider to cite our paper and this framework, if they are helpful in your research.
155
-
156
- ```bibtex
157
- @article{fu2026rethinking,
158
- title={Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement},
159
- author={Fu, Szu-Wei and Chao, Rong and Yang, Xuesong and Huang, Sung-Feng and Zezario, Ryandhimas E and Nasretdinov, Rauf and Juki{\'c}, Ante and Tsao, Yu and Wang, Yu-Chiang Frank},
160
- journal={arXiv preprint arXiv:2603.02641},
161
- year={2026}
162
- }
163
- ```
 
1
  ---
 
 
 
 
2
  tags:
3
+ - model_hub_mixin
4
+ - pytorch_model_hub_mixin
 
5
  ---
 
 
6
 
7
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
+ - Code: [More Information Needed]
9
+ - Paper: [More Information Needed]
10
+ - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,4 +1,40 @@
1
  {
2
- "model_type": "mamba",
3
- "architectures": ["SEMamba"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  }
 
1
  {
2
+ "env_setting": {
3
+ "checkpoint_interval": 5000,
4
+ "dist_cfg": {
5
+ "dist_backend": "nccl",
6
+ "dist_url": "tcp://localhost:19478",
7
+ "world_size": 1
8
+ },
9
+ "num_gpus": 8,
10
+ "num_workers": 20,
11
+ "persistent_workers": true,
12
+ "pin_memory": true,
13
+ "prefetch_factor": 8,
14
+ "seed": 1234,
15
+ "stdout_interval": 5000,
16
+ "validation_interval": 5000
17
+ },
18
+ "model_cfg": {
19
+ "beta": 2.0,
20
+ "compress_factor": "relu_log1p",
21
+ "d_conv": 4,
22
+ "d_state": 16,
23
+ "expand": 4,
24
+ "hid_feature": 64,
25
+ "inner_mamba_nlayer": 1,
26
+ "input_channel": 2,
27
+ "mapping": true,
28
+ "nonlinear": "None",
29
+ "norm_epsilon": 1e-05,
30
+ "num_tfmamba": 30,
31
+ "output_channel": 1
32
+ },
33
+ "stft_cfg": {
34
+ "hop_size": 40,
35
+ "n_fft": 320,
36
+ "sampling_rate": 8000,
37
+ "sfi": true,
38
+ "win_size": 320
39
+ }
40
  }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:87a4a970ce9aa79d5d92e71899ab034defcf13d93e5e4393ec0dc7db6d4ec048
3
+ size 38592940