Hybrid voice activity detection system based on LSTM and auditory speech features

[ X ]

Tarih

2023

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Elsevier Sci Ltd

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

Voice Activity Detection (VAD), sometimes called as Speech Activity Detection, is the process of extracting speech regions in audio recordings including many type of sounds. Because undesired data causes both computational complexity and time wasting, most of speech based applications consider only speech part (region of interest) and ignore the rest. This is the main reason that makes usage of the VAD stands a preliminary stage in applications like automatic speech recognition (ASR), speaker identification/verification, speech enhancement, speaker diarization etc. In this study, a successful semi-supervised VAD system, which we named as hybrid-VAD , was proposed especially for the environment with high signal-to-noise ratio (SNR) with the manner of two-stage. At first, VAD decision was obtained from a relatively simple Long-Short Term Memory (LSTM) network trained by auditory speech features like energy, zero crossing rate (ZCR) and 13rd order-Mel Frequency Cepstral Coefficients (MFCC). After we applied a reasonable thresholding strategy to the same features to have second VAD decision, we combined both decisions with logical operators. The result was surprisingly showed that final VAD decision have low FEC and OVER errors, which are specifically critical for any speaker diarization system, mostly in the environments with high SNR.

Açıklama

Anahtar Kelimeler

Voice Activity Detection, Zero Crossing Rate, Mfcc, Lstm

Kaynak

Biomedical Signal Processing and Control

WoS Q Değeri

Q2

Scopus Q Değeri

Q1

Cilt

80

Sayı

Künye