milVAD: A bag-level MNIST modelling of voice activity detection using deep multiple instance learning
Citation
Korkmaz, Y. ve Boyacı, A. (2022). milVAD: A bag-level MNIST modelling of voice activity detection using deep multiple instance learning. Biomedical Signal Processing and Control, 74, 103520.Abstract
Voice Activity Detection (VAD) which is used as an onset step for majority of the applications in Digital Speech Processing (DSP) area is defined as the process of identifying speech region in an audio recording. It is mostly used for automatic speech recognition, speaker identification/verification, speech enhancement, speaker diari-zation etc. in order to reduce output errors and increase overall effectiveness of the systems. In this study, a bag-level MNIST modelling of VAD was proposed using Deep Multiple Instance Learning (Deep MIL) approach. To the best of our knowledge, because this is the first attempt that the VAD was modelled as a MIL problem in the literature, we named "milVAD". The MNIST dataset was modified to obtain bag-level classifier model for the VAD framework while the MIL algorithm was implemented inside a Convolutional Neural Network (CNN) as an embedded layer using Noisy-And pooling method. The proposed modelling scenario has surprisingly achieved high training accuracy, which is approx. 99.91%, with only nine epochs via Deep MIL at bag-level. These results proved that the MIL can efficiently be used for the VAD systems in the manner of binary classification
WoS Q Category
Q2Scopus Q Category
Q1Volume
74URI
https://www.sciencedirect.com/science/article/pii/S1746809422000428?via%3Dihubhttps://hdl.handle.net/11468/11551