DSpace Arşivi :: by Yazar "Korkmaz, Yunus" değerine göre listeleniyor

Yazar "Korkmaz, Yunus" seçeneğine göre listele

Listeleniyor 1 - 5 / 5

A comprehensive Turkish accent/dialect recognition system using acoustic perceptual formants
(Elsevier Sci Ltd, 2022) Korkmaz, Yunus; Boyaci, Aytug
Accent or dialect is one of the hot topics of emerging technology in speech processing. In an audio recording, extracting accent clues from a speech signal can help investigators to have an idea about where speaker is from. It is mostly used for detecting regional origin/ethnicity of speakers in real-time surveillance systems as well as demographic researches about changes in human sounds by geography. In this work, Turkish language accent analysis was performed using formant frequencies (F1, F2 and F3) of vowels. We divided our work into two approaches which are statistical and classification. In both of them, evaluations were done by splitting Turkey's map in to 2 and 3 dialect regions virtually. Totally 112 monolingual university students (72 males, 40 females) have uttered 103 meaningful Turkish syllables. Because formant frequencies can vary depending on genders, males and females were evaluated separately in both statistical and classification analysis. The result surprisingly showed that especially the isolated vowel 'e' is able to classify a male speaker which was pre-known to be from Mediterranean or Eastern Anatolia regions with an accuracy of 90% using KNN classifier. (c) 2022 Elsevier Ltd. All rights reserved.
Hybrid voice activity detection system based on LSTM and auditory speech features
(Elsevier Sci Ltd, 2023) Korkmaz, Yunus; Boyaci, Aytug
Voice Activity Detection (VAD), sometimes called as Speech Activity Detection, is the process of extracting speech regions in audio recordings including many type of sounds. Because undesired data causes both computational complexity and time wasting, most of speech based applications consider only speech part (region of interest) and ignore the rest. This is the main reason that makes usage of the VAD stands a preliminary stage in applications like automatic speech recognition (ASR), speaker identification/verification, speech enhancement, speaker diarization etc. In this study, a successful semi-supervised VAD system, which we named as hybrid-VAD , was proposed especially for the environment with high signal-to-noise ratio (SNR) with the manner of two-stage. At first, VAD decision was obtained from a relatively simple Long-Short Term Memory (LSTM) network trained by auditory speech features like energy, zero crossing rate (ZCR) and 13rd order-Mel Frequency Cepstral Coefficients (MFCC). After we applied a reasonable thresholding strategy to the same features to have second VAD decision, we combined both decisions with logical operators. The result was surprisingly showed that final VAD decision have low FEC and OVER errors, which are specifically critical for any speaker diarization system, mostly in the environments with high SNR.
milVAD: A bag-level MNIST modelling of voice activity detection using deep multiple instance learning
(Elsevier SCI LTD., 2022) Korkmaz, Yunus; Boyacı, Aytuğ
Voice Activity Detection (VAD) which is used as an onset step for majority of the applications in Digital Speech Processing (DSP) area is defined as the process of identifying speech region in an audio recording. It is mostly used for automatic speech recognition, speaker identification/verification, speech enhancement, speaker diari-zation etc. in order to reduce output errors and increase overall effectiveness of the systems. In this study, a bag-level MNIST modelling of VAD was proposed using Deep Multiple Instance Learning (Deep MIL) approach. To the best of our knowledge, because this is the first attempt that the VAD was modelled as a MIL problem in the literature, we named "milVAD". The MNIST dataset was modified to obtain bag-level classifier model for the VAD framework while the MIL algorithm was implemented inside a Convolutional Neural Network (CNN) as an embedded layer using Noisy-And pooling method. The proposed modelling scenario has surprisingly achieved high training accuracy, which is approx. 99.91%, with only nine epochs via Deep MIL at bag-level. These results proved that the MIL can efficiently be used for the VAD systems in the manner of binary classification
SS-ESC: a spectral subtraction denoising based deep network model on environmental sound classification
(Springer London Ltd, 2025) Korkmaz, Yunus
Environmental Sound Classification (ESC), also referred as Sound Event Classification, is an essential part of many speech processing applications in terms of of separating background audio from original signal. By the recent developments in deep learning area, studies related to the ESC area have also been improved significiantly by the researchers. Because the nature of digital sound signals, the ESC was mostly developed using manually extracted one dimensional (1D) so far. In this paper, a novel ESC pipeline which uses spectral subtraction denoising as a preliminary stage was proposed based on deep learning architectures. The well-known deep learning architectures which are GoogLeNet, AlexNet, ShuffleNet, SqueezeNet and ResNet-18 were run over ESC problem by using ESC-10 benchmark dataset. Log-mel spectrogram images were preferred as feature matrices for mentioned networks. The results showed that the proposed SS-ESC model achieved the best results and outperformed many state-of-the-art methods with a test accuracy of 99.17% for the ESC-10 by the help of the AlexNet. These findings significiantly proved that the spectral subtraction denoising can contribute to the environmental sound classification problem in leveraging classification accuracy when it is used as a preliminary stage.
Unsupervised and supervised VAD systems using combination of time and frequency domain features
(Elsevier Sci Ltd, 2020) Korkmaz, Yunus; Boyaci, Aytug
Voice Activity Detection (VAD), also referred as Speech Activity Detection (SAD) is the process of identifying speech/non-speech region in digital speech recordings. It is used as a preliminary stage to reduce errors and increase effectiveness in the most of speech based applications like automatic speech recognition (ASR), speaker identification/verification, speech enhancement, speaker diarization etc. In this study, two independent VAD structures were proposed for unsupervised and supervised approaches using both time and frequency domain features. The autocorrelation based pitch contour estimation was used together with the 1NN Cosine classifier trained by 21-column feature matrix comprising Energy, Zero Crossing Rate (ZCR), 13rd order-Mel Frequency Cepstral Coefficients (MFCC) and Shannon Entropies of daubechies-filtered 5th depth-Wavelet Packet Transform (WPT) to obtain VAD decision in supervised approach, while methods like normalization, thresholding and median filtering were applied over the same feature set in unsupervised approach. The proposed unsupervised VAD achieved error rates of 4%, 19%, 0.02% and 0.7% for the FEC, MSC, OVER and NDS, respectively at 0 dB SNR. The VAD decisions of both supervised and unsupervised systems showed that the proposed methods can efficiently be used either in silent or in environments with noise similar to Additive White Gaussian Noise (AWGN). (C) 2020 Elsevier Ltd. All rights reserved.

Yazar "Korkmaz, Yunus" seçeneğine göre listele

Sayfa Başına Sonuç

Sıralama seçenekleri