Найдено 398
Comparative performance analysis of end-to-end ASR models on Indo-Aryan and Dravidian languages within India’s linguistic landscape
Jain P., Bhowmick A.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org
Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN
Li P., Hoi L.M., Wang Y., Yang X., Im S.K.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
In today’s society, speaker recognition plays an increasingly important role. Currently, neural networks are widely employed for extracting speaker features. Although the Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) model can obtain temporal context information through dilated convolution to some extent, this model falls short in acquiring fully comprehensive speech features. To further improve the accuracy of the model, better capture the temporal context information, and make ECAPA-TDNN unaffected by small offsets in the frequency domain, based on the ECAPA-TDNN model, we combine a two-dimensional convolutional network (Conv2D), a residual network (ResNet), and ECAPA-TDNN to form a novel CRET model. In this study, two CRET models are proposed, and these two models are compared with the baseline models Multi-Scale Backbone Architecture (Res2Net) and ECAPA-TDNN in different channels and different datasets. The experimental findings indicate that our proposed models exhibit strong performance across various experiments conducted on both training and test sets, even when the network layer is deep. Our model performs the best on the VoxCeleb2 dataset with 1024 channels, achieving an accuracy of 0.97828, an equal error rate (EER) of 0.03612 on the VoxCeleb1-O dataset, and a minimum detection cost function (MinDCF) of 0.43967. This technology can improve public safety and service efficiency in smart city construction, promote finance, education, and other fields, and bring more convenience to people's lives.
Investigations on higher-order spherical harmonic input features for deep learning-based multiple speaker detection and localization
Poschadel N., Preihs S., Peissig J.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Abstract In this paper, a detailed investigation of deep learning-based speaker detection and localization (SDL) with higher-order Ambisonics signals is conducted. Different spherical harmonic (SH) input features such as the higher-order pseudointensity vector (HO-PIV), relative harmonic coefficients (RHCs), and the spatially-localized pseudointensity vector (SL-PIV), a feature proposed for the first time as an input feature for deep learning-based SDL, are examined using first- to fourth-order SH signals. The trained neural networks, optimized with a single loss function for the combined tasks of detection and localization, are then evaluated in detail for overall SDL performance as well as their performance in the sub-tasks of detection and, particularly, localization. The results are further analyzed in dependence on room reverberation, signal-to-interference ratio (SIR), as well as the number and distances between multiple simultaneously active speakers, utilizing both simulated and measured data. The findings indicate an overall improvement in SDL performance up to third-order Ambisonics for all investigated features, while using fourth-order signals does not yield any further improvement or sometimes even delivers worse results. Notably, the HO-PIV and the SL-PIV, both extensions of the first-order pseudointensity vector (FO-PIV), have proven to be suitable input features. In particular the newly proposed SL-PIV has been found to be the best of the investigated features on third- and fourth-order Ambisonics signals, especially in the most demanding scenarios on measured data, with multiple, closely located speakers and poor SIR.
AI-based Chinese-style music generation from video content: a study on cross-modal analysis and generation methods
Cao M., Zheng J., Zhang C.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
In recent years, Artificial Intelligence Generated Content (AIGC) technologies have advanced rapidly, with models such as Stable Diffusion and GPT garnering significant attention across various domains. Against this backdrop, AI-driven music composition techniques have also produced significant progress. However, no existing model has yet demonstrated the capability to generate Chinese-style music corresponding to Chinese-style videos. To address this gap, this study proposes a novel Chinese-style video music generation model based on the Latent Diffusion Model (LDM) and Diffusion Transformers (DiT). Experimental results demonstrate that the proposed model generates Chinese-style music from Chinese-style videos and achieves performance comparable to the baseline models in audio quality, distribution fitting, musicality, rhythmic stability, and audio-visual synchronization. These findings indicate that the model captures the stylistic features of Chinese music. This research not only demonstrates the feasibility applications of artificial intelligence in music creation but also provides a new technological approach to preserve and innovate the traditional Chinese music culture in the digital era. Furthermore, it explores new possibilities for the dissemination and innovation of Chinese cultural arts in the digital age.
A speech recognition method with enhanced transformer decoder
Hu H., Niu T., He Z.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Addressing the issue that the Transformer decoder struggles to capture local features for monotonic alignment in speech recognition, and simultaneously incorporating language model cold fusion training into the decoder, an enhanced decoder-based speech recognition model is investigated. The enhanced decoder separates and combines the two attention mechanisms in the Transformer decoder into cross-attention layers and a self-attention language model module. The cross-attention layers are utilized to capture local features more efficiently from the encoder output, and the self-attention language model module is used to pre-train with additional domain-related text, followed by cold fusion training. Experimental results on the Mandarin Aishell-1 dataset demonstrate that when the encoder is a Conformer, the enhanced decoder achieves a 16.1% reduction in character error rate compared to the Transformer decoder. Furthermore, when the language model is pre-trained with suitable text data, the performance of the cold fusion-trained model is further enhanced.
Improving multi-talker binaural DOA estimation by combining periodicity and spatial features in convolutional neural networks
Varzandeh R., Doclo S., Hohmann V.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Abstract Deep neural network-based direction of arrival (DOA) estimation systems often rely on spatial features as input to learn a mapping for estimating the DOA of multiple talkers. Aiming to improve the accuracy of multi-talker DOA estimation for binaural hearing aids with a known number of active talkers, we investigate the usage of periodicity features as a footprint of speech signals in combination with spatial features as input to a convolutional neural network (CNN). In particular, we propose a multi-talker DOA estimation system employing a two-stage CNN architecture that utilizes cross-power spectrum (CPS) phase as spatial features and an auditory-inspired periodicity feature called periodicity degree (PD) as spectral features. The two-stage CNN incorporates a PD feature reduction stage prior to the joint processing of PD and CPS phase features. We investigate different design choices for the CNN architecture, including varying temporal reduction strategies and spectro-temporal filtering approaches. The performance of the proposed system is evaluated in static source scenarios with 2–3 talkers in two reverberant environments under varying signal-to-noise ratios using recorded background noises. To evaluate the benefit of combining PD features with CPS phase features, we consider baseline systems that utilize either only CPS phase features or combine CPS phase and magnitude spectrogram features. Results show that combining PD and CPS phase features in the proposed system consistently improves DOA estimation accuracy across all conditions, outperforming the two baseline systems. Additionally, the PD feature reduction stage in the proposed system improves DOA estimation accuracy while significantly reducing computational complexity compared to a baseline system without this stage, demonstrating its effectiveness for multi-talker DOA estimation.
A big data dynamic approach for adaptive music instruction with deep neural fuzzy logic control
Li D., Liu Z.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Music training for learners has improved greatly in recent years with the inclusion of information technology and optimization methods. The improvements focus on assisted learning, instruction suggestions, and performance assessments. An adaptive instructive suggestion method (AISM) using deep neural fuzzy control (FC) is introduced in this paper to provide persistent assistance for technology-based music classrooms. This proposed method reduces learning errors by pursuing instructions based on the learner’s level. The instructions are adaptable depending on the error and level independent of different suggestions. The suggestions are replicated for similar issues across various music learning classrooms, retaining the constant fuzzification. The fuzzy control deviates at every new level, and errors are identified over the deviations from the instructions pursued. This control process verifies the input based on instruction deviations to prevent error repetitions. Therefore, the fuzzification relies on error normalization using common adaptive suggestions for different learning sessions. If the fuzzy control fails to match the existing instruction pursued, then new instructions are augmented to reduce errors that serve as the FC constraint. This constraint is pursued by unresolved previous errors to improve learning efficacy. Thus, compared to other methods, the system improves adaptability by 13.9%, efficiency analysis by 9.02%, and constraint detection by 10.26%.
A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end models
Li Y., Wang Y., Hoi L.M., Yang D., Im S.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
Обзор, PDF, doi.org, Abstract
At present, automatic speech recognition has become an important bridge for human-computer interaction and is widely applied in multiple fields. The Portuguese speech recognition task is gradually receiving attention due to its unique language stance. However, the relatively scarce data resources have constrained the development and application of Portuguese speech recognition systems. The neglect of accent issues is also detrimental to the promotion of recognition systems. This study focuses on the research progress of end-to-end technology on Portuguese speech recognition task. It discusses relevant research from two directions: Brazilian Portuguese recognition and European Portuguese recognition, and organizes available corpus resources for potential researchers. Then, taking European Portuguese speech recognition as an example, it takes the Fairseq-S2T and Whisper as benchmarks tested on a 500-h European Portuguese dataset to estimate the performance of large-scale pre-trained models and fine-tuning techniques. Whisper obtained a WER of 5.11% which indicates that multilingual joint training can enhance the generalization ability. Finally, to the existing problems in Portuguese speech recognition, it explores future research directions, which provides new ideas for the next stage of research and system construction.
Performance evaluation of perceptible impulsive noise detection methods based on auditory models
Özdoğru A., Rund F., Fliegel K.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Reference-free audio quality assessment is a valuable tool in many areas, such as audio recordings, vinyl production, and communication systems. Therefore, evaluating the reliability and performance of such tools is crucial. This paper builds on previous research by analyzing the performance of four additional algorithms in detecting perceptible impulsive noise (clicks) based on auditory models. We compared the results of eight algorithms, hypothesizing that computationally simpler algorithms could perform as well as more complex ones. We obtained a set of audio signals, with and without clicks, annotated by human subjects from a publicly available dataset. Audio signal sets are categorized based on the obtained annotation results to train the algorithms for different levels of the experiments. Experiments containing cross-validation are done for multiple parameters of algorithms. The algorithm training is based on maximizing a discriminability metric ( $$A'$$ ). Evaluation criteria of the algorithms included the hit rate, false alarm rate, $$A'$$ , and computational time. Our findings indicate that computationally simpler auditory models have performed as well as computationally more complex ones, while conventional models exhibit lower performance. Conclusively, the ERBlet transform based algorithm demonstrated superior performance in terms of $$A'$$ and robustness. This paper provides insights into the capabilities of auditory models in a practical use case of perceptible click detection. The results presented here can help research and develop such algorithms for vinyl production, audio archiving, podcasting, music production, and telecommunications.
Sound recurrence analysis for acoustic scene classification
Abeßer J., Liang Z., Seeber B.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2025, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Abstract In everyday life, people experience different soundscapes in which natural sounds, animal noises, and man-made sounds blend together. Although there have been several studies on the importance of recurring sound patterns in music and language, the relevance of this phenomenon in natural soundscapes is still largely unexplored. In this article, we study the repetition patterns of harmonic and transient sound events as potential cues for acoustic scene classification (ASC). In the first part of our study, our aim is to identify acoustic scene classes that exhibit characteristic sound repetition patterns concerning harmonic and transient sounds. We propose three metrics to measure the overall prevalence of sound repetitions as well as their repetition periods and temporal stability. In the second part, we evaluate three strategies to incorporate self-similarity matrices as an additional input feature to a convolutional neural network architecture for ASC. We observe the characteristic repetition of transient sounds in recordings of “park” and “street traffic” as well as harmonic sound repetitions in acoustic scene classes related to public transportation. In the ASC experiments, hybrid network architectures, which combine spectrogram features and features from sound recurrence analysis, show increased accuracy for those classes with prominent sound repetition patterns. Our findings provide additional perspective on the distinctions among acoustic scenes previously primarily ascribed in the literature to their spectral features.
Can all variations within the unified mask-based beamformer framework achieve identical peak extraction performance?
Hiroe A., Itoyama K., Nakadai K.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
This study investigates mask-based beamformers (BFs), which estimate filters for target sound extraction (TSE) using time-frequency masks. Although multiple mask-based BFs have been proposed, no consensus has been reached on which one offers the best target-extraction performance. Previously, we found that maximum signal-to-noise ratio and minimum mean square error (MSE) BFs can achieve the same extraction performance as the theoretical upper-bound performance, with each BF containing a different optimal mask. However, two issues remained unsolved: only two BFs were covered, excluding the minimum variance distortionless response BF; and ideal scaling (IS) was employed to ideally adjust the output scale, which is not applicable to realistic scenarios. To address these issues, this study proposes a unified framework for mask-based BFs comprising two processes: filter estimation that can cover all possible BFs and scaling applicable to realistic scenarios by employing a mask to generate a scaling reference. Based on the operators and covariance matrices used in BF formulas, all possible BFs can be classified into 12 variations, including two new ones. Optimal masks for both processes are obtained by minimizing the MSE between the target and BF output. The experimental results using the CHiME-4 dataset suggested that 1) all 12 variations can achieve the theoretical upper-bound performance, and 2) mask-based scaling can behave like IS, even when constraining the temporal mean of a non-negative mask to one. These results can be explained by considering the practical parameter count of the masks. These findings contribute to 1) designing a TSE system, 2) improving scaling accuracy through mask-based scaling, and 3) estimating the extraction performance of a BF.
Acoustic scene classification using inter- and intra-subarray spatial features in distributed microphone array
Kawamura T., Kinoshita Y., Ono N., Scheibler R.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
AbstractIn this study, we investigate the effectiveness of spatial features in acoustic scene classification using distributed microphone arrays. Under the assumption that multiple subarrays, each equipped with microphones, are synchronized, we investigate two types of spatial feature: intra- and inter-generalized cross-correlation phase transforms (GCC-PHATs). These are derived from channels within the same subarray and between different subarrays, respectively. Our approach treats the log-Mel spectrogram as a spectral feature and intra- and/or inter-GCC-PHAT as a spatial feature. We propose two integration methods for spectral and spatial features: (a) middle integration, which fuses embeddings obtained by spectral and spatial features, and (b) late integration, which fuses decisions estimated using spectral and spatial features. The evaluation experiments showed that, when using only spectral features, employing all channels did not markedly improve the F1-score compared with the single-channel case. In contrast, integrating both spectral and spatial features improved the F1-score compared with using only spectral features. Additionally, we confirmed that the F1-score for late integration was slightly higher than that for middle integration.
Domain-weighted transfer learning and discriminative embeddings for low-resource speaker verification
Wang H., He M., Zhang M., Luo C., Xu L.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Transfer learning has been shown to be effective in enhancing speaker verification performance in low-resource conditions. However, the inclusion of additional datasets may cause domain mismatch. Additionally, mismatched data volume and model complexity during fine-tuning can degrade speaker verification performance. In this paper, we propose a domain-weighted allocation fine-tuning strategy that employs the Kernel Mean Matching (KMM) algorithm to adjust the distribution differences between the in-domain and out-of-domain datasets. It assigns weights to each sample in the source datasets and utilizes the maximum mean discrepancy (MMD) distance to measure the effectiveness of distribution adaptation. The domain-weighted allocation fine-tuning strategy (DWA-FT) effectively mitigates the issue of domain mismatch during model training. We also propose two backend canonical correlation analysis (CCA) embedding transformation methods, the CCA embedding fusion and the CCA embedding constraint. These methods aim to enhance the quality of speaker embeddings. The experimental results demonstrate that the proposed methods effectively enhance the performance of the speaker verification system in low-resource scenarios. Compared to the baseline, our methods achieve relative improvements of 51.03% in PLDA scoring and 46.02% in cosine similarity scoring on the Himia dataset.
Variants of LSTM cells for single-channel speaker-conditioned target speaker extraction
Sinha R., Rollwage C., Doclo S.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
AbstractSpeaker-conditioned target speaker extraction aims at estimating the target speaker from a mixture of speakers utilizing auxiliary information about the target speaker. In this paper, we consider a single-channel target speaker extraction system consisting of a speaker embedder network and a speaker separator network. Instead of using standard long short-term memory (LSTM) cells in the separator network, we propose two variants of LSTM cells that are customized for speaker-conditioned target speaker extraction. The first variant customizes both the forget gate and input gate of the LSTM cell, aiming at retaining only relevant features related to target speaker and disregarding the interfering speakers by simultaneously resetting and updating the cell state using the speaker embedding. For the second variant, we introduce a new gate within the LSTM cell, referred to as auxiliary-modulation gate. This gate modulates the information processing during cell state reset, aiming at learning the long-term and short-term discriminative features of the target speaker. Both in unidirectional and bidirectional mode, experimental results on 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures (containing 1, 2, or 3 speakers) show that both proposed variants of LSTM cells outperform the standard LSTM cells for target speaker extraction, where the best performance is obtained using the auxiliary-gated LSTM cells.
A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams
Kynych F., Cerva P., Zdansky J., Svendsen T., Salvi G.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
AbstractThis manuscript deals with the task of real-time speaker diarization (SD) for stream-wise data processing. Therefore, in contrast to most of the existing papers, it considers not only the accuracy but also the computational demands of individual investigated methods. We first propose a new lightweight scheme allowing us to perform speaker diarization of streamed audio data. Our approach utilizes a modified residual network with squeeze-and-excitation blocks (SE-ResNet-34) to extract speaker embeddings in an optimized way using cached buffers. These embeddings are subsequently used for voice activity detection (VAD) and block-online k-means clustering with a look-ahead mechanism. The described scheme yields results similar to the reference offline system while operating solely on a CPU with a low real-time factor (RTF) below 0.1 and a constant latency of around 5.5 s. In the next part of the work, our research moves toward much more demanding and complex real-time processing of audio-visual data streams. For this purpose, we extend the above-mentioned scheme for audio data processing by adding an audio-video module. This module utilizes SyncNet combined with visual embeddings for identity tracking. Our resulting multi-modal SD framework then combines the outputs from audio and audio-video modules by using a new overlap-based fusion strategy. It yields diarization error rates that are competitive with the existing state-of-the-art offline audio-visual methods while allowing us to process various audio-video streams, e.g., from Internet or TV broadcasts, in real-time using GPU and with the same latency as for audio data processing.
Analysis of spatial filtering in neural spatiospectral filters and its dependence on training target characteristics
Briegleb A., Kellermann W.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
AbstractMask-based multichannel speech enhancement methods based on artificial neural networks estimate a mask that is applied to the multichannel input signal or a reference channel to obtain the estimated desired signal. For the estimation, both spectral and spatial cues from the multichannel input can be used. However, the interplay of the two inside the neural network is typically unknown. In this contribution, we propose a framework to analyze neural spatiospectral filters (NSSFs) with respect to their capabilities to extract and represent spatial information. We explicitly take the characteristics of the training target signal into account and analyze its effect on the functionality of the NSSF. Using two conceptually different NSSFs as example, we show that not all NSSFs use spatial information under all circumstances and that the training target signal has a significant influence on the spatial filtering behavior of an NSSF. These insights help to assess the signal processing capabilities of neural networks and allow to make informed decisions when configuring, training, and deploying NSSFs.
Modelling note’s pitch and duration in trained professional singers
Faghih B., Shoari Nejad A., Timoney J.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Performing musical notes correctly does not mean that all the performers will play the notes at the exact same pitch and duration. However, it does imply that they are performing the notes within acceptable psychoacoustic ranges. Therefore, this article aims to find the range of a note’ duration and pitch according to its position in a piece of music by analysing several parameters in trained-professional singers’ behaviours in singing notes. To achieve the goal, the variations of eight variables on 2688 solo singing recorded files by trained professional singers were investigated to find the relationships between a performed note’s F0 and duration with these variables. The variables considered in this study are the interval to the following and previous notes, the existence of rest before or after the note, the note’s MIDI pitch code and duration in a music score, and the particular singing technique applied. The Bayesian hierarchical model was used to find the effect of the variables on the pitch and duration of a note sung by professionals, mainly in opera style, singers. The investigation confirms that these parameters affect the pitch and duration of notes performed by professional singers. Finally, this paper proposes formulas to calculate the pitch frequency and duration of the notes according to the variables to simulate the behaviour of the trained-professional singers in performing notes’ pitches and duration.
Steered Response Power for Sound Source Localization: a tutorial review
Grinstein E., Tengan E., Çakmak B., Dietzen T., Nunes L., van Waterschoot T., Brookes M., Naylor P.A.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 1,
open access Open access ,
Обзор, PDF, doi.org, Abstract
AbstractIn the last three decades, the Steered Response Power (SRP) method has been widely used for the task of Sound Source Localization (SSL), due to its satisfactory localization performance on moderately reverberant and noisy scenarios. Many works have analysed and extended the original SRP method to reduce its computational cost, to allow it to locate multiple sources, or to improve its performance in adverse environments. In this work, we review over 200 papers on the SRP method and its variants, with emphasis on the SRP-PHAT method. We also present eXtensible-SRP, or X-SRP, a generalized and modularized version of the SRP algorithm which allows the reviewed extensions to be implemented. We provide a Python implementation of the algorithm which includes selected extensions from the literature.
Multi-channel neural audio decorrelation using generative adversarial networks
Anemüller C., Thiergart O., Habets E.A.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
AbstractThe degree of correlation between the sounds received by the ears significantly influences the spatial perception of a sound image. Audio signal decorrelation is, therefore, a commonly used tool in various spatial audio rendering applications. In this paper, we propose a multi-channel extension of a previously proposed decorrelation method based on generative adversarial networks. A separate generator network is employed for each output channel. All generator networks are optimized jointly to obtain a multi-channel output signal with the desired properties. The training objective includes a number of individual loss terms to control both the input-output and the inter-channel correlation as well as the quality of the individual output channels. The proposed approach is trained on music signals and evaluated both objectively and through formal listening tests. Thereby, a comparison with two classical signal processing-based multi-channel decorrelators is performed. Additionally, the influence of the number of output channels, the individual loss term weightings, and the employed training data on the proposed method’s performance is investigated.
Multi-scale Information Aggregation for Spoofing Detection
Li C., Wan Y., Yang F., Yang J.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
AbstractSynthesis artifacts that span scales from small to large are important cues for spoofing detection. However, few spoofing detection models leverage artifacts across different scales together. In this paper, we propose a spoofing detection system built on SincNet and Deep Layer Aggregation (DLA), which leverages speech representations at different levels to distinguish synthetic speech. DLA is totally convolutional with an iterative tree-like structure. The unique topology of DLA makes possible compounding of speech features from convolution layers at different depths, and therefore the local and the global speech representations can be incorporated simultaneously. Moreover, SincNet is employed as the frontend feature extractor to circumvent manual feature extraction and selection. SincNet can learn fine-grained features directly from the input speech waveform, thus making the proposed spoofing detection system end-to-end. The proposed system outperforms the baselines when tested on ASVspoof LA and DF datasets. Notably, our single model surpasses all competing systems in ASVspoof DF competition with an equal error rate (EER) of 13.99%, which demonstrates the importance of multi-scale information aggregation for synthetic speech detection.
Point neuron learning: a new physics-informed neural network architecture
Bi H., Abhayapala T.D.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Machine learning and neural networks have advanced numerous research domains, but challenges such as large training data requirements and inconsistent model performance hinder their application in certain scientific problems. To overcome these challenges, researchers have investigated integrating physics principles into machine learning models, mainly through (i) physics-guided loss functions, generally termed as physics-informed neural networks, and (ii) physics-guided architectural design. While both approaches have demonstrated success across multiple scientific disciplines, they have limitations including being trapped to a local minimum, poor interpretability, and restricted generalizability beyond sampled data range. This paper proposes a new physics-informed neural network (PINN) architecture that combines the strengths of both approaches by embedding the fundamental solution of the wave equation into the network architecture, enabling the learned model to strictly satisfy the wave equation. The proposed point neuron learning method can model an arbitrary sound field based on microphone observations without any dataset. Compared to other PINN methods, our approach directly processes complex numbers, offers better interpretability, and can be generalized to out-of-sample scenarios. We evaluate the versatility of the proposed architecture by a sound field reconstruction problem in a reverberant environment. Results indicate that the point neuron method outperforms two competing methods and can efficiently handle noisy environments with sparse microphone observations.
SVQ-MAE: an efficient speech pre-training framework with constrained computational resources
Zhuang X., Qian Y., Wang M.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Self-supervised learning for speech pre-training models has achieved remarkable success in acquiring superior speech contextual representations by learning from unlabeled audio, excelling in numerous downstream speech tasks. However, the pre-training of these models necessitates significant computational resources and training duration, presenting a high barrier to entry into the realm of pre-training learning. In our efforts, by amalgamating the resource-efficient benefits of the generative learning model, Masked Auto Encoder, with the efficacy of the vector quantization method in discriminative learning, we introduce a novel pre-training framework: Speech Vector Quantization Masked Auto Encoder (SVQ-MAE). Distinct from the majority of SSL frameworks, which require simultaneous construction of speech contextual representations and mask reconstruction within an encoder-only module, we have exclusively designed a decoupled decoder for pre-training SVQ-MAE. This allows the additional decoupled decoder to undertake the mask reconstruction task solely, reducing the learning complexity of pretext tasks and enhancing the encoder’s efficiency in extracting speech contextual representations. Owing to this innovation, by using only 4 GPUs, SVQ-NAE can achieve high performance comparable to wav2vec 2.0, which requires 64 GPUs for training. In the Speech Processing Universal Performance Benchmark, SVQ-MAE surpasses wav2vec 2.0 in both keyword spotting and emotion recognition tasks. Furthermore, in cross-lingual ASR for Mandarin, upon fine-tuning on AISHELL-1, SVQ-MAE achieves a Character Error Rate of 4.09%, outperforming all supervised ASR models.
UTran-DSR: a novel transformer-based model using feature enhancement for dysarthric speech recognition
Irshad U., Mahum R., Ganiyu I., Butt F.S., Hidri L., Ali T.G., El-Sherbeeny A.M.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Over the past decade, the prevalence of neurological diseases has significantly risen due to population growth and aging. Individuals suffering from spastic paralysis, brain attack, and idiopathic Parkinson’s disease (PD), among other neurological illnesses, commonly suffer from dysarthria. Early detection and treatment of dysarthria in these patients are essential for effectively managing the progression of their disease. This paper provides UTrans-DSR, a novel encoder-decoder architecture for analyzing Mel-spectrograms (generated from audios) and classifying speech as healthy or dysarthric. Our model employs transformer encoder features based on a hybrid design, which includes the feature enhancement block (FEB) and the vision transformer (ViT) encoders. This combination effectively extracts global and local pixel information regarding localization while optimizing the mel-spectrograms feature extraction process. We keep up with the original class-token grouping sequence in the vision transformer while generating a new equivalent expanding route. More specifically, two unique growing pathways use a deep-supervision approach to increase spatial data recovery and expedite model convergence. We add consecutive residual connections to the system to reduce feature loss while increasing spatial data retrieval. Our technique is based on identifying gaps in mel-spectrograms distinguishing between normal and dysarthric speech. We conducted several experiments on UTrans-DSR using the UA speech and TORGO datasets, and it outperformed the existing top models. The model performed significantly in pixel’s localized and spatial feature extraction, effectively detecting and classifying spectral gaps. The Tran-DSR model outperforms previous research models, achieving an accuracy of 97.75%.
Ensemble width estimation in HRTF-convolved binaural music recordings using an auditory model and a gradient-boosted decision trees regressor
Antoniuk P., Zieliński S.K., Lee H.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
Binaural audio recordings become increasingly popular in multimedia repositories, posing new challenges in indexing, searching, and retrieval of such excerpts in terms of their spatial audio scene characteristics. This paper presents a new method for the automatic estimation of one of the most important spatial attributes of binaural recordings of music, namely “ensemble width.” The method has been developed using a repository of 23,040 binaural excerpts synthesized by convolving 192 multi-track music recordings with 30 sets of head-related transfer functions (HRTF). The synthesized excerpts represented various spatial distributions of music sound sources along a frontal semicircle in the horizontal plane. A binaural auditory model was exploited to derive the standard binaural cues from the synthesized excerpts, yielding a dataset representing interaural level and time differences, complemented by interaural cross-correlation coefficients. Subsequently, a regression method, based on gradient-boosted decision trees, was applied to the formerly calculated dataset to estimate ensemble width values. According to the obtained results, the mean absolute error of the ensemble width estimation averaged across experimental conditions amounts to 6.63° (SD 0.12°). The accuracy of the method is the highest for the recordings with ensembles narrower than 30°, yielding the mean absolute error ranging between 0.8° and 10.2°. The performance of the proposed algorithm is relatively uniform regardless of the horizontal position of an ensemble. However, its accuracy deteriorates for wider ensembles, with the error reaching 25.2° for the music ensembles spanning 90°. The developed method exhibits satisfactory generalization properties when evaluated both under music-independent and HRTF-independent conditions. The proposed method outperforms the technique based on “spatiograms” recently introduced in the literature.
DOA-informed switching independent vector extraction and beamforming for speech enhancement in underdetermined situations
Ueda T., Nakatani T., Ikeshita R., Araki S., Makino S.
Q2
Springer Nature
Eurasip Journal on Audio, Speech, and Music Processing, 2024, цитирований: 0,
open access Open access ,
PDF, doi.org, Abstract
AbstractThis paper proposes novel methods for extracting a single Speech signal of Interest (SOI) from a multichannel observed signal in underdetermined situations, i.e., when the observed signal contains more speech signals than microphones. It focuses on extracting the SOI using prior knowledge of the SOI’s Direction of Arrival (DOA). Conventional beamformers (BFs) and Blind Source Separation (BSS) with spatial regularization struggle to suppress interference speech signals in such situations. Although Switching Minimum Power Distortionless Response BF (Sw-MPDR) can handle underdetermined situations using a switching mechanism, its estimation accuracy significantly decreases when it relies on a steering vector determined by the SOI’s DOA. Spatially-Regularized Independent Vector Extraction (SRIVE) can robustly enhance the SOI based solely on its DOA using spatial regularization, but its performance degrades in underdetermined situations. This paper extends these conventional methods to overcome their limitations. First, we introduce a time-varying Gaussian (TVG) source model to Sw-MPDR to effectively enhance the SOI based solely on the DOA. Second, we introduce the switching mechanism to SRIVE to improve its speech enhancement performance in underdetermined situations. These two proposed methods are called Switching weighted MPDR (Sw-wMPDR) and Switching SRIVE (Sw-SRIVE). We experimentally demonstrate that both surpass conventional methods in enhancing the SOI using the DOA in underdetermined situations.
Cobalt Бета
ru en