Frontier AI Research, Open to Everyone

Our Research Mission

We aim to make a meaningful contribution to the development of artificial intelligence by regularly sharing our state-of-the-art algorithms and research. We hope that each publication inspires young scientists to advance more rapidly and unlock new, exciting discoveries.
This technical report describes our approaches for Task 1A (Low-Complexity Acoustic Scene Classification with Multiple Devices) of the DCASE 2021 Challenge. We propose a new architecture with mobile inverted bottleneck blocks (Fused-MBConv and MBConv) for acoustic scene classification tasks. This architecture is based on EfficientNetV2. Our models have a very small number of parameters. We also use several data augmentation techniques during the training of models. Our best model has 62,346 non-zero parameters and achieves a classification macro-average accuracy of 70.5% and an average multiclass cross-entropy (log loss) of 0.848 on the development dataset. The resulting model size is just 121.8 KB (the model parameters are quantized to float16 after the training).

Abstract

Low-complexity acoustic scene classification using mobile inverted bottleneck blocks

Unsupervised anomalous sound detection using multiple time-frequency representations

Abstract

This task aims to continue research on unsupervised anomalous sound detection and develop new high-performing systems for monitoring the condition of machines. In contrast to the DCASE2021 Challenge Task 2, the 2022 task primarily focuses on domain generalization. First and foremost, we propose the idea of using ensembles of 2D CNN- based systems that utilize different time-frequency representations as input features. We use normal sound clips and their section indices to train our anomalous sound detection (ASD) systems for each machine type, and embedding vectors extracted from our CNNs, cosine similarity, and the k-nearest neighbors algorithm (k-NN) to calculate the anomaly scores of test clips. As a result, our method achieves the official score of 0.725 on the development dataset and significantly outperforms the baseline systems.
Interruption in a dialogue occurs when the listener begins their speech before the current speaker finishes speaking. Interruptions can be broadly divided into two groups: cooperative (when the listener wants to support the speaker), and competitive (when the listener tries to take control of the conversation against the speaker’s will). A system that automatically classifies interruptions can be used in call centers, specifically in the tasks of customer satisfaction monitoring and agent monitoring. In this study, we developed a text-based interruption classification model by preparing an in-house dataset consisting of ASR-transcribed customer support telephone dialogues in Russian. We fine-tuned Conversational RuBERT on our dataset and optimized hyperparameters, and the model performed well. With further improvements, the proposed model can be applied to automatic monitoring systems.

Abstract

Conversational Rubert for Detecting Competitive Interruptions in ASR-Transcribed Dialogues

Text-Based detection of on-hold scripts in contact center calls

Abstract

Average hold time is a concern for call centers because it affects customer satisfaction. Contact centers should instruct their agents to use special on-hold scripts to maintain positive interactions with clients. This study presents a natural language processing model that detects on-hold phrases in customer service calls transcribed by automatic speech recognition technology. The task of finding hold scripts in dialogue was formulated as a multiclass text classification problem with three mutually exclusive classes: scripts for putting a client on hold, scripts for returning to a client, and phrases irrelevant to on-hold scripts. We collected an in-house dataset of calls and labeled each dialogue turn in each call. We fine-tuned RuBERT on the dataset by exploring various hyperparameter sets and achieved high model performance. The developed model can help agent monitoring by providing a way to check whether an agent follows predefined on-hold scripts.

Abstract

Audio pattern recognition (APR) is an important research topic and can be applied to several fields related to our lives. Therefore, accurate and efficient APR systems need to be developed as they are useful in real applications. In this paper, we propose a new convolutional neural network (CNN) architecture and a method for improving the inference speed of CNN-based systems for APR tasks. Moreover, using the proposed method, we can improve the performance of our systems, as confirmed in experiments conducted on four audio datasets. In addition, we investigate the impact of data augmentation techniques and transfer learning on the performance of our systems.
Our best system achieves a mean average precision (mAP) of 0.450 on the AudioSet dataset. Although this value is less than that of the state-of-the-art system, the proposed system is 7.1x faster and 9.7x smaller. On the ESC-50, UrbanSound8K, and RAVDESS datasets, we obtain state-of-the-art results with accuracies of 0.961, 0.908, and 0.748, respectively. Our system for the ESC-50 dataset is 1.7x faster and 2.3x smaller than the previous best system. For the RAVDESS dataset, our system is 3.3x smaller than the previous best system. We name our systems “Efficient Residual Audio Neural Networks”.

ERANNs: Efficient residual audio neural networks for audio pattern recognition