Detecting aggressiveness in calls to call centers, monitoring stress in aircraft pilots or allowing the creation of chat services in the field of psychiatry and psychology are just some of the applications that recognition systems can have. emotions through voice
Although it is a relatively new field, in Spain some researchers from the Higher Technical School of Computer Engineering of the Polytechnic University of Madrid (UPM), in collaboration with the Computational Intelligence Group of the University of the Basque Country (UPV/EHU), They are working so that, through deep learning techniques (or Deep Learning, a form of artificial intelligence), it is possible to create a model based on deep neural networks that allows emotions to be recognized in spoken language. In this way, “the system can react in one way or another according to each case, monitoring the responses to guide the dialogue or redirect it to a human”, explains Javier de Lope, professor at the UPM and member of the project.
The group of researchers, belonging to the Department of Artificial Intelligence of the ETSIINF, has been working for years on emotion recognition systems both from machine learning (a classic form of artificial intelligence) and with the application of deep learning techniques.
“The model proposed in this work uses this second type of techniques. We focus on the recognition of a basic set of eight primary emotions, following one of the most accepted models in domains of behavioral study, such as psychology and neurology”, he adds. These emotions are associated with states or situations of calm, happiness, sadness, anger, fear, disgust and surprise, to which a neutral state is added.
According to De Lope, “voice recognition of emotions is a much less studied field than voice recognition. The objective is not only to identify the word, but also to incorporate the way it is said, which is associated with the mood of the speaker”. These are techniques that have application in many fields in which the social aspect is relevant, such as in social robotics -which comes to supply or complement lacks of an affective and relational type- or in helping to detect states of anxiety or depressives
Basically, spectrograms of a special type are generated from the audio, which are used to feed the neural network (a network that emulates the operation of a set of neurons). The proposed network model processes the spectrogram images as sequences. It consists of a first set of convolutional layers that extract characteristics from the images, followed by more layers that allow treating the temporal information inherent in the speeches. The model offers a set of values as output, from which the emotions associated with the input audios are determined.
Starting from a spectrogram of the voice to be analysed, the system processes the information and reaches a conclusion about the person’s emotional state. (Image: UPM)
The results achieved so far are satisfactory. “With the current prototype, it has been possible to exceed the performance of most of the most advanced models, while at the same time reducing the computational requirements for the neuron network model”, explains the researcher. “Improvements and optimizations continue to be tested, both in the Deep Learning models and in the previous treatment of the data generated from the audios of the speeches that are used during the training of the networks. Therefore, we anticipate an increase in performance in future versions”, he concludes.
The study is titled “A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition.” And it has been published in the academic journal International Journal of Neural Systems. (Source: UPM)