Optimizing Emotional Insight through Unimodal and Multimodal Long Short-term Memory Models

Hemin F. Ibrahim; Chu K. Loo; Shreeyash Y. Geda; Abdulbasit K. Al-Talabani

doi:10.14500/aro.11477

Hemin F. Ibrahim Department of Information Technology, Tishk International University, Erbil, Kurdistan Region - F.R. Iraq https://orcid.org/0000-0001-7602-6838
Chu K. Loo Department of Artificial Intellengence, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Malaysia https://orcid.org/0000-0001-7867-2665
Shreeyash Y. Geda Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
Abdulbasit K. Al-Talabani Department of Software Engineering, Faculty of Engineering, Koya University, Danielle Mitterrand Boulevard, Koya KOY45, Kurdistan Region – F.R. Iraq https://orcid.org/0000-0001-6328-204X

Keywords: Multimodal emotion recognition, Long short- term memory model, Class weight technique, Fusion techniques, Imbalanced data handling

Abstract

The field of multimodal emotion recognition is increasingly gaining popularity as a research area. It involves analyzing human emotions across multiple modalities, such as acoustic, visual, and language. Emotion recognition is more effective as a multimodal learning task than relying on a single modality. In this paper, we present an unimodal and multimodal long short-term memory model with a class weight parameter technique for emotion recognition on the CMU-Multimodal Opinion Sentiment and Emotion Intensity dataset. In addition, a critical challenge lies in selecting the most effective fusion method for integrating multiple modalities. To address this, we applied four different fusion techniques: Early fusion, late fusion, deep fusion, and tensor fusion. These fusion methods improved the performance of multimodal emotion recognition compared to unimodal approaches. With the highly imbalanced number of samples per emotion class in the MOSEI dataset, adding a class weight parameter technique leads our model to outperform the state of the art on all three modalities — acoustic, visual, and language — as well as on all the fusion models. The challenges of class imbalance, which can lead to biased model performance, and using an effective fusion method for integrating multiple modalities often result in decreased accuracy in recognizing less frequent emotion classes. Our proposed model shows 2–3% performance improvement in the unimodal and 2% in the multimodal over the state-of-the-art achieved results.

Downloads

Download data is not yet available.

Author Biographies

Hemin F. Ibrahim, Department of Information Technology, Tishk International University, Erbil, Kurdistan Region - F.R. Iraq

Hemin F. Ibrahim is a Lecturer at the Department of IT, Faculty of Applied Science, Tishk International University. He got the B.Sc. degree in IT from Kurdistan University - Hewler, the M.Sc. degree in Advanced Software Engineering from The University of Sheffield, and the Ph.D. degree in AI from Universiti Malaysia. His research interests are in machine learning, speech processing, emotion recognition, and robotics.

Chu K. Loo, Department of Artificial Intellengence, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur, Malaysia

Chu K. Loo is a Professor at the Department of AI, Faculty of Computer Science and Information technology, Universiti Malaya. He got the B.Sc. degree in mechanical engineering, and the Ph.D. degree from Universiti Sains Malaysia, George Town, Malaysia. Currently, he is a Professor in computer science and information technology with the University of Malaya. His research interests are in soft- computing, quantum inspired SC, brain-like intelligent systems and intelligent robot.

Shreeyash Y. Geda, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India

Shreeyash Y. Geda is a Data Scientist at Trexquant. He got the B.Tech. degree in Electrical Engineering from Indian Institute of Technology Roorkee. His research interests are in Deep Learning, Time series forecasting and making market neutral portfolios.

Abdulbasit K. Al-Talabani, Department of Software Engineering, Faculty of Engineering, Koya University, Danielle Mitterrand Boulevard, Koya KOY45, Kurdistan Region – F.R. Iraq

Abdulbasit K. Al-Talabani is an Assistant Professor at the Department of Software Engineering, Faculty of Engineering, Koya University. He received the B.Sc. degree in Mathematics from Salahaddin University, M.Sc. and Ph.D. degrees in computer science from Koya University, Kurdistan Region, Iraq in 2006 and the University of Buckingham, the UK in 2016, respectively. His research interest includes machine learning, speech analysis, deep learning and computer vision.

References

Ahmed, J., and Green 2nd, R.C., 2024. Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced S.M.A.R.T. sensors data. Engineering Applications of Artificial Intelligence, 127, 107339. DOI: https://doi.org/10.1016/j.engappai.2023.107339

Angelov, P., Gu, X., Iglesias, J., Ledezma, A., Sanchis, A., Sipele, O., and Ramezani, R., 2017. Cybernetics of the mind: Learning individual’s perceptions autonomously. IEEE Systems, Man, and Cybernetics Magazine, 3(2), pp.6-17. DOI: https://doi.org/10.1109/MSMC.2017.2664478

Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C.M., Kazemzadeh, A., Lee, S.,and Neumann, U., 2004. Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information. In: Proceedings of the 6th International Conference on Multimodal Interfaces. DOI: https://doi.org/10.1145/1027933.1027968

Chen, L., Huang, T., Miyasato, T., and Nakatsu, R., 1998. Multimodal Human Emotion/Expression Recognition. In: Proceedings 3rd IEEE International Conference on Automatic Face and Gesture Recognition. Nara, Japan.

Churamani, N., Barros, P., Strahl, E., and Wermter, S., 2018. Learning Empathy-Driven Emotion Expressions using Affective Modulations. In: Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN). DOI: https://doi.org/10.1109/IJCNN.2018.8489158

Crangle, C.E., Wanga, R., Perreau-Guimaraesa, M., Nguyena, M.U., Nguyena, D.T., and Suppes, P., 2019. Machine learning for the recognition of emotion in the speech of couples in psychotherapy using the Stanford Suppes Brain Lab Psychotherapy Dataset. Available from: https://arxiv.org/abs/1901.04110v1

Degottex, G., Kane, J., Drugman, T., Raitio, T., and Scherer, S., 2014. COVAREP - A Collaborative Voice analysis Repository for Speech Technologies. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy. DOI: https://doi.org/10.1109/ICASSP.2014.6853739

Drugman, T., Thomas, M., Gudnason, J., Naylor, P., and Dutoit, T., 2012. Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio Speech and Language Processing, 20, pp.994-1009. DOI: https://doi.org/10.1109/TASL.2011.2170835

Ekman, P., Friesen, W.V., and Ancoli, S., 1980. Facial signs of emotional experience. Journal of Personality and Social Psychology, 39, pp.1125-1134. DOI: https://doi.org/10.1037/h0077722

Geetha, A.V., Mala, T., Priyanka, D., and Uma, E., 2024. Multimodal emotion recognition with deep learning: Advancements, challenges, and future directions. Information Fusion, 105, 102218. DOI: https://doi.org/10.1016/j.inffus.2023.102218

Gladys, A.A., and Vetriselvi, V., 2023. Survey on multimodal approaches to emotion recognition. Neurocomputing, 556, p.126693. DOI: https://doi.org/10.1016/j.neucom.2023.126693

Griol, D., Molina, J.M., and Callejas, Z., 2019. Combining speech-based and linguistic classifiers to recognize emotion in user spoken utterances. Neurocomputing, 326, pp.132-140. DOI: https://doi.org/10.1016/j.neucom.2017.01.120

Huang, Y., Yang, J., Liao, P., and Pan, J., 2017. Fusion of Facial Expressions and EEG for Multimodal Emotion Recognition. Computational Intelligence and Neuroscience, 2017, p.2107451. DOI: https://doi.org/10.1155/2017/2107451

Jiang, Y., Li, W., Hossain, MS., Chen, M., Alelaiwi, A., and Al-Hammadi, M., 2020. A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition. Information Fusion, 53, pp.209-221. DOI: https://doi.org/10.1016/j.inffus.2019.06.019

Kane, J., and Gobl, C., 2011. Identifying Regions of Non-modal Phonation Using Features of the Wavelet Transform. In: Proceedings of the Annual Conference of the International Speech Communication Association. DOI: https://doi.org/10.21437/Interspeech.2011-76

Kim, J.K., and Kim, Y.B., 2018. Joint Learning of Domain Classification and Out-of-Domain Detection with Dynamic Class Weighting for Satisficing False Acceptance Rates.In: Proceedings of the Annual Conference of the International Speech Communication Association. DOI: https://doi.org/10.21437/Interspeech.2018-1581

Stöckli, S., Schulte-Mecklenbeck, M., Borer, S., and Samson, A.C., 2018. Facial expression analysis with AFFDEX and FACET: A validation study. Behavior Research Methods, 50, pp. 1446-1460. DOI: https://doi.org/10.3758/s13428-017-0996-1

Li, P., Abdel-Aty, M., and Yuan, J., 2020. Real-time crash risk prediction on arterials based on LSTM-CNN. Accident Analysis and Prevention, 135, p.105371. DOI: https://doi.org/10.1016/j.aap.2019.105371

Lotfian, R., and Busso, C., 2019. Over-sampling emotional speech data based on subjective evaluations provided by multiple individuals. IEEE Transactions on Affective Computing, 12, pp.870-882. DOI: https://doi.org/10.1109/TAFFC.2019.2901465

Nojavanasghari, B., Gopinath, D., Koushik, J., Baltrušaitis, T., and Morency, L.P., 2016. Deep Multimodal Fusion for Persuasiveness Prediction.In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. New York. DOI: https://doi.org/10.1145/2993148.2993176

Paiva, A.M., Leite, I., Boukricha, B., and Wachsmuth, I., 2017. Empathy in virtual agents and robots: A survey. ACM Transactions on Interactive Intelligent Systems, 7, pp.1-40. DOI: https://doi.org/10.1145/2912150

Pennington, J., Socher, R., and Manning, C.D., 2014. GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). DOI: https://doi.org/10.3115/v1/D14-1162

Sherstinsky, A., 2020. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, p.132306. DOI: https://doi.org/10.1016/j.physd.2019.132306

Tong, E., Zadeh, A., Jones, C., and Morency, L.P., 2017. Combating Human Trafficking with Multimodal Deep Models. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). DOI: https://doi.org/10.18653/v1/P17-1142

Yang, Q., and Wu, X., 2006. 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 5, pp.597-604. DOI: https://doi.org/10.1142/S0219622006002258

Yuan, J., and Liberman, M., 2008. Speaker identification on the SCOTUS corpus. The Journal of the Acoustical Society of America, 123, p.3878. DOI: https://doi.org/10.1121/1.2935783

Zadeh, A., Chen, M., Poria, S., Cambria, E., and Morency, LP., 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. Copenhagen, Denmark. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. DOI: https://doi.org/10.18653/v1/D17-1115

Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., and Morency, L.P., 2018. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia.

Zhang, S., Yang, Y., Chen, C., Zhang, X., Leng, Q., and Zhao, X., 2024. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Systems with Applications, 237, p.121692. DOI: https://doi.org/10.1016/j.eswa.2023.121692

Zhu, Q., Yeh, M.C., Cheng, K.T., and Avidan, S., 2006. Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06).

Zhu, X. Liu, Y., Li, J., Tao, W., and Qin, Z., 2018. Emotion Classification with Data Augmentation Using Generative Adversarial Networks. Springer, Cham. DOI: https://doi.org/10.1007/978-3-319-93040-4_28