Overview
One of the pillars of human-human communication is the capability to perceive, understand and respond to social interactions, usually determined through affective expression[1]. Therefore, applying emotion expression recognition in robots can drastically change our interaction with them[2]. A robot capable to understand emotion expressions can increase its own capability to solve problems by using these expressions in its decision making process, in a similar way that humans do[3]. Especially the use of emotional information between robots and older adults can improve the interaction by several degrees[4].
Research has indicated a major difference in emotion reactions and regulation in older adults[5]. Therefore, the projects in WP2 will collaborate closely to investigate emotions in the aging society and to investigate synergies between different modalities. The final goal is integrated emotion recognition that takes into account age-related variations of Interaction Quality. Such variations may be caused by physical or cognitive fatigue, and also by large variations between individuals. A common possible application for the work in WP2 is social companion robots such as the PARO therapeutic robot made available via our partner ADELE. PARO has been found to reduce patient stress and to stimulate interaction between elderly and caregivers, and is used in nursery homes. Equipping such robots with a multi-modal system for emotion recognition has a potential to improve interaction.
Tasks and Deliverables
TASKS
T2.1 Development of deep neural architecture incorporating unsupervised approaches to learn and recognise visual emotional states (ESR1)
T2.2 Auditory and language features for emotion recognition and development of deep neural architecture for classification (ESR2)
T2.3 Development of models for emotion recognition and generation based on gait (ESR3)
T2.4 Investigation of how Interaction Quality depends on fusing modalities for emotion recognition (ESR1, ESR2, ESR3)
DELIVERABLES
D2.1 Deep neural model for recognition of emotional state using visual upper-body and face input (ESR1) M23
D2.2 Deep neural model for recognition of emotional states using auditory input (ESR2) M23
D2.3 Model for recognition of emotional states from gait (ESR3) M23
D2.4 Report on emotional state recognition using visual and auditory input with deep neural architectures (ESR1, ESR2) M32
D2.5 Report on emotion generation on a TIAGo robot (ESR3) (M32)
D2.6 Report on multi-modal recognition of emotional states (ESR1, ESR2, ESR3) M40
Involved ESRs
ESR1 (HAM) Learning face and upper-body emotion recognition
ESR1 will focus on face and upper-body emotion recognition since psychological studies[6] have shown that in non-verbal communication, facial expressions and body motion complement each other and lead to a more robust recognition when determining emotional states and are perceived differently when shown individually. Building on our previous work[7], where emotions were recognised from visual features, ESR1 will develop an architecture containing unsupervised learning techniques combined with the dimensional model for emotions[8]. In this model, emotional states are represented by values in two universal dimensions: pleasure-displeasure (valence) and activation-deactivation (arousal) since even the same person can express the same emotion in different ways. Using this model, the aim is to develop a deep learning neural system that is capable to learn facial and upper-body features to create a new intensity grid of emotion expressions and cluster them with self-organising layers to distinguish different and so far unknown emotional states. This is especially needed to compensate for the varying Interaction Quality when interacting with older adults who show a different variance in emotional reaction, e.g. less distinct facial expressions or more aroused reactions. Several categories of complementary datasets will be used. The Cohn-Kanade dataset[9] contains acted emotion expressions, and CAM3D[10] contains spontaneous emotion expressions. We will train the network also with data from the Emotions in the wild dataset[11]. At a secondment to ESR3@BGU, general visual features will be investigated. During an industrial secondment to FHG, the model will be integrated on the Care-O-bot platform, and evaluated (this is a tentative plan that may be adjusted to best fit actual research).
ESR2 (HAM) Learning emotion recognition through auditory cues and language
Compared to facial expressions, emotional sound and language detection is more dependent on temporal features. In speech processing, deep learning has emerged as a very prominent successful research direction over the past few years[12],[13]. ESR2 will aim at a neural architecture that learns to incorporate auditory features to detect emotional states[14], from low-level prosodic features to higher-level cues from word or sentence level (grammar, sentence structure, specific word combinations signalling anger, etc.), exploring recurrent architectures on top of several different feature extractors working at different time scales. Due to its more robust detection using several features, the chosen architecture can increase auditory communication quality with older adults who exhibit less distinct auditory features. The architecture will substantially extend previous work on multi-modal feature extraction[15] and will be evaluated with databases for auditory affect recognition[16], e.g. the VAM database[17]. During a secondment to ESR12@UWE, data will be recorded, and Wizard-of-Oz experiments will be conducted. Possibilities to apply results on a social companion robot will be investigated during a secondment to ADELE using their PARO robot (this is a tentative plan that may be adjusted to best fit actual research).
ESR3 (BGU) Emotion recognition and expression based on Human Motion
ESR3 will examine the effect of people’s emotional experiences on their gait. Recent research has investigated the association between emotional state and human body motion[18] but quantitative assessment of the effect of emotions on body motion, i.e. gait, is still lacking. The complexity and versatility of the musculoskeletal motor system, and its intricate connections with the emotional system, requires multidisciplinary efforts, from performance arts to neuroscience and human-computer interaction. Building on our earlier work[19],[20], the methodology will combine practices used in psychology research and in biomechanics. Emotional states will be manipulated in a within-subjects setting to four conditions: happiness, relaxation, sadness, and fear. Specific emotions to be analysed will be based on focus groups in elderly homes in collaboration with ESR15@BGU. Since human mobility significantly varies in the older adults this will be taken into account and considered as part of the quality of interaction. The results will be used in developing models and algorithms for human-robot interaction such that the robot can both express emotions by body motions, and understand the human’s emotional state. Regression and decision tree classifications methods will be developed to relate the way human motion and posture characteristic represent different emotions. Since research has indicated there is a difference in emotion expression and regulation depending on age, we will ensure throughout the research comparison to other populations to ensure the findings are valid for other applications. During a secondment to ESR2@HAM the focus will be on developing recurrent algorithms for classification and combining the algorithms with the work on auditory cues. During a secondment to PAL-R, the algorithms for generation of robot motions representing emotions will be implemented and evaluated on their robot platform TIAGo aided by the PAL engineers (this is a tentative plan that may be adjusted to best fit actual research).
References
[1] F. Foroni, G. R. Semin. (2009) Language that puts you in touch with your bodily feelings the multimodal responsiveness of affective expressions. Psychological Science, 20(8):974–980.
[2] P. Rani and N. Sarkar. (2004) Emotion-sensitive robots – a new paradigm for human robot interaction. In Humanoid Robots, 2004, p.149–167 Vol. 1
[3] D. Bandyopadhyay, V. C. Pammi, and N. Srinivasan. (2013) Chapter 3 – role of affect in decision making. In V. C. Pammi and N. Srinivasan, editors, Decision Making Neural and Behavioural Approaches, volume 202 of Progress in Brain Research, pages 37 – 53. Elsevier.
[4] Castillo, José Carlos, et al. (2014) A framework for recognizing and regulating emotions in the elderly. Amb. Ass. Liv. and Daily Act. 2014. 320-327.
[5] Scheibe, S., & Carstensen, L. L. (2010). Emotional aging: recent findings and future trends. Journal of Gerontology: Psychological Sciences 1-10.
[6] Y. Gu, X. Mai, and Y.-j. Luo. (2013) Do bodily expressions compete with facial expressions? Time course of integration of emotional signals from the face and the body. PLoS ONE, 8(7):736–762, 07.
[7] Barros, P., Weber, C., Wermter, S. (2015) Emotional Expression Recognition with a Cross-Channel Convolutional Neural Network for Human-Robot Interaction. Proceedings of the IEEE-RAS International Conference on Humanoid Robots (Humanoids), Seoul, South Korea
[8] S. M. Kim, A. Valitutti, and R. A. Calvo. (2010) Evaluation of unsupervised emotion models to textual affect recognition. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pages 62–70. Association for Computational Linguistics.
[9] P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. (2010) The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conf., pages 94–101
[10] M. Mahmoud, T. Baltrusaitis, P. Robinson, and L. Riek. (2011) 3D corpus of spontaneous complex mental states. In S. D’Mello, A. Graesser, B. Schuller, and J.-C. Martin, editors, Affective Computing and Intelligent Interaction, volume 6974 of Lecture Notes in Comp Science, p205–214.
[11] A. Dhall et al. (2012) Collecting large, richly annotated facial-expression databases from movies.
[12] Bellegarda, J. R., & Monz, C. (2016). State of the art in statistical methods for language and speech processing. Computer Speech & Language, 35, 163-184.
[13] Amodei, D., Anubhai, R., … & Elsen, E. (2015). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:1512.02595.
[14] Weninger, F., Eyben, F., Schuller, B. W., Mortillaro, M., & Scherer, K. R. (2013). On the acoustics of emotion in audio: what speech, music, and sound have in common. Frontiers in psychology, 4.
[15] Barros, P., Jirak, D., Weber, C., & Wermter, S. (2015). Multimodal emotional state recognition using sequence-dependent deep hierarchical features. Neural Networks, 72, 140-151.
[16] Zeng, Z., Pantic, M., Roisman, G., & Huang, T. S. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(1), 39-58.
[17] Grimm M., Kroschel K., Narayanan S. (2008). “The Vera am Mittag German audio-visual emotional speech database,” in Proceeding of the IEEE International Conference on Multimedia and Expo (ICME) (Hannover: IEEE), 865–868
[18] Gross, M. M., Crane, E. A., and Fredrickson, B. L. (2007), Effect of felt and recognized emotions on gait kinematics. American Society of Biomechanics.
[19] Riemer, R., Hsiao-Wecksler, E.,Zhang, X. 2008. An analysis of uncertainties in inverse dynamics solutions using 2D Model. Gait and Posture, V27, Issue 4.
[20] Riemer, R. and Hsiao-Wecksler, E. 2009. Improving net joint torque calculations through a two-step optimization method for estimating body segment parameters. Journal of Biomechanical Engineering, Vol. 131, 11071-11077.