Article Preview
TopIntroduction
Social media, as a network platform for users to create, share, and communicate, can make it more convenient for users to access information and also provide them with more choices and editing rights. Unlike paper media, such as newspapers, social media has a variety of content forms (Sahoo & Gupta, 2021; Ahmed et al., 2022; Almomani et al., 2022). In addition to text mode, social media can also provide users with more intuitive and three-dimensional information content through modes such as voice and image. Images, speech, and text constitute the most common scenes in daily life (Su et al., 2023; Gao et al., 2022; Balcilar et al., 2021).
Sentiments play a crucial role in our daily lives, helping us communicate, learn, and make decisions. For a long time, researchers have been dedicated to using machines to analyze human sentiments (Tiwari et al., 2021; Schneider et al., 2023; Singh & Sachan, 2021). Early MSA often focused on single modality information such as sound, text, visual, and biological signals. However, using a single modality for MSA users often did not accurately analyze their sentiments (Salhi et al., 2021; Mohammed et al., 2022; Garcia-Garcia, 2023). Because the same text may express opposite meanings in different contexts, it is difficult to accurately predict users' sentiments based solely on one modality. Due to the inability of single-mode MSA technology to effectively process data and fully utilize the diversity of information, it is no longer suitable for the current complex environment (Sun et al., 2020; Yuan et al., 2021; Zhang et al., 2022).
As research deepens, researchers have found that information can more effectively analyze human sentiments than single-modal information. The way people communicate and express sentiments in daily life is usually through the fusion of sound, text, and visual modalities. MSA is research on mining user perspectives and emotional states from data, such as text, vision, or speech, based on single-mode MSA (Chen et al., 2022; Niu et al., 2021; Poria et al., 2023).
Multimodal emotion recognition can be used to analyze user emotions on social media. By combining text, image, and video data, users' emotional tendencies and emotional states can be more accurately understood.In the field of education, multimodal plays an important role. Multimodal emotion recognition analyzes students' speech, facial expression, and gesture data and can understand students' emotional state and learning effect and provide personalized teaching and feedback. When integrating a multi-modal emotion recognition system, it can be integrated with an existing system or platform through an API interface or SDK. The results of multimodal emotion recognition can be used as input for decision-making, personalized recommendation, emotion analysis, and other functions.