Optimized Feature Selection Approach with Elicit Conditional Generative Adversarial Network Based Class Balancing Approach for Multimodal Sentiment Analysis in Car Reviews
Abstract
Multimodal Sentiment Analysis (MSA) is a growing area of emotional computing that involves analyzing data from three different modalities. Gathering data from Multimodal Sentiment analysis in Car Reviews (MuSe-CaR) is challenging due to data imbalance across modalities. To address this, an effective data augmentation approach is proposed by combining dynamic synthetic minority oversampling with a multimodal elicitation conditional generative adversarial network for emotion recognition using audio, text, and visual data. The balanced data is then fed into a granular elastic-net regression with a hybrid feature selection method based on dandelion fick’s law optimization to analyze sentiments. The selected features are input into a multilabel wavelet convolutional neural network to classify emotion states accurately. The proposed approach, implemented in python, outperforms existing methods in terms of trustworthiness (0.695), arousal (0.723), and valence (0.6245) on the car review dataset. Additionally, the feature selection method achieves high accuracy (99.65%), recall (99.45%), and precision (99.66%). This demonstrates the effectiveness of the proposed MSA approach, even with three modalities of data.
1. Introduction
With the increasing wealth of information available on the internet, individuals can enhance their product choices and lifestyles based on textual details.1 Businesses, recognizing the importance of understanding customer opinions, employ sentiment analysis through social media monitoring, brand tracking, and emotions found in emails and comments.2 Sentiment analysis helps to deduce the emotional tone underlying text to comprehend user opinions.2 However, analyzing sentiment in large datasets, such as user reviews, poses challenges due to variations in sequence length and textual order. Deep learning approaches are introduced in text classification and question answering to address these challenges.3,4
Machine learning and deep learning techniques were applied to MSA, even in domains such as analyzing sentiments from coronavirus-related tweets.5 Innovative approaches, like Massive Open Online Courses (MOOC), have utilized sentiment classification methods organized by ensemble learning paradigms.6 The integration of deep learning approaches serves as a robust baseline model for feature extraction in sentiment analysis.7,39,40 Various sentiment analysis methods use lexicon-based approaches to identify the orientation of text documents, while machine learning-based methods leverage labeled datasets to train models.8,9
In the context of MSA, polarity detection is often considered the most relevant information.10 Classification tasks such as subjective classification, opinion extraction, word sentiment classification, and document sentiment classification are employed to identify classes in datasets.11 The domain of sentiment analysis has witnessed growing interest in applications such as visual summaries and trend analysis.12 Furthermore, hyperparameter optimization strategies play a crucial role in minimizing generalization errors in machine learning approaches.13,14
Unlike single model sentiment analysis,33,34 MSA poses several challenges and opportunities. Early approaches relied on handcrafted features, limiting sentiment abstraction and yielding suboptimal results. Explaining multimodal models is complex due to the need to relate model performance to diverse input data. The heterogeneity and dimensionality of human behaviors hinder feature interpretation and understanding of model decisions. Additionally, research on compact, human-friendly data summaries is scarce. Interpreting inter-modal interactions is non-trivial, despite their importance. For instance, discerning positive sentiment from neutral voice and facial cues remains challenging, highlighting the unique nature of MSA. Therefore, a new MSA approach is developed to differentiate emotions using different modalities such as audio, video, and text.
Motivation
In the digital age, user-generated content on online platforms has created a rich repository of multimodal data, including text, audio, and video. This diversity presents both an opportunity and a challenge for sentiment analysis, especially in understanding and interpreting human emotions comprehensively. MSA emerges as a critical field of emotional computing research, leveraging varied data to gain deeper insights into user sentiments. MSA integrates and analyzes multiple data modalities, providing an accurate understanding of sentiments compared to unimodal approaches. For instance, text alone might convey a different emotional tone when not considered alongside corresponding vocal intonations or facial expressions present in audio and video data. By combining these modalities, MSA offers a holistic view of sentiments, leading to more robust and reliable analysis outcomes. The application of MSA is particularly impactful in car review analysis, where user feedback encompasses various forms of expression. Car reviews often include detailed text descriptions, vocal commentary, and visual demonstrations, making them an ideal candidate for multimodal analysis. Understanding the sentiments behind these reviews provides automotive companies with valuable insights into customer preferences, and overall satisfaction. This aids in product improvement, marketing strategies, and customer engagement, ultimately enhancing the consumer experience and driving business success.
Contributions
Social media information is a valuable resource for sentiment analysis applications, aiding in the identification of user emotions to solve various problems. This work contributes to the field by emphasizing texture classification through spectral analysis.
This research work utilizes the Granular Elastic-Net Regression (GENR) model to enhance texture classification and improve the accuracy and robustness of MSA, especially in scenarios with intricate textual patterns.
The proposed Multilabel Wavelet Convolutional Neural Network (MWCNN) offers a comprehensive analysis by considering both low and high-frequency components in textual data, improving sensitivity to a wider range of features compared to traditional CNNs.
Integration of Dynamic Synthetic Minority Over-sampling TEchnique (DSMOTE) provides a sophisticated mechanism to address biases in dataset sampling, ensuring a more representative dataset compared to traditional methods.
Also, GENR is introduced to refine the regularization process and improve model adaptability, overcoming sparsity limitations inherent in Elastic Net Regression (ENR) for more precise sentiment analysis tasks.
This research work is organized as follows: An overview of several existing sentiment analyses with the MuSe-CaR dataset is provided in Section 2. The developed model is briefly explained in Section 3 with the steps of data acquisition, balancing class samples, feature selection, and classification. Section 4 gives a detailed explanation of the experimental result. Finally, the whole research work is concluded in Section 5.
2. Literature Survey
Exploring emotion recognition and sentiment analysis through multimodal data, the literature survey provides diverse methodologies. These range from extensive annotation and fusion techniques to unique feature extraction approaches. Various research works were previously carried out which were based on the MSA. Some of them are reviewed and listed in Table 1,
Authors | Methods | Pros | Cons |
---|---|---|---|
Stappen et al.15 | Multi head Attention Network (MAN) | Continuous prediction capability of multitasking | Limited in the multimodal data analysis |
Sun et al.16 | Self-attention based Long Short-Term Memory (LSTM) with Recurrent Neural Network (RNN) | The performance is improved by continuous prediction | Limited to explore multitask learning and advanced fusion models |
Stappen et al.17 | Rater Aligned Annotation Weighting (RAAW) | It offered a cohesive approach to developing regression | Limited to reduced dimension |
Baird et al.18 | Lexical knowledge-based extraction approach | Low computational power and improved the linguistic baseline | Lacked to combine high-level features from different modalities using unsupervised methods |
Vlasenko et al.19 | Ensemble based classifier | Classification performance was improved by Natural Language Processing based approaches | Lacked gives an improved method by combining three modalities |
Schuller et al.20 | Support VectorRegression (SVR) | Consistently more robust prediction for video likes | It never accurately predicts hate likes and fake news |
Cambria et al.21 | LSTM-RNN | High possible transparency with a large set of features | Limited to combine modalities at an earlier stage |
Jiang et al.22 | Hybrid temporal modal | The best prediction result was achieved by multimodal features | Lacked to get better results with combined arousal and valence |
Yadav et al.23 | An effective rater ensemble model | It takes all information about human emotions without losing any subjective information | More advanced architectures are needed to improve the performance (Accuracy: 77.89%) of the model |
Padminivalli et al.24 | Spatial-temporal deep neural network | Overfitting is avoided by introducing cross-validation during testing and training | Struggled with recognizing neutral emotions (98.3 %) |
Liu et al.30 | Modality Translation-based MSA model (MTMSA) | Minimizing the complexity of the MSA model | The accuracy of the suggested model sharply decreases as the missing rates increase (67.29% to 47.5%) |
Chandrasekaran et al.31 | Long short-term memory eXtreme Gradient Boosting (LXGB) | The suggested model excels in MSA, accurately classifying emotions across diverse data types | The model struggled with predicting neutral emotions (92.9%) |
Wang et al.32 | Cross-Correlation in Dual-Attention (CCDA) | Better performance in MSA | Increased computational complexity |
Stappen et al.15 have presented an extensive annotation process in covering all the emotions and selects the videos of having changing shots and dynamic backgrounds. The intensity class of trustworthiness is predicted based on audio and visual behaviors. Sun et al.16 have presented a human emotion/sentiment analysis with videos, audio, and text to automate sentiment analysis in many areas. The relevant features are extracted by using low-level and hybrid deep learning approaches to explore the robust feature extraction. The modal performance is again enhanced by introducing a fusion technique for arousal and valence.
Stappen et al.17 have presented an annotation toolkit to relate the various types of fusion techniques in continuous annotation. The relevant configuration parameter was pointed out in a series of manners to improve the capability of the tool kit. Another MSA modal was developed by Baird et al.18 for exploring sub-symbolic representation from sentic information into the emotional information provided from videos. It also provided the usefulness of high features from text and audio. With these better learning to predict valance and arousal was obtained.
Automatic emotion recognition is presented by Vlasenko et al.19 with the turn-level prediction of emotions using valence and arousal dimensions. Different approaches were investigated to fuse the text and audio features and various structures of deep learning were explored the cross-dependencies. Schuller et al.20 have presented a first approach to feature extraction and selection without using audio video and text. There is no YouTube video prediction against time series features. The different choices between interpretability of accuracy were chosen for the prediction method.
Cambria et al.21 have presented a challenging sentiment analysis approach with a large set of features by using open source software. Then it described a detailed description of multimodal feature extraction with applied preprocessing and alignment for the baseline modelling. The best prediction outcomes were obtained using multimodal features. Jiang et al.22 have explained the mapping of continuous dimensional emotions to discrete classes. It predicted emotion classes based on audio and video about user reviews. For the multimodal features, feature fusion was the best for segmenting data.
Yadev et al.23 have presented an emotional recognition with audio and video data. It also explored the fusion of ensemble prediction with several techniques. The main motive of this modal is to achieve the whole information from raw annotation to predicted emotions. Padminivalli et al.24 have presented an audio-visual based block that is analyzed with the temporal convolutional network. Along, with these block leader followers’ attentive fusion block after that, the cross-modality fusion is achieved to get noise removed network.
Liu et al.30 suggested an MTMSA robust to uncertain missing modalities. The model translates visual and audio data to text and fuses them into Missing Joint Features (MJFs). A transformer encoder, supervised by a pre-trained model, encodes MJFs to approximate complete modalities. The transformer decoder learns intermodal dependencies, facilitating sentiment classification. However, the suggested model’s accuracy sharply decreases with increased missing rates, impacting intermodal feature projection and affecting visual and auditory feature space projection onto text.
Chandrasekaran et al.31 utilized the hybrid LXGB technique, which combined LSTM and eXtreme Gradient Boosting (XGBoost) classifiers for MSA. The suggested model addressed emotional understanding across image, textual, and audio data, showcasing effectiveness in capturing different sentiments.
Wang et al.32 developed the CCDA model and utilized dual attention mechanisms to obtain inter and intramodal dynamics efficiently. The model incorporates a cross-correlation loss to acquire attention correlation and utilizes relevant coefficients for effective feature integration. However, these computations, relying on matrix multiplication, increase model complexity and involve redundant information.
The disadvantages in existing methods are considered to improve the proposed system by introducing separate hybrid methods for data balancing and feature selection. The developed scheme is aimed at attaining a better outcome with the dimension of arousal, valence, and trustworthiness.
Problem statement
Many approaches were proposed for analyzing MSA. However, the existing approaches are constrained by the lack of combining modalities, which may result in low result in prediction. Also, the existing method does not provide the outcome with all three emotional states such as arousal, valence, and trustworthiness. Since most of the sentiment analyses are data driven, the capability of machine learning methods is limited for training data. Looking at the existing MSA, it seems that all the people express positive opinions as there is a lack of negative opinions.16,17,18,19,20,21,22,23,24
3. Proposed Methodology for MSA
For the users generating videos, audio, and text on online websites, it is important to understand human emotions and sentiments to analyze the reviews of a certain product. Figure 1 shows the workflow of this research work,
MSA is a rapidly growing field of emotional computing research where reviews are gathered from the MuSe-CaR dataset.
These reviews are in three different modalities such as audio, video, and text which has some challenges due to imbalance data, especially when it involves multimodal data from various modalities which is addressed by an effective data augmentation approach of combined DSMOTE with multimodal Elicit Conditional Generative Adversarial Network (ECGAN)
Then the balanced data is given to the GENR with Hybrid Dandlion Fick’s Law Optimization (HDFLO) based feature selection method.
The selected features are given to MWCNN for accurate classification of sentiment states as valence, arousal, and trustworthiness.

Fig. 1. The workflow of the proposed system.
3.1. Multimodal data collection
Data is sourced from MuSe-CaR, a large car review dataset with multimodal data, featuring videos with varied face angles, backgrounds and occlusions, audios with high noise, and text containing colloquialisms. Given these challenges, sentiment analysis faces imperfect classification accuracy. To address this, an imbalanced data augmentation approach is applied to achieve balanced datasets.
3.2. Data augmentation method for imbalanced data using combined DSMOTE with ECGAN
The proposed MSA system aims to categorize the data as three classes but the dataset is affected by an imbalanced class distribution problem which affects the performance of the classifier. ECGAN is combined with DSMOTE to get efficient balanced classes at the classifier without blind oversampling. Moreover, the existing models lack control over the locations where synthetic samples are produced in the data space, leading to two key issues: (1) noisy minority samples create more noisy samples during oversampling, and (2) noisy samples within the majority class area cause under-fitting due to class overlap. The proposed data augmentation method has different phases. Initially, data are separated by class labels. Next, minority class samples are categorized into border and noisy types to oversample borderline samples while ignoring noisy ones, avoiding misclassification. Finally, range-controlled oversampling prevents minority samples from being generated within majority class regions.
Initially, the border samples are oversampled to balance the dataset for generating new synthetic samples. Then, new synthetic samples (Xnew) are generated through randomly selected line segments between the minority sample (̂xi) and neighbors (xi). The generated new samples are expressed as eqn. (1),
3.3. Feature selection using GENR HDFLO
GENR is also applied to multimodal data, such as audio, video, and text by using recursive feature elimination to determine the appropriate number of features. In this context, each modality represents a different set of features, and the goal is to classify the most relevant features within each modality as well as potential interactions between modalities. The block diagram of the proposed feature selection (GENR HDFLO) method is shown in Fig. 2.

Fig. 2. The block diagram of GENR HDFLO.
The GENR HDFLO technique enhances feature selection by identifying clusters of correlated features rather than individual ones, improving predictive performance and understanding of data relationships. Each modality’s features are treated as separate clusters, applying GENR within them to account for interdependencies. The method integrates L1 and L2 regularizations (lasso and ridge) into an ENR framework, balancing sparsity and coefficient magnitude. The HDFLO algorithm combines Fick’s Law Optimization (FLO) and Dandelion Optimization (DO) to balance exploration and exploitation during optimization. DO simulates dandelion seed dispersion stages, while FLO’s Steady-State Operator (SSO) ensures efficient navigation through optimization stages.
Instead of simply selecting individual features, it identifies and selects groups or clusters of features that exhibit similar behavior or have high correlations. This leads to improved predictive performance and a better understanding of the underlying relationships in the data. Apply GENR to each modality separately by treating the features from each modality as a separate cluster and performing ENR within that cluster. This step helps to identify the most relevant features within each modality while considering their interdependencies.
L1 and L2 Regularization: L2 regularization encompasses adding a regularization parameter β multiplied by the sum of the squares of the weights to the loss function LF. This method is known as Tikhonov Regularization. On the other hand, L1 Regularization replaces the squared weights with the absolute value of weights. The mathematical representation formulated in Eq. (3),
Ridge Regression: Ridge regression incorporates L2 regularization along with the Mean Squared Error (MSE) loss function. Its cost function G(δ) is depicted in Eq. (4),
Lasso Regression: Lasso regression utilizes L1 regularization, follows similar processes as ridge regression, and employs the same MSE loss function. The cost function G(δ) in lasso regression is represented by Eq. (5).
ENR: In this network, the cost function combines the MSE loss function with both L2 and L1 regularization, expressed in Eq. (6),
HDFLO: The algorithm integrates FLO and DO, combining their update rules to optimize exploration and exploitation. DO mimics dandelion seed flight stages such as rising in spirals or drifting locally, adjusting globally, and landing randomly to grow. Meanwhile, FLO includes an SSO ensuring a balance between exploration and exploitation. By incorporating FLO’s SSO into DO, the algorithm achieves enhanced capability in navigating diverse optimization stages effectively, avoiding local optima stagnation, and ensuring robust exploration-exploitation trade-offs. A detailed explanation of the mathematical model behind this new meta-heuristic algorithm, which draws inspiration from the optimal reproduction location of dandelion seeds as they mature, is provided. In the initialization phase DO randomly generates a candidate solution which is given in Eq. (8),
The fitness value (F(Xi)) creates the fitness function of the ith seed in the population which is taken to minimize the loss function. The fitness function (Fitnessfunction) is expressed in Eq. (9),
Rising stage: In this stage, the weather conditions along with the speed of wind jointly find the dandelion seed’s height. In the search space, dandelion seeds are blown to various locations, rising higher and scattering farther with stronger wind, following a spiral motion influenced by wind speed and vortex adjustments, which is expressed in Eq. (10),
Input: pop, T,Dim and C. |
---|
Output: Xbest (best position of seed) |
• Initial position of dandelion seeds. |
• Compute fitness function as, Fitnessfunction=Min(1n∑mj=1(v−ˆv)2) |
if (t≤T) do |
if i=1 to pop do |
if randn<1.5 |
Compute the TF by eqn. (13) |
If TF<0.9 |
Update the position by eqn. (15) |
Else if |
Update dandelion seeds |
else Rising stage rainy day |
Update dandelion seeds by multiplying flus with a position as, |
Xt+1i=Xti+a∗sx∗sy∗lnY∗(Xts−Xti) |
end for |
if i=1 to pop do |
update descending stage position |
end for |
if i=1 to pop do |
Update landing stage position |
end for |
if i=1 to pop do |
arrange seeds based on fitness values |
end for |
if f(Xelite)<f(Xbest) then |
Xbest=Xelite |
fbest=f(Xelite) // Best position and fitness value |
Descending stage: In this phase, the dandelion population travels to the appropriate location for reproduction which is reflected by the mean location in the rising stage. The descending phase (Xt+1i) is mathematically expressed in Eq. (11),
Landing stage: In this local neighborhood development is achieved to get the global optimum where local exploitation is taken by the current elite information. The landing stage population (Xt+1i) is expressed in Eq. (12),
SSO of FLO: Successful optimization algorithms hinge on transitioning between exploration and exploitation effectively. Therefore, the Transfer Function (TF) is computed to smoothly navigate between the exploration and exploitation phases, improving adaptability and performance, as provided in Eq. (13).
This optimization algorithm, inspired by dandelion seed dispersal, initializes a population with random seed positions. Each seed’s fitness is evaluated based on an objective function, iteratively refining positions over generations. The algorithm balances exploration and exploitation inspired by FO using a transfer function, adjusting seed exploration and exploitation strategies. During the rising stage, seeds adjust positions influenced by parameters like wind speed and direction, simulating natural dispersal patterns. In the descending stage, seeds move towards optimal locations identified in previous stages, and in the landing stage, positions are further refined for optimal growth conditions. The algorithm continuously updates the best seed position based on fitness, converging towards an optimal solution by the end of iterations, effectively navigating diverse optimization stages while avoiding local optima.
Thus, the selected features are based on multimodal data where the loss function of GENR is minimized with HDFLO to efficiently extract relevant features. Then these features are taken to classify the sentiments as three dimensions.
3.4. Emotional states classification using MWCNN
The selected features are classified as three types of emotional states such as valence, arousal, and trustworthiness from audio, video, and text. For classification, a formulation is achieved to connect convolution and pooling with multimodal data analysis. The MWCNN classifier integrates CNN and spectral analyses, capturing spatial and frequency domain features. By using wavelet transforms, this model extends traditional CNN architecture, enhancing the feature extraction and classification process.
3.4.1. Classification based on emotional states
A CNN is a variant of a Neural Network (NN) to sparsely connect deep networks. In conventional NN each input in one layer is connected with the next layer but in the proposed classifier the use of an activation function and fully connected layer, CNN introduces a convolution/pooling layer that connects only the local respective field around each input.
A fusion of CNNs and spectral analyses is employed to address texture classification challenges. CNNs process textures directly to capture spatial statistics, while spectral analysis transforms textures into frequency domains for scale-invariant features. This unified approach integrates spatial and spectral information within a single model. By extending the traditional CNN architecture with multiresolution analysis through wavelet transforms, the pooling and convolution layers mimic aspects of spectral analysis. This innovative model, termed wavelet CNNs, combines CNN strengths with spectral analysis for comprehensive texture feature extraction. Figure 3 illustrates the configuration of the MWCNN-based classifier.

Fig. 3. The architecture of MWCNN based classifier.
Convolutional layer: For the n component input vector, a convolutional layer produces the output vector of the same number of components. The output vector (Yi) is expressed as Eq. (15),
CNNs achieve translation invariance in the image space and reduce the parameter count by parameter sharing. The definition of Yi is essentially the result of convolving xj with a filtering kernel wj hence it is referred to as a convolution layer. Consequently, the output (y) is expressed as Eq. (16),
Pooling layer: Pooling layers are commonly employed right after convolution layers to simplify data representation. Although max pooling finds extensive use in CNN applications, average pooling is better suited for feature extraction. Consequently, it focusses lies on average pooling, which offers the advantage of revealing the association with multiresolution analysis. The output of pooling is expressed as Eq. (17),
3.4.2. Generalized convolution and pooling
By combining the output of convolution and average pooling the generalized form is taken as Eq. (18),
Implementation: The network structure is modeled after VGG-19, selected for its effectiveness in texture feature extraction, using 3×3 convolutional kernels with 1×1 padding to preserve input size. Convolution layers with increased stride are utilized; incorporating 1×1 padding and a stride of two results in output size reduction, replacing max-pooling without compromising accuracy. To align with the size reduction in multiresolution analysis, decomposed images are integrated with feature maps. An energy layer preceding fully connected layers enhances performance with fewer parameters. The wavelet CNN model comprises nine convolution layers, aligned with decomposition levels, featuring an energy layer followed by three fully connected layers. Input size constraints (32×32) necessitate training images to be scaled to 256×256, then randomly cropped and flipped for diversity, effectively mitigating overfitting.
In contrast, proposed wavelet CNNs incorporate all components, including the high-frequency components so that no information from the input x is lost, aligning with the principles of multimodal analysis. The classified output is then compared with the existing methods to find its performance.
This research work gathered input from the MuSe-CaR dataset to tackle data imbalance in sentiment analysis across modalities by employing a comprehensive approach. This involves combining DSMOTE with an ECGAN for emotion recognition using audio, text, and visual data. The balanced data is then processed through GENR with an HDFLO to select relevant features for sentiment analysis. Subsequently, a MWCNN is employed to accurately classify emotion states. By integrating these techniques, the research aims to overcome challenges associated with diverse data types and provide a robust framework for MSA in the context of car reviews.
4. Experiment Results
MuSe-CaR dataset maintains the content and context largely without breaking into equally sized segments which is better than other large datasets. This section also takes an analysis of python implementation in terms of three dimensions such as arousal, valence, and trustworthiness against some conventional approaches.15,16,17,18,19,20,21,22,23,24 The performance metrics such as accuracy, recall, and precision are used to compare the proposed feature selection method of GENR HDFLO in contrast with existing methods.26,27,28,29
4.1. Experimental setup
The experiments utilized Anaconda Navigator Spyder with python 3.10 on an Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, 16.0GB RAM, and Windows 10. Following network design, factors impacting performance were analyzed, including epochs, iterations, and learning rate. Set at 0.001 and 100 respectively, experiments ran for 1000 epochs with 10 timesteps. Using a patch size of 32×32 enabled efficient comparison with existing methods. The proposed GENR model, implemented with the TensorFlow library, underwent evaluation based on different features in a multimodal approach. The MuSe-CaR dataset was split into a 60:20 ratio for training, development, and test sets to ensure robust evaluation.
Data preprocessing: Gather data from the MuSe-CaR dataset, which includes multimodal data such as audio, video, and text reviews. For initial pre-processing, handle missing data by removing incomplete entries. Since the annotation for MuSe-CaR is dense (4 frames per second), split each video into segments with a window size of 200 frames (50s) and a hop size of 100 frames (25s). This segmentation enriches training samples and facilitates model convergence. A moving average filter is utilized to smooth the input signal of each sliding filter frame for better intelligibility.
Model Training: The developed scheme is executed using python 3.10. The network structure is based on VGG-19 for effective texture feature extraction, using 3×3 convolutional kernels with 1×1 padding. The wavelet CNN model includes nine convolution layers, an energy layer, and three fully connected layers. Training images are scaled to 256×256, then randomly cropped and flipped for diversity.
Key hyperparameters: Learning rate 0.001, 1000 epochs, and Adam optimizer. Batch normalization is used before activation layers, with ReLU as the activation function.
4.2. Dataset description
MuSe-CaR is a comprehensive dataset of multimodal data collected in real-world settings, aimed at understanding emotional engagement in product reviews, particularly automotive ones. It ensures high-quality voice and video recordings, offering valuable social media content. The sample video image of the MuSe-CaR dataset is shown in Fig. 4.

Fig. 4. Sample images from the Videos of the MuSe-CaR dataset.
4.3. Density estimation of the dimensions
Figure 5 presents a comparison of the distributions for the manually labeled annotations of arousal, valence, and trustworthiness following feature selection.

Fig. 5. Density estimation of the dimensions.
While arousal demonstrates a near-perfect Gaussian distribution, valence exhibits a positive skewness, leaning towards the positive end of the spectrum. Trustworthiness, on the other hand, displays a highly peaked distribution with a strong left-skew. The model aims to analyze the phenomenon of feature selection due to the high-density distribution observed across all three dimensions.
4.4. The feature set of each modality in the MuSe-CaR dataset
Table 3 illustrates the feature set of each modality with the corresponding arousal and valence dimension where “A” is the representation of audio, “V” is the representation of video, and “T” is the representation of text.
Features | Modality | Arousal | Valence |
---|---|---|---|
Energy | A | 0.4684 | 0.2633 |
Pitch | A | 0.4596 | 0.2484 |
MFCC | A | 0.4313 | 0.2646 |
Wav2vec | A | 0.4818 | 0.333 |
VGGFace | V | 0.4023 | 0.2241 |
HOG | V | 0.4653 | 0.1059 |
SeNetFace | V | 0.4678 | 0.1543 |
ResNetFace | V | 0.4311 | 0.1344 |
BERT | T | 0.4325 | 0.5624 |
RoBERT | T | 0.4256 | 0.6132 |
ALBERT | T | 0.4356 | 0.5532 |
BOW | T | 0.4567 | 0.5689 |
Word2vec | T | 0.4675 | 0.6732 |
The analysis reveals several key findings: Firstly, in terms of arousal, audio features demonstrate higher effectiveness compared to visual and text features. Secondly, for valence, the text modality exhibits significantly superior performance compared to the other two modalities. Lastly, the visual modality generally yields the least favorable results. The researchers speculate that the dominance of audio features in arousal assessment stems from the reliance on speech-related aspects such as intonation and tone, whereas valence assessment is primarily influenced by verbal content.
4.5. Confusion matrix of three-dimension prediction using classifier
For this experiment, it focused on features from acoustic, linguistic, and visual modalities. Figure 6 displays the confusion matrix derived from using features to forecast emotion categories in arousal, valence, and trustworthiness.

Fig. 6. Confusion matrix of prediction using classifier. (a) Valence. (b) Arousal. and (c) Trustworthiness.
When comparing the confusion matrix obtained from a three-dimensional representation with the corresponding fixed length representation, there is an improvement in the weighting along the diagonal of the confusion matrix. The findings indicate that, overall, valence yields better results than arousal for this task, and trustworthiness performs better than arousal as well. Consequently, predicting arousal proves to be more challenging in detecting audio modality compared to other modalities.
4.6. Continuous prediction analysis with frequency distribution
The MuSe-CaR dataset predicts sentiment dimensions (valence, arousal, and trustworthiness) in a continuous manner with time. The sentiment analysis has the emotional component of valence which is often used interchangeably.
The process of manually annotating continuous emotions by humans often results in disagreements due to variations in perception and reaction time. To mitigate these issues, a feature selection model is employed. Various selection methods are discussed in existing literature, and this advancement has also sparked new challenges. In Fig. 7, a feature selection method is depicted which utilizes the distribution of proposed features from a minimum of five different ratings. Additionally, the model takes into account the diverse reaction times observed during the process.

Fig. 7. Continuous prediction analysis with frequency distribution. (a) Valence. (b) Arousal. (c) Trustworthiness.
4.7. Comparison of unimodal and multimodal using 6-fold cross validation
Figure 8 denotes the results of a 6-fold cross-validation experiment designed to compare two emotion recognition models: unimodal and multimodal. The unimodal is trained solely on video frame data, while the multimodal model capitalizes on information from three different modalities (likely visual, audio, and another source). The experiment employed a 6-fold cross-validation technique, where the data is split into six sections. Each section is then used for testing, while the remaining sections are employed for training. This process is repeated six times, ensuring a robust evaluation of both models.

Fig. 8. Comparison of unimodal and multimodal using 6-fold cross validation. (a) Valence. (b) Arousal. (c) Trustworthiness.
The key finding from Fig. 8 is the clear superiority of the multimodal model. It consistently outperforms the unimodal model across all three emotional dimensions: valence, arousal, and trustworthiness. This indicates that incorporating additional data modalities significantly enhances the model’s ability to accurately detect emotions compared to relying solely on video information. Interestingly, the results also suggest that trustworthiness might be a more readily detectable emotion, as it exhibits the highest accuracy across both models.
4.8. Comparison of the proposed MSA system against existing sentiment analysis systems
The sentiment analysis with the MuSe-CaR dataset provides three dimensions (Valance, arousal, and trustworthiness) to analyze the system. In this section, some of the sentiment analysis systems such as Multihead attention Network,15 attention enhanced Recurrent Model,16 MuSe-Toolbox.17
Sentiment Analysis of YouTube Comments,20 LSTM-RNN,21 and temporal convolutional network24 are taken to compare with proposed MSA systems.
Figure 9, shows that the proposed system attains high output to find the range of emotions as the dimensions valence, arousal, and Valence. The valence, arousal, and trustworthiness attain 0.6245, 0.723, and 0.695 high outcomes which is a 20% higher rate than other existing methods. Thus, the recent researches are motivated by multimodal data for sentiment analysis. The effectiveness of the proposed model is due to several key factors. Initially, the system uses DSMOTE combined with ECGAN to address data imbalance, which ensures the generation of high-quality, diverse synthetic samples. Subsequently, feature selection is optimized using GENR HDFLO, which captures relevant features and their interactions, thus improving predictive performance. Finally, the MWCNN classifier, incorporating wavelet transforms and robust convolutional architecture, accurately classifies emotional states from multimodal data. These advancements contribute to the system’s superior performance in valence, arousal, and trustworthiness, achieving up to 20% better results compared to other methods.

Fig. 9. Comparison of proposed MSA system against existing sentiment analysis systems. (a) Valence. (b) Arousal. (c) Trustworthiness.
4.9. Comparison against existing sentiment analysis systems
The classification performance is improved by limiting the large amount of data in the dataset. Thus, the feature selection approach is motivated to choose a set of relevant features. This research work uses the GENR HDFLO based selection method. GENR is applied to multimodal data, such as audio, video, and text by using recursive feature elimination to determine the appropriate number of features which is optimized to reduce loss function using HDFLO.
The existing approaches Principal Component Analysis (PCA),26 Feature Weight-based Decision Tree algorithm (FWDT),27 Deep Convolutional Network (DCN),28 and Attention-based Bidirectional Convolutional Deep Model (ABCDM)29 are taken to compare with the proposed GENR HDFLO based feature selection method in terms of accuracy, precision and recall. Figure 10, shows that the proposed GENR HDFLO method obtains 0.9965 accuracy, 0.9945 recall, and 0.9966 precision which are higher than the existing methods. The proposed GENR HDFLO model improves classification across a wide range of dimensions. Its superior performance compared to other methods is due to its comprehensive and innovative approach to handling multimodal data and addressing class imbalance. By using DSMOTE with ECGAN for data augmentation, GENR HDFLO for feature selection, and MWCNN for classification, the model effectively captures and analyzes precise emotional states from diverse inputs, resulting in more accurate and reliable sentiment analysis.

Fig. 10. Comparison of GENR HDFLO based feature selection against existing methods. (a) Accuracy. (b) Recall. (c) Precision.
Table 4 presents a performance comparison between analyses conducted without and with DSMOTE. Without DSMOTE, the model obtained an accuracy of 82.76%, recall of 80.27%, and precision of 80.46%. Introducing DSMOTE significantly improves the performance, resulting in a notable increase in accuracy to 94.76%, recall to 91.62%, and precision to 90.26%, respectively. This demonstrates a substantial percentage increase in model performance across all metrics with the incorporation of DSMOTE.
Performance metrics | Analysis without DSMOTE | Analysis with DSMOTE |
---|---|---|
Accuracy | 0.8276 | 0.9476 |
Recall | 0.8027 | 0.9162 |
Precision | 0.8046 | 0.9026 |
4.10. Comparison of feature selection approaches
In this section, shows the comparison of the proposed feature selection approach with existing models like RAAW,15 MAN,17 and Ensemble models,19 Table 5 compares different feature selection methods based on their performance metrics. RAAW achieved an accuracy of 82.76%, recall of 80.28%, and precision of 81.74%. MAN demonstrated improved performance with 84.75%, 82.87%, and 82.54% in accuracy, recall, and precision, respectively. Ensemble models yielded lower metrics at 74.62%, 72.94%, and 71.63%.
Performance metrics | |||
---|---|---|---|
Methods | Accuracy | Recall | Precision |
RAAW15 | 0.8276 | 0.8028 | 0.8174 |
MAN17 | 0.8475 | 0.8287 | 0.8254 |
Ensemble models19 | 0.7462 | 0.7294 | 0.7163 |
GENR HDFLO (proposed) | 0.9476 | 0.9162 | 0.9026 |
The proposed method, GENR HDFLO, outperformed all, achieving a significantly higher accuracy of 94.76%, recall of 91.62%, and precision of 90.26%. GENR introduces a sophisticated feature selection mechanism, refining traditional ENR. This approach overcomes sparsity limitations, allowing for a more precise and adaptive model. By incorporating granular selection, GENR enhances the regularization process, leading to improved accuracy and robustness, particularly beneficial in sentiment analysis tasks with complex textual data.
Table 6 presents the performance analysis of the developed GENR HDFLO model based on different learning rates. As the learning rate decreases from 0.1 to 0.00001, there are changes in the model’s performance metrics. Table 6 demonstrates that the developed GENR HDFLO scheme achieved the optimal performance with a learning rate of 0.001, showing higher accuracy, recall, and precision compared to other learning rates. This finding suggests that a learning rate of 0.001 optimally balances the model’s ability to learn from the data, leading to improved overall performance in terms of classification accuracy and precision.
Model | Learning rate | Accuracy | Recall | Precision |
---|---|---|---|---|
GENR HDFLO (proposed) | 0.1 | 0.8912 | 0.8573 | 0.8421 |
0.01 | 0.9045 | 0.8673 | 0.8592 | |
0.001 | 0.9476 | 0.9162 | 0.9026 | |
0.0001 | 0.9228 | 0.8841 | 0.8721 | |
0.00001 | 0.9389 | 0.8943 | 0.8842 |
The developed model is evaluated across various patch sizes to identify the optimal setting, with 32×32 emerging as the most effective size. Figure 11 presents the comparative results achieved with different patch sizes, highlighting the performance variations observed across the range of sizes tested. This analysis underscores the significance of patch size selection in influencing model outcomes and showcases the superiority of the 32×32 patch size based on the experimental findings.

Fig. 11. Classification results for various patch sizes.
The accuracy graph depicted in Fig. 12 provides valuable insight into the learning progress of a model over training epochs. This visualization illustrates the model’s accuracy improvement with each iteration during training. Ideally, the graph shows a steady increase in accuracy as the model learns from the training data.

Fig. 12. Accuracy vs epochs.
The proposed GENR HDFLO model demonstrates superior performance compared to existing models, as shown by the metrics in Table 7. It achieves an exceptionally low Mean Absolute Error (MAE) of 0.451, which is significantly lower than the MAE of HyCon (0.71) and AdaMoW (0.69), indicating high precision in predictions. The model also boasts an impressive F1-score of 94.21%, surpassing all other models and reflecting an excellent balance between precision and recall. This high F1-score highlights the model’s effectiveness in accurately identifying positive instances. Additionally, the model’s accuracy of 94.76% is notably high, demonstrating robust performance. This combination of a low error rate, high F1-score, and substantial accuracy underscores the GENR HDFLO model’s advanced capability in handling multimodal sentiment analysis, offering a more reliable and effective understanding of emotions in the data.
Techniques | MAE | F1-score (%) | Accuracy (%) |
---|---|---|---|
HyCon37 | 0.71 | 85.1 | 85.2 |
AdaMoW38 | 0.69 | 86.57 | 86.57 |
GENR HDFLO [proposed] | 0.451 | 99.42 | 99.65 |
Figure 13 compares the performance of the proposed GENR HDFLO model against other models (ABCDM, DCN, PWDTT, PCA) based on different missing rates. In Fig. 13(a), the accuracy of each model decreases as the missing rate increases, but the GENR HDFLO model consistently achieves higher accuracy across all missing rates, demonstrating its robustness in handling incomplete data. In Fig. 13(b), the MAE increases with higher missing rates for all models. However, the GENR HDFLO model maintains a lower MAE compared to others, indicating its superior capability in minimizing prediction errors even with substantial data loss. These results underscore the effectiveness and reliability of the GENR HDFLO model in managing and analyzing multimodal data with missing values. The model excels in this area due to several key factors: it employs DSMOTE and ECGAN for effective data augmentation, which ensures a balanced dataset and minimizes issues such as noisy sample creation. Additionally, GENR HDFLO optimizes feature selection by considering the interdependencies among features from different modalities. MWCNN further enhances emotion classification by integrating spatial and spectral information, allowing it to capture complex data patterns more effectively. Together, these techniques result in higher accuracy and lower error rates compared to existing methods.

Fig. 13. Performance comparison of proposed model based on missing rates (a) accuracy (b) MAE.
4.11. Discussions
The proposed framework for MSA demonstrates significant advantages in effectively integrating audio, visual, and text modalities to understand emotions across diverse data sources. It exhibits superior performance compared to existing methods, achieving higher accuracy, recall, and precision in emotion classification tasks, particularly with the use of GENR HDFLO-based feature selection.
To demonstrate the novelty of the proposed work, this research conducted a comparative analysis with the existing models.35,36 Table 8 presents a detailed comparison highlighting the advancements of the proposed model over the existing ones.
Input modality | |||||
---|---|---|---|---|---|
Reference | Technique | Audio | Video | Text | Accuracy (%) |
35 | SVM | × | × | ✓ | 91 |
36 | BiGRU | × | × | ✓ | 91 |
— | GENR HDFLO | ✓ | ✓ | ✓ | 94.76 |
The GENR HDFLO model demonstrates advanced capabilities by integrating audio, video, and text modalities to achieve a high accuracy of 94.76%, surpassing existing SVM35 and BiGRU36 classifiers. Unlike SVM and BiGRU, which process only textual data, GENR HDFLO harnesses rich information from multiple modalities, enhancing its sentiment analysis capabilities. Existing models like SVM and BiGRU are limited by their reliance on single-modal data, which may hinder their ability to capture essential sentiment across different modalities. Additionally, single-modal classifiers may struggle with the inherent complexity and variability present in multimodal data, potentially resulting in lower accuracy compared to the proposed approach GENR HDFLO that integrates and interprets multiple modalities simultaneously. Therefore, the proposed model represents a significant advancement in MSA, addressing the limitations of existing single-modal approaches.
The proposed method, GENR HDFLO, offers several distinct advantages over existing methods in handling multimodal data with missing values. Firstly, it integrates DSMOTE and ECGAN for data augmentation, which effectively addresses class imbalance and ensures a more robust dataset without introducing noisy samples or overlapping classes. This approach leads to improved generalization and accuracy in classification tasks compared to traditional augmentation methods. Secondly, the use of GENR HDFLO for feature selection enhances the model’s ability to identify and incorporate the most relevant features from different modalities. This not only improves prediction accuracy but also ensures a more comprehensive representation of complex data relationships. By considering interdependencies among features across modalities, the model captures different patterns that may be missed by methods focusing solely on individual modalities.
However, the framework may suffer from high computational complexity, especially during feature selection and optimization stages. Future research will focus on optimizing model parameters to strike a better balance between performance and complexity, enhancing the framework’s scalability and efficiency. Additionally, efforts are needed to enhance the model’s robustness in handling missing modal information, ensuring its effectiveness in practical applications, and improving the interpretability of inter-modal interactions for better user understanding.
5. Conclusion
In this research, the MuSe-CaR dataset is introduced and collected from user-generated audio-visual and transcript recordings, which present imbalanced classes for testing and training. To address this, a data augmentation approach is employed to balance the three classes of multimodal data emotions. Various multimodal features were deeply represented with GENR which is optimized to minimize the loss function with HDFLO to select relevant features. These features are then utilized to classify emotions as valance, arousal, and trustworthiness using MWCNN, achieving high performance compared to existing methods with 0.6245 valences, 0.723 arousals, and 0.695 trustworthiness on MuSe-CaR. Notably, the feature selection method achieved 99.65% accuracy, 99.45% recall, and 99.66% precision, outperforming the existing deep learning and machine learning classifiers. Despite these achievements, future research efforts should prioritize (1) reducing model parameters while maintaining performance, (2) decreasing computational complexity, and (3) mitigating redundant information during model training. However, a notable challenge lies in enhancing the model’s robustness and accuracy in handling missing modal information, particularly in real-world scenarios. Addressing these challenges will expand the model’s applicability and reliability in practical settings involving MSA.
Acknowledgement
None
Data Availability Statements
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.