Loading [MathJax]/jax/output/CommonHTML/jax.js
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Optimized Feature Selection Approach with Elicit Conditional Generative Adversarial Network Based Class Balancing Approach for Multimodal Sentiment Analysis in Car Reviews

    https://doi.org/10.1142/S0218488525500035Cited by:0 (Source: Crossref)

    Abstract

    Multimodal Sentiment Analysis (MSA) is a growing area of emotional computing that involves analyzing data from three different modalities. Gathering data from Multimodal Sentiment analysis in Car Reviews (MuSe-CaR) is challenging due to data imbalance across modalities. To address this, an effective data augmentation approach is proposed by combining dynamic synthetic minority oversampling with a multimodal elicitation conditional generative adversarial network for emotion recognition using audio, text, and visual data. The balanced data is then fed into a granular elastic-net regression with a hybrid feature selection method based on dandelion fick’s law optimization to analyze sentiments. The selected features are input into a multilabel wavelet convolutional neural network to classify emotion states accurately. The proposed approach, implemented in python, outperforms existing methods in terms of trustworthiness (0.695), arousal (0.723), and valence (0.6245) on the car review dataset. Additionally, the feature selection method achieves high accuracy (99.65%), recall (99.45%), and precision (99.66%). This demonstrates the effectiveness of the proposed MSA approach, even with three modalities of data.

    1. Introduction

    With the increasing wealth of information available on the internet, individuals can enhance their product choices and lifestyles based on textual details.1 Businesses, recognizing the importance of understanding customer opinions, employ sentiment analysis through social media monitoring, brand tracking, and emotions found in emails and comments.2 Sentiment analysis helps to deduce the emotional tone underlying text to comprehend user opinions.2 However, analyzing sentiment in large datasets, such as user reviews, poses challenges due to variations in sequence length and textual order. Deep learning approaches are introduced in text classification and question answering to address these challenges.3,4

    Machine learning and deep learning techniques were applied to MSA, even in domains such as analyzing sentiments from coronavirus-related tweets.5 Innovative approaches, like Massive Open Online Courses (MOOC), have utilized sentiment classification methods organized by ensemble learning paradigms.6 The integration of deep learning approaches serves as a robust baseline model for feature extraction in sentiment analysis.7,39,40 Various sentiment analysis methods use lexicon-based approaches to identify the orientation of text documents, while machine learning-based methods leverage labeled datasets to train models.8,9

    In the context of MSA, polarity detection is often considered the most relevant information.10 Classification tasks such as subjective classification, opinion extraction, word sentiment classification, and document sentiment classification are employed to identify classes in datasets.11 The domain of sentiment analysis has witnessed growing interest in applications such as visual summaries and trend analysis.12 Furthermore, hyperparameter optimization strategies play a crucial role in minimizing generalization errors in machine learning approaches.13,14

    Unlike single model sentiment analysis,33,34 MSA poses several challenges and opportunities. Early approaches relied on handcrafted features, limiting sentiment abstraction and yielding suboptimal results. Explaining multimodal models is complex due to the need to relate model performance to diverse input data. The heterogeneity and dimensionality of human behaviors hinder feature interpretation and understanding of model decisions. Additionally, research on compact, human-friendly data summaries is scarce. Interpreting inter-modal interactions is non-trivial, despite their importance. For instance, discerning positive sentiment from neutral voice and facial cues remains challenging, highlighting the unique nature of MSA. Therefore, a new MSA approach is developed to differentiate emotions using different modalities such as audio, video, and text.

    Motivation

    In the digital age, user-generated content on online platforms has created a rich repository of multimodal data, including text, audio, and video. This diversity presents both an opportunity and a challenge for sentiment analysis, especially in understanding and interpreting human emotions comprehensively. MSA emerges as a critical field of emotional computing research, leveraging varied data to gain deeper insights into user sentiments. MSA integrates and analyzes multiple data modalities, providing an accurate understanding of sentiments compared to unimodal approaches. For instance, text alone might convey a different emotional tone when not considered alongside corresponding vocal intonations or facial expressions present in audio and video data. By combining these modalities, MSA offers a holistic view of sentiments, leading to more robust and reliable analysis outcomes. The application of MSA is particularly impactful in car review analysis, where user feedback encompasses various forms of expression. Car reviews often include detailed text descriptions, vocal commentary, and visual demonstrations, making them an ideal candidate for multimodal analysis. Understanding the sentiments behind these reviews provides automotive companies with valuable insights into customer preferences, and overall satisfaction. This aids in product improvement, marketing strategies, and customer engagement, ultimately enhancing the consumer experience and driving business success.

    Contributions

    Social media information is a valuable resource for sentiment analysis applications, aiding in the identification of user emotions to solve various problems. This work contributes to the field by emphasizing texture classification through spectral analysis.

    • This research work utilizes the Granular Elastic-Net Regression (GENR) model to enhance texture classification and improve the accuracy and robustness of MSA, especially in scenarios with intricate textual patterns.

    • The proposed Multilabel Wavelet Convolutional Neural Network (MWCNN) offers a comprehensive analysis by considering both low and high-frequency components in textual data, improving sensitivity to a wider range of features compared to traditional CNNs.

    • Integration of Dynamic Synthetic Minority Over-sampling TEchnique (DSMOTE) provides a sophisticated mechanism to address biases in dataset sampling, ensuring a more representative dataset compared to traditional methods.

    • Also, GENR is introduced to refine the regularization process and improve model adaptability, overcoming sparsity limitations inherent in Elastic Net Regression (ENR) for more precise sentiment analysis tasks.

    This research work is organized as follows: An overview of several existing sentiment analyses with the MuSe-CaR dataset is provided in Section 2. The developed model is briefly explained in Section 3 with the steps of data acquisition, balancing class samples, feature selection, and classification. Section 4 gives a detailed explanation of the experimental result. Finally, the whole research work is concluded in Section 5.

    2. Literature Survey

    Exploring emotion recognition and sentiment analysis through multimodal data, the literature survey provides diverse methodologies. These range from extensive annotation and fusion techniques to unique feature extraction approaches. Various research works were previously carried out which were based on the MSA. Some of them are reviewed and listed in Table 1,

    Table 1. Comparison of existing methods.

    AuthorsMethodsProsCons
    Stappen et al.15Multi head Attention Network (MAN)Continuous prediction capability of multitaskingLimited in the multimodal data analysis
    Sun et al.16Self-attention based Long Short-Term Memory (LSTM) with Recurrent Neural Network (RNN)The performance is improved by continuous predictionLimited to explore multitask learning and advanced fusion models
    Stappen et al.17Rater Aligned Annotation Weighting (RAAW)It offered a cohesive approach to developing regressionLimited to reduced dimension
    Baird et al.18Lexical knowledge-based extraction approachLow computational power and improved the linguistic baselineLacked to combine high-level features from different modalities using unsupervised methods
    Vlasenko et al.19Ensemble based classifierClassification performance was improved by Natural Language Processing based approachesLacked gives an improved method by combining three modalities
    Schuller et al.20Support VectorRegression (SVR)Consistently more robust prediction for video likesIt never accurately predicts hate likes and fake news
    Cambria et al.21LSTM-RNNHigh possible transparency with a large set of featuresLimited to combine modalities at an earlier stage
    Jiang et al.22Hybrid temporal modalThe best prediction result was achieved by multimodal featuresLacked to get better results with combined arousal and valence
    Yadav et al.23An effective rater ensemble modelIt takes all information about human emotions without losing any subjective informationMore advanced architectures are needed to improve the performance (Accuracy: 77.89%) of the model
    Padminivalli et al.24Spatial-temporal deep neural networkOverfitting is avoided by introducing cross-validation during testing and trainingStruggled with recognizing neutral emotions (98.3 %)
    Liu et al.30Modality Translation-based MSA model (MTMSA)Minimizing the complexity of the MSA modelThe accuracy of the suggested model sharply decreases as the missing rates increase (67.29% to 47.5%)
    Chandrasekaran et al.31Long short-term memory eXtreme Gradient Boosting (LXGB)The suggested model excels in MSA, accurately classifying emotions across diverse data typesThe model struggled with predicting neutral emotions (92.9%)
    Wang et al.32Cross-Correlation in Dual-Attention (CCDA)Better performance in MSAIncreased computational complexity

    Stappen et al.15 have presented an extensive annotation process in covering all the emotions and selects the videos of having changing shots and dynamic backgrounds. The intensity class of trustworthiness is predicted based on audio and visual behaviors. Sun et al.16 have presented a human emotion/sentiment analysis with videos, audio, and text to automate sentiment analysis in many areas. The relevant features are extracted by using low-level and hybrid deep learning approaches to explore the robust feature extraction. The modal performance is again enhanced by introducing a fusion technique for arousal and valence.

    Stappen et al.17 have presented an annotation toolkit to relate the various types of fusion techniques in continuous annotation. The relevant configuration parameter was pointed out in a series of manners to improve the capability of the tool kit. Another MSA modal was developed by Baird et al.18 for exploring sub-symbolic representation from sentic information into the emotional information provided from videos. It also provided the usefulness of high features from text and audio. With these better learning to predict valance and arousal was obtained.

    Automatic emotion recognition is presented by Vlasenko et al.19 with the turn-level prediction of emotions using valence and arousal dimensions. Different approaches were investigated to fuse the text and audio features and various structures of deep learning were explored the cross-dependencies. Schuller et al.20 have presented a first approach to feature extraction and selection without using audio video and text. There is no YouTube video prediction against time series features. The different choices between interpretability of accuracy were chosen for the prediction method.

    Cambria et al.21 have presented a challenging sentiment analysis approach with a large set of features by using open source software. Then it described a detailed description of multimodal feature extraction with applied preprocessing and alignment for the baseline modelling. The best prediction outcomes were obtained using multimodal features. Jiang et al.22 have explained the mapping of continuous dimensional emotions to discrete classes. It predicted emotion classes based on audio and video about user reviews. For the multimodal features, feature fusion was the best for segmenting data.

    Yadev et al.23 have presented an emotional recognition with audio and video data. It also explored the fusion of ensemble prediction with several techniques. The main motive of this modal is to achieve the whole information from raw annotation to predicted emotions. Padminivalli et al.24 have presented an audio-visual based block that is analyzed with the temporal convolutional network. Along, with these block leader followers’ attentive fusion block after that, the cross-modality fusion is achieved to get noise removed network.

    Liu et al.30 suggested an MTMSA robust to uncertain missing modalities. The model translates visual and audio data to text and fuses them into Missing Joint Features (MJFs). A transformer encoder, supervised by a pre-trained model, encodes MJFs to approximate complete modalities. The transformer decoder learns intermodal dependencies, facilitating sentiment classification. However, the suggested model’s accuracy sharply decreases with increased missing rates, impacting intermodal feature projection and affecting visual and auditory feature space projection onto text.

    Chandrasekaran et al.31 utilized the hybrid LXGB technique, which combined LSTM and eXtreme Gradient Boosting (XGBoost) classifiers for MSA. The suggested model addressed emotional understanding across image, textual, and audio data, showcasing effectiveness in capturing different sentiments.

    Wang et al.32 developed the CCDA model and utilized dual attention mechanisms to obtain inter and intramodal dynamics efficiently. The model incorporates a cross-correlation loss to acquire attention correlation and utilizes relevant coefficients for effective feature integration. However, these computations, relying on matrix multiplication, increase model complexity and involve redundant information.

    The disadvantages in existing methods are considered to improve the proposed system by introducing separate hybrid methods for data balancing and feature selection. The developed scheme is aimed at attaining a better outcome with the dimension of arousal, valence, and trustworthiness.

    Problem statement

    Many approaches were proposed for analyzing MSA. However, the existing approaches are constrained by the lack of combining modalities, which may result in low result in prediction. Also, the existing method does not provide the outcome with all three emotional states such as arousal, valence, and trustworthiness. Since most of the sentiment analyses are data driven, the capability of machine learning methods is limited for training data. Looking at the existing MSA, it seems that all the people express positive opinions as there is a lack of negative opinions.16,17,18,19,20,21,22,23,24

    3. Proposed Methodology for MSA

    For the users generating videos, audio, and text on online websites, it is important to understand human emotions and sentiments to analyze the reviews of a certain product. Figure 1 shows the workflow of this research work,

    • MSA is a rapidly growing field of emotional computing research where reviews are gathered from the MuSe-CaR dataset.

    • These reviews are in three different modalities such as audio, video, and text which has some challenges due to imbalance data, especially when it involves multimodal data from various modalities which is addressed by an effective data augmentation approach of combined DSMOTE with multimodal Elicit Conditional Generative Adversarial Network (ECGAN)

    • Then the balanced data is given to the GENR with Hybrid Dandlion Fick’s Law Optimization (HDFLO) based feature selection method.

    • The selected features are given to MWCNN for accurate classification of sentiment states as valence, arousal, and trustworthiness.

    Fig. 1.

    Fig. 1. The workflow of the proposed system.

    3.1. Multimodal data collection

    Data is sourced from MuSe-CaR, a large car review dataset with multimodal data, featuring videos with varied face angles, backgrounds and occlusions, audios with high noise, and text containing colloquialisms. Given these challenges, sentiment analysis faces imperfect classification accuracy. To address this, an imbalanced data augmentation approach is applied to achieve balanced datasets.

    3.2. Data augmentation method for imbalanced data using combined DSMOTE with ECGAN

    The proposed MSA system aims to categorize the data as three classes but the dataset is affected by an imbalanced class distribution problem which affects the performance of the classifier. ECGAN is combined with DSMOTE to get efficient balanced classes at the classifier without blind oversampling. Moreover, the existing models lack control over the locations where synthetic samples are produced in the data space, leading to two key issues: (1) noisy minority samples create more noisy samples during oversampling, and (2) noisy samples within the majority class area cause under-fitting due to class overlap. The proposed data augmentation method has different phases. Initially, data are separated by class labels. Next, minority class samples are categorized into border and noisy types to oversample borderline samples while ignoring noisy ones, avoiding misclassification. Finally, range-controlled oversampling prevents minority samples from being generated within majority class regions.

    Initially, the border samples are oversampled to balance the dataset for generating new synthetic samples. Then, new synthetic samples (Xnew) are generated through randomly selected line segments between the minority sample (̂xi) and neighbors (xi). The generated new samples are expressed as eqn. (1),

    Xnew=xi+r(̂xixi)(1)
    where, r is the random number in [0,1]. These samples are fed into ECGAN, which generates new samples with a distribution matching that of the dataset. ECGAN proves to be more suitable for multimodal translation tasks. It incorporates an additional input (Xnew) from DSMOTE, sampled from true data distributions based on the minimum amount of data, both from the minority and majority classes. The learned distributions are then used to generate synthetic images. The ECGAN architecture contains two networks: a generator G and a discriminator D, represented by differentiable functions. The discriminator D is trained by learning from the generator G. The noise vector z is taken by the generator network in a latent space and differentiable function where the fake image, audio, and text, are denoted as G(z)X. Simultaneously, the discriminator network distinguishes between raw data (labeled by 1) and artificially generated data (labeled by 0) by the generator network, expressed as D(Xnew)[0,1]. The objective function of ECGANs for the three data types (Video V(D,G), audio A(D,G) and text T(D,G)) is defined as Eq. (2),
    minGmaxDV(D,G)=Expr(x)[log(D(Xnew))]+Ezpz(z)[log(1D(G(z)))]minGmaxDA(D,G)=Expr(x)[log(D(Xnew))]+Ezpz(z)[log(1D(G(z)))]minGmaxDT(D,G)=Expr(x)[log(D(Xnew))]+Ezpz(z)[log(1D(G(z)))](2)
    where, the random noise vector and real images are represented as z and x, the generated and the real data distribution are denoted as pr(x) and pz(z), E represents the expectation, Xnew is the output of DSMOTE, G and D represents the representation of generator and discriminator. The generator goal is to minimize log(1D(G(z))) and the discriminator aims to maximize log(D(Xnew)). The balanced dataset of audio, video, and text-based reviews with three classes is then given to select the relevant features.

    3.3. Feature selection using GENR HDFLO

    GENR is also applied to multimodal data, such as audio, video, and text by using recursive feature elimination to determine the appropriate number of features. In this context, each modality represents a different set of features, and the goal is to classify the most relevant features within each modality as well as potential interactions between modalities. The block diagram of the proposed feature selection (GENR HDFLO) method is shown in Fig. 2.

    Fig. 2.

    Fig. 2. The block diagram of GENR HDFLO.

    The GENR HDFLO technique enhances feature selection by identifying clusters of correlated features rather than individual ones, improving predictive performance and understanding of data relationships. Each modality’s features are treated as separate clusters, applying GENR within them to account for interdependencies. The method integrates L1 and L2 regularizations (lasso and ridge) into an ENR framework, balancing sparsity and coefficient magnitude. The HDFLO algorithm combines Fick’s Law Optimization (FLO) and Dandelion Optimization (DO) to balance exploration and exploitation during optimization. DO simulates dandelion seed dispersion stages, while FLO’s Steady-State Operator (SSO) ensures efficient navigation through optimization stages.

    Instead of simply selecting individual features, it identifies and selects groups or clusters of features that exhibit similar behavior or have high correlations. This leads to improved predictive performance and a better understanding of the underlying relationships in the data. Apply GENR to each modality separately by treating the features from each modality as a separate cluster and performing ENR within that cluster. This step helps to identify the most relevant features within each modality while considering their interdependencies.

    L1 and L2 Regularization: L2 regularization encompasses adding a regularization parameter β multiplied by the sum of the squares of the weights to the loss function LF. This method is known as Tikhonov Regularization. On the other hand, L1 Regularization replaces the squared weights with the absolute value of weights. The mathematical representation formulated in Eq. (3),

    LF=1nnj=1(vˆv)2L1=LF+βnj=1|uj|=1nnj=1(vˆv)2+βnj=1|uj|L2=LF+βnj=1u2j=1nnj=1(vˆv)2+βnj=1u2j(3)
    where, β represents the strength of regularization, the total number of rows are denoted as n, u represents weights, and j denotes iterations. v represents the ground truth values used during the training phase of the regression model, while ˆv represents the predicted values generated by the model. The variables n,m denote the total number of rows and columns, respectively, in the training dataset. L2 regularization is used in ridge regression whereas L1 regularization is used in lasso regression. As it incorporates both L2 and L1 regularization into the same expression, ENR serves as a bridge between lasso and ridge. The ENR measurements are given as follows,

    Ridge Regression: Ridge regression incorporates L2 regularization along with the Mean Squared Error (MSE) loss function. Its cost function G(δ) is depicted in Eq. (4),

    G(δ)=1nnj=1(vˆv)2+βmi=1u2i(4)
    where, 1nnj=1(vˆv)2 denotes the loss function’s Mean Square Error (MSE) and βni=1u2i denotes the penalty (L2 regularization). Gradient descent is used by all regression algorithms to determine the optimal or minimum weights and biases. This is accomplished by first calculating the partial derivative of this cost function about bias and weights.

    Lasso Regression: Lasso regression utilizes L1 regularization, follows similar processes as ridge regression, and employs the same MSE loss function. The cost function G(δ) in lasso regression is represented by Eq. (5).

    G(δ)=1nnj=1(vˆv)2+βmi=1|ui|(5)
    where, βmi=1|ui| is the L1 regularization, 1nnj=1(vˆv)2 denotes the loss function’s MSE

    ENR: In this network, the cost function combines the MSE loss function with both L2 and L1 regularization, expressed in Eq. (6),

    G(δ)=1nnj=1(vˆv)2+sβmi=1|ui|+1s2βmi=1|u2i|(6)
    where, sβmi=1|ui|L1 regularization, 1s2βmi=1|u2i|L2 regularization, and 1n×nj=1(vˆv)2 represents the MSE loss function. The hyperparameter L1 ratio is specified as s, controls the behavior of elastic net regression, which combines aspects of both lasso and ridge regression. If s=0.5, the elastic net acts similarly to the Ridge and Lasso. Elastic net approximates the ridge regression when s is reduced to zero and resembles ridge regression when s is zero, it completely. Elastic net behaves similarly to Lasso when s is set to 1, but when s is set to 1, it completely resembles Lasso regression.25 Consider a multimodal dataset with three modalities: audio (A), video (V), and text (T). Each modality has its own set of features, denoted by A=[a1,a2,,ai], V=[v1,v2,,vi], and T=[t1,t2,,ti], respectively. It formulates the GENR (G) as Eq. (7),
    G=Loss function+λ1(βai+βvi+βti)+λ2(βai2+βvi2+βti2)(7)
    where, the loss function represents the specific regression loss based on feature selection (e.g., mean squared error). βai, βvi and βti are the regression coefficients associated with the features from audio, video, and text modalities respectively. λ1 is the regularization parameter that controls the L1 (Lasso) penalty term and encourages sparsity by pushing the coefficients to zero. λ2 is the regularization parameter that controls the L2 (Ridge) penalty term and helps control the overall magnitude of the coefficients. The loss function is minimized by the HDFLO model.

    HDFLO: The algorithm integrates FLO and DO, combining their update rules to optimize exploration and exploitation. DO mimics dandelion seed flight stages such as rising in spirals or drifting locally, adjusting globally, and landing randomly to grow. Meanwhile, FLO includes an SSO ensuring a balance between exploration and exploitation. By incorporating FLO’s SSO into DO, the algorithm achieves enhanced capability in navigating diverse optimization stages effectively, avoiding local optima stagnation, and ensuring robust exploration-exploitation trade-offs. A detailed explanation of the mathematical model behind this new meta-heuristic algorithm, which draws inspiration from the optimal reproduction location of dandelion seeds as they mature, is provided. In the initialization phase DO randomly generates a candidate solution which is given in Eq. (8),

    Xi,j=rand×(ubjlbj)+lbji=1,2,,pop;j=1,2,,Dim(8)
    where, Xi,j represents the candidate solution, rand is the random number, size of the population and dimension of the variable are represented as pop and Dim. ub and lb are the upper bound and lower bound of the seed position.

    The fitness value (F(Xi)) creates the fitness function of the ith seed in the population which is taken to minimize the loss function. The fitness function (Fitnessfunction) is expressed in Eq. (9),

    Fitnessfunction=Min(1nmj=1(vˆv)2)(9)
    where, v represents the ground truth, ˆv represents the predicted value by the model, n,m denotes the number of rows and columns in the training dataset. Min represents the minimization of the loss function.

    Rising stage: In this stage, the weather conditions along with the speed of wind jointly find the dandelion seed’s height. In the search space, dandelion seeds are blown to various locations, rising higher and scattering farther with stronger wind, following a spiral motion influenced by wind speed and vortex adjustments, which is expressed in Eq. (10),

    Xt+1i=Xti+αsxsyln(Y)(XtsXti)(10)
    where, the seed position at iteration t is expressed as Xti, the randomly selected position at t is denoted as Xts, ln(Y) denotes a lognormal distribution obey in μ=0, σ2=1, the adaptive parameter is denoted as α, seed’s lift component coefficients are represented as sx and sy. The diffusion coefficient represents the variability or exploration level of the algorithm, and the concentration gradient is related to the fitness landscape or objective function. Table 2 gives the pseudocode for the HDFLO algorithm.

    Table 2. Pseudocode for HDFLO algorithm.

    Input: pop, T,Dim and C.
    Output: Xbest (best position of seed)
    • Initial position of dandelion seeds.
    • Compute fitness function as, Fitnessfunction=Min(1nmj=1(vˆv)2)
    if (tT) do
    if i=1 to pop do
    if randn<1.5
    Compute the TF by eqn. (13)
    If TF<0.9
    Update the position by eqn. (15)
    Else if
    Update dandelion seeds
    else Rising stage rainy day
    Update dandelion seeds by multiplying flus with a position as,
     Xt+1i=Xti+asxsylnY(XtsXti)
    end for
    if i=1 to pop do
    update descending stage position
    end for
    if i=1 to pop do
    Update landing stage position
    end for
    if i=1 to pop do
    arrange seeds based on fitness values
    end for
    if f(Xelite)<f(Xbest) then
     Xbest=Xelite
     fbest=f(Xelite) // Best position and fitness value

    Descending stage: In this phase, the dandelion population travels to the appropriate location for reproduction which is reflected by the mean location in the rising stage. The descending phase (Xt+1i) is mathematically expressed in Eq. (11),

    Xt+1i=Xtiaβt(Xmean_tαβtXti)(11)
    where, the seed position is expressed as Xti, an adaptive parameter is denoted as α, Brownian motion is expressed as βt and the mean location of the population in ith repetition is denoted as Xmean_t.

    Landing stage: In this local neighborhood development is achieved to get the global optimum where local exploitation is taken by the current elite information. The landing stage population (Xt+1i) is expressed in Eq. (12),

    Xt+1i=Xelite+λa(XeliteXtiδ):δ=2tT(12)
    where, each iteration is represented as t, the seed position at iteration t is expressed as Xti, δ indicates the linear coefficient. the seed best location is denoted as Xelite, the maximum number of iterations is denoted as T, λ denotes a function between [0,2].

    SSO of FLO: Successful optimization algorithms hinge on transitioning between exploration and exploitation effectively. Therefore, the Transfer Function (TF) is computed to smoothly navigate between the exploration and exploitation phases, improving adaptability and performance, as provided in Eq. (13).

    TF=sinh(tT)c1(13)
    where, c1=0.5. The final stage in the optimization search is the exploitation phase, where molecules update their positions (Xt+1i) for stability, as presented in Eq. (14),
    Xt+1i=XtS+Qtg×Xti+Qtg×(MSti×XtSXti)(14)
    where Xti denotes the position of the paricle, XtS denotes the steady state location, MSti refers to the motion step and Qtg represents the relative quantity of the region.

    This optimization algorithm, inspired by dandelion seed dispersal, initializes a population with random seed positions. Each seed’s fitness is evaluated based on an objective function, iteratively refining positions over generations. The algorithm balances exploration and exploitation inspired by FO using a transfer function, adjusting seed exploration and exploitation strategies. During the rising stage, seeds adjust positions influenced by parameters like wind speed and direction, simulating natural dispersal patterns. In the descending stage, seeds move towards optimal locations identified in previous stages, and in the landing stage, positions are further refined for optimal growth conditions. The algorithm continuously updates the best seed position based on fitness, converging towards an optimal solution by the end of iterations, effectively navigating diverse optimization stages while avoiding local optima.

    Thus, the selected features are based on multimodal data where the loss function of GENR is minimized with HDFLO to efficiently extract relevant features. Then these features are taken to classify the sentiments as three dimensions.

    3.4. Emotional states classification using MWCNN

    The selected features are classified as three types of emotional states such as valence, arousal, and trustworthiness from audio, video, and text. For classification, a formulation is achieved to connect convolution and pooling with multimodal data analysis. The MWCNN classifier integrates CNN and spectral analyses, capturing spatial and frequency domain features. By using wavelet transforms, this model extends traditional CNN architecture, enhancing the feature extraction and classification process.

    3.4.1. Classification based on emotional states

    A CNN is a variant of a Neural Network (NN) to sparsely connect deep networks. In conventional NN each input in one layer is connected with the next layer but in the proposed classifier the use of an activation function and fully connected layer, CNN introduces a convolution/pooling layer that connects only the local respective field around each input.

    A fusion of CNNs and spectral analyses is employed to address texture classification challenges. CNNs process textures directly to capture spatial statistics, while spectral analysis transforms textures into frequency domains for scale-invariant features. This unified approach integrates spatial and spectral information within a single model. By extending the traditional CNN architecture with multiresolution analysis through wavelet transforms, the pooling and convolution layers mimic aspects of spectral analysis. This innovative model, termed wavelet CNNs, combines CNN strengths with spectral analysis for comprehensive texture feature extraction. Figure 3 illustrates the configuration of the MWCNN-based classifier.

    Fig. 3.

    Fig. 3. The architecture of MWCNN based classifier.

    Convolutional layer: For the n component input vector, a convolutional layer produces the output vector of the same number of components. The output vector (Yi) is expressed as Eq. (15),

    Yi=jNiwjxj(15)
    where, xj=x0,x1,,xn1 is the n number of the input vector, wj is a weight, Ni is the set of indices in the local respective field at input vector and weight. The wight includes the bias by having a constant input of 1. Thus, the output is the addition of weighted sum and constant.

    CNNs achieve translation invariance in the image space and reduce the parameter count by parameter sharing. The definition of Yi is essentially the result of convolving xj with a filtering kernel wj hence it is referred to as a convolution layer. Consequently, the output (y) is expressed as Eq. (16),

    y=xw(16)
    where, w is the set of weights for the same input, and the output results as a concatenated vector.

    Pooling layer: Pooling layers are commonly employed right after convolution layers to simplify data representation. Although max pooling finds extensive use in CNN applications, average pooling is better suited for feature extraction. Consequently, it focusses lies on average pooling, which offers the advantage of revealing the association with multiresolution analysis. The output of pooling is expressed as Eq. (17),

    y=(xP)p(17)
    where, P=1p,,np is the support of pooling which represents the average filter to reduce the number of outputs by taking a pairwise average of half of the input, the input vector is represented as x and y is the output vector. Average pooling involves convolution through P followed by down sampling p.

    3.4.2. Generalized convolution and pooling

    By combining the output of convolution and average pooling the generalized form is taken as Eq. (18),

    y=(xk)p(18)
    where, x is the input vector and y is the output vector and k is the wavelet function where k=w with p=1 for convolution and k=p with p>1 for pooling and k=wP with p>1 for generalized form. If it has the pair of high pass and low pass filter, the data part is decomposed as low and high frequency with p=2. Traditional CNNs are known for utilizing low-frequency components and disregarding the wavelet.

    Implementation: The network structure is modeled after VGG-19, selected for its effectiveness in texture feature extraction, using 3×3 convolutional kernels with 1×1 padding to preserve input size. Convolution layers with increased stride are utilized; incorporating 1×1 padding and a stride of two results in output size reduction, replacing max-pooling without compromising accuracy. To align with the size reduction in multiresolution analysis, decomposed images are integrated with feature maps. An energy layer preceding fully connected layers enhances performance with fewer parameters. The wavelet CNN model comprises nine convolution layers, aligned with decomposition levels, featuring an energy layer followed by three fully connected layers. Input size constraints (32×32) necessitate training images to be scaled to 256×256, then randomly cropped and flipped for diversity, effectively mitigating overfitting.

    In contrast, proposed wavelet CNNs incorporate all components, including the high-frequency components so that no information from the input x is lost, aligning with the principles of multimodal analysis. The classified output is then compared with the existing methods to find its performance.

    This research work gathered input from the MuSe-CaR dataset to tackle data imbalance in sentiment analysis across modalities by employing a comprehensive approach. This involves combining DSMOTE with an ECGAN for emotion recognition using audio, text, and visual data. The balanced data is then processed through GENR with an HDFLO to select relevant features for sentiment analysis. Subsequently, a MWCNN is employed to accurately classify emotion states. By integrating these techniques, the research aims to overcome challenges associated with diverse data types and provide a robust framework for MSA in the context of car reviews.

    4. Experiment Results

    MuSe-CaR dataset maintains the content and context largely without breaking into equally sized segments which is better than other large datasets. This section also takes an analysis of python implementation in terms of three dimensions such as arousal, valence, and trustworthiness against some conventional approaches.15,16,17,18,19,20,21,22,23,24 The performance metrics such as accuracy, recall, and precision are used to compare the proposed feature selection method of GENR HDFLO in contrast with existing methods.26,27,28,29

    4.1. Experimental setup

    The experiments utilized Anaconda Navigator Spyder with python 3.10 on an Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, 16.0GB RAM, and Windows 10. Following network design, factors impacting performance were analyzed, including epochs, iterations, and learning rate. Set at 0.001 and 100 respectively, experiments ran for 1000 epochs with 10 timesteps. Using a patch size of 32×32 enabled efficient comparison with existing methods. The proposed GENR model, implemented with the TensorFlow library, underwent evaluation based on different features in a multimodal approach. The MuSe-CaR dataset was split into a 60:20 ratio for training, development, and test sets to ensure robust evaluation.

    Data preprocessing: Gather data from the MuSe-CaR dataset, which includes multimodal data such as audio, video, and text reviews. For initial pre-processing, handle missing data by removing incomplete entries. Since the annotation for MuSe-CaR is dense (4 frames per second), split each video into segments with a window size of 200 frames (50s) and a hop size of 100 frames (25s). This segmentation enriches training samples and facilitates model convergence. A moving average filter is utilized to smooth the input signal of each sliding filter frame for better intelligibility.

    Model Training: The developed scheme is executed using python 3.10. The network structure is based on VGG-19 for effective texture feature extraction, using 3×3 convolutional kernels with 1×1 padding. The wavelet CNN model includes nine convolution layers, an energy layer, and three fully connected layers. Training images are scaled to 256×256, then randomly cropped and flipped for diversity.

    Key hyperparameters: Learning rate 0.001, 1000 epochs, and Adam optimizer. Batch normalization is used before activation layers, with ReLU as the activation function.

    4.2. Dataset description

    MuSe-CaR is a comprehensive dataset of multimodal data collected in real-world settings, aimed at understanding emotional engagement in product reviews, particularly automotive ones. It ensures high-quality voice and video recordings, offering valuable social media content. The sample video image of the MuSe-CaR dataset is shown in Fig. 4.

    Fig. 4.

    Fig. 4. Sample images from the Videos of the MuSe-CaR dataset.

    4.3. Density estimation of the dimensions

    Figure 5 presents a comparison of the distributions for the manually labeled annotations of arousal, valence, and trustworthiness following feature selection.

    Fig. 5.

    Fig. 5. Density estimation of the dimensions.

    While arousal demonstrates a near-perfect Gaussian distribution, valence exhibits a positive skewness, leaning towards the positive end of the spectrum. Trustworthiness, on the other hand, displays a highly peaked distribution with a strong left-skew. The model aims to analyze the phenomenon of feature selection due to the high-density distribution observed across all three dimensions.

    4.4. The feature set of each modality in the MuSe-CaR dataset

    Table 3 illustrates the feature set of each modality with the corresponding arousal and valence dimension where “A” is the representation of audio, “V” is the representation of video, and “T” is the representation of text.

    Table 3. Feature set of each modality in MuSe-CaR dataset.

    FeaturesModalityArousalValence
    EnergyA0.46840.2633
    PitchA0.45960.2484
    MFCCA0.43130.2646
    Wav2vecA0.48180.333
    VGGFaceV0.40230.2241
    HOGV0.46530.1059
    SeNetFaceV0.46780.1543
    ResNetFaceV0.43110.1344
    BERTT0.43250.5624
    RoBERTT0.42560.6132
    ALBERTT0.43560.5532
    BOWT0.45670.5689
    Word2vecT0.46750.6732

    The analysis reveals several key findings: Firstly, in terms of arousal, audio features demonstrate higher effectiveness compared to visual and text features. Secondly, for valence, the text modality exhibits significantly superior performance compared to the other two modalities. Lastly, the visual modality generally yields the least favorable results. The researchers speculate that the dominance of audio features in arousal assessment stems from the reliance on speech-related aspects such as intonation and tone, whereas valence assessment is primarily influenced by verbal content.

    4.5. Confusion matrix of three-dimension prediction using classifier

    For this experiment, it focused on features from acoustic, linguistic, and visual modalities. Figure 6 displays the confusion matrix derived from using features to forecast emotion categories in arousal, valence, and trustworthiness.

    Fig. 6.

    Fig. 6. Confusion matrix of prediction using classifier. (a) Valence. (b) Arousal. and (c) Trustworthiness.

    When comparing the confusion matrix obtained from a three-dimensional representation with the corresponding fixed length representation, there is an improvement in the weighting along the diagonal of the confusion matrix. The findings indicate that, overall, valence yields better results than arousal for this task, and trustworthiness performs better than arousal as well. Consequently, predicting arousal proves to be more challenging in detecting audio modality compared to other modalities.

    4.6. Continuous prediction analysis with frequency distribution

    The MuSe-CaR dataset predicts sentiment dimensions (valence, arousal, and trustworthiness) in a continuous manner with time. The sentiment analysis has the emotional component of valence which is often used interchangeably.

    The process of manually annotating continuous emotions by humans often results in disagreements due to variations in perception and reaction time. To mitigate these issues, a feature selection model is employed. Various selection methods are discussed in existing literature, and this advancement has also sparked new challenges. In Fig. 7, a feature selection method is depicted which utilizes the distribution of proposed features from a minimum of five different ratings. Additionally, the model takes into account the diverse reaction times observed during the process.

    Fig. 7.

    Fig. 7. Continuous prediction analysis with frequency distribution. (a) Valence. (b) Arousal. (c) Trustworthiness.

    4.7. Comparison of unimodal and multimodal using 6-fold cross validation

    Figure 8 denotes the results of a 6-fold cross-validation experiment designed to compare two emotion recognition models: unimodal and multimodal. The unimodal is trained solely on video frame data, while the multimodal model capitalizes on information from three different modalities (likely visual, audio, and another source). The experiment employed a 6-fold cross-validation technique, where the data is split into six sections. Each section is then used for testing, while the remaining sections are employed for training. This process is repeated six times, ensuring a robust evaluation of both models.

    Fig. 8.

    Fig. 8. Comparison of unimodal and multimodal using 6-fold cross validation. (a) Valence. (b) Arousal. (c) Trustworthiness.

    The key finding from Fig. 8 is the clear superiority of the multimodal model. It consistently outperforms the unimodal model across all three emotional dimensions: valence, arousal, and trustworthiness. This indicates that incorporating additional data modalities significantly enhances the model’s ability to accurately detect emotions compared to relying solely on video information. Interestingly, the results also suggest that trustworthiness might be a more readily detectable emotion, as it exhibits the highest accuracy across both models.

    4.8. Comparison of the proposed MSA system against existing sentiment analysis systems

    The sentiment analysis with the MuSe-CaR dataset provides three dimensions (Valance, arousal, and trustworthiness) to analyze the system. In this section, some of the sentiment analysis systems such as Multihead attention Network,15 attention enhanced Recurrent Model,16 MuSe-Toolbox.17

    Sentiment Analysis of YouTube Comments,20 LSTM-RNN,21 and temporal convolutional network24 are taken to compare with proposed MSA systems.

    Figure 9, shows that the proposed system attains high output to find the range of emotions as the dimensions valence, arousal, and Valence. The valence, arousal, and trustworthiness attain 0.6245, 0.723, and 0.695 high outcomes which is a 20% higher rate than other existing methods. Thus, the recent researches are motivated by multimodal data for sentiment analysis. The effectiveness of the proposed model is due to several key factors. Initially, the system uses DSMOTE combined with ECGAN to address data imbalance, which ensures the generation of high-quality, diverse synthetic samples. Subsequently, feature selection is optimized using GENR HDFLO, which captures relevant features and their interactions, thus improving predictive performance. Finally, the MWCNN classifier, incorporating wavelet transforms and robust convolutional architecture, accurately classifies emotional states from multimodal data. These advancements contribute to the system’s superior performance in valence, arousal, and trustworthiness, achieving up to 20% better results compared to other methods.

    Fig. 9.

    Fig. 9. Comparison of proposed MSA system against existing sentiment analysis systems. (a) Valence. (b) Arousal. (c) Trustworthiness.

    4.9. Comparison against existing sentiment analysis systems

    The classification performance is improved by limiting the large amount of data in the dataset. Thus, the feature selection approach is motivated to choose a set of relevant features. This research work uses the GENR HDFLO based selection method. GENR is applied to multimodal data, such as audio, video, and text by using recursive feature elimination to determine the appropriate number of features which is optimized to reduce loss function using HDFLO.

    The existing approaches Principal Component Analysis (PCA),26 Feature Weight-based Decision Tree algorithm (FWDT),27 Deep Convolutional Network (DCN),28 and Attention-based Bidirectional Convolutional Deep Model (ABCDM)29 are taken to compare with the proposed GENR HDFLO based feature selection method in terms of accuracy, precision and recall. Figure 10, shows that the proposed GENR HDFLO method obtains 0.9965 accuracy, 0.9945 recall, and 0.9966 precision which are higher than the existing methods. The proposed GENR HDFLO model improves classification across a wide range of dimensions. Its superior performance compared to other methods is due to its comprehensive and innovative approach to handling multimodal data and addressing class imbalance. By using DSMOTE with ECGAN for data augmentation, GENR HDFLO for feature selection, and MWCNN for classification, the model effectively captures and analyzes precise emotional states from diverse inputs, resulting in more accurate and reliable sentiment analysis.

    Fig. 10.

    Fig. 10. Comparison of GENR HDFLO based feature selection against existing methods. (a) Accuracy. (b) Recall. (c) Precision.

    Table 4 presents a performance comparison between analyses conducted without and with DSMOTE. Without DSMOTE, the model obtained an accuracy of 82.76%, recall of 80.27%, and precision of 80.46%. Introducing DSMOTE significantly improves the performance, resulting in a notable increase in accuracy to 94.76%, recall to 91.62%, and precision to 90.26%, respectively. This demonstrates a substantial percentage increase in model performance across all metrics with the incorporation of DSMOTE.

    Table 4. Performance comparisons without DSMOTE and with DSMOTE.

    Performance metricsAnalysis without DSMOTEAnalysis with DSMOTE
    Accuracy0.82760.9476
    Recall0.80270.9162
    Precision0.80460.9026

    4.10. Comparison of feature selection approaches

    In this section, shows the comparison of the proposed feature selection approach with existing models like RAAW,15 MAN,17 and Ensemble models,19 Table 5 compares different feature selection methods based on their performance metrics. RAAW achieved an accuracy of 82.76%, recall of 80.28%, and precision of 81.74%. MAN demonstrated improved performance with 84.75%, 82.87%, and 82.54% in accuracy, recall, and precision, respectively. Ensemble models yielded lower metrics at 74.62%, 72.94%, and 71.63%.

    Table 5. Comparison of feature selection methods.

    Performance metrics
    MethodsAccuracyRecallPrecision
    RAAW150.82760.80280.8174
    MAN170.84750.82870.8254
    Ensemble models190.74620.72940.7163
    GENR HDFLO (proposed)0.94760.91620.9026

    The proposed method, GENR HDFLO, outperformed all, achieving a significantly higher accuracy of 94.76%, recall of 91.62%, and precision of 90.26%. GENR introduces a sophisticated feature selection mechanism, refining traditional ENR. This approach overcomes sparsity limitations, allowing for a more precise and adaptive model. By incorporating granular selection, GENR enhances the regularization process, leading to improved accuracy and robustness, particularly beneficial in sentiment analysis tasks with complex textual data.

    Table 6 presents the performance analysis of the developed GENR HDFLO model based on different learning rates. As the learning rate decreases from 0.1 to 0.00001, there are changes in the model’s performance metrics. Table 6 demonstrates that the developed GENR HDFLO scheme achieved the optimal performance with a learning rate of 0.001, showing higher accuracy, recall, and precision compared to other learning rates. This finding suggests that a learning rate of 0.001 optimally balances the model’s ability to learn from the data, leading to improved overall performance in terms of classification accuracy and precision.

    Table 6. Performance analysis of the proposed model concerning learning rate.

    ModelLearning rateAccuracyRecallPrecision
    GENR HDFLO (proposed)0.10.89120.85730.8421
    0.010.90450.86730.8592
    0.0010.94760.91620.9026
    0.00010.92280.88410.8721
    0.000010.93890.89430.8842

    The developed model is evaluated across various patch sizes to identify the optimal setting, with 32×32 emerging as the most effective size. Figure 11 presents the comparative results achieved with different patch sizes, highlighting the performance variations observed across the range of sizes tested. This analysis underscores the significance of patch size selection in influencing model outcomes and showcases the superiority of the 32×32 patch size based on the experimental findings.

    Fig. 11.

    Fig. 11. Classification results for various patch sizes.

    The accuracy graph depicted in Fig. 12 provides valuable insight into the learning progress of a model over training epochs. This visualization illustrates the model’s accuracy improvement with each iteration during training. Ideally, the graph shows a steady increase in accuracy as the model learns from the training data.

    Fig. 12.

    Fig. 12. Accuracy vs epochs.

    The proposed GENR HDFLO model demonstrates superior performance compared to existing models, as shown by the metrics in Table 7. It achieves an exceptionally low Mean Absolute Error (MAE) of 0.451, which is significantly lower than the MAE of HyCon (0.71) and AdaMoW (0.69), indicating high precision in predictions. The model also boasts an impressive F1-score of 94.21%, surpassing all other models and reflecting an excellent balance between precision and recall. This high F1-score highlights the model’s effectiveness in accurately identifying positive instances. Additionally, the model’s accuracy of 94.76% is notably high, demonstrating robust performance. This combination of a low error rate, high F1-score, and substantial accuracy underscores the GENR HDFLO model’s advanced capability in handling multimodal sentiment analysis, offering a more reliable and effective understanding of emotions in the data.

    Table 7. Error rate, and F1-score analysis of the proposed model over existing models.

    TechniquesMAEF1-score (%)Accuracy (%)
    HyCon370.7185.185.2
    AdaMoW380.6986.5786.57
    GENR HDFLO [proposed]0.45199.4299.65

    Figure 13 compares the performance of the proposed GENR HDFLO model against other models (ABCDM, DCN, PWDTT, PCA) based on different missing rates. In Fig. 13(a), the accuracy of each model decreases as the missing rate increases, but the GENR HDFLO model consistently achieves higher accuracy across all missing rates, demonstrating its robustness in handling incomplete data. In Fig. 13(b), the MAE increases with higher missing rates for all models. However, the GENR HDFLO model maintains a lower MAE compared to others, indicating its superior capability in minimizing prediction errors even with substantial data loss. These results underscore the effectiveness and reliability of the GENR HDFLO model in managing and analyzing multimodal data with missing values. The model excels in this area due to several key factors: it employs DSMOTE and ECGAN for effective data augmentation, which ensures a balanced dataset and minimizes issues such as noisy sample creation. Additionally, GENR HDFLO optimizes feature selection by considering the interdependencies among features from different modalities. MWCNN further enhances emotion classification by integrating spatial and spectral information, allowing it to capture complex data patterns more effectively. Together, these techniques result in higher accuracy and lower error rates compared to existing methods.

    Fig. 13.

    Fig. 13. Performance comparison of proposed model based on missing rates (a) accuracy (b) MAE.

    4.11. Discussions

    The proposed framework for MSA demonstrates significant advantages in effectively integrating audio, visual, and text modalities to understand emotions across diverse data sources. It exhibits superior performance compared to existing methods, achieving higher accuracy, recall, and precision in emotion classification tasks, particularly with the use of GENR HDFLO-based feature selection.

    To demonstrate the novelty of the proposed work, this research conducted a comparative analysis with the existing models.35,36 Table 8 presents a detailed comparison highlighting the advancements of the proposed model over the existing ones.

    Table 8. Shows the comparison analysis of the proposed model over existing models.

    Input modality
    ReferenceTechniqueAudioVideoTextAccuracy (%)
    35SVM××91
    36BiGRU××91
    GENR HDFLO94.76

    The GENR HDFLO model demonstrates advanced capabilities by integrating audio, video, and text modalities to achieve a high accuracy of 94.76%, surpassing existing SVM35 and BiGRU36 classifiers. Unlike SVM and BiGRU, which process only textual data, GENR HDFLO harnesses rich information from multiple modalities, enhancing its sentiment analysis capabilities. Existing models like SVM and BiGRU are limited by their reliance on single-modal data, which may hinder their ability to capture essential sentiment across different modalities. Additionally, single-modal classifiers may struggle with the inherent complexity and variability present in multimodal data, potentially resulting in lower accuracy compared to the proposed approach GENR HDFLO that integrates and interprets multiple modalities simultaneously. Therefore, the proposed model represents a significant advancement in MSA, addressing the limitations of existing single-modal approaches.

    The proposed method, GENR HDFLO, offers several distinct advantages over existing methods in handling multimodal data with missing values. Firstly, it integrates DSMOTE and ECGAN for data augmentation, which effectively addresses class imbalance and ensures a more robust dataset without introducing noisy samples or overlapping classes. This approach leads to improved generalization and accuracy in classification tasks compared to traditional augmentation methods. Secondly, the use of GENR HDFLO for feature selection enhances the model’s ability to identify and incorporate the most relevant features from different modalities. This not only improves prediction accuracy but also ensures a more comprehensive representation of complex data relationships. By considering interdependencies among features across modalities, the model captures different patterns that may be missed by methods focusing solely on individual modalities.

    However, the framework may suffer from high computational complexity, especially during feature selection and optimization stages. Future research will focus on optimizing model parameters to strike a better balance between performance and complexity, enhancing the framework’s scalability and efficiency. Additionally, efforts are needed to enhance the model’s robustness in handling missing modal information, ensuring its effectiveness in practical applications, and improving the interpretability of inter-modal interactions for better user understanding.

    5. Conclusion

    In this research, the MuSe-CaR dataset is introduced and collected from user-generated audio-visual and transcript recordings, which present imbalanced classes for testing and training. To address this, a data augmentation approach is employed to balance the three classes of multimodal data emotions. Various multimodal features were deeply represented with GENR which is optimized to minimize the loss function with HDFLO to select relevant features. These features are then utilized to classify emotions as valance, arousal, and trustworthiness using MWCNN, achieving high performance compared to existing methods with 0.6245 valences, 0.723 arousals, and 0.695 trustworthiness on MuSe-CaR. Notably, the feature selection method achieved 99.65% accuracy, 99.45% recall, and 99.66% precision, outperforming the existing deep learning and machine learning classifiers. Despite these achievements, future research efforts should prioritize (1) reducing model parameters while maintaining performance, (2) decreasing computational complexity, and (3) mitigating redundant information during model training. However, a notable challenge lies in enhancing the model’s robustness and accuracy in handling missing modal information, particularly in real-world scenarios. Addressing these challenges will expand the model’s applicability and reliability in practical settings involving MSA.

    Acknowledgement

    None

    Data Availability Statements

    Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

    Funding

    This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.