Processing math: 100%
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Real-Time Change Detection with Convolutional Density Approximation

    https://doi.org/10.1142/S219688882350015XCited by:0 (Source: Crossref)

    Abstract

    Background Subtraction (BgS) is a widely researched technique to develop online Change Detection algorithms for static video cameras. Many BgS methods have employed the unsupervised, adaptive approach of Gaussian Mixture Model (GMM) to produce decent backgrounds, but they lack proper consideration of scene semantics to produce better foregrounds. On the other hand, with considerable computational expenses, BgS with Deep Neural Networks (DNN) is able to produce accurate background and foreground segments. In our research, we blend both approaches for the best. First, we formulated a network called Convolutional Density Approximation (CDA) for direct density estimation of background models. Then, we propose a self-supervised training strategy for CDA to adaptively capture high-frequency color distributions for the corresponding backgrounds. Finally, we show that background models can indeed assist foreground extraction by an efficient Neural Motion Subtraction (NeMos) network. Our experiments verify competitive results in the balance between effectiveness and efficiency.

    1. Introduction

    Change Detection is a fundamental semantic segmentation task that handles the identification of changing or moving areas in the field of view of a camera. With the swift progress in computer vision, practical utilization of it in visual systems has involved advanced tasks such as behavior analysis,1,2 instance segmentation3 and traffic analysis.4 While there are various online and offline algorithms that have been proposed, online algorithms are arguably much more favorable as they can make predictions on demand for large-scale inputs of essentially all tasks. By making continuous predictions on demand, and even detecting and dealing with out-of-distribution signals, online Change Detection algorithms have been pivotal for proper, timely understanding of scene dynamics and extraction of interesting attributes in many systems.

    One popular approach is Change Detection via background modeling and Background Subtraction (BgS). The background modeling step aims to construct ideal backgrounds, which are scene captures containing only stationary objects and features (e.g. streets, houses) uninteresting to the systems analytic purposes. Then, by comparing visual inputs from a video sequence with their backgrounds, the BgS technique can localize all desired targets (i.e. so-called foregrounds like cars, pedestrians, etc.) for analysis. Despite having been challenged by a plethora of real-life scenarios such as shadows, illumination changes, dynamic background, among others,5 background modeling and BgS remain prominent topics of research toward being applied to a wide range of applications including video surveillance, smart environments, and content retrieval.

    Prominent online approaches6 include pixel-based statistical frameworks such as the Gaussian Mixture Models (GMMs).7,8,9,10 These GMM frameworks are based on the hypothesis that background intensities can be observed most frequently in a video sequence recorded from a still camera, thereby creating explicit mathematical structures for simple inferences. Additionally, this design notably entails general applicability under illumination changes (e.g. from moving clouds), view noises (e.g. rain, snowflakes), and implicit motions (e.g. river water). However, they struggle to perform effectively when the background intensity hypothesis fails (i.e. stopped objects or visual noises in the scene presenting prolonged intensity values in certain regions), resulting in corrupted backgrounds and inaccurate foreground estimations. On the whole, it is thanks to their simplicity and efficiency on CPUs that make GMM-driven approaches very appealing, but their lack of not only explicit parallel-computing design with GPUs, but also limited consideration for scene semantics for better foregrounds have made them much less relevant in modern research, especially concerning big data and deep learning (DL).

    On the other hand, ever-advancing processing units specialized for large-scale data are making Deep Neural Networks (DNNs) not only powerful, but also tractable. However, in addition to a reliance on labels for practical utilization in real scenarios, existing architectures have inevitable trade-offs between computational efficiency and high accuracy, among which also include DL architectures for background modeling and Change Detection. With respect to labeled data for Change Detection in particular, authors in Ref. 11 saw a clear lack of labels for training general motion detectors, and there is currently no universal dataset that can ensure all possible scenes’ true properties are appropriately presented. These findings have obviously presented many challenges, but it has also motivated research into designing DNNs that can achieve high accuracy, while utilizing as little labeled data as possible.12,13 Nevertheless, because domain generalization14 is a complex, still unsolved problem in research to overcome data biases, especially with regards to a semantic segmentation task like Change Detection, learned models can always be susceptible to unseen contextual variations that may occur in the real world.15,16

    Through highly parallelizable neural architectures, the literature on DNNs has shown that they can approximate any functions up to any arbitrary accuracies. This signifies that we can efficiently utilize their parallelism to not only approximate the mechanism behind the optimization of GMM for background modeling, but we can also facilitate a more efficient data-driven foreground extractor that uses few labels. Unfortunately, little research has focused on striking a balance between effectiveness and efficiency for real-time, scalable, and reliable processing.

    Hence, in this paper, we propose a real-time, highly effective network design for BgS that uses few learning labels. Our novel approach essentially reserves most of DNNs’ benefits, addresses the sequentialism of GMM-based background modeling, and shows that backgrounds can be used to streamline foreground extraction processes at high accuracies. Essentially, we develop a dual framework of BgS consisting of two modules in this paper: (1) Background Modeling by Convolutional Density Approximation (CDA) for direct density estimation of background distributions; and (2) Foreground Extraction by Neural Motion Subtraction (NeMos) that estimates changed regions based on contextual constraints. Our contributions are summarized as follows:

    First, inspired by the existing computing technologies and Bishop,17 we present our formulation of a GMM-based background solver via CDA. It is a feed-forward, 2800-parameter parallelizable Convolutional Neural Network (CNN) that simulates a posterior probability function conditioned on the temporal history at each pixel location. The architecture is lightweight, compressed, and efficient by addressing the conventional sequentialism of GMM-based background models, and it performs as an effective codebook for mapping arrays of pixel values to the corresponding GMM functions.

    Second, to technically model the underlying generator of input data, we propose a self-supervised learning strategy based on unsupervised learning and data augmentation. In particular, the strategy includes an unsupervised objective function that guides CDA to approximate the parameters of GMMs via expectation maximization, and teaches it to behave as a permutation invariant network. The proposed background modeling architecture not only achieves high degrees of mathematical interpretability, but also possesses adaptation to contextual dynamics with the neural statistical analysis. Furthermore, with self-supervision, the framework can be pre-trained with an inexhaustible amount of data.

    Third, we propose to use a context-driven, 700-parameter neural foreground extraction component called NeMos, on top of background models, for effectively and efficiently segmenting the difference mapping between input frames and their corresponding background estimations. This is motivated not only by our construction of GMM-driven background models with CDA to provide for summarized semantic understanding of a context-rich scene, but also by addressing the prohibitive expenses of existing segmentation networks for Change Detection. The network can properly maintain generalization across a scenario’s dynamics in real time.

    The organization of this paper is as follows. Section 2 encapsulates the synthesis of recent approaches in background initialization and foreground segmentation. The proposed method is described in Sec. 3. Experimental evaluations are discussed in Sec. 4. Finally, our conclusion and motivations toward future works are discussed in Sec. 5.

    2. Related Works

    The new era of video analysis has witnessed a proliferation of methods that concentrate on Change Detection. In fact, studies in recent decades have been encapsulated in various conceptual and experimental perspectives.6,15,18 The literature has specifically remarked on both unsupervised and supervised learning, particularly on the two most prominent concepts used in BgS or foreground detection: statistics-based approaches that are unsupervised and supervised DNNs. This work has been extended from our preprint19 to further investigate self-supervision and experimental results.

    2.1. Statistics

    Statistics-driven methods have been widely studied in terms of both research and practical applications due to simplicity, lightweightness, and online adaptation to scene dynamics without label training. Deployed methods in the practice of this category are usually sample-based (e.g. temporal median,20 histograms,21 codebooks22) or via estimations of the multi-modular probability density function (PDF) (e.g. the GMMs7) on data inputs.

    Sample-based approaches essentially record the history of observed input pixels by sets of intensity values representing a background model. From a new input value, an algorithm compares the corresponding set to that pixel value to determine whether that pixel belongs to the background, and selectly adapts the model. For example, a codebook algorithm like Ref. 23 records all intensities in YCrCb color space at each pixel, which is done over a period of time through quantization of scene multi-modularity. In a similar way, Ref. 24 estimates visual changes by extracting histogram features, and thresholding their means over their highest probable occurrence probability. Another recently proposed approach employs a weight-sample-based strategy25 that rapidly adapts to changing scenarios by a reward-and-penalty function on samples. Recently, Agrawal and Natu26 presented a two-level adaptive thresholding algorithm to remove shadow pixels and detect foregrounds. The algorithm is based on the YCbCr color space, and uses the intensity ratio method for improved pixel-wise recognition.

    In parallel, popular approaches also aim to construct the PDF of data, where pixels’ spatio-temporal visual features are captured in the corresponding probabilistic models at either pixel-level or region-level. In the last decades, scientists have proposed a variety of statistical models to resolve the problem of background modeling and subtraction. Stauffer and Grimson7 proposed a pioneering work that handled gradual changes in outdoor scenes using pixel-level GMM with a sequential K-means distribution matching algorithm. To enhance the foreground/background discrimination ability regarding scene dynamics, Pulgarin-Giraldo et al.27 improved GMM with a contextual sensitivity that used a Least Mean Squares formulation to update the parameter estimation framework. By validating the robustness of background modeling in a high amount of dynamic scene changes, Ha et al.28 proposed a GMM with a high variation removal module using entropy estimation. On the other hand, Zhao et al.29 showed that BgS is possible with the integration of alternative cues about foreground and background on freely-moving cameras, where foreground cues can be extracted from the GMM compensated with image alignment, and background cues can be obtained from the spatio-temporal features filtered by the homography transformation. Then, in an effort to address the sequential bottleneck among statistical methods in pixel-wise learning, an unsupervised, tensor-driven framework of GMM was proposed by Ha et al.10 with a balanced trade-off between satisfactory foreground mask and exceptional processing speed. However, the approach’s number of parameters requires a lot of manual tuning. Overall, statistical models were developed with explicit probabilistic hypotheses to sequentially present the correlation of history observation at each image point or a pixel block, added with a global thresholding approach to extract foregrounds. This global thresholding technique for foreground detection usually leads to a compromise between the segregation of slow-moving objects and rapid adaptation to sudden scene changes within short-term measurement. This trade-off usually damages the image-BgS in multi-contextual scenarios, which is considered a sensitive concern in motion estimation. Hence, regarding foreground segmentation from background modeling, it is critical to improve frame differencing from constructed background scenes with a better approximation mechanism, and utilize parallel technologies.

    Drawn from the published methods, statistical studies essentially aim to characterize the history of pixels’ intensities with generalistic background models. The construction of these models is conveniently unsupervised and can be effectively adaptive to the dynamics of their input domains. However, indiscriminate adaptability entails compromises between valuable incorporation of domain contexts and over-adaptations of foreground objects into background models. While addressing these effects has shown promising results, they entail extra computational burdens corresponding to improved accuracy, but still without full consideration of scene properties for segmenting accurate foregrounds. Regarding GMM-based approaches specifically, the GMM mathematical framework has not only demonstrated strong multi-modular approximations of input statistics where effective background extraction procedures may excel, but it has also shown how highly customizable it can be in the extensive literature to address specific problems. Nevertheless, in terms of computations, there has yet to be a common, explicit computing framework for GMM-based approaches in which pixel-wise processing for high dimensionalities and scales can be accomplished with GPUs.

    2.2. Deep neural networks

    Unlike statistical frameworks, neural networks can explicitly exploit nonlinear data manipulations on parallel distributed computing paradigms with modern technologies by label training. Their goal is to generalize an equivariant function of foreground segmentation across video sequences where there can be visual changes of varying degrees of complexity, which is by either BgS or direct foreground extraction.

    Recently, there have been many attempts to apply DNNs to BgS. Inspired by LeNet-530 used for handwritten digit recognition, one of the earliest efforts to subtract the background from the input image frame was done by Braham et al.31 This work explores the potential of visual features learned by hidden layers for foreground–background pixel classification. Similarly, Wang et al.32 proposed a deep CNN trained on only a small subset of frames as there is a large redundancy in a video taken by surveillance systems. The model requires a hand-labeled segmentation of moving regions as an indicator in observed scenes. Lim et al.33 constructed an encoder–decoder architecture with the encoder inherited from VGG-16.34 The proposed encoder–decoder network takes a video frame, along its corresponding grayscale background and its previous frame as the network’s inputs to compute their latent representations, and to deconvolve these latent features into a foreground binary map. Another method is DeepBS35 which was proposed by Babaee et al. to compute the background model using both SuBSENSE36 and Flux Tensor method.37 The authors extract the foreground mask from a small patch from the current video frame and its corresponding background to feed into the CNN, and the mask is later post-processed to give the result. Nguyen et al. proposed a motion feature network38 to exploit motion patterns via encoding motion features from small samples of images. The method’s experimental results showed that the network obtained promising results and was well-performed on several unseen data sequences.

    Regarding direct foreground extraction, these models essentially construct implicit backgrounds within the hidden states, and cluster pixel regions by recognizing semantic classes of interest in the training set. An excellent proposed approach is the scene-specific FgSegNet series of encoder–decoder architectures12,13 proposed by Long et al. FgSegNet is one of the top-performing approaches in Change Detection that is built on top of VGG-16 of convolutional layers. There is also a published work from Chen et al.39 which aims to exploit high-level spatial-temporal features with a deep pixel-wise attention mechanism and convolutional long short-term memory (ConvLSTM). Chen et al. introduced a pixel-wise deep sequence learning architecture with attention mechanism and ConvLSTM to Change Detection. On the other hand, Yang et al.40 proposed an end-to-end multi-scale spatiotemporal propagation network to detect motions. Instead of using ConvLSTM or 3D convolutions, they developed a feature aggregation block to fuse motion features of various scales. Similarly, to take into account multi-scale features, Houhou et al.41 presented a deep multi-scale network for BgS, which fuse both the RGB color channels and depth maps to perform spatio-semantic BgS at various scales. Recently, Gouizi and Megherbi42 extended the U-net architecture with more skip connections on residual micro-autoencoder blocks. The approach is called Nested-Net, which produces high accuracy at the expense of significant computational costs over U-net and many skip connections.

    All things considered, neural-network-based methods significantly benefit from learning a transformation from an input batch of consecutive frames to manually labeled foregrounds of visual changes. From training with selected samples, these approaches are able to accurately generalize to varying degrees of contextual dynamics within a scene, essentially by constructing a numerical understanding of foreground extraction within a network parameters. However, recent DNNs-based methods do not ensure real-time performance, which is a crucial requirement for practical systems that need on-the-fly predictions. Despite the fact that DNNs can utilize the parallel-computing mechanisms of modern hardware very well, and can also make use of data for high-accuracy prediction, hardly any work has been done to investigate a proper balance between effectiveness and efficiency for DNNs-based Change Detection models.

    Therefore, inspired by how statistics-based models are very popular with application scientists,6 we advocate real-time processing and high accuracy to account for the convenience, scalability, and functionality of deploying DNNs in practical scenarios.

    3. Methodology

    In our work, we propose a framework consisting of two CNNs, as shown in Fig. 1. First, grounded on a generalized GMM model like Ref. 7, the first network models posterior PDFs conditioned on records of temporal information to construct background scenes. The vanilla form of GMM on background modeling is very simple for neural networks to approximate, as highly frequent intensity values are skewed toward background values. Then, the second component is developed to perform deep BgS across thresholds of differences, in which we show that a CNN-based encoder–decoder not only can be used in estimating frame-to-background differences like Ref. 35, but by leveraging generalized backgrounds, it can also make accurate predictions efficiently.

    Fig. 1.

    Fig. 1. The overview of the proposed method for background modeling and foreground detection.

    3.1. Convolutional density approximation of gaussians

    In this section, we first propose to formulate the GMM problem under DNNs’ perspectives. Following Zivkovic,8 let χTc={x1,x2,,xT|xi[0,255]c} be the set of T observed color signals at a pixel position, where c is the number of dimensions in the color space, the distribution of pixel intensity xi can be modeled by a linear combination of K probabilistic components θk and their corresponding posterior functions P(xi|θk). The marginal probability P(xi) of the mixture is defined in the following equation :

    P(x)=Kk=1P(θk)P(x|θk)=Kk=1πk𝒩(x|μk,σk),(1)
    where P(θk)=πk is the non-negative mixing coefficient that sums to unity over all k’s, representing the likelihood of occurrence of the kth Gaussian distribution θk.

    In practice, real-life recorded scenes have often presented various degrees of changing context dynamics (e.g. body of water, waving trees, changing weather, illumination, etc.). Obviously, while also taking into account acquisition noises, a single Gaussian would not be sufficient to model the pixel’s values. This multi-modality ought to be captured by a mixture of adaptive Gaussians. To also avoid performing costly matrix inversion,7 each color channel in the color space is assumed to be distributed independently, thus each Gaussian component in the mixture is described with a scalar variance σk.

    P(x|θk)=𝒩(x|μk,σk)=1(2π)cσckexp(xμk22σk),(2)
    where μk is the estimated mean and σk is the estimated universal covariance of color channels in the kth Gaussian component.

    From this hypothesis, we propose an architecture called Convolutional Density Approximation (CDA), which employs a set of nonlinear transformations fθ() to formulate a conditional GMM-based density function of x given a set of randomly selected, vectorized data points χT :

    yT=fθ(χTc)P(x|χTc).(3)

    In this work, we incorporate the mixture density model with the CNN instead of a multi-layer perceptron as done by Bishop et al. in the vanilla research.17 In the proposed scheme, the network itself learns to act as a feature extractor to formulate statistical inferences on temporal series of intensity values. First, as the background image contains the most frequently presented intensities in the sequence of observed scenes, we take advantage of this in CDA intuition to exploit the most likely intensity value that will rise in the background image via consideration of temporal arrangement. Second, the memory requirement to store so many weights with multi-layer perceptron may rule out certain hardware implementations. In convolutional layers, the scheme of weight sharing in the proposed CNN reduces the number of parameters, making CDA lighter and exploiting the parallel processing of a set of multiple pixel-wise analyses within a batch of video frames.

    The architecture of CDA contains seven learned layers, not counting the input — two depthwise convolutional, two convolutional, and three dense layers. Our network is summarized in Fig. 2. The input of our rudimentary architecture of the proposed network is a time series of color intensity at each pixel, which was analyzed with noncomplete connection schemes in four convolution layers regarding temporal perspective. Finally, the feature map of the last convolution layer was connected with three different configurations of dense layers to form a three-fold output of the network which presents the kernel parameter of the GMM.

    Fig. 2.

    Fig. 2. The proposed architecture of Convolution Density Network of GMM.

    The main goal of CDA is to construct an architecture of CNN that presents multivariate mapping in the form of GMM with the mechanism of offline learning. With the simulated probabilistic function, we aim to model the description of the most likely background scenes from actual observed data. In other words, the regularities in the proposed CNN should cover a generalized presentation of the intensity series of a set of consecutive frames at the pixel level. To achieve this proposition, instead of using separate GMM for each pixel-wise statistical learning, we consider using a single GMM to formulate the temporal history of all pixels in the whole image. Accordingly, CDA architecture is extended through a spatial extension of temporal data at image points with an extensive scheme defined in Fig. 2.

    The network output yT, whose dimension is (c+2)×K, is partitioned into three portions yμ(χTc), yσ(χTc), and yπ(χTc) of GMM:

    yT=[yμ(χTc),yσ(χTc),yπ(χTc)],=[y1μ,,yKμ,y1σ,,yKσ,y1π,,yKπ].(4)

    With our goal of formulating the GMM, we impose a different restriction on threefold outputs from the network:

    First, as the mixing coefficients πk indicate the proportion of data accounted for by mixture component k, they must be defined as independent and identically distributed probabilities. To achieve this regulation, in principle, we activate the network output with a softmax activation function :

    πk(χTc)=exp(ykπ)Kl=1exp(ylπ).(5)

    Second, in realistic scenarios, the measured intensity of observed image signals may fluctuate due to a variety of factors, including illumination transformations, dynamic contexts, and bootstrapping. Hence, we restrict the value of the variance of each component to the range [ˉσmin,ˉσmax] so that each component does not span spread the entire color space, and does not focus on one single color cluster :

    σk(χTc)=ˉσmin×(1ˆσk)+ˉσmax׈σk255,(6)
    where σk(χTc) is normalized toward a range of [0,1] over the maximum color intensity value, 255; and ˆσk is the normalized variance activated through a hard-sigmoid function, from the output neurons yσ that correspond with the variances :
    ˆσk(χTc)=max[0,min(1,2×ykσ+510)].(7)
    In this work, we adopt the hard sigmoid function because of the piecewise linear property and correspondence to the bounded form of a linear rectifier function (ReLU) of the technique. Furthermore, this was proposed and proved to be more efficient in both software and specialized hardware implementations by Courbariaux et al.43

    Third, the mean of the probabilistic mixture is considered on a normalized RGB color space where the intensity values retain in a range of [0,1] so that they can be approximated correspondingly with the normalized input. Similar to the normalized variance ˆσk, we have

    μk(χTc)=max[0,min(1,2×ykμ+510)].(8)

    From the proposed CNN, we extract the periodical background image for each block of pixel-wise time series of data in a period of T. This can be done by taking the weighted average of the estimated means, essentially summarizing the contextual dynamics of the scene into one background image.

    BG(χTc)=Kk=1πk(χTc)μk(χTc).(9)

    3.2. Learning posterior estimation

    In practice, particularly in each real-life scenario, the background model must capture multiple degrees of dynamics, which is more challenging by the fact that scene dynamics may also change gradually under external effects (e.g. lighting deviations). These effects convey the latest information regarding contextual deviations that may constitute new background predictions. Therefore, the modeling of backgrounds must not only take into account the various degrees of dynamics across multiple imaging pixels of the data source, but it must also be able to adaptively update its predictions concerning semantic changes.

    Equivalently, to approximate a statistical mapping function for background modeling, the proposed neural network function has to be capable of approximating a conditional PDF, thereby estimating a multi-modular distribution conditioned on its time-wise latest raw imaging inputs. The criteria for the neural statistical function to be instituted can be summarized as follows:

    By taking adaptiveness into account, the neural probabilistic density function can directly interpolate predictions in evolving scenes upon reception of new data.

    As a metric for estimating distributions, input data sequences cannot be weighted in terms of order.

    Hence, we have developed a self-supervised approach.

    3.2.1. Adaptive objective function

    To satisfy the first criterion, we propose to use an unsupervised loss function capable of directing CDA’s parameters toward adaptively capturing the conditional distribution of data inputs.

    At every single pixel, the proposed CNN estimates the probabilistic density function on the provided data by parameterizing the GMM. Specifically, given a set χTc of vectorized data points, πk, μk and σk shall be functions parameterized by the set. Thus, Eq. (B.1) can be modified for target x :

    P(x)=Kk=1πk(χTc)𝒩(x|μk,σk),(10)
    where
    𝒩(x|μk,σk)=1(2π)cσck(χTc)exp{xμk(χTc)2σk(χTc)}.(11)
    In this loss objective, the data distribution to be approximated is the set of data points relevant to background construction. This is rationalized by the goal of directing the neural network’s variables toward generalizing universal statistical mapping functions. Even with constantly evolving scenes where the batches of data values also vary, this loss measure can constitute fair weighting on the sequence of inputs thanks to explicit design to capture various pixel-wise dynamics over a video scene, and encompass unseen perspectives.

    Practical modeling: We establish the mapping function on the RGB color space, which would require optimizing the loss on not just any 3-channeled pixel, but for b=H×W spatial blocks of image intensity data, over the temporal data axis T

    =bi(i)(χTc)=biTj(i)j(12)
    with
    (i)j=ln[Kk=1π(i)k𝒩(xj|μ(i)k,σ(i)k)],(13)
    where xj is the jth element of the ith time-series data χT,(i)c of pixel values; π(i), μ(i), and σ(i) are, respectively, the desired mixing coefficients, means, and variances that commonly model the distribution of χT,(i)c in GMM.

    We define (i)j as the error function for our learned estimation on an observed data point xj, given the locally relevant dataset χT,(i)c for the neural function. (i)j is based on the statistical log-likelihood function and is equal to the negative of its magnitude. Hence, by minimizing this loss measure, we will essentially be maximizing the expected likelihood value of the GMM-based neural probabilistic density function P(x).

    Employing stochastic gradient descent on the negative logarithmic function (i)j involves not only monotonic decreases, which are steep when close to zero, but also upon convergence it also leads to the proposed neural function approaching an optimized mixture of Gaussians PDF. In addition, since this loss function depends entirely on the input and the output of the network (i.e. without external data labels), it is completely unsupervised. Optimization of the function is intended for the network to generalize on new data that is available on the fly without labels.

    Learning by back-propagation: Learning can only be achieved if we can obtain suitable equations of the partial derivatives of the error with respect to outputs of the network. As we describe in the previous section, yμ, yσ, and yπ present the proposed CDA’s outputs that formulate to the latent variables of GMM. The partial derivatives (i)jy(k) can be evaluated for a particular pattern and then summed up to produce the derivative of the error function . To simplify the further analysis of the derivatives, it is convenient to introduce the following notation that presents the posterior probabilities of the component k in the mixture, using Bayes theorem :

    Π(i)k=π(i)k𝒩(xj|μ(i)k,σ(i)k)Kl=1π(i)l𝒩(xj|μ(i)l,σ(i)l).(14)

    First, we need to consider the derivatives of the loss function with respect to the network’s outputs yπ that correspond to the mixing coefficients πk. Using Eqs. (B.14) and (B.15), we obtain

    (i)jπ(i)k=Π(i)kπ(i)k.(15)
    From this expression, we perceive that the value of π(i)k explicitly depends on y(l)π for l=1,2,,K as π(i)k is the result of the softmax mapping from y(l)π as indicated in Eq. (B.6). We continue to examine the partial derivative of π(i)k with respect to a particular network output y(l)π, which is
    π(i)ky(l)π={π(i)k(1π(i)l)if k=l,π(i)lπ(i)kotherwise.(16)
    By the chain rule, we have
    (i)jy(l)π=k(i)jπ(i)kπ(i)ky(l)π.(17)
    From Eqs. (B.15), (B.18), (B.19), and (B.21), we then obtain
    (i)jy(l)π=π(i)lΠ(i)l.(18)

    For y(k)σ, we make use of Eqs. (B.3), (B.7), (B.34), (B.14), and (B.15), by differentiation, to obtain

    (i)jy(k)σ=0.2(ˉσmaxˉσmin)255Πk(c2σ(i)kxjμk22(σ(i)k)2)(19)
    for 2.5<y(k)σ<2.5. This is because of the piece-wise property in the definition of the hard-sigmoid activation function.

    Finally, for y(k)μ, let μ(i)k,l be the lth element of the mean vector, where l is an integer, which lies in [0,c) and suppose that μ(i)k,l corresponds to an output oμk of the network. We can get derivative of μ(i)k,l by taking Eqs. (B.3), (B.9), (B.14), (B.15) into the differentiation process :

    (i)jy(k,l)μ=0.2×Π(i)k[x(i)j,lμ(i)k,lσ(i)k](20)
    for 2.5<y(k)μ<2.5.

    From Eqs. (B.22), (B.34), and (B.31), we validate the primary conceptualization that the loss objective is differentiable and optimizable by our CDA formulation.

    3.2.2. Inducing permutation invariance

    True to fundamental theories in statistics, the posterior PDF on the history of intensity occurrences ought not to depend on the order of appearances. To satisfy the second requirement, the order of the inputs should not matter upon loading, which is proper for any statistical function that estimates PDFs. In other words, regardless of what sampled pixel values appear first, estimates of the population distribution only depend on their frequency. We propose an augmentation method to revise the loss objective as a self-supervised procedure to induce permutation invariance.

    We denote ρn(χTc) as the n permutation of χTc, such that n is an integer and ρ1(χTc)=χTc. Thus, we aim to satisfy

    πk(χTc)πk(ρn(χTc)),σk(χTc)σk(ρn(χTc)),μk(χTc)μk(ρn(χTc)),(21)
    which is applied n|1nT!.

    We implicitly drive the model’s parameters to achieve condition (21) by slight modification of the loss function on random samples of integer n :

    =bi(i)(ρn(χTc)),(22)
    which is to regularize model parameters for generalized inferencing of a diverse range of cases, as the convolutional operations have demonstrated rotational and translational robustness, but not against permutational variance.

    3.3. Background modeling for efficient foreground extraction

    In this section, we show that the utilization of background models can provide sufficient information for foreground extraction, thereby reducing the required computational expenses involved while maintaining decent accuracies. Hence, we developed a convolutional auto-encoder, called NeMos, to simulate nonlinear frame-background differencing for foreground detection on background models.

    Traditionally, thresholding schemes are employed to find the highlighted difference between an imaging input and its corresponding static view in order to segment motion. For example, Stauffer and Grimson7 employed variance thresholding on background-input pairs by modeling the static view with GMM. While experimental results suggest certain degrees of applicability due to its simplicity, the approach lacks flexibility as the background model is usually not static and may contain various motion effects such as occlusions, stopped objects and shadow effects.

    In practice, a good design of a difference function between the current frame and its background must be capable of facilitating segmentation across a plethora of scenarios and effects. However, regarding countless scenarios in real life, where there are unique image features and object behaviors, there is yet any explicit mathematical model that is general enough to cover them all. Thus, effective subtraction requires high-degreed nonlinearity in order to approximate a model for the underlying mathematical framework. Following the universal approximation theorem,44 we design the technologically parallelizable neural function for an approximation of such framework. Specifically, we make use of a CNN to construct a foreground segmentation network. The motive is further complemented by two folds.

    CNNs have long been known for their effectiveness in approximating nonlinear functions with arbitrary accuracy.

    CNNs are capable of balancing between both speed and generalization accuracy, especially when given an effective design and enough representative training data.

    We exploit the use of a pair of the current video frame and its corresponding background as the input to the neural function and extract motion estimation. By combining this with a suitable learning objective, we explicitly provide the neural function with enough information to mold itself into a context-driven nonlinear difference function, thereby restricting model behavior and its search directions. This also allows us to scale down the networks parameter size, width, and depth to focus on learning representations while maintaining generalization for unseen cases. As empirically shown in the experiments, the proposed architecture is lightweight in terms of the number of parameters, and is also extremely resource-efficient.

    Compared to approaches that perform semantic segmentation on single images to cluster pixels of certain known classes in the training set (e.g. FgSegNet12), NeMos relies on existing input data to perform a learned pixel-wise subtraction procedure on input signals, conditioned on obvious and implicit distinctions. Essentially given,

    It=BGt+ϕ(FGt),(23)
    where FGt is a binary foreground map and ϕ represents the pixel-wise transformation function such that ϕ(FGt)=ItBGt. Thus, to solve the equation FGt=ϕ1(ItBGt)=Φ(It,BGt), we seek to approximate the equivariant function Φ by a neural network that operates using all necessary information in It and BGt. This means we can not only circumvent expensive analytic operations on spatio-temporal 4D colored tensors, but can also avoid associating pixel regions with certain target classes that may require more data or heavier architectures for full semantic segmentation operations.

    3.3.1. Architectural design

    The overall flow of the NeMos is shown in Fig. 3. We employ the encoder–decoder design approach for our segmentation function. With this approach, data inputs are compressed into a low-dimensional latent space of learned informative variables in the encoder, and the encoded feature map is then passed into the decoder, thereby generating foreground masks.

    Fig. 3.

    Fig. 3. The proposed architecture of NeMos grounded on convolutional autoencoder for foreground detection.

    Not only do we reduce the network size compared to FgSegNet, we also utilize the use of depthwise separable convolution introduced in MobileNets45 so that our method can be suitable for mobile vision applications. Because this type of layer significantly scales down the number of convolutional parameters, we reduced the number of parameters of our network by approximately 81.7% compared to using only standard 2D convolution, rendering a lightweight network of around 2,800 parameters. Interestingly, even with such a small set of parameters, the network still does not lose its ability to generalize predictions at high accuracy. Our architecture also employs normalization layers, but only for the decoder. This design choice is to avoid the loss of information in projecting the contextual differences of background-input pairs into the latent space via the encoder, while formulating normalization to boost the decoders learning.

    Encoder. The encoder can be thought of as a folding function that projects the loaded data into an information-rich low-dimensional feature space. In our architecture, the background image estimated by CDA is concatenated with imaging signals such that raw information can be preserved for the neural network to freely learn to manipulate. Moreover, with the background image also in its raw form, context-specific scene dynamics (e.g. moving waves, camera jittering, intermittent objects) are also captured. In addition, by explicitly providing a pair of the current input frame and its background image to segment foregrounds, our designed network essentially constructs a simple difference function that is capable of extending its behaviors to accommodate contextual effects. Thus, we theorize that approximating this neural difference function would not require an enormous number of parameters. In other words, it is possible to reduce the number of layers and the weight size of the foreground extraction network to accomplish the task. Hence, the encoder only consists of a few convolutional layers, with two max-pooling layers for downsampling contextual attributes into a feature-rich latent space.

    Decoder. The decoder of our network serves to unfold the encoded feature map into the foreground space using convolutional layers with two upsampling layers to restore the original resolution of its input data. In order to facilitate faster training and better estimation of the final output, we engineered the decoder to include instance normalization, which is more efficient than batch normalization.46 Using upsampling to essentially expand the latent tensors, the decoder also employs convolutional layers to induce nonlinearity like the encoder.

    The final output of the decoder is a grayscale probability map where each pixel’s value represents the chance that it is a component of a foreground object. We use the hard sigmoid activation function because of its property that allows faster gradient propagation, which results in less training time. At inference time, the final segmentation result is a binary image obtained by placing a constant threshold ϵ, which is experimentally determined, on the generated probability map.

    3.3.2. Loss objective

    We penalize the output of the network using the cross-entropy loss function commonly used for segmentation tasks [x,y], as the goal of the model is to threshold the value of each pixel. The description of the loss function is as follows :

    =Hi=1Wj=1[Yi,jlog(ˆYi,j)+(1Yi,j)log(1ˆYi,j)],(24)
    where Y is the corresponding target set of foreground binary masks for ˆY. We minimize 𝔼() on batches of predicted foreground probability maps. The network is trained for about 1000 epochs for each sequence in CDnet using Adam optimizer with the learning rate=0.005. The designed architecture is enabled to learn not only pixel-wise motion estimates of the training set, but it also is taught to recognize inherent dynamics in its data to accurately interpolate region-wise foreground predictions of unseen perspectives.

    4. Experiments and Discussion

    4.1. Experimental setup

    In this section, we verify experimentally the capabilities of the proposed method via comparative evaluations. Our goals are to evaluate the effectiveness and efficiency of CDA and NeMos in background modeling and subtraction. Our proposed scheme is designed to explicitly incorporate probabilistic density properties into the architecture to achieve accurate adaptiveness, while taking advantage of parallel computing technologies often used with DNNs to compete with state-of-the-art works in speed given its light structure. Therefore, we compare the accuracy of the proposed framework not only with unsupervised approaches that are light-weighted and generalizable without pretraining: GMM — Stauffer & Grimson,7 GMM — Zivkovic,8 SuBSENSE,36 PAWCS,47 TensorMoG,10 BMOG,9 FTSG,37 SWCD,48 but also with the data-driven, supervised models which trade computational expenses for high accuracy performance: FgSegNet_S,12 FgSegNet,12 FgSegNet_v2,13 Cascade CNN,32 DeepBS,35 STAM.49

    First, in terms of BgS results, we employ quantitative analysis on the CDnet-201450 dataset. Our metrics are those that can be appraised from confusion matrices, i.e. Precision, Recall, F-Measure, False-Negative Rate (FNR), False-Positive Rate (FPR), and Percentage of Wrong Classification (PWC). With overall results being drawn from the combination of all confusion matrices across given scenarios, the benchmarks on CDnet-2014 were performed by comparing foreground predictions against provided ground-truths. Through our results, we observe the capabilities of NeMos in leveraging background models of CDA for context-driven BgS.

    Then, we proceed with an ablation study to evaluate the contribution of the background generator, CDA, to the overall architecture on the Scene Background Modeling (SBMnet) dataset.51 The metrics include AGE (Average Gray-level Error), pEPs (Percentage of Error Pixels), pCEPs (Percentage of Clustered Error Pixels), MS-SSIM (MultiScale Structural Similarity Index), PSNR (Peak-Signal-to-Noise-Ratio), and CQM (Color image Quality Measure). The first three measure the intensity-level error difference between the algorithm’s output with the provided ground-truth, where lower estimation values indicate better background estimates. In contrast to how the first three are sensitive to small variations and require intensity-level exactness with referenced ground-truth, the latter three focus on quantifying the visual and structural quality of the background image generated by an algorithm. As exact background images are virtually impossible to obtain due to unavoidable variations of the camera’s capturing process, these structural- and visual-focused metrics provide for more objectiveness in background evaluations against reference ground-truths (higher values indicate better results).

    Finally, we will also analyze all methods in terms of processing speed with the image resolution of 320×240 and draw final conclusions.

    4.2. Implementation

    In our experiments, the number of Gaussians K is empirically and heuristically to balance the CDA’s capability of modeling constantly evolving contexts (e.g. moving body of water) under many effects of potentially corruptive noises. With K too big, many GMM components may be unused or they simply capture various noises within contextual dynamics. As the Gaussian component corresponding to the background intensity revolves around the most frequently occurring color subspaces to draw predictions, the extra components serve only as either placeholders for abrupt changes in backgrounds, be empty, or capture intermittent noises of various degrees. In practice, noise Gaussian components in GMM are pulse-like as they would appear for short durations, and low-weighted because they are not as often matched as background components. Nevertheless, they still present corruptive effects to our model. Our proposed CDA model was set up with the number of Gaussian components K=3 for all experimented sequences, and was trained on the CDnet-2014 dataset with Adam optimizer using a learning rate of α=1e4.

    In addition, the constants ˉσmin and ˉσmax were chosen such that no Gaussian components span the whole color space while not contracting to a single point that represents noises. If the [ˉσmin,ˉσmax] interval is too small, all of the Gaussian components will be likely to focus on one single color cluster. Otherwise, if the interval is too large, some of the components might still cover all intensity values, making it hard to find the true background intensity. Based on this assumption and experimental observations, we find that the difference between color clusters usually does not exceed approximately 16 at minimum and 32 at maximum.

    Regarding NeMos, the value of ϵ was empirically chosen to be 0.3 to extract the foreground effectively even under high color similarity between objects and background.

    The training dataset for NeMos is chosen by hand so that the data maintains a balance between background labels and foreground labels since imbalanced data will increase the model’s likelihood of being overfitted. We chose just 200 labeled ground truths to train the model. This is only up to 20% of the number of labeled frames for some sequences in CDnet, and 8.7% of CDnet’s labeled data overall. During training, the associated background of each chosen frame is directly generated using CDA as NeMos is trained separately from CDA because of the manually chosen input-label pairs.

    4.3. Results on CDnet 2014 benchmarks

    With 53 video sequences (length varying from 1,000 to 7,000 frames) spread over 11 different scenarios, the CDnet-2014 dataset50 is the current biggest, most comprehensive large-scale public dataset for evaluating algorithms in the field of online video Change Detection. Using it, we demonstrate empirically the effectiveness of our proposed approach across a plethora of scenarios and effects. For each thousands-frame sequence of a scenario, we sample only 200 foreground images for training our foreground estimator. This strategy of sampling for supervised learning is the same as that of FgSegNet and Cascade CNN. The experimental results are summarized in Table 1, which highlights the F-measure quantitative results of our approach compared against several existing state-of-the-art approaches. Despite its compact architecture, the proposed approach is shown to be capable of significantly outperforming unsupervised methods, and competing with complex deep-learning-based, supervised approaches in terms of accuracy.

    Table 1. F-measure comparisons over all of 11 categories in the CDnet 2014 dataset.

    Notes: Semi-Unsupervised; Experimented scenarios include bad weather (BDW), low frame rate (LFR), night videos (NVD), turbulence (TBL), baseline (BSL), dynamic background (DBG), camera jitter (CJT), intermittent object motion (IOM), shadow (SHD), and thermal (THM). In each column, is for the best, is for the second best, and  is for the third best.

    In comparison with unsupervised models built on the GMM background modeling framework like GMM — Stauffer & Grimson, GMM — Zivkovic, BMOG, and TensorMoG, the proposed approach is better augmented by the context-driven motion estimation plugin, without being constrained by simple thresholding schemes. Thus, it is able to provide remarkably superior F-measure results across the scenarios, especially on those where there are high degrees of noises or background dynamics like LFR, NVD, IOM, CJT, DBG and TBL. However, it is a little worse than TensorMoG on BDW, SHD, IOM, and CJT, which may be attributed to TensorMoG’s carefully tuned hyperparameters on segmenting foreground, thereby suggesting that the proposed method is still limited possibly by its architectural size and training data. Comparison with other unsupervised methods is also conducted, using mathematically rigorous approaches such as SuBSENSE, PAWCS, FTSG, and SWCD that are designed to tackle scenarios commonly seen in real life (i.e. BSL, DBG, SHD, and BDW). Nevertheless, the F-measure results of the proposed approach around 0.90 suggest that it is still able to outperform these complex unsupervised approaches, possibly ascribing to its use of hand-labeled data for explicitly enabling context capturing.

    In comparison with supervised approaches, the proposed approach is apparently very competitive against the more computationally expensive state of the arts. For instance, our approach considerably surpasses the generalistic methods of STAM and DeepBS on LFR and NVD, but it loses against both of these methods on SHD and CMJ, and especially is outperformed by STAM on many scenarios. While STAM and DeepBS are constructed using only 5% of CDnet-2014, they demonstrate good generalization capability across multiple scenarios by capturing the holistic features of their training dataset. However, despite being trained on all scenarios, their behaviors showcase higher degrees of instability (e.g. with LFR, NVD) than our proposed approach on scenarios that deviate from common features of the dataset. Finally, as our proposed method is compared against similarly scene-specific approaches like FgSegNets, Cascade CNN, the results were within expectations for almost all scenarios that ours would not be significantly outperformed, as the compared models could accommodate various features of each sequence in their big architectures. However, surprisingly, our method surpasses even these computationally expensive to be at the top of the LFR scenarios. This suggests that, with a background for facilitating motion segmentation from an input, our trained model can better tackle scenarios where objects are constantly changing and moving than even existing state-of-the-arts.

    Interestingly, NeMos+CDA on the PTZ sequence returns substantially correct results, even better than its performances on BDW, NVD, SHD, or CJT. It can be hypothesized that NeMos would rather work with averaged-out backgrounds to perform raw semantic extraction, than with noisy motions (i.e. snow droplets, shadows, jitters, lighting shifts) for context-driven BgS. Nevertheless, the sub-dataset PTZ is more limited in terms of observable objects and images in the region and scenario of interest compared to others, as can also be observed in the poor performance of DeepBS which attempted to generalize learning on imbalanced learned data.

    Overall, with small training sets, NeMos+CDA achieved decent results in Precision, Recall, FPR, FNR, PWC, and a score of 0.8774 in average F-measure, which is much higher than any compared unsupervised approaches and can practically compete with other, more computationally expensive, supervised approaches despite its light-weighted structure. Table 2 presents evaluation metrics of a confusion matrix.

    Table 2. Result of quantitative evaluation on CDnet 2014 dataset.

    Notes: Semi-Unsupervised; In each column, is for the best, is for the second best, and  is for the third best.

    4.4. Result on SBMnet benchmarks

    We perform an empirical ablation study of how the background generator, CDA, contributes to the overall architecture with the SBMnet dataset51 for evaluating background estimation results. The SBMnet dataset has 80 real-life video sequences and their corresponding ground-truth backgrounds for references over eight scenarios (illumination changes, cluttering, camera jitter, intermittent motion, etc.). It is an often-used dataset to quantitatively evaluate background modeling algorithms. Some of the algorithms that do not model the background, e.g. FgSegNet, SWCD, etc., are left out by default. For brevity, Table 3 provides the overall quantitative rankings (across all dataset sequences) of the proposed method along with state-of-the-art background estimation algorithms which are originally based on Gaussian Mixture Estimation.

    Table 3. Comparison on the SBMnet dataset.

    Notes: In each column, is for the best, is for the second best, and  is for the third best.

    In general, Table 3 demonstrates that the traditional GMM-based methods, GMM — Stauffer & Grimson, GMM — Zivkovic, and TensorMoG, are the top-performing methods in the background modeling domain. The proposed CDA module is outperformed by these traditional GMM-based algorithms in terms of the pixel-based metrics, i.e. AGE, pEPs, and pCEPs, which measure the intensity difference between the generated background and the ground-truth. However, the gains of CDA on visual quality measurements, i.e. MS-SSIM, PSNR, and CQM, signify that the background generated by CDA is competitive against the top GMM-based methods on the background estimation domain in terms of textural and semantic information compared to the ground-truth.

    The shortcomings of the proposed background extraction methods in its exact background grayscale estimation show up very clearly in the three metrics AGE, pEPs, and pCEPs, where lower results are better. The background component of the proposed method consistently falls out of the top-3 best methods. There are two main possible reasons why such shortcomings exist. First, the first three grayscale-based metrics are highly sensitive to small variations in the estimated background as these metrics measure the estimation result based entirely on its absolute difference with the provided ground-truth. In real life, however, obtaining a completely accurate background image is inherently impossible since the camera cannot consistently capture the same signal for every pixel in the image, i.e. avoiding variations in capturing pixel signals is inherently impossible. Thus, while these metrics surely provide some degree of confidence in the computed background image quality, they cannot serve as the absolute determination of the image quality. Second, our design of CDA focuses on speed efficiency with a small temporal window in contrast to traditional GMM-based methods that can capture long-term pixel signals to estimate the background intensities with high accuracy. This tradeoff between the efficiency and effectiveness of the algorithm results in a clear disadvantage for CDA in estimating the grayscale signal as close as possible to the ground-truth compared to the other five methods. However, the absolute difference of the CDA module with the top performing method on each metric is still within an acceptable margin: 6.7557 in AGE (compared to GMM — Zivkovic), 0.0758 in pEPs (compared to GMM — Stauffer & Grimson), and 0.0741 in pCEPs (compared to TensorMoG).

    In contrast, on the other three metrics (MS-SSIM, PSNR, and CQM) which measure the visual and structural distortion of estimated backgrounds against ground-truths, the proposed CDA yields very good results. The focus of these metrics was to quantify the errors in artificially generated image’s visual quality as perceived by humans as closely as possible. In these metrics, higher quantitative values correspond to better background estimations. The effectiveness of CDA’s background images’ quality is showcased with (1) the difference with the top-1 method in MS-SSIM, TensorMoG with parameter tuning, is only a marginal value of 0.0364, and (2) CDA consistently shows up as the second best method in PSNR and CQM. Thus, backgrounds approximated by CDA are images of decent quality, with good textural and semantic information compared to ground-truths.

    As an unsupervised, generalistic approach, although our proposed CDA module is less competitive against traditional GMM-based methods on grayscale error estimations of the background, the good results on visual quality metrics imply that the semantic information of the generated background and the input image are very similar. This semantic similarity between the background and the input frame possibly has suppressed a large number of background distractors from the input frame for the imbalance foreground segmentation learning task (very high difference in the number of pixels classified as background compared to the number of foreground pixels). However, it should be noted that CDA is still independent of NeMos, which means that any background generation algorithm can theoretically replace CDA in reducing background distractors.

    Nevertheless, there are two main reasons for the preference of CDA over other algorithms. First, because CDA is advantageous in maintaining adaptation in cases where environmental changes happen often (e.g., illumination changes) like traditional GMM approaches, its learning to generalize for a small local temporal history is comparable to the slow, gradual adaptation of GMM-based family of algorithms. Thus, with CDA providing suitable backgrounds (via feed-forwarded GMM approximations of windowed data) in a timely manner for NeMos, the latter module is supported with suppression of distractions, which contributes greatly to the reason why NeMos can be such light-weighted but still maintains effective context-driven segmentation. Secondly, most importantly, CDA is the more modern paradigm of background modeling with GMM, in which CDA is highly parallelizable on modern hardware and avoids the speed-throttling nature of sequential paradigm of methods such as GMM — Stauffer & Grimson, GMM — Zivkovic, SuBSENSE, and PAWCS in pixel-wise background generation.

    4.5. Computational speed comparison

    The proposed framework was implemented on a CUDA-capable machine with an NVIDIA GTX 1070Ti GPU or similar, along with the methods that require CUDA runtime, i.e. TensorMoG, DeepBS, STAM, FgSegNet, and Cascade CNN. For unsupervised approaches, we conducted our speed tests on the configuration of an Intel Core i7 with 16GB RAM. Our results are recorded quantitatively with execution performance in frame-per-seconds (fps), and time (milliseconds) versus accuracy in Fig. 4. At the overall speed of 129.4510fps (from about 3,500 parameters), with CDA (about 2,800 parameters) module processing at 402.1087fps, NeMos+CDA is much faster than other supervised deep learning approaches, of which the fastest — FgSegNet_S — runs at 23.1275fps. By concatenating estimations of background scenes with raw signals for foreground extraction, our approach makes efficient use of hardware resources due to its completely lightweight architecture and the latent-space-limitation approach. In contrast, other DNNs are burdened with a large number of trainable parameters to achieve accurate input-target mapping. Furthermore, the proposed scheme dominates the mathematically rigorous unsupervised methods frameworks in terms of speed and accuracy such as SuBSENSE, SWCD, and PAWCS, as their paradigms of sequential processing are penalized by significant penalties in execution. Significantly, the average speeds of the top three methods are dramatically disparate. With the objective of parallelizing the traditional imperative outline of rough statistical learning on GMM, TensorMoG reformulates a tensor-based framework that surpasses our dual architecture at 302.5261fps. On the other hand, GMM — Zivkovic’s design focuses on optimizing its mixture components, thereby significantly trading off its accuracy to attain the highest performance.

    Fig. 4.

    Fig. 4. Computational speed and average F-measure comparison with state-of-the-art methods.

    Notwithstanding, our proposed framework gives the most balanced trade-off (top-left-most) in addressing the speed-and-accuracy dilemma. Our model outperforms other approaches of top accuracy ranking when processing at exceptionally high speed, while obtaining good accuracy scores, at over 90% on more than half of CDnet’s categories and at least 84%.

    5. Conclusion

    This paper has developed a novel, two-stage BgS framework with a GMM-based CNN for background modeling, and a convolutional auto-encoder NeMos to simulate input-BgS for foreground detection, thus being considered as a search space limitation approach to compress a model of DNNs, while keeping up good accuracy. Our first and second contributions in this paper include a pixel-wise, light-weighted, feed-forward CNN representing a multi-modular conditional PDF of the temporal history of data, and a corresponding self-supervised training strategy for the CNN to learn from virtually inexhaustible datasets for approximating the mixture of Gaussian density function. In such a way, the proposed CDA not only gains the better capability of adaptation in contextual dynamics with humanly interpretable statistical learning for extension, but it is also designed in the tensor form to exploit modern parallelizing hardware. Secondly, we showed that incorporating such statistical features into NeMos’s motion-region extraction phase promises more efficient use of powerful hardware, with prominent speed performance and high accuracy, along with a decent generalization ability using a small-scale set of training labels, in a deep nonlinear scheme of only a few thousand parameters.

    Since CDN constructs GMMs out of each pixel’s fixed-sized temporal window, neighborhood information is not captured while redundant temporal information may have been incorporated. Inspired by DGCNN,52 which learns irregular neighborhood patterns through the GMM and optimizes the neural network kernels accordingly over graph data, we are investigating irregular patterns of convolution on spatio-temporal data to efficiently and adaptively distinguish background and foreground features. In particular, sampling window widths out of the constructed GMMs can potentially overcome the issues of fixed-width temporal convolution, so an extension to spatio-temporal irregular convolution can address the efficiency issues commonly observed in 3D convolution.

    Acknowledgments

    This research is funded by the Vietnam National University HoChiMinh City (VNU-HCM) under grant number DS2022-28-04. We thank the Ho Chi Minh City International University — Vietnam National University (HCMIU-VNU) for facilitating this work. Our sincere appreciation also goes to our colleagues for their support, which significantly improved this paper.

    Appendix A. Formulation of Equation (2)

    Definition 3.2.1, Eqs. (3.2.1) and (3.2.2) in the textbook by Tong53 can be revisited in Fig. A.1.

    Fig. A.1.

    Fig. A.1. Snippet from p. 26 of Ref. 53.

    We adopt the original mathematical formulation of the Gaussian distribution to the setting of background modeling on multi-channeled videos. To avoid performing costly matrix inversion, each color channel in the color space is assumed to be distributed independently, so each Gaussian component in the mixture is simply described with a positive scalar variance value (i.e. σ). Hence, we get a positive definite covariance matrix (i.e. Σ) for each Gaussian distribution, represented as a diagonal matrix with values equal to the same positive scalar across the diagonal. As a result, under general input data of c dimensions (e.g. if the video is encoded in RGB, then c=3), the determinant of Σ is the multiplication of the same value σ across c diagonal values (i.e. |Σ|=σc), and the inversion of Σ can simply be Σ1=1σ so that Σ1Σ is an identity matrix.

    Σ=[σσσ].(A.1)
    Equations (3.2.1) and (3.2.2) are combined and adopted as
    f(x;μ,Σ)=1(2π)c2|Σ|12exp(xμ)Σ1(xμ)2,x[+]c(A.2)
    then equivalently for our case, at the kth Gaussian component,
    f(x;μk,Σk)=1(2π)c|Σk|exp((xμk)Σ1k(xμk)2),x[+]c=1(2π)cσckexp(||xμk)||22σk).(A.3)

    Appendix B. Formulation Proof of CDA-GM

    B.1. Formulation of CDA-GM

    Let χTc={x1,x2,,xT|xi[0,255]c} be the time series of the T most recently observed color signals of a pixel where the dimension of the vector xi in the color space is c, the distribution of pixel intensity xi can be modeled by a linear combination of K probabilistic components θk and their corresponding conditional PDFs P(xi|θk). The marginal probability P(xi) of the mixture is

    P(x)=Kk=1P(θk)P(x|θk)=Kk=1πk𝒩(x|μk,σk),(B.1)
    where πk is a non-negative mixing coefficient for θk :
    Kk=1πk=1.(B.2)
    We use the re-formulated multivariate Gaussian distribution as
    𝒩(x|μk,σk)=1(2π)cσckexp(xμk22σk),(B.3)
    where μk is the estimated mean and σk is the estimated universal covariance of examined color channels in the kth Gaussian component θk.

    Our proposed CNN formulates a conditional formalism of GMM density function of x given a set of randomly selected, vectorized data points χT :

    yT=fθ(χTc)P(x|χTc),(B.4)
    where fθ() is a set of nonlinear transformations.

    The network output yT, whose dimension is (c+2)×K, is partitioned into three portions yμ(χTc), yσ(χTc), and yπ(χTc) of GMMs:

    yT=[yμ(χTc),yσ(χTc),yπ(χTc)]=[y1μ,,yKμ,y1σ,,yKσ,y1π,,yKπ].(B.5)

    With our goal of formulating the GMM, we restate the three restrictions on network outputs:

    First, as πk indicates the proportion of data accounted for by mixture component k, they are defined as independent, weighted scores :

    πk(χTc)=exp(ykπ)Kl=1exp(ylπ).(B.6)

    Second, we restrict the value of the variance of each component to the range [ˉσmin,ˉσmax] so that the components do not span the entire color space.

    σk(χTc)=ˉσmin×(1ˆσk)+ˉσmax׈σk255,(B.7)
    where ˆσk is the normalized variance that was activated through a hard-sigmoid function from the output neurons yσ :
    ˆσk(χTc)={0if ykσ<2.5,0.2×ykσ+0.5if 2.5ykσ2.5,1otherwise.(B.8)

    Third, the mixture mean is standardized from the corresponding network outputs with a hard-sigmoid function :

    μk(χTc)={0if ykμ<2.5,0.2×ykμ+0.5if2.5ykμ2.5,1otherwise.(B.9)

    We choose the hard-sigmoid function for the means and the variances, as explained.

    From the proposed CNN, we extract the periodical background image for each block of pixel-wise time series of data in a period of T, by taking the weighted average of the estimated means,

    BG(χTc)=Kk=1πk(χTc)μk(χTc).(B.10)

    Accordingly, the corresponding frame-wise foreground mask of each input frame is extracted from the Gaussian mixtures at each pixel location. Specifically, we applied a threshold ξ on the squared Mahalanobis distance between the input frame and the background distribution.

    FG(xTc)=[(xTcBG(χTc))2˜σ2t>ξ],(B.11)
    where
    ˜σ2t=max[σ2k(χTc)̂BGk,T(χTc)],for k[1,K].(B.12)

    B.2. Unsupervised training via backpropagation

    Specifically, given the set χTc randomly selected, vectorized data points, it is possible to retrieve the continuous conditional distribution of the data target x.

    In our proposed loss function, the data distributions to be approximated are the sets of data points that are relevant to background construction themselves.

    =biTj(i)j,(B.13)
    where
    (i)j=ln(Kk=1π(i)k𝒩(xj|μ(i)k,σ(i)k)),(B.14)
    where xj is the jth element of the ith time-series data χT,(i)c of pixel values; π(i), μ(i), and σ(i) are, respectively, the desired mixing coefficients, means, and variances that commonly model the distribution of χT,(i)c in GMM. We define (i)j as the error function for our learned estimation on an observed data point xj, given the locally relevant dataset χT,(i)c for the neural function. (i)j is based on the statistical log-likelihood function and is equal to the negative of its magnitude. Hence, by minimizing this loss measure, we will essentially be maximizing the expectation value of the GMM-based neural probabilistic density function P(x), from the history of pixel intensities at a pixel position.

    The key thing here is that whether the neural network can learn to optimize the loss function with the standard stochastic gradient descent algorithm with back-propagation. To simplify the further analysis of the derivatives using Bayes theorem, it is convenient to introduce the following notation :

    Πk=πk𝒩(xj|μk,σk)Kl=1πl𝒩(xj|μl,σl).(B.15)

    First, we need to consider the derivatives of the loss function with respect to network outputs yπ that correspond to the mixing coefficients πk. Using Eq. (B.14) and (B.15), we obtain

    jykπ=jπkπkykπ.(B.16)
    Thus,
    jπk=𝒩(xj|μk,σk)Kl=1πl𝒩(xj|μl,σl)=Πkπk.(B.17)
    From this expression, we perceive that the value of π(i)k explicitly depends on y(l)π for l=1,2,,K as π(i)k is the result of the softmax mapping from y(l)π as indicated in Eq. (B.6). We continue to examine the partial derivative of π(i)k with respect to a particular network output y(l)π, which is
    πkylπ={exp(ylπ)Kp=1exp(ypπ)[exp(ykπ)]2[Kp=1exp(ypπ)]2if k=l,exp(ylπ)exp(ykπ)[Kp=1exp(ypπ)]2otherwise,(B.18)
    which can be simplified to
    πkylπ={πk(1πk)if k=l,πlπkotherwise.(B.19)
    By chain rule, we have
    jylπ=kjπkπkylπ(B.20)
    thus,
    jylπ=kΠkπkπkylπ=(kΠkπl)Πl.(B.21)
    From Eqs. (B.15), (B.18), (B.19), and (B.21), we then obtain
    jylπ=πlΠl.(B.22)

    For y(k)σ, we make use of Eqs. (B.3), (B.7), (B.33), (B.14), and (B.15), by differentiation, to obtain

    jσk=σk[ln(Kl=1πl1(2π)cσclexp(xjμl22σl))]=σk[Kl=1πl1(2π)cσclexp(xjμl22σl)]Kl=1πl1(2π)cσclexp(xjμl22σl)=πk1(2π)cσk[1σckexp(xjμk22σk)]Kl=1πl1(2π)cσclexp(xjμl22σl).(B.23)

    Let A=πk1(2π)c, and B=Kl=1πl1(2π)cσclexp(xjμl22σl).

    Thus,

    jσk=Aσk[1σckexp(xjμk22σk)]B.(B.24)

    We also let

    C=σk[1σckexp(xjμk22σk)]=σk[exp(xjμk22σk)σck].(B.25)
    Thus,
    C=[exp(xjμk22σk)(xjμk22)(1σ2k)σck][c2σc21kexp(xjμk22σk)]σck=1σckexp(xjμk22σk)[xjμk22σc22kc2σc21k]=1σckexp(xjμk22σk)[xjμk22σ2kc2σk].(B.26)

    So plugging in A, B and C, we get

    jσk=πk1(2π)cσckexp(xjμk22σk)Kl=1πl1(2π)cσclexp(xjμl22σl)[xjμk22σ2kc2σk]=Πk[c2σkxjμk22σ2k].(B.27)
    We also have
    σkˆσk=ˉσmaxˉσmin255.(B.28)
    And,
    ˆσkykσ={0.2if 2.5ykσ2.5,0otherwise.(B.29)
    Thus,
    jykσ=jσkσkˆσkˆσkykσ=0.2(ˉσmaxˉσmin)255Πk[c2σkxjμk22σ2k](B.30)

    for 2.5<y(k)σ<2.5. This is because of the piece-wise property in the definition of the hard-sigmoid activation function.

    Finally, for y(k)μ, let μ(i)k,l be the lth element of the mean vector where l is an integer lies in [0,c) and suppose that μ(i)k,l corresponds to an output oμk of the network. We can get derivative of μ(i)k,l by taking Eqs. (B.3), (B.9), (B.14), (B.15) into the differentiation process:

    jμk=μk[ln(Kl=1πl1(2π)cσclexp(xjμl22σl))]=μk[Kl=1πl1(2π)cσclexp(xjμl22σl)]Kl=1πl1(2π)cσclexp(xjμl22σl)=πk1(2π)cσckμk[exp(xjμk22σk)]Kl=1πl1(2π)cσclexp(xjμl22σl)=πk1(2π)cσckexp(xjμk22σk)Kl=1πl1(2π)cσclexp(xjμl22σl)μk[xjμk22σk].(B.31)

    Then, we get

    jμk=Πk2σkμk[[xjμk]T[xjμk]](B.32)
    =Πkσk[xjμk].

    For data at each color channel l, we have

    μk,lyk,lμ={0.2if 2.5yk,lμ2.5,0otherwise.(B.33)

    Thus, for 2.5<yk,lμ<2.5,

    jyk,lμ=jμk,lμk,lyk,lμ=0.2×Πk[xj,lμk,lσk].(B.34)
    From Eqs. (B.22), (B.34), and (B.31), when CDA-GM is performed data-driven learning individually on each video sequence using Adam optimizer with a learning rate of α, the process tries to regulate the values of latent parameters in the mixture model via minimizing the negative of log likelihood function.

    ORCID

    Synh Viet-Uyen Ha  https://orcid.org/0000-0002-5056-8337

    Tien-Cuong Nguyen  https://orcid.org/0000-0001-9084-8977

    Hung Ngoc Phan  https://orcid.org/0000-0003-1909-1793

    Phuong Hoai Ha  https://orcid.org/0000-0001-8366-5590