An Analysis of the Cluster-Detecting Property of the BCM Neuron
Abstract
The BCM learning rule, named for Elie Bienenstock, Leon Cooper, and Paul Munro, was first proposed to measure the selectivity of neurons in the primary visual cortex and its dependency on neuronal inputs. We show that an artificial BCM neuron has the ability to detect clusters in a dataset. By exploring the qualitative behaviors of an underlying system of differential equations, we present a rigorous mathematical analysis of this cluster-detecting property. While the focus of this work is not to develop a robust state-of-the-art clustering method, we also analyze and discuss the performance of a resulting preliminary clustering algorithm.
1. Introduction
It is well corroborated that when trained in the right environment, a sensory neuron becomes selective; that is, it responds to a certain class of stimuli but not others. Experiments by Hubel and Wiesel, for instance, revealed that neurons of the visual cortex of cats may respond preferentially to stimulus orientation.1 Their studies also show that these neurons detect edges regardless of where — in the receptive field of the neuron — they are placed, and can preferentially detect motion in certain directions.1,2,3 While all sensory neurons are selective, their preference and degree of selectivity vary.4,5,6,7
Synaptic plasticity is an umbrella term used to describe how the strength of the synapse between a neuron and a stimulus — or between a neuron and another neuron — modifies over time. One of the most cited theories in this field is the Hebbian theory of synaptic plasticity.8,9 Hebb in 1949 proposed that when neuron A repeatedly participates in firing neuron B, the strength of the influence of A onto B increases. If we think of A as the pre-synaptic neuron and B as the post-synaptic neuron, Hebbian theory implies that changes in synaptic strengths in a neural network is a function of the pre- and post-synaptic neural activities. A few decades later, Nass and Cooper10 developed a Hebbian synaptic modification theory for the synapses of the visual cortex, which was later extended to a threshold-dependent setup by Cooper et al.11 In this setup, the sign of a synaptic strength modification is based on whether the post-synaptic response is below or above a static threshold. A response above the threshold is meant to strengthen the active synapse, and a response below the threshold should lead to a weakening of the active synapse.
One of the most accurate models of synaptic plasticity to date is the BCM learning rule proposed by Elie Bienenstock, Leon Cooper, and Paul Munro in 1982. By incorporating a dynamic threshold that is a function of the average post-synaptic activity over time, the BCM learning rule captures the development of stimulus selectivity in the primary visual cortex of higher vertebrates.12 Mathematically speaking, the BCM learning rule is a system of differential equations involving the synaptic strengths (also known as synaptic weights), the stimulus being received by the neuron, the activity response of the neuron to the stimulus, and the threshold for the activity. Unlike its predecessors, which use static thresholds to modulate neuronal activity response, the BCM learning rule allows the threshold to be dynamic. This dynamic threshold provides stability to the learning rule, and from a biological perspective, provides homeostasis to the neural system.
In corroborating the BCM learning rule, Shouval et al. showed that a network of BCM neurons develops orientation selectivity and ocular dominance in a natural scene environment.13 The BCM learning rule has been reformulated and adapted to suit various interaction environments of biological neural networks, including laterally interacting neurons14,15 and stimuli generalizing neurons.4 The BCM learning rule has been used to demonstrate a few machine learning applications of neuronal selectivity. Intrator et al. showed that a BCM neural network can perform projection pursuit,16,17,18 i.e., it can find projections in which a dataset departs from statistical normality. This is an important finding that highlights the feature detecting property of a BCM neural model. As a result, the BCM neural network has been successfully applied to some specific pattern recognition tasks. For example Bachman et al. incorporated the BCM learning rule in their algorithm for classifying radar data.19 Intrator et al. developed an algorithm for recognizing 3D objects from 2D view by combining existing statistical feature extraction models with the BCM model.20,21 In Ref. 22, Poljovka and Benuskova presented a preliminary simulation on how the BCM learning rule has the potential to identify alphanumeric letters.
In this work, we show that a BCM neuron has the ability to detect clusters in a dataset. Our focus is not to develop a robust state-of-the-art clustering method, but to present a rigorous mathematical analysis of this cluster-detecting property. In Sec. 2, we discuss the BCM learning rule as a system of stochastic differential equations and explain its variables and parameters as well as how synaptic weight training is achieved. In Sec. 3, we illustrate a simulation in the simplest case of a single neuron with two weights and two different competing stimuli. We then extend this simulation to competing data point clusters and note the variations in selectivity behavior that are a function of the sizes of the clusters and the angular distance between the clusters. In Sec. 4, we derive the mean field system of equations for the BCM neuron and show that there are changes in the stability as the synaptic threshold time scale changes. In Sec. 5, we analyze a resulting preliminary clustering algorithm; we also analyze its computational complexity, and how it performs in comparison to benchmark algorithms that perform similar tasks.
2. Mathematical Methods
We model a neuron with the following variables:
x={x1,x2,…,xn}, a vector representing the stimulus pattern being received by the neuron.
w={w1,w2,…,wn}, a vector of synaptic weights by which the components of x are scaled.
v=w⋅x, the response of the neuron to the stimulus x.
This model is usually referred to as the linear model, in that the response is a weighted linear combination of each component of the stimulus pattern. See Fig. 1(a) for a visual representation of the linear neuron model.

Fig. 1. (a) A linear neuron. x={x1,x2,…,xn} is a vector representing the stimulus pattern being received by the neuron. w={w1,w2,…,wn} is a vector of synaptic weights by which the components of x are scaled. v=w⋅x is the response of the neuron to the stimulus x. (b) A nonlinear function ϕ of the post-synaptic neuronal activity, v, and a threshold θ, of the activity. For low values of the post-synaptic activity (i.e., v<θ), ϕ is negative; for high values of the post-synaptic activity (i.e., v>θ), ϕ is positive.
For a single neuron, the BCM learning rule expresses the changes in synaptic weights, w as a product of the input stimulus vector, x, and a nonlinear function, ϕ, of the neuronal response, v, and a dynamic threshold, θ, of the response. If at any time, the neuron receives a stimulus x from a stimulus set, say {x(1),x(2),…,x(n)}, the synaptic weight vector, w, evolve according to the BCM rule as follows :
Later formulations of the learning rule — for instance, the one presented in Ref. 16 — have shown that a spatial average can be used in lieu of a temporal average, and that with p=2, E[vp] is a good approximation of Ep[v]. We can also replace the moving temporal average of v with a first-order low-pass filter. Thus, a differential form of the learning rule (as presented in Ref. 23) is
Once its dynamics are at a stable steady state, the BCM neuron is selective. This means that it exhibits high-response activities to a relatively small subset of the stimuli presented to it and exhibits very low-response activities to other stimuli. With data points as stimuli, our conjecture is that points that are close enough — in the Euclidean sense — to each other would yield comparable responses. Thus, based on this conjecture, the main focus of this work is to explore the ability of the BCM neuron to detect a cluster within a set of data points.
3. Preliminary Experiments
We proceed to numerically simulate different scenarios of the BCM learning rule that are vital to the present discussion.
3.1. Two stimuli, two weights
Consider a BCM neuron that receives a stimulus input x stochastically from a set {x(1),x(2)} with equal probabilities; that is, Pr[x(t)=x(1)]=Pr[x(t)=x(2)]=12. We create a simple hybrid stochastic system where the value of x switches between the pair {x(1),x(2)} as in a two state Markov process. Figure 2 shows results when τw=100, τθ=10, the stimuli are x(1)=(cosπ/24,sinπ/24) and x(2)=(sinπ/24,cosπ/24) — respectively, the red dot and the blue dot on the left panel — and the initial conditions of w1, w2, and θ lie in the interval (0,0.5). We concede that the choice of π/24 is arbitrary, however the decision to employ unit vectors for the stimuli is motivated by analyses that will be shown in future sections. The two plots in the right panel of Fig. 2 show the response of the neuron to each of the two stimuli over time. The colors of the plots correspond to the colors of the stimuli. As can be inferred from the steady state behavior of the neuron, it becomes selective and chooses x(1) with response of approximately 2 over x(2) which yields a response of approximately 0. We note that since the presentation of stimuli is stochastic, it is possible that if this experiment is repeated the neuron will choose x(2) over x(1).

Fig. 2. Left: Two stimuli presented to a BCM neuron. x(1)=(cosπ/24,sinπ/24) is red and x(2)=(sinπ/24,cosπ/24) is blue. Right: response of the neuron over time to each of the two stimuli. The colors of the plots correspond to the colors of the stimuli. The initial conditions of w1, w2, and θ lie in the interval (0,0.5). τw=100 and τθ=10.
3.2. Cluster analysis
We define a stimulus cluster as a group of stimuli with some sort of similarity. If the stimuli are represented as vectors — as we have been doing — the center of a cluster is thus the mean of all the vectors in the cluster. Motivated by the setup and results of Sec. 3.1, we construct an artificial set of stimuli whose members have the form x=(cosα+noise,sinα+noise) or x=(sinα+noise,cosα+noise). This is equivalent to creating two clusters with centers (cosα,sinα) and (sinα,cosα).
According to a theorem in Ref. 12, a BCM neuron is most selective when the set of stimulus patterns is linearly independent. When α=π/4, the two centers described above are the same. When α=−π/4, the angle between the two centers is π, implying that they are opposite of each other, and thus linearly dependent. Therefore, it is important to restrict α to −π/4<α<π/4.
Lemma 1. For any two angles β1 and β2, there exists an angle α such that a uniform rotation converts the set of unit vectors {x(1)=(cosβ1,sinβ1),x(2)=(cosβ2,sinβ2)} to {x(1)=(cosα,sinα),x(2)=(sinα,cosα)}.
Proof. It suffices to find the required rotation angle. If this angle is h, then α is such that rotating x(1) by h gives
Thus, for the sake of our analysis in this section and beyond, we measure the distance between two stimuli by measuring the non-reflex angle between the two vectors representing the points. The prescribed range of α above captures the desired angular measure because any value outside this range would result in a reflex angle. Figure 3 pictorially explains the relationship between the two centers for α>0 and α<0.

Fig. 3. Cluster centers defined as pattern vectors x=(cosα,sinα) and x=(sinα,cosα) for (a) α>0 (b) α=0 (c) α<0.
Now, let C1 be the cluster whose center is (cosα,sinα) and C2 the cluster whose center is (cosα,sinα). We parameterized these clusters as follows :
We proceed to train a BCM neuron by presenting to the neuron a stimulus from C1∪C2 at each iteration. Each row of Fig. 4 shows the result of carrying out an instance of this simulation. In each instance, the initial conditions of w1, w2, and θ lie in the interval (0,0.5), τw=100, and τθ=10. We keep the clusters in each instance moderately tight by setting b=0.35. The left column shows the distribution of the stimuli on the xy plane; the middle column shows plots of the responses of the neuron to the cluster centers. As in Fig. 2, the colors of the plots correspond to the colors of the cluster centers. The right column shows box plot distributions of the responses.

Fig. 4. C1={x(i)=(cosα+bϵi,sinα+bϵi)|b>0,−1<ϵi<1,i=1,2,…,N1} C2={x(j)=(sinα+bϵj,cosα+bϵj)|b>0,−1<ϵj<1,i=1,2,…,N2}. The initial conditions of w1, w2, and θ lie in the interval (0,0.5), τw=100, τθ=10, b=0.35. (a) N1=50, N2=50, and α=0 (b) N1=50, N2=50, and α=π/12 (c) N1=70, N2=30, and α=π/12.
In Fig. 4(a), N1=50, and N2=50. With α=0, the clusters in this case are clearly far apart enough from each other. At steady state, it can be seen that the responses to the centers are stable, and the neuron selects the center of cluster C1. The box plot shows that the responses to each cluster in fact forms its own cluster and that there is no intersection between the clusters of responses. The responses to the stimulus cluster whose center is selected by the neuron are higher than the responses to the stimulus cluster whose center is not selected.
In Fig. 4(b), N1=50, and N2=50. With α=π/12, the clusters are still far enough from each other, although not as far apart as in Fig. 4(a). At steady state, it can be seen that the responses to the centers are stable, and the neuron selects the center of cluster C2. The box plots show the responses to the stimulus clusters. A conclusion similar to that made for the box plots in Fig. 4(a) can also be made here.
In Fig. 4(c), N1=70, and N2=30. With α=π/12, the clusters in this case are still far enough from each other. At steady state, it can be seen that the responses to the centers are still stable, and the neuron selects the center of cluster C2. Note that the response to the center of C2 here is not as high as the response to C2 in Fig. 4(b).
We have shown that as long as the clusters are far enough from each other, the BCM neuron will exhibit a clear difference between the two sets of responses to the two different clusters. Note that the ratio of N1 to N2 is a factor in determining the value of the response to the selected cluster center. This will be considered in future analysis.
3.3. Oscillatory responses
We have observed that the response values, and thus the selectivity, of the neuron depend on the ratio of the sizes of the clusters. This also implies that in a two stimuli case, the relative values of the responses would then depend on the probability with which the stimuli were presented. Our studies have also shown that when the threshold, θ, changes slower than the weights, w, the dynamics of the BCM neuron take on a different kind of behavior: the responses oscillate. Figure 5 shows results for a slightly modified version of the simulation performed in Sec. 3.1. Again the two stimuli are x(1)=(cosπ/24,sinπ/24) and x(2)=(sinπ/24,cosπ/24) — respectively, the red dot and the blue dot on the left panel — and the initial conditions of w1, w2, and θ lie in the interval (0,0.5). Also Pr[x(t)=x(1)]=Pr[x(t)=x(2)]=12, however, τw=10 while τθ=135.

Fig. 5. Left: Two stimuli presented to a BCM neuron. x(1)=(cosπ/24,sinπ/24) is red and x(2)=(sinπ/24,cosπ/24) is blue. Right: oscillatory response of the neuron over time to each of the two stimuli over. The colors of the plots correspond to the colors of the stimuli. The initial conditions of w1, w2, and θ lie in the interval (0,0.5). τw=100 and τθ=135.
As can be seen, the dynamics of the system does not yield a stable steady state. It gives rise to an oscillation; though a stable oscillation. This behavior does not lend itself well to selectivity. Let v1=w⋅x(1) be the response of the neuron to x(1), and v2=w⋅x(2) the response of the neuron to x(2). Even though v1 looks like it is always higher than v2, the overall ranges of v1 and v2 intersect. The oscillatory behavior observed here has to do with the new choices of τw=10 and τθ=135 and will be studied further in Sec. 4.
4. Steady State and Stability
Although the scope of this work does not include an in-depth study of the dynamical system of the BCM learning — which have been explored in other work including Ref. 24 — we use this section to explore the plausible stability regimes of the learning rule as it relates to the goal of this work. The dynamics of the BCM neuron (Eq. (2)) are stochastic in nature since at each time step, the neuron randomly receives one stimulus out of a set of stimuli. One way to gain more insight into the nature of these dynamics is to study a mean-field deterministic approximation of the learning rule. If the rate of change of the stimuli is rapid compared to that of the weights and threshold, then we can average over the fast timescale to get a mean field or averaged model and then study this through the usual methods of dynamical systems. Consider a BCM neuron that receives a stimulus input x, stochastically from the set {x(1)=(x11,x12,…,x1n),x(2)=(x21,x22,…,x2n)} such that Pr[x(t)=x(1)]=ρ and Pr[x(t)=x(2)]=1−ρ. A mean field equation for the synaptic weights is
We will ignore the fixed points (0,0,0) and (1,1,1) as they are neither stable nor selective — and thus not useful to the task at hand. References 12 and 24 discussed the stability of these unstable fixed points as they pertain to the original formulation. References 14 and 16 gave a similar treatment to the objective function formulation. In the context of the main focus of this paper, the following theorem summarizes the steady state result detailed above
Theorem 1. Assume that a BCM neuron is trained with a set of two points {x(1),x(2)}, and that the data points are presented with respective probabilities ρ and 1−ρ. Let the responses of the neuron to x(1) and x(2) be v1 and v2, respectively. Then, at selective steady state, either (v1,v2)=(1/ρ,0) or (v1,v2)=(0,1/(1−ρ)).
As a side note, we observe that when ρ=1/2, the selective fixed point is either (2,0,2) or (0,2,2) and corresponds to the results seen in Figs. 2, 4(a) and 4(b). The plots in these figures suggest that the combination of parameters we employed in the simulations allow for stability of the fixed point in each case. Note, however, that ρ=1/2 also in Fig. 5 but the plots show that the fixed point in this case is not stable. We proceed to explore the stability of these fixed points as a function of the angle between the stimuli, the probability factor of the stimuli, ρ, and the ratio of τθ to τw.
To simplify our calculations, we translate the common tail of the vectors x(1) and x(2) so they have equal magnitudes, and assume — without loss of generality — that x(1)⋅x(1)=x(2)⋅x(2)=1. To further simplify the calculations, we let τ=τθ/τw, β=x(1)⋅x(2), and r=ρ/(1−ρ). Note that r∈(0,∞) with r=1 being the case of equal probability.
The Jacobian of Eq. (7) is
Applying a similar treatment to z2≡(0,11−ρ,11−ρ), we obtain the Jacobian
Collating the discussion that led to Eqs. (12) and (15), we obtain the following theorem about the stability of our selective steady state.
Theorem 2. Let β=x1⋅x2, τ=τθ/τw, and r=ρ/(1−ρ). Then the qualitative behaviors of the steady state solutions presented in Theorem 1 can be summarized as follows:
(v1,v2)=(1/ρ,0) is linearly asymptotically stable if and only if
τ<1and(r+τ−τr)(r−τ+β2+1)+τr(β2−1)>0.(v1,v2)=(0,1/(1−ρ)) is linearly asymptotically stable if and only if
τ<1and(r+τr−τ)(1−τr+β2τr+r)+r(β2−1)>0.
5. A Single-Neuron BCM Clustering Algorithm
We propose a working preliminary algorithm that makes use of the selectivity property of the BCM neuron to cluster a dataset. The algorithm for this task will involve the variables X, nd, v, w, θ, τw, τθ, R, κ, and δ characterized as follows:
(1) | X is an n×d matrix containing the dataset. | ||||
(2) | n is the number of data points in the dataset. | ||||
(3) | d is the number of attributes of each data point i.e., the dimension of the dataset. | ||||
(4) | v is the activity of the neuron. | ||||
(5) | w is a d-dimensional vector of the synaptic weights of the neuron. To avoid convergence to a degenerate fixed point, w should not be initialized with the zero vector. | ||||
(6) | θ is the activity threshold of the neuron. | ||||
(7) | R is an n-dimensional vector. The element Ri corresponds to the response of the neuron to the ith data point of X. | ||||
(8) | τw is the time-scale factor for w. | ||||
(9) | τθ is the time-scale factor for θ. As has been shown in Secs. 3 and 4, steady state stability is guaranteed if τθ/τw<1. | ||||
(10) | numItr is the number of iterations during which the neuron is trained. | ||||
(11) | κ and δ are cluster detection cutoff factors. δ is an expected average difference in the neuronal responses between two data points in the same cluster. This is usually very small (say, in the order of 0.1). κδ is the expected minimum difference in the neuronal responses between two data points in two different clusters, implying that κ should be chosen in such a way that κδ is large enough to reflect this. However, as can be deduced in the algorithms that follow, if κ is too large, points belonging to different clusters may be grouped together. Other than by experimentation, there is currently no hard rule in determining the value of κ and δ. |
The mechanism of the proposed single-neuron BCM clustering algorithm (Algorithm 2) can be described as detect and eliminate. This means that it trains the BCM neuron with the dataset, detects the high-response data points, designates these points as a cluster, and eliminates them from the dataset. It then repeats this procedure until the dataset is empty. The algorithm can easily be modified to accept a pre-defined number of desired clusters.
5.1. Performance
The primary goal of this work is to analyze the cluster-detection property of the BCM neuron. For this reason, we choose datasets that will properly demonstrate this goal. We concede that future work will include fine-tuning the above preliminary clustering algorithm to account for probability distribution and other relevant factors. With that being said, we proceed to apply Algorithm 2 on some select datasets. In each simulation, τw=10, τθ=1, numItr=10,000, κ=10, δ=0.05, and w is initialized with numbers in the interval (0,0.3).
Artificially generated dataset: A dataset is said to be linearly separable if for each cluster in the dataset, there exists a hyperplane that can be used to separate the cluster from the remainder of the dataset. The datasets in Fig. 6 are generated in the same manner that the ones in Fig. 4 were generated. Figures 6(a) and 6(b) show the best performances of Algorithm 2 on two linearly separable datasets. The numbers on the plots show the order in which the clusters were detected. Figure 6(c) shows the performance of the algorithm on a dataset that is not linearly separable. In each case, we specified the number of desired clusters.
Iris dataset: This dataset was introduced by Fisher in 193627 and can be found on the UCI machine learning repository.28 The set contains 150 data points on three species of the Iris flower comprising 50 samples from the Iris setosa species, 50 from Iris virginica, and 50 from Iris versicolor. It measures (in cm) the following four attributes for each sample: length of the sepal, the width of the sepal, the length of the petal, and the width of the petals. This dataset is hardly analyzed by clustering since there are only two obvious clusters (within three groups): one of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information Fisher used. Figure 7(a) shows a 3D projection of the Iris data points. The cyan points are samples of Iris setosa, the purple points are samples of Iris virginica, and the yellow points are samples of iris versicolor. Figure 7(b) shows a performance of Algorithm 2 that comes very close to mirroring the species clusters of the dataset. Again, the numbers on the plots show the order in which the clusters were detected and the number of desired clusters, 3, was pre-defined.
Cardiotocography dataset: This dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians. This dataset can be found on the UCI machine learning repository.28 The dataset consists of 2126 samples and each sample has 23 attributes. With respect to morphological patterns, each sample in the dataset can be placed into one of 10 groups A,B,C,D,…J. Each of the samples is already pre-classified to be in one of these groups. We set Algorithm 2 to detect exactly 10 clusters in order to see how the clusters compare to the original 10 groups in the dataset. We first choose a representative centroid for each of the 10 groups A,B,C,D,…J. This representative centroid is the point in the group that is closest to the mean of the group. Before clustering, we associate each data point with a representative centroid from original grouping. We thus consider a point to be correctly clustered if after the clustering it belongs in the same cluster as the associated representative centroid. Over 100 runs of Algorithm 2, we found the average percentage of correctly clustered points to be 79.23%.

Fig. 6. (a) and (b) Best performances of Algorithm 2 on two linearly separable datasets (c) Performance of the algorithm on a dataset that is not linearly separable. The numbers on the plots show the order in which the clusters were detected.

Fig. 7. (a) The cyan points are samples of Iris setosa, the purple points are samples of Iris virginica, and the yellow points are samples of iris versicolor (b) A performance of Algorithm 2 that comes very close to mirroring the species clusters of the dataset. The numbers on the plots show the order in which the clusters were detected.
To further evaluate the performance of Algorithm 2, we compute the Davies–Bouldin (DB) index29 for the clusters produced by the algorithm, as well as the DB index for clusters produced by a few other popular clustering algorithms. The DB index takes into consideration a combination the similarity of points within the clusters and the dissimilarity between the clusters and is calculated in the following way :
n is the number of clusters.
cx is the centroid of cluster x.
σx is the average distance of all elements in cluster x to centroid cx.
d(ci,cj) is the distance between centroids ci and cj.
In a clustering result with high intracluster similarity and low intercluster similarity, the numerator of the inside of each term of the summation above should be less than the denominator. Thus, a reasonable DB index should be less than one. In addition, the algorithm that yields the best clustering result should have the lowest DB index.
Table 1 shows the DB indices for different algorithms on the datasets described above. Each algorithm is adjusted to yield the same number of clusters associated with each dataset, and each number is the average DB index over 100 runs. As can be seen, Algorithm 2 performs very well when compared with the other clustering algorithms. The bolded numbers highlight cases where, of all the algorithms, Algorithm 2 gives the best results. As expected, the algorithm performs best when the clusters are clearly linearly separable.
Dataset | BCM (Algorithm 2) | k-means | EM | CLINK |
---|---|---|---|---|
Figure 6(a) dataset (3 clusters) | 0.3001 | 0.3228 | 0.3180 | 0.3110 |
Figure 6(b) dataset (4 clusters) | 0.3009 | 0.3100 | 0.3013 | 0.2910 |
Figure 6(c) dataset (3 clusters) | 0.3611 | 0.4200 | 0.3991 | 0.4710 |
Iris dataset (3 clusters) | 0.3910 | 0.4012 | 0.4024 | 0.3211 |
Cardiotocography dataset (10 clusters) | 0.3411 | 0.3510 | 0.3521 | 0.4437 |
5.2. Convergence and computational complexity
The BCM clustering algorithm uses Euler’s method to solve the system of ordinary differential equations in the BCM learning rule. The convergence rate of Euler’s method can be analytically determined for Lipschitz continuous functions, and pretty difficult to determine for functions that are not Lipschitz continuous, especially when stochasticity is involved, like in Algorithm 2. Even though the BCM learning rule is not Lipschitz continuous and determining its convergence rate is not within the scope of this study, our experimental simulations (see Fig. 4) show that as long as τθ/τw<1, the learning rule converges in a reasonable amount of time.
In terms of computational complexity, the parts of Algorithm 2 that contribute in determining its computational complexity are the for-loop that starts on line 2, and the matrix-vector multiplication in line 8. The for-loop runs a constant number of times (numItr) which does not depend on the size of the data. Line 5 is the dot product of two d-dimensional vectors, and has a complexity of O(d). Each of lines 5 and 6 has a complexity of O(1) since the dominant operation in each of them is the multiplication of a vector by a constant. Thus, the complexity of lines 2–7 of the algorithm is O(d). Line 8 is the multiplication of an n×d matrix by a d-dimensional vector, which has a complexity of O(nd). Therefore, the overall complexity algorithm 1 is O(nd).

In terms of computational complexity, the major parts of Algorithm 2 are line 2, line 3, and the if-else block (lines 6–20). The contributions of these major parts can be enumerated as follows:
As explained above, line 2 has a complexity of O(nd).
Line 3 can be done with a merge sort, which has a complexity of O(nlogn).
The most computationally complex part of the if-else block is lines 17–20. Line 17 is a sequential comparison of consecutive elements of an n-dimensional vector, and has a complexity of O(n). Line 18 also has a complexity of O(n) since it is the sequential comparison of the elements of an n-dimensional vector to a threshold. Line 19 is the deletion or flagging of at most n rows in a matrix, and has a complexity of O(n).

If m is the number of clusters detected, the while loop that starts at line 1 and ends at line 21 runs exactly m times. In the worst case scenario, which is very unlikely, m=n. Otherwise m<n. Thus, the overall complexity of the algorithm 2 is O(m(nlogn+nd)) where m is the number of clusters, d is the number of attributes of each data point, and n is the number of points in the dataset. This is generally better than the complexity of the other three algorithms we compared with Algorithm 2. Each of EM and k-means clustering algorithms has a best case complexity of O(nmd+2logn),30 and the complete linkage hierarchical clustering algorithm has a complexity of O(n3).31
Table 2 shows the mean CPU time, in seconds, (over 100 runs) of each clustering algorithm on the datasets already discussed. As can be seen, the BCM clustering algorithm is faster than the other algorithms. On the cardiotocography dataset for instance, the algorithm took only about 80% of the CPU time of k-means.
Dataset | BCM (Algorithm 2) | k-means | EM | CLINK |
---|---|---|---|---|
Figure 6(a) dataset | 0.4108 | 0.5004 | 0.5891 | 0.5021 |
Figure 6(b) dataset | 0.5108 | 0.5704 | 0.6891 | 0.6021 |
Figure 6(b) dataset | 0.2108 | 0.3004 | 0.3891 | 0.4021 |
Iris dataset | 0.4113 | 0.6100 | 0.8711 | 0.4364 |
Cardiotocography dataset | 5.8753 | 7.2953 | 7.7613 | 7.8553 |
6. Discussion
We have explored the ability of the BCM neuron to detect clusters in a dataset. Our main approach involved treating data points in a dataset as stimuli presented to a BCM neuron. We have shown that in this context, the following factors contribute to the stability of the selective fixed points of the BCM learning rule for a single neuron: the amplitudes of the competing stimuli; the probabilities with which the stimuli are presented; and the time scale of the sliding threshold. Similar results would be possible if the quadratic term, v2, in the second line of Eq. (2) were replaced with vp,p>2, since the original formulation in Ref. 12 suggests that the fixed point structures are preserved for any positive value of p.
Our analytical results in Secs. 3 and 4 leaned heavily on scenarios where the neuron is selecting between two data points or two clusters. We found that this is sufficient since our overall approach is for the neuron to detect one cluster at a time. Thus if there are more than two clusters, the analyses in these sections will still be applied if one uses a one-versus-rest approach. We also considered other approaches but decided not include them in this paper. One such approach is to use a small network of coupled BCM neurons, with the number of neurons corresponding to the number of desired clusters. To make such a network receive stimulus patterns from a common set and work for the task at hand, it is important to incorporate a mechanism for competitive selectivity within the network. A mechanism of this sort, found in visual processes32 (and also in tactile,33 auditory,34 and olfactory processing35) is called lateral inhibition. During lateral inhibition, an excited neuron reduces the activity of its neighbors by disabling the spreading of action potentials to neighboring neurons in the lateral direction. It is thus expected that, at the end, each neuron would select one unique cluster. We reserve this approach for future work.
Another approach we considered involves observing the response of all the clusters at once and cluster-labeling points based on where their responses lie among the subset of responses. For instance, Fig. 8 shows the distribution of the responses of a BCM neuron to three separable data clusters generated in a fashion similar to how the clusters of Fig. 4 were generated. We however found this approach to not be robust enough when the number of clusters is high.

Fig. 8. The distribution of the responses of a BCM neuron to three separable data clusters.
The model neuron we used in this paper has been assumed to have linear response properties, which may be seen as oversimplified. However, this approach is sufficient for the scope of our present study. The introduction of a nonlinear transfer function to the BCM learning rule has been addressed by Ref. 16. In their formulation, the learning rule is derived as a gradient descent rule on an objective function that is cubic in the nonlinear response. Our decision to use linear neurons is motivated by their amenability to formal analysis.
The proposed clustering algorithm is preliminary and has limitations. While it shows good performance for the limited datasets discussed here, it still needs some fine-tuning in order to be applied to a more general class of data. Furthermore, the algorithm only deals with data with numerical values and does not handle missing attributes. Thus, future work should gear toward improving the algorithm to account for a wider variety of data; this will involve incorporating other distance measures apart from the Euclidean distance measure. To further probe the limitations of the proposed algorithm, we compare it to the k-means algorithm for a class of datasets with similar centers. In particular, we generate two isotropic Gaussian blobs using the “make blobs” dataset from scikit-learn36 and vary the dispersion within the clusters while keeping the centers fixed. Figure 9(a) shows an instance of this dataset in which the intercluster standard deviation is 0.3. Figure 9(b) shows the mean silhouette coefficient as a function of the intercluster standard deviation. The silhouette coefficient of a sample is computed as (b−a)/max(a,b), where b is the distance between the sample and the nearest cluster that the sample is not part of, and a is the mean intracluster distance between the sample and other data points in its cluster. Note that the best value of the silhouette coefficient is 1 and the worst value is −1; values near 0 indicate overlapping clusters; negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar. As can be seen, for this specific class of data, the k-mean clustering algorithm outperforms the BCM clustering algorithm.

Fig. 9. (a) Two isotropic Gaussian blobs each have 0.5 intercluster standard deviation. (b) Performance of k-means and BCM as a function of intercluster standard deviation.
Acknowledgment
This work is based upon work supported, in part, by the U.S. Army Research Office Grant No. W911NF-21-1-0192.
ORCID
Lawrence Udeigwe https://orcid.org/0000-0002-7627-4820