Research ArticleOpen Access

An Analysis of the Cluster-Detecting Property of the BCM Neuron

Department of Mathematics, Manhattan College, 4513 Manhattan College Parkway, Riverdale, NY 10471, USA

E-mail Address: lawrence.udeigwe@manhattan.edu

https://doi.org/10.1142/S2972370124500028Cited by:0 (Source: Crossref)

Abstract

The BCM learning rule, named for Elie Bienenstock, Leon Cooper, and Paul Munro, was first proposed to measure the selectivity of neurons in the primary visual cortex and its dependency on neuronal inputs. We show that an artificial BCM neuron has the ability to detect clusters in a dataset. By exploring the qualitative behaviors of an underlying system of differential equations, we present a rigorous mathematical analysis of this cluster-detecting property. While the focus of this work is not to develop a robust state-of-the-art clustering method, we also analyze and discuss the performance of a resulting preliminary clustering algorithm.

Keywords:

1. Introduction

It is well corroborated that when trained in the right environment, a sensory neuron becomes selective; that is, it responds to a certain class of stimuli but not others. Experiments by Hubel and Wiesel, for instance, revealed that neurons of the visual cortex of cats may respond preferentially to stimulus orientation.¹ Their studies also show that these neurons detect edges regardless of where — in the receptive field of the neuron — they are placed, and can preferentially detect motion in certain directions.^1,2,3 While all sensory neurons are selective, their preference and degree of selectivity vary.^4,5,6,7

Synaptic plasticity is an umbrella term used to describe how the strength of the synapse between a neuron and a stimulus — or between a neuron and another neuron — modifies over time. One of the most cited theories in this field is the Hebbian theory of synaptic plasticity.^8,9 Hebb in 1949 proposed that when neuron A repeatedly participates in firing neuron B, the strength of the influence of A onto B increases. If we think of A as the pre-synaptic neuron and B as the post-synaptic neuron, Hebbian theory implies that changes in synaptic strengths in a neural network is a function of the pre- and post-synaptic neural activities. A few decades later, Nass and Cooper¹⁰ developed a Hebbian synaptic modification theory for the synapses of the visual cortex, which was later extended to a threshold-dependent setup by Cooper et al.¹¹ In this setup, the sign of a synaptic strength modification is based on whether the post-synaptic response is below or above a static threshold. A response above the threshold is meant to strengthen the active synapse, and a response below the threshold should lead to a weakening of the active synapse.

One of the most accurate models of synaptic plasticity to date is the BCM learning rule proposed by Elie Bienenstock, Leon Cooper, and Paul Munro in 1982. By incorporating a dynamic threshold that is a function of the average post-synaptic activity over time, the BCM learning rule captures the development of stimulus selectivity in the primary visual cortex of higher vertebrates.¹² Mathematically speaking, the BCM learning rule is a system of differential equations involving the synaptic strengths (also known as synaptic weights), the stimulus being received by the neuron, the activity response of the neuron to the stimulus, and the threshold for the activity. Unlike its predecessors, which use static thresholds to modulate neuronal activity response, the BCM learning rule allows the threshold to be dynamic. This dynamic threshold provides stability to the learning rule, and from a biological perspective, provides homeostasis to the neural system.

In corroborating the BCM learning rule, Shouval et al. showed that a network of BCM neurons develops orientation selectivity and ocular dominance in a natural scene environment.¹³ The BCM learning rule has been reformulated and adapted to suit various interaction environments of biological neural networks, including laterally interacting neurons^14,15 and stimuli generalizing neurons.⁴ The BCM learning rule has been used to demonstrate a few machine learning applications of neuronal selectivity. Intrator et al. showed that a BCM neural network can perform projection pursuit,^16,17,18 i.e., it can find projections in which a dataset departs from statistical normality. This is an important finding that highlights the feature detecting property of a BCM neural model. As a result, the BCM neural network has been successfully applied to some specific pattern recognition tasks. For example Bachman et al. incorporated the BCM learning rule in their algorithm for classifying radar data.¹⁹ Intrator et al. developed an algorithm for recognizing 3D objects from 2D view by combining existing statistical feature extraction models with the BCM model.^20,21 In Ref. 22, Poljovka and Benuskova presented a preliminary simulation on how the BCM learning rule has the potential to identify alphanumeric letters.

In this work, we show that a BCM neuron has the ability to detect clusters in a dataset. Our focus is not to develop a robust state-of-the-art clustering method, but to present a rigorous mathematical analysis of this cluster-detecting property. In Sec. 2, we discuss the BCM learning rule as a system of stochastic differential equations and explain its variables and parameters as well as how synaptic weight training is achieved. In Sec. 3, we illustrate a simulation in the simplest case of a single neuron with two weights and two different competing stimuli. We then extend this simulation to competing data point clusters and note the variations in selectivity behavior that are a function of the sizes of the clusters and the angular distance between the clusters. In Sec. 4, we derive the mean field system of equations for the BCM neuron and show that there are changes in the stability as the synaptic threshold time scale changes. In Sec. 5, we analyze a resulting preliminary clustering algorithm; we also analyze its computational complexity, and how it performs in comparison to benchmark algorithms that perform similar tasks.

2. Mathematical Methods

We model a neuron with the following variables:

$x = {x_{1}, x_{2}, \dots, x_{n}}$ , a vector representing the stimulus pattern being received by the neuron.
$w = {w_{1}, w_{2}, \dots, w_{n}}$ , a vector of synaptic weights by which the components of x are scaled.
$v = w \cdot x$ , the response of the neuron to the stimulus $x$ .

This model is usually referred to as the linear model, in that the response is a weighted linear combination of each component of the stimulus pattern. See Fig. 1(a) for a visual representation of the linear neuron model.

Fig. 1. (a) A linear neuron. $x = {x_{1}, x_{2}, \dots, x_{n}}$ is a vector representing the stimulus pattern being received by the neuron. $w = {w_{1}, w_{2}, \dots, w_{n}}$ is a vector of synaptic weights by which the components of x are scaled. $v = w \cdot x$ is the response of the neuron to the stimulus $x$ . (b) A nonlinear function $ϕ$ of the post-synaptic neuronal activity, v, and a threshold $θ$ , of the activity. For low values of the post-synaptic activity (i.e., $v < θ$ ), $ϕ$ is negative; for high values of the post-synaptic activity (i.e., $v > θ$ ), $ϕ$ is positive.

For a single neuron, the BCM learning rule expresses the changes in synaptic weights, $w$ as a product of the input stimulus vector, $x$ , and a nonlinear function, $ϕ$ , of the neuronal response, v, and a dynamic threshold, $θ$ , of the response. If at any time, the neuron receives a stimulus $x$ from a stimulus set, say ${x^{(1)}, x^{(2)}, \dots, x^{(n)}}$ , the synaptic weight vector, $w$ , evolve according to the BCM rule as follows :

˙ w = ϕ (v; θ) x, θ = E p [v], <math display="block" altimg="eq-00022.gif"><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="right"><mover accent="true"><mrow><mstyle mathvariant="bold"><mi>w</mi></mstyle></mrow><mo>̇</mo></mover></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mi>ϕ</mi><mo stretchy="false">(</mo><mi>v</mi><mo>;</mo><mi>θ</mi><mo stretchy="false">)</mo><mstyle mathvariant="bold"><mi>x</mi><mo>,</mo></mstyle></mtd></mtr><mtr><mtd columnalign="right"><mi>θ</mi></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><msup><mrow><mi>E</mi></mrow><mrow><mi>p</mi></mrow></msup><mo stretchy="false">[</mo><mi>v</mi><mo stretchy="false">]</mo><mo>,</mo></mtd></mtr></mtable></math> (1)

where

$\dot{w}$ means

$\frac{d w}{d t}$ . The quantity

$θ$ is sometimes referred to as the “sliding threshold” because, as can be seen from Eq. (1), it changes with time, and this change depends on the neuronal response, v. The function

$ϕ$ has the following property: for low values of the post-synaptic activity (i.e.,

$v < θ$ ),

$ϕ$ is negative; for high values of the post-synaptic activity (i.e.,

$v > θ$ ),

$ϕ$ is positive. The results presented in Ref. 12 use

$ϕ (v) = v (v - θ)$ . Figure 1(b) shows the behavior of this particular

$ϕ$ . The quantity

$E [v]$ is a running temporal average of v, and the learning rule is stable for

$p > 1$ .

Later formulations of the learning rule — for instance, the one presented in Ref. 16 — have shown that a spatial average can be used in lieu of a temporal average, and that with $p = 2$ , $E [v^{p}]$ is a good approximation of $E^{p} [v]$ . We can also replace the moving temporal average of v with a first-order low-pass filter. Thus, a differential form of the learning rule (as presented in Ref. 23) is

τ w ˙ w = v x (v - θ), τ θ ˙ θ = (v 2 - θ), <math display="block" altimg="eq-00038.gif"><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="right"><msub><mrow><mi>τ</mi></mrow><mrow><mi>w</mi></mrow></msub><mover accent="true"><mrow><mstyle mathvariant="bold"><mi>w</mi></mstyle></mrow><mo>̇</mo></mover></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mi>v</mi><mstyle mathvariant="bold"><mi>x</mi></mstyle><mo stretchy="false">(</mo><mi>v</mi><mo>-</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="right"><msub><mrow><mi>τ</mi></mrow><mrow><mi>θ</mi></mrow></msub><mover accent="true"><mrow><mi>θ</mi></mrow><mo>̇</mo></mover></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mo stretchy="false">(</mo><msup><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>-</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr></mtable></math> (2)

where

$τ_{w}$ and

$τ_{θ}$ are time scale factors, which in simulated environments, are used to adjust how fast the system is changing with respect to time. We point out that the vector input,

$x$ , is changing rapidly compared to

$θ$ and

$w$ ; thus Eq. (2) is actually a stochastic equation. The stimulus,

$x$ , are generally taken from a finite set of patterns,

$x^{(k)}$ , and is randomly selected and presented to the model.

Once its dynamics are at a stable steady state, the BCM neuron is selective. This means that it exhibits high-response activities to a relatively small subset of the stimuli presented to it and exhibits very low-response activities to other stimuli. With data points as stimuli, our conjecture is that points that are close enough — in the Euclidean sense — to each other would yield comparable responses. Thus, based on this conjecture, the main focus of this work is to explore the ability of the BCM neuron to detect a cluster within a set of data points.

3. Preliminary Experiments

We proceed to numerically simulate different scenarios of the BCM learning rule that are vital to the present discussion.

3.1. Two stimuli, two weights

Consider a BCM neuron that receives a stimulus input $x$ stochastically from a set ${x^{(1)}, x^{(2)}}$ with equal probabilities; that is, $Pr [x (t) = x^{(1)}] = Pr [x (t) = x^{(2)}] = \frac{1}{2}$ . We create a simple hybrid stochastic system where the value of $x$ switches between the pair ${x^{(1)}, x^{(2)}}$ as in a two state Markov process. Figure 2 shows results when $τ_{w} = 100$ , $τ_{θ} = 10$ , the stimuli are $x^{(1)} = (cos π / 24, sin π / 24)$ and $x^{(2)} = (sin π / 24, cos π / 24)$ — respectively, the red dot and the blue dot on the left panel — and the initial conditions of $w_{1}$ , $w_{2}$ , and $θ$ lie in the interval $(0, 0.5)$ . We concede that the choice of $π / 24$ is arbitrary, however the decision to employ unit vectors for the stimuli is motivated by analyses that will be shown in future sections. The two plots in the right panel of Fig. 2 show the response of the neuron to each of the two stimuli over time. The colors of the plots correspond to the colors of the stimuli. As can be inferred from the steady state behavior of the neuron, it becomes selective and chooses $x^{(1)}$ with response of approximately 2 over $x^{(2)}$ which yields a response of approximately 0. We note that since the presentation of stimuli is stochastic, it is possible that if this experiment is repeated the neuron will choose $x^{(2)}$ over $x^{(1)}$ .

Fig. 2. Left: Two stimuli presented to a BCM neuron. $x^{(1)} = (cos π / 24, sin π / 24)$ is red and $x^{(2)} = (sin π / 24, cos π / 24)$ is blue. Right: response of the neuron over time to each of the two stimuli. The colors of the plots correspond to the colors of the stimuli. The initial conditions of $w_{1}$ , $w_{2}$ , and $θ$ lie in the interval $(0, 0.5)$ . $τ_{w} = 100$ and $τ_{θ} = 10$ .

3.2. Cluster analysis

We define a stimulus cluster as a group of stimuli with some sort of similarity. If the stimuli are represented as vectors — as we have been doing — the center of a cluster is thus the mean of all the vectors in the cluster. Motivated by the setup and results of Sec. 3.1, we construct an artificial set of stimuli whose members have the form $x = (cos α + noise, sin α + noise)$ or $x = (sin α + noise, cos α + noise)$ . This is equivalent to creating two clusters with centers $(cos α, sin α)$ and $(sin α, cos α)$ .

According to a theorem in Ref. 12, a BCM neuron is most selective when the set of stimulus patterns is linearly independent. When $α = π / 4$ , the two centers described above are the same. When $α = - π / 4$ , the angle between the two centers is $π$ , implying that they are opposite of each other, and thus linearly dependent. Therefore, it is important to restrict $α$ to $- π / 4 < α < π / 4$ .

Lemma 1. For any two angles $β_{1}$ and $β_{2}$ , there exists an angle $α$ such that a uniform rotation converts the set of unit vectors ${x^{(1)} = (cos β_{1}, sin β_{1}), x^{(2)} = (cos β_{2}, sin β_{2})}$ to ${x^{(1)} = (cos α, sin α), x^{(2)} = (sin α, cos α)} .$

Proof. It suffices to find the required rotation angle. If this angle is h, then $α$ is such that rotating $x^{(1)}$ by h gives

cos α = cos (β 1 + h), sin α = sin (β 1 + h) <math display="block" altimg="eq-00088.gif"><mo>cos</mo><mi>α</mi><mo>=</mo><mo>cos</mo><mo stretchy="false">(</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>+</mo><mi>h</mi><mo stretchy="false">)</mo><mo>,</mo><mspace width="1em" class="quad"></mspace><mo>sin</mo><mi>α</mi><mo>=</mo><mo>sin</mo><mo stretchy="false">(</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>+</mo><mi>h</mi><mo stretchy="false">)</mo></math> (3)

and rotating

$x^{(2)}$ by h gives

sin α = cos (β 2 + h), cos α = sin (β 2 + h) . <math display="block" altimg="eq-00090.gif"><mo>sin</mo><mi>α</mi><mo>=</mo><mo>cos</mo><mo stretchy="false">(</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>+</mo><mi>h</mi><mo stretchy="false">)</mo><mo>,</mo><mspace width="1em" class="quad"></mspace><mo>cos</mo><mi>α</mi><mo>=</mo><mo>sin</mo><mo stretchy="false">(</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>+</mo><mi>h</mi><mo stretchy="false">)</mo><mo>.</mo></math> (4)

Now, recall that for any y,

$sin y = cos (π / 2 - y)$ . So setting

$cos α$ of Eq. (3) equal to the

$cos α$ of Eq. (4) gives

$β_{1} + h = π / 2 - (β_{2} + h)$ . Thus, the required rotation angle is

$h = \frac{π}{4} - \frac{(β_{1} + β_{2})}{2}$ and the required parametric angle for the two vectors is

$α = \frac{(β_{1} - β_{2})}{2} + \frac{π}{4}$ . □

Thus, for the sake of our analysis in this section and beyond, we measure the distance between two stimuli by measuring the non-reflex angle between the two vectors representing the points. The prescribed range of $α$ above captures the desired angular measure because any value outside this range would result in a reflex angle. Figure 3 pictorially explains the relationship between the two centers for $α > 0$ and $α < 0$ .

Fig. 3. Cluster centers defined as pattern vectors $x = (cos α, sin α)$ and $x = (sin α, cos α)$ for (a) $α > 0$ (b) $α = 0$ (c) $α < 0$ .

Now, let $C_{1}$ be the cluster whose center is $(cos α, sin α)$ and $C_{2}$ the cluster whose center is $(cos α, sin α)$ . We parameterized these clusters as follows :

where

$ϵ_{i}$ and

$ϵ_{j}$ are uniformly distributed and determine the deviation of each stimulus from the cluster center; b is used to control how scattered the clusters are; and

$N_{1}$ and

$N_{2}$ are the respective number of stimuli in the clusters.

We proceed to train a BCM neuron by presenting to the neuron a stimulus from $C_{1} \cup C_{2}$ at each iteration. Each row of Fig. 4 shows the result of carrying out an instance of this simulation. In each instance, the initial conditions of $w_{1}$ , $w_{2}$ , and $θ$ lie in the interval $(0, 0.5)$ , $τ_{w} = 100$ , and $τ_{θ} = 10$ . We keep the clusters in each instance moderately tight by setting $b = 0.35$ . The left column shows the distribution of the stimuli on the $x y$ plane; the middle column shows plots of the responses of the neuron to the cluster centers. As in Fig. 2, the colors of the plots correspond to the colors of the cluster centers. The right column shows box plot distributions of the responses.

In Fig. 4(a), $N_{1} = 50$ , and $N_{2} = 50$ . With $α = 0$ , the clusters in this case are clearly far apart enough from each other. At steady state, it can be seen that the responses to the centers are stable, and the neuron selects the center of cluster $C_{1}$ . The box plot shows that the responses to each cluster in fact forms its own cluster and that there is no intersection between the clusters of responses. The responses to the stimulus cluster whose center is selected by the neuron are higher than the responses to the stimulus cluster whose center is not selected.

In Fig. 4(b), $N_{1} = 50$ , and $N_{2} = 50$ . With $α = π / 12$ , the clusters are still far enough from each other, although not as far apart as in Fig. 4(a). At steady state, it can be seen that the responses to the centers are stable, and the neuron selects the center of cluster $C_{2}$ . The box plots show the responses to the stimulus clusters. A conclusion similar to that made for the box plots in Fig. 4(a) can also be made here.

In Fig. 4(c), $N_{1} = 70$ , and $N_{2} = 30$ . With $α = π / 12$ , the clusters in this case are still far enough from each other. At steady state, it can be seen that the responses to the centers are still stable, and the neuron selects the center of cluster $C_{2}$ . Note that the response to the center of $C_{2}$ here is not as high as the response to $C_{2}$ in Fig. 4(b).

We have shown that as long as the clusters are far enough from each other, the BCM neuron will exhibit a clear difference between the two sets of responses to the two different clusters. Note that the ratio of $N_{1}$ to $N_{2}$ is a factor in determining the value of the response to the selected cluster center. This will be considered in future analysis.

3.3. Oscillatory responses

We have observed that the response values, and thus the selectivity, of the neuron depend on the ratio of the sizes of the clusters. This also implies that in a two stimuli case, the relative values of the responses would then depend on the probability with which the stimuli were presented. Our studies have also shown that when the threshold, $θ$ , changes slower than the weights, w, the dynamics of the BCM neuron take on a different kind of behavior: the responses oscillate. Figure 5 shows results for a slightly modified version of the simulation performed in Sec. 3.1. Again the two stimuli are $x^{(1)} = (cos π / 24, sin π / 24)$ and $x^{(2)} = (sin π / 24, cos π / 24)$ — respectively, the red dot and the blue dot on the left panel — and the initial conditions of $w_{1}$ , $w_{2}$ , and $θ$ lie in the interval $(0, 0.5)$ . Also $Pr [x (t) = x^{(1)}] = Pr [x (t) = x^{(2)}] = \frac{1}{2}$ , however, $τ_{w} = 10$ while $τ_{θ} = 135$ .

Fig. 5. Left: Two stimuli presented to a BCM neuron. $x^{(1)} = (cos π / 24, sin π / 24)$ is red and $x^{(2)} = (sin π / 24, cos π / 24)$ is blue. Right: oscillatory response of the neuron over time to each of the two stimuli over. The colors of the plots correspond to the colors of the stimuli. The initial conditions of $w_{1}$ , $w_{2}$ , and $θ$ lie in the interval $(0, 0.5)$ . $τ_{w} = 100$ and $τ_{θ} = 135$ .

As can be seen, the dynamics of the system does not yield a stable steady state. It gives rise to an oscillation; though a stable oscillation. This behavior does not lend itself well to selectivity. Let $v_{1} = w \cdot x^{(1)}$ be the response of the neuron to $x^{(1)}$ , and $v_{2} = w \cdot x^{(2)}$ the response of the neuron to $x^{(2)}$ . Even though $v_{1}$ looks like it is always higher than $v_{2}$ , the overall ranges of $v_{1}$ and $v_{2}$ intersect. The oscillatory behavior observed here has to do with the new choices of $τ_{w} = 10$ and $τ_{θ} = 135$ and will be studied further in Sec. 4.

4. Steady State and Stability

Although the scope of this work does not include an in-depth study of the dynamical system of the BCM learning — which have been explored in other work including Ref. 24 — we use this section to explore the plausible stability regimes of the learning rule as it relates to the goal of this work. The dynamics of the BCM neuron (Eq. (2)) are stochastic in nature since at each time step, the neuron randomly receives one stimulus out of a set of stimuli. One way to gain more insight into the nature of these dynamics is to study a mean-field deterministic approximation of the learning rule. If the rate of change of the stimuli is rapid compared to that of the weights and threshold, then we can average over the fast timescale to get a mean field or averaged model and then study this through the usual methods of dynamical systems. Consider a BCM neuron that receives a stimulus input $x$ , stochastically from the set ${x^{(1)} = (x_{11}, x_{12}, \dots, x_{1 n}), x^{(2)} = (x_{21}, x_{22}, \dots, x_{2 n})}$ such that $Pr [x (t) = x^{(1)}] = ρ$ and $Pr [x (t) = x^{(2)}] = 1 - ρ$ . A mean field equation for the synaptic weights is

ẇ i = ρ x 1 i v 1 (v 1 - θ) + (1 - ρ) x 2 i v 2 (v 2 - θ), i \in {1, 2, \dots n} . <math display="block" altimg="eq-00189.gif"><mrow><msub><mrow><mi>ẇ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><mi>ρ</mi><msub><mrow><mi>x</mi></mrow><mrow><mn>1</mn><mi>i</mi></mrow></msub><msub><mrow><mi>v</mi></mrow><mrow><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><msub><mrow><mi>v</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>-</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>ρ</mi><mo stretchy="false">)</mo><msub><mrow><mi>x</mi></mrow><mrow><mn>2</mn><mi>i</mi></mrow></msub><msub><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msub><mo stretchy="false">(</mo><msub><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>-</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>,</mo><mspace width="1em" class="quad"></mspace><mi>i</mi><mo>\in</mo><mo stretchy="false">{</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mo>\dots</mo><mi>n</mi><mo stretchy="false">}</mo><mo>.</mo></mrow></math>

Now, let the responses to the two stimuli be

$v_{1} = w \cdot x^{(1)}$ and

$v_{2} = w \cdot x^{(2)}$ . With this, changes in the responses can be written as

˙ v 1 = x 11 ẇ 1 + x 12 ẇ 2 + \dots + x 1 n ẇ n, ˙ v 2 = x 21 ẇ 1 + x 22 ẇ 2 + \dots + x 2 n ẇ n . <math display="block" altimg="eq-00192.gif"><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="right"><msub><mrow><mover accent="true"><mrow><mi>v</mi></mrow><mo>̇</mo></mover></mrow><mrow><mn>1</mn></mrow></msub></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><msub><mrow><mi>x</mi></mrow><mrow><mn>1</mn><mn>1</mn></mrow></msub><msub><mrow><mi>ẇ</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>+</mo><msub><mrow><mi>x</mi></mrow><mrow><mn>1</mn><mn>2</mn></mrow></msub><msub><mrow><mi>ẇ</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>+</mo><mo>\dots</mo><mo>+</mo><msub><mrow><mi>x</mi></mrow><mrow><mn>1</mn><mi>n</mi></mrow></msub><msub><mrow><mi>ẇ</mi></mrow><mrow><mi>n</mi></mrow></msub><mo>,</mo></mtd></mtr><mtr><mtd columnalign="right"><msub><mrow><mover accent="true"><mrow><mi>v</mi></mrow><mo>̇</mo></mover></mrow><mrow><mn>2</mn></mrow></msub></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><msub><mrow><mi>x</mi></mrow><mrow><mn>2</mn><mn>1</mn></mrow></msub><msub><mrow><mi>ẇ</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>+</mo><msub><mrow><mi>x</mi></mrow><mrow><mn>2</mn><mn>2</mn></mrow></msub><msub><mrow><mi>ẇ</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>+</mo><mo>\dots</mo><mo>+</mo><msub><mrow><mi>x</mi></mrow><mrow><mn>2</mn><mi>n</mi></mrow></msub><msub><mrow><mi>ẇ</mi></mrow><mrow><mi>n</mi></mrow></msub><mo>.</mo></mtd></mtr></mtable></math> (6)

So, a mean field equation in terms of the responses is

We note that we could also write down the averaged equations in terms of the weights, but this form of the equations is much more cumbersome. We now look for equilibria and the stability of these equations. We note that if the two stimuli are not colinear and

$ρ \in (0, 1)$ , then

${\dot{v}}_{1, 2} = 0$ if and only if

$v_{j} (v_{j} - θ) = 0$ ,

$j \in {1, 2}$ . Using the fact that at equilibrium,

$θ = ρ {v_{1}}^{2} + (1 - ρ) {v_{2}}^{2}$ , we find

v 1 (v 1 - (ρ v 1 2 + (1 - ρ) v 2 2)) = 0, v 2 (v 2 - (ρ v 1 2 + (1 - ρ) v 2 2)) = 0, <math display="block" altimg="eq-00199.gif"><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="right"><msub><mrow><mi>v</mi></mrow><mrow><mn>1</mn></mrow></msub><mo stretchy="false">(</mo><msub><mrow><mi>v</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>-</mo><mo stretchy="false">(</mo><mi>ρ</mi><msup><mrow><msub><mrow><mi>v</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow><mrow><mn>2</mn></mrow></msup><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>ρ</mi><mo stretchy="false">)</mo><msup><mrow><msub><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mn>0</mn><mo>,</mo></mtd></mtr><mtr><mtd columnalign="right"><msub><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msub><mo stretchy="false">(</mo><msub><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>-</mo><mo stretchy="false">(</mo><mi>ρ</mi><msup><mrow><msub><mrow><mi>v</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow><mrow><mn>2</mn></mrow></msup><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>ρ</mi><mo stretchy="false">)</mo><msup><mrow><msub><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mn>0</mn><mo>,</mo></mtd></mtr></mtable></math> (8)

which gives the fixed points

(v1,v2,θ)={(0,0,0),(1ρ,0,1ρ),(0,11−ρ,11−ρ),(1,1,1)}.<math display="block" altimg="eq-00200.gif"><mo stretchy="false">(</mo><msub><mrow><mi>v</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>=</mo><mfenced separators="" open="{" close="}"><mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>0</mn><mo>,</mo><mn>0</mn><mo stretchy="false">)</mo><mo>,</mo><mfenced separators="" open="(" close=")"><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mi>ρ</mi></mrow></mfrac><mo>,</mo><mn>0</mn><mo>,</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>ρ</mi></mrow></mfrac></mrow></mfenced><mo>,</mo><mfenced separators="" open="(" close=")"><mrow><mn>0</mn><mo>,</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>1</mn><mo>−</mo><mi>ρ</mi></mrow></mfrac><mo>,</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>1</mn><mo>−</mo><mi>ρ</mi></mrow></mfrac></mrow></mfenced><mo>,</mo><mo stretchy="false">(</mo><mn>1</mn><mo>,</mo><mn>1</mn><mo>,</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></mfenced><mo>.</mo></math>(9)

We will ignore the fixed points $(0, 0, 0)$ and $(1, 1, 1)$ as they are neither stable nor selective — and thus not useful to the task at hand. References 12 and 24 discussed the stability of these unstable fixed points as they pertain to the original formulation. References 14 and 16 gave a similar treatment to the objective function formulation. In the context of the main focus of this paper, the following theorem summarizes the steady state result detailed above

Theorem 1. Assume that a BCM neuron is trained with a set of two points ${x^{(1)}, x^{(2)}}$ , and that the data points are presented with respective probabilities $ρ$ and $1 - ρ$ . Let the responses of the neuron to $x^{(1)}$ and $x^{(2)}$ be $v_{1}$ and $v_{2}$ , respectively. Then, at selective steady state, either $(v_{1}, v_{2}) = (1 / ρ, 0)$ or $(v_{1}, v_{2}) = (0, 1 / (1 - ρ))$ .

As a side note, we observe that when $ρ = 1 / 2$ , the selective fixed point is either $(2, 0, 2)$ or $(0, 2, 2)$ and corresponds to the results seen in Figs. 2, 4(a) and 4(b). The plots in these figures suggest that the combination of parameters we employed in the simulations allow for stability of the fixed point in each case. Note, however, that $ρ = 1 / 2$ also in Fig. 5 but the plots show that the fixed point in this case is not stable. We proceed to explore the stability of these fixed points as a function of the angle between the stimuli, the probability factor of the stimuli, $ρ$ , and the ratio of $τ_{θ}$ to $τ_{w}$ .

To simplify our calculations, we translate the common tail of the vectors $x^{(1)}$ and $x^{(2)}$ so they have equal magnitudes, and assume — without loss of generality — that $x^{(1)} \cdot x^{(1)} = x^{(2)} \cdot x^{(2)} = 1$ . To further simplify the calculations, we let $τ = τ_{θ} / τ_{w}$ , $β = x^{(1)} \cdot x^{(2)}$ , and $r = ρ / (1 - ρ)$ . Note that $r \in (0, \infty)$ with $r = 1$ being the case of equal probability.

The Jacobian of Eq. (7) is

Thus to linearize the equation at

$z_{1} \equiv (\frac{1}{ρ}, 0, \frac{1}{ρ})$ , we evaluate the above Jacobian at this point to obtain

J z 1 = (1 - β / r - 1 β - 1 / r - β 2 / τ 0 - 1 / τ), <math display="block" altimg="eq-00229.gif"><msub><mrow><mi>J</mi></mrow><mrow><msub><mrow><mstyle mathvariant="bold"><mi>z</mi></mstyle></mrow><mrow><mn>1</mn></mrow></msub></mrow></msub><mo>=</mo><mfenced separators="" open="(" close=")"><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="center"><mn>1</mn><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mo>-</mo><mi>β</mi><mo stretchy="false">/</mo><mi>r</mi><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mo>-</mo><mn>1</mn></mtd></mtr><mtr><mtd columnalign="center"><mi>β</mi><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mo>-</mo><mn>1</mn><mo stretchy="false">/</mo><mi>r</mi><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mo>-</mo><mi>β</mi></mtd></mtr><mtr><mtd columnalign="center"><mn>2</mn><mo stretchy="false">/</mo><mi>τ</mi><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mn>0</mn><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mo>-</mo><mn>1</mn><mo stretchy="false">/</mo><mi>τ</mi></mtd></mtr></mtable></mrow></mfenced><mo>,</mo></math> (10)

where for simplicity, we rescale time and let

$τ_{w} = 1$ . The characteristic polynomial becomes

where

A 1 = (1 / r + 1 / τ - 1), B 1 = (1 / τ - 1 / r + β 2 / r + 1 / τ r), C 1 = (- β 2 / τ r + 1 / τ r) . <math display="block" altimg="eq-00232.gif"><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="right"><msub><mrow><mi>A</mi></mrow><mrow><mn>1</mn></mrow></msub></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">/</mo><mi>r</mi><mo>+</mo><mn>1</mn><mo stretchy="false">/</mo><mi>τ</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="right"><msub><mrow><mi>B</mi></mrow><mrow><mn>1</mn></mrow></msub></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">/</mo><mi>τ</mi><mo>-</mo><mn>1</mn><mo stretchy="false">/</mo><mi>r</mi><mo>+</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">/</mo><mi>r</mi><mo>+</mo><mn>1</mn><mo stretchy="false">/</mo><mi>τ</mi><mi>r</mi><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="right"><msub><mrow><mi>C</mi></mrow><mrow><mn>1</mn></mrow></msub></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mo stretchy="false">(</mo><mo>-</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">/</mo><mi>τ</mi><mi>r</mi><mo>+</mo><mn>1</mn><mo stretchy="false">/</mo><mi>τ</mi><mi>r</mi><mo stretchy="false">)</mo><mo>.</mo></mtd></mtr></mtable></mrow></math>

Observe that if we assume that

$0 < τ < 1$ , then we can be sure that

$A > 0$ . Also note that since

$x^{(1)} \cdot x^{(1)} = x^{(2)} \cdot x^{(2)} = 1$ , we obtain that

$β^{2} < 1$ implying that

$C_{1} > 0$ . The Routh–Hurwitz criterion (Ref. 25) can be employed to provide more insight on the stability of

$z_{1}$ . According to this criterion, the roots of the above characteristic polynomial all have negative real parts if and only if

$A_{1} > 0$ ,

$C_{1} > 0$ and

$R_{1} : = A_{1} B_{1} - C_{1} > 0$ .²⁶ Thus if

$0 < τ < 1$ , then the stability of

$z_{1}$ is guaranteed if and only if

R1:=(r+τ−τr)(r−τ+β2+1)+τr(β2−1)τr>0.<math display="block" altimg="eq-00244.gif"><msub><mrow><mi>R</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>:</mo><mo>=</mo><mfrac><mrow><mo stretchy="false">(</mo><mi>r</mi><mo>+</mo><mi>τ</mi><mo>−</mo><mi>τ</mi><mi>r</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mi>r</mi><mo>−</mo><mi>τ</mi><mo>+</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo><mo>+</mo><mi>τ</mi><mi>r</mi><mo stretchy="false">(</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo></mrow><mrow><mi>τ</mi><mi>r</mi></mrow></mfrac><mo>&gt;</mo><mn>0</mn><mo>.</mo></math>(12)

Applying a similar treatment to $z_{2} \equiv (0, \frac{1}{1 - ρ}, \frac{1}{1 - ρ})$ , we obtain the Jacobian

J z 2 = (- r β - β - r β 1 - 1 0 2 / τ - 1 / τ) . <math display="block" altimg="eq-00246.gif"><msub><mrow><mi>J</mi></mrow><mrow><msub><mrow><mstyle mathvariant="bold"><mi>z</mi></mstyle></mrow><mrow><mn>2</mn></mrow></msub></mrow></msub><mo>=</mo><mfenced separators="" open="(" close=")"><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="center"><mo>-</mo><mi>r</mi><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mi>β</mi><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mo>-</mo><mi>β</mi></mtd></mtr><mtr><mtd columnalign="center"><mo>-</mo><mi>r</mi><mi>β</mi><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mn>1</mn><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mo>-</mo><mn>1</mn></mtd></mtr><mtr><mtd columnalign="center"><mn>0</mn><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mn>2</mn><mo stretchy="false">/</mo><mi>τ</mi><mspace width="1em" class="quad"></mspace></mtd><mtd columnalign="center"><mo>-</mo><mn>1</mn><mo stretchy="false">/</mo><mi>τ</mi></mtd></mtr></mtable></mrow></mfenced><mo>.</mo></math> (13)

The characteristic polynomial in this case is

where

A 2 = (r + 1 / τ - 1), B 2 = (1 / τ - r + β 2 r + r / τ), C 2 = (- r β 2 / τ + r / τ) . <math display="block" altimg="eq-00248.gif"><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="right"><msub><mrow><mi>A</mi></mrow><mrow><mn>2</mn></mrow></msub></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mo stretchy="false">(</mo><mi>r</mi><mo>+</mo><mn>1</mn><mo stretchy="false">/</mo><mi>τ</mi><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="right"><msub><mrow><mi>B</mi></mrow><mrow><mn>2</mn></mrow></msub></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">/</mo><mi>τ</mi><mo>-</mo><mi>r</mi><mo>+</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>r</mi><mo>+</mo><mi>r</mi><mo stretchy="false">/</mo><mi>τ</mi><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="right"><msub><mrow><mi>C</mi></mrow><mrow><mn>2</mn></mrow></msub></mtd><mtd columnalign="center"><mo>=</mo></mtd><mtd columnalign="left"><mo stretchy="false">(</mo><mo>-</mo><mi>r</mi><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">/</mo><mi>τ</mi><mo>+</mo><mi>r</mi><mo stretchy="false">/</mo><mi>τ</mi><mo stretchy="false">)</mo><mo>.</mo></mtd></mtr></mtable></mrow></math>

Again we observe that the assumption of

$0 < τ < 1$ will imply that

$A > 0$ ; also since

$x^{(1)} \cdot x^{(1)} = x^{(2)} \cdot x^{(2)} = 1$ , we get that

$β^{2} < 1$ implying that

$C_{2} > 0$ . To apply the Routh–Hurwitz criterion, we note that

$A_{2} > 0$ ,

$C_{2} > 0$ , and

$R_{2} : = A_{2} B_{2} - C_{2} > 0$ . Thus if

$0 < τ < 1$ , then

$z_{2}$ is asymptotically stable if and only if

R2:=(r+τr−τ)(1−τr+β2τr+r)+r(β2−1)τ>0.<math display="block" altimg="eq-00259.gif"><msub><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>:</mo><mo>=</mo><mfrac><mrow><mo stretchy="false">(</mo><mi>r</mi><mo>+</mo><mi>τ</mi><mi>r</mi><mo>−</mo><mi>τ</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><mi>τ</mi><mi>r</mi><mo>+</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>τ</mi><mi>r</mi><mo>+</mo><mi>r</mi><mo stretchy="false">)</mo><mo>+</mo><mi>r</mi><mo stretchy="false">(</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo></mrow><mrow><mi>τ</mi></mrow></mfrac><mo>&gt;</mo><mn>0</mn><mo>.</mo></math>(15)

Collating the discussion that led to Eqs. (12) and (15), we obtain the following theorem about the stability of our selective steady state.

Theorem 2. Let $β = x_{1} \cdot x_{2}$ , $τ = τ_{θ} / τ_{w}$ , and $r = ρ / (1 - ρ)$ . Then the qualitative behaviors of the steady state solutions presented in Theorem 1 can be summarized as follows:

$(v_{1}, v_{2}) = (1 / ρ, 0)$ is linearly asymptotically stable if and only if
$τ < 1 and (r + τ - τ r) (r - τ + β 2 + 1) + τ r (β 2 - 1) > 0 . <math display="block" altimg="eq-00264.gif"><mrow><mi>τ</mi><mo><</mo><mn>1</mn><mspace width="1em" class="quad"></mspace><mstyle><mtext mathvariant="normal">and</mtext></mstyle><mspace width="1em" class="quad"></mspace><mo stretchy="false">(</mo><mi>r</mi><mo>+</mo><mi>τ</mi><mo>-</mo><mi>τ</mi><mi>r</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mi>r</mi><mo>-</mo><mi>τ</mi><mo>+</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>+</mo><mn>1</mn><mo stretchy="false">)</mo><mo>+</mo><mi>τ</mi><mi>r</mi><mo stretchy="false">(</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mo>></mo><mn>0</mn><mo>.</mo></mrow></math>$
$(v_{1}, v_{2}) = (0, 1 / (1 - ρ))$ is linearly asymptotically stable if and only if
$τ < 1 and (r + τ r - τ) (1 - τ r + β 2 τ r + r) + r (β 2 - 1) > 0 . <math display="block" altimg="eq-00266.gif"><mrow><mi>τ</mi><mo><</mo><mn>1</mn><mspace width="1em" class="quad"></mspace><mstyle><mtext mathvariant="normal">and</mtext></mstyle><mspace width="1em" class="quad"></mspace><mo stretchy="false">(</mo><mi>r</mi><mo>+</mo><mi>τ</mi><mi>r</mi><mo>-</mo><mi>τ</mi><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>τ</mi><mi>r</mi><mo>+</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mi>τ</mi><mi>r</mi><mo>+</mo><mi>r</mi><mo stretchy="false">)</mo><mo>+</mo><mi>r</mi><mo stretchy="false">(</mo><msup><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo><mo>></mo><mn>0</mn><mo>.</mo></mrow></math>$

5. A Single-Neuron BCM Clustering Algorithm

We propose a working preliminary algorithm that makes use of the selectivity property of the BCM neuron to cluster a dataset. The algorithm for this task will involve the variables X, nd, v, $w$ , $θ$ , $τ_{w}$ , $τ_{θ}$ , R, $κ$ , and $δ$ characterized as follows:

(1)	X is an $n \times d$ matrix containing the dataset.
(2)	n is the number of data points in the dataset.
(3)	d is the number of attributes of each data point i.e., the dimension of the dataset.
(4)	v is the activity of the neuron.
(5)	$w$ is a d-dimensional vector of the synaptic weights of the neuron. To avoid convergence to a degenerate fixed point, $w$ should not be initialized with the zero vector.
(6)	$θ$ is the activity threshold of the neuron.
(7)	R is an n-dimensional vector. The element $R_{i}$ corresponds to the response of the neuron to the ith data point of X.
(8)	$τ_{w}$ is the time-scale factor for w.
(9)	$τ_{θ}$ is the time-scale factor for $θ$ . As has been shown in Secs. 3 and 4, steady state stability is guaranteed if $τ_{θ} / τ_{w} < 1$ .
(10)	$n u m I t r$ is the number of iterations during which the neuron is trained.
(11)	$κ$ and $δ$ are cluster detection cutoff factors. $δ$ is an expected average difference in the neuronal responses between two data points in the same cluster. This is usually very small (say, in the order of 0.1). $κ δ$ is the expected minimum difference in the neuronal responses between two data points in two different clusters, implying that $κ$ should be chosen in such a way that $κ δ$ is large enough to reflect this. However, as can be deduced in the algorithms that follow, if $κ$ is too large, points belonging to different clusters may be grouped together. Other than by experimentation, there is currently no hard rule in determining the value of $κ$ and $δ$ .

The mechanism of the proposed single-neuron BCM clustering algorithm (Algorithm 2) can be described as detect and eliminate. This means that it trains the BCM neuron with the dataset, detects the high-response data points, designates these points as a cluster, and eliminates them from the dataset. It then repeats this procedure until the dataset is empty. The algorithm can easily be modified to accept a pre-defined number of desired clusters.

5.1. Performance

The primary goal of this work is to analyze the cluster-detection property of the BCM neuron. For this reason, we choose datasets that will properly demonstrate this goal. We concede that future work will include fine-tuning the above preliminary clustering algorithm to account for probability distribution and other relevant factors. With that being said, we proceed to apply Algorithm 2 on some select datasets. In each simulation, $τ_{w} = 10$ , $τ_{θ} = 1$ , $n u m I t r = 10, 000$ , $κ = 10$ , $δ = 0.05$ , and $w$ is initialized with numbers in the interval $(0, 0.3)$ .

Artificially generated dataset: A dataset is said to be linearly separable if for each cluster in the dataset, there exists a hyperplane that can be used to separate the cluster from the remainder of the dataset. The datasets in Fig. 6 are generated in the same manner that the ones in Fig. 4 were generated. Figures 6(a) and 6(b) show the best performances of Algorithm 2 on two linearly separable datasets. The numbers on the plots show the order in which the clusters were detected. Figure 6(c) shows the performance of the algorithm on a dataset that is not linearly separable. In each case, we specified the number of desired clusters.
Iris dataset: This dataset was introduced by Fisher in 1936²⁷ and can be found on the UCI machine learning repository.²⁸ The set contains 150 data points on three species of the Iris flower comprising 50 samples from the Iris setosa species, 50 from Iris virginica, and 50 from Iris versicolor. It measures (in cm) the following four attributes for each sample: length of the sepal, the width of the sepal, the length of the petal, and the width of the petals. This dataset is hardly analyzed by clustering since there are only two obvious clusters (within three groups): one of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information Fisher used. Figure 7(a) shows a 3D projection of the Iris data points. The cyan points are samples of Iris setosa, the purple points are samples of Iris virginica, and the yellow points are samples of iris versicolor. Figure 7(b) shows a performance of Algorithm 2 that comes very close to mirroring the species clusters of the dataset. Again, the numbers on the plots show the order in which the clusters were detected and the number of desired clusters, 3, was pre-defined.
Cardiotocography dataset: This dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians. This dataset can be found on the UCI machine learning repository.²⁸ The dataset consists of 2126 samples and each sample has 23 attributes. With respect to morphological patterns, each sample in the dataset can be placed into one of 10 groups $A, B, C, D, \dots J$ . Each of the samples is already pre-classified to be in one of these groups. We set Algorithm 2 to detect exactly 10 clusters in order to see how the clusters compare to the original 10 groups in the dataset. We first choose a representative centroid for each of the 10 groups $A, B, C, D, \dots J$ . This representative centroid is the point in the group that is closest to the mean of the group. Before clustering, we associate each data point with a representative centroid from original grouping. We thus consider a point to be correctly clustered if after the clustering it belongs in the same cluster as the associated representative centroid. Over 100 runs of Algorithm 2, we found the average percentage of correctly clustered points to be $79.23 % .$

Fig. 6. (a) and (b) Best performances of Algorithm 2 on two linearly separable datasets (c) Performance of the algorithm on a dataset that is not linearly separable. The numbers on the plots show the order in which the clusters were detected.

Fig. 7. (a) The cyan points are samples of *Iris setosa*, the purple points are samples of *Iris virginica*, and the yellow points are samples of iris versicolor (b) A performance of Algorithm 2 that comes very close to mirroring the species clusters of the dataset. The numbers on the plots show the order in which the clusters were detected.

To further evaluate the performance of Algorithm 2, we compute the Davies–Bouldin (DB) index²⁹ for the clusters produced by the algorithm, as well as the DB index for clusters produced by a few other popular clustering algorithms. The DB index takes into consideration a combination the similarity of points within the clusters and the dissimilarity between the clusters and is calculated in the following way :

DB=1nn∑i=1maxi≠j(σi+σjd(ci,cj)),<math display="block" altimg="eq-00302.gif"><mrow><mstyle><mtext mathvariant="normal">DB</mtext></mstyle><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><munderover accentunder="true" accent="true"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><munder><mrow><mo>max</mo></mrow><mrow><mi>i</mi><mo>≠</mo><mi>j</mi></mrow></munder><mfenced separators="" open="(" close=")"><mrow><mfrac><mrow><msub><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow><mi>σ</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow><mrow><mi>d</mi><mo stretchy="false">(</mo><msub><mrow><mi>c</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>c</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac></mrow></mfenced><mo>,</mo></mrow></math>

where

n is the number of clusters.
$c_{x}$ is the centroid of cluster x.
$σ_{x}$ is the average distance of all elements in cluster x to centroid $c_{x}$ .
$d (c_{i}, c_{j})$ is the distance between centroids $c_{i}$ and $c_{j}$ .

In a clustering result with high intracluster similarity and low intercluster similarity, the numerator of the inside of each term of the summation above should be less than the denominator. Thus, a reasonable DB index should be less than one. In addition, the algorithm that yields the best clustering result should have the lowest DB index.

Table 1 shows the DB indices for different algorithms on the datasets described above. Each algorithm is adjusted to yield the same number of clusters associated with each dataset, and each number is the average DB index over 100 runs. As can be seen, Algorithm 2 performs very well when compared with the other clustering algorithms. The bolded numbers highlight cases where, of all the algorithms, Algorithm 2 gives the best results. As expected, the algorithm performs best when the clusters are clearly linearly separable.

**Table 1. Mean DB indices (over 100 runs) for different clustering algorithms.**
Dataset	BCM (Algorithm 2)	k-means	EM	CLINK
Figure 6(a) dataset (3 clusters)	0.3001	0.3228	0.3180	0.3110
Figure 6(b) dataset (4 clusters)	0.3009	0.3100	0.3013	0.2910
Figure 6(c) dataset (3 clusters)	0.3611	0.4200	0.3991	0.4710
Iris dataset (3 clusters)	0.3910	0.4012	0.4024	0.3211
Cardiotocography dataset (10 clusters)	0.3411	0.3510	0.3521	0.4437

5.2. Convergence and computational complexity

The BCM clustering algorithm uses Euler’s method to solve the system of ordinary differential equations in the BCM learning rule. The convergence rate of Euler’s method can be analytically determined for Lipschitz continuous functions, and pretty difficult to determine for functions that are not Lipschitz continuous, especially when stochasticity is involved, like in Algorithm 2. Even though the BCM learning rule is not Lipschitz continuous and determining its convergence rate is not within the scope of this study, our experimental simulations (see Fig. 4) show that as long as $τ_{θ} / τ_{w} < 1$ , the learning rule converges in a reasonable amount of time.

In terms of computational complexity, the parts of Algorithm 2 that contribute in determining its computational complexity are the for-loop that starts on line 2, and the matrix-vector multiplication in line 8. The for-loop runs a constant number of times ( $n u m I t r$ ) which does not depend on the size of the data. Line 5 is the dot product of two d-dimensional vectors, and has a complexity of $O (d)$ . Each of lines 5 and 6 has a complexity of $O (1)$ since the dominant operation in each of them is the multiplication of a vector by a constant. Thus, the complexity of lines 2–7 of the algorithm is $O (d)$ . Line 8 is the multiplication of an $n \times d$ matrix by a d-dimensional vector, which has a complexity of $O (n d)$ . Therefore, the overall complexity algorithm 1 is $O (n d)$ .

In terms of computational complexity, the major parts of Algorithm 2 are line 2, line 3, and the if-else block (lines 6–20). The contributions of these major parts can be enumerated as follows:

As explained above, line 2 has a complexity of $O (n d)$ .
Line 3 can be done with a merge sort, which has a complexity of $O (n log n)$ .
The most computationally complex part of the if-else block is lines 17–20. Line 17 is a sequential comparison of consecutive elements of an n-dimensional vector, and has a complexity of $O (n)$ . Line 18 also has a complexity of $O (n)$ since it is the sequential comparison of the elements of an n-dimensional vector to a threshold. Line 19 is the deletion or flagging of at most n rows in a matrix, and has a complexity of $O (n)$ .

If m is the number of clusters detected, the while loop that starts at line 1 and ends at line 21 runs exactly m times. In the worst case scenario, which is very unlikely, $m = n$ . Otherwise $m < n$ . Thus, the overall complexity of the algorithm 2 is $O (m (n log n + n d))$ where m is the number of clusters, d is the number of attributes of each data point, and n is the number of points in the dataset. This is generally better than the complexity of the other three algorithms we compared with Algorithm 2. Each of EM and k-means clustering algorithms has a best case complexity of $O (n^{m d + 2} log n)$ ,³⁰ and the complete linkage hierarchical clustering algorithm has a complexity of $O (n^{3})$ .³¹

Table 2 shows the mean CPU time, in seconds, (over 100 runs) of each clustering algorithm on the datasets already discussed. As can be seen, the BCM clustering algorithm is faster than the other algorithms. On the cardiotocography dataset for instance, the algorithm took only about $80 %$ of the CPU time of k-means.

**Table 2. Mean CPU time, in seconds, (over 100 runs) for different clustering algorithms.**
Dataset	BCM (Algorithm 2)	k-means	EM	CLINK
Figure 6(a) dataset	0.4108	0.5004	0.5891	0.5021
Figure 6(b) dataset	0.5108	0.5704	0.6891	0.6021
Figure 6(b) dataset	0.2108	0.3004	0.3891	0.4021
Iris dataset	0.4113	0.6100	0.8711	0.4364
Cardiotocography dataset	5.8753	7.2953	7.7613	7.8553

6. Discussion

We have explored the ability of the BCM neuron to detect clusters in a dataset. Our main approach involved treating data points in a dataset as stimuli presented to a BCM neuron. We have shown that in this context, the following factors contribute to the stability of the selective fixed points of the BCM learning rule for a single neuron: the amplitudes of the competing stimuli; the probabilities with which the stimuli are presented; and the time scale of the sliding threshold. Similar results would be possible if the quadratic term, $v^{2}$ , in the second line of Eq. (2) were replaced with $v^{p}, p > 2$ , since the original formulation in Ref. 12 suggests that the fixed point structures are preserved for any positive value of p.

Our analytical results in Secs. 3 and 4 leaned heavily on scenarios where the neuron is selecting between two data points or two clusters. We found that this is sufficient since our overall approach is for the neuron to detect one cluster at a time. Thus if there are more than two clusters, the analyses in these sections will still be applied if one uses a one-versus-rest approach. We also considered other approaches but decided not include them in this paper. One such approach is to use a small network of coupled BCM neurons, with the number of neurons corresponding to the number of desired clusters. To make such a network receive stimulus patterns from a common set and work for the task at hand, it is important to incorporate a mechanism for competitive selectivity within the network. A mechanism of this sort, found in visual processes³² (and also in tactile,³³ auditory,³⁴ and olfactory processing³⁵) is called lateral inhibition. During lateral inhibition, an excited neuron reduces the activity of its neighbors by disabling the spreading of action potentials to neighboring neurons in the lateral direction. It is thus expected that, at the end, each neuron would select one unique cluster. We reserve this approach for future work.

Another approach we considered involves observing the response of all the clusters at once and cluster-labeling points based on where their responses lie among the subset of responses. For instance, Fig. 8 shows the distribution of the responses of a BCM neuron to three separable data clusters generated in a fashion similar to how the clusters of Fig. 4 were generated. We however found this approach to not be robust enough when the number of clusters is high.

Fig. 8. The distribution of the responses of a BCM neuron to three separable data clusters.

The model neuron we used in this paper has been assumed to have linear response properties, which may be seen as oversimplified. However, this approach is sufficient for the scope of our present study. The introduction of a nonlinear transfer function to the BCM learning rule has been addressed by Ref. 16. In their formulation, the learning rule is derived as a gradient descent rule on an objective function that is cubic in the nonlinear response. Our decision to use linear neurons is motivated by their amenability to formal analysis.

The proposed clustering algorithm is preliminary and has limitations. While it shows good performance for the limited datasets discussed here, it still needs some fine-tuning in order to be applied to a more general class of data. Furthermore, the algorithm only deals with data with numerical values and does not handle missing attributes. Thus, future work should gear toward improving the algorithm to account for a wider variety of data; this will involve incorporating other distance measures apart from the Euclidean distance measure. To further probe the limitations of the proposed algorithm, we compare it to the k-means algorithm for a class of datasets with similar centers. In particular, we generate two isotropic Gaussian blobs using the “make blobs” dataset from scikit-learn³⁶ and vary the dispersion within the clusters while keeping the centers fixed. Figure 9(a) shows an instance of this dataset in which the intercluster standard deviation is 0.3. Figure 9(b) shows the mean silhouette coefficient as a function of the intercluster standard deviation. The silhouette coefficient of a sample is computed as $(b - a) / max (a, b)$ , where b is the distance between the sample and the nearest cluster that the sample is not part of, and a is the mean intracluster distance between the sample and other data points in its cluster. Note that the best value of the silhouette coefficient is 1 and the worst value is −1; values near 0 indicate overlapping clusters; negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar. As can be seen, for this specific class of data, the k-mean clustering algorithm outperforms the BCM clustering algorithm.

Fig. 9. (a) Two isotropic Gaussian blobs each have 0.5 intercluster standard deviation. (b) Performance of k-means and BCM as a function of intercluster standard deviation.

Acknowledgment

This work is based upon work supported, in part, by the U.S. Army Research Office Grant No. W911NF-21-1-0192.

ORCID

Lawrence Udeigwe https://orcid.org/0000-0002-7627-4820

References

1. D. H. Hubel and T. Wiesel, Receptive fields of single neurones in the cat’s striate cortex, J. Physiol. 148 (1959) 574–591. Crossref, Google Scholar
2. D. H. Hubel and T. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, J. Physiol. 160 (1962) 106–154. Crossref, Google Scholar
3. D. H. Hubel and T. N. Wiesel, The period of susceptibility to the physiological effects of unilateral eye closure in kittens, J. Physiol. 206(2) (1970) 419–436. Crossref, Google Scholar
4. P. W. Munro, A model for generalization and specification by a single neuron, Biol. Cybern. 51 (1984) 169–179. Crossref, Google Scholar
5. K. Cheng, T. Hasegawa, K. Saleem and T. K. , Comparison of neuronal selectivity for stimulus speed, length, and contrast in the prestriate visual cortical areas V4 and MT of the macaque monkey, J. Neurophysiol. 71(6) (1994) 2269–2280. Crossref, Google Scholar
6. T. Czuba, L. Cormack, A. Huk and A. Kohn, Neuronal selectivity for directions of 3D motion in area MT, J. Vis. 13(9) (2013) 608. Crossref, Google Scholar
7. I. Nauhaus, A. Benucci, M. Carandini and D. L. Ringach, Neuronal selectivity and local map structure in visual cortex, Neuron 57 (2008) 673–679. Crossref, Google Scholar
8. D. Hebb, The Organization of Behavior (John Wiley and Sons, 1949). Google Scholar
9. J. Hertz, K. A. and R. Palmer, Introduction to the Theory of Neural Computation (Addison-Wesley, 1991). Google Scholar
10. M. N. Nass and L. Cooper, A theory for the development of feature detecting cells in the visual cortex, Biol. Cybern. 19 (1975) 1–18. Crossref, Google Scholar
11. L. N. Cooper, F. Liberman and E. Oja, A theory for the acquisition and loss of neuron specificity in the visual cortex, Biol. Cybern. 33 (1979) 9–28. Crossref, Google Scholar
12. E. L. Bienstock, L. Cooper and P. Munro, Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex, J. Neurosci. 2 (1982) 32–48. Crossref, Google Scholar
13. H. Shouval, N. Intrator and L. Cooper, BCM network develops orientation selectivity and ocular dominance in natural scene environment, Vis. Res. 37(23) (1997) 3339–3342. Crossref, Google Scholar
14. G. C. Castellani, N. Intrator, H. Shouval and L. Cooper, Solutions of the BCM learning rule in a network of lateral interacting nonlinear neurons, Comput. Neural Syst. 10 (1999) 111. Crossref, Google Scholar
15. L. N. Cooper and M. F. Bear, The BCM theory of synapse modification at 30: Interaction of theory with experiment, Nature 13 (2012) 798. Google Scholar
16. N. Intrator and L. Cooper, Objective function formulation of the BCM theory for visual cortical plasticity: Statistical connections, stability conditions, Neural Netw. 5 (1992) 3–17. Crossref, Google Scholar
17. Y. Dotan and N. Intrator, Multimodality exploration by an unsupervised projection pursuit neural network, IEEE Trans. Neural Netw. 9 (1998) 464–472. Crossref, Google Scholar
18. N. Intrator and J. I. Gold, Three-dimension object recognition of gray level images: The usefulness of distinguishing features, Neural Comput. 5 (1993) 61–74. Crossref, Google Scholar
19. C. M. Bachman, S. Musman, D. Luong and A. Schultz, Unsupervised BCM projection pursuit algorithms for classification of simulated radar presentations, Neural Netw. 7 (1994) 709–728. Crossref, Google Scholar
20. N. Intrator, Feature extraction using unsupervised neural network, Neural Comput. 4 (1992) 98–107. Crossref, Google Scholar
21. N. Intrator, J. I. Gold, H. H. Bülthoff and S. Edelman, Three-dimensional object recognition using an unsupervised neural network: Understanding the distinguishing features, in Proc. 8th Israeli Conf. AICV (1991). Google Scholar
22. A. Poljovka and L. Benuskova, Pattern classification with the BCM neural network, Unpublished. Google Scholar
23. P. Dayan and L. Abbott, Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems (MIT Press, 2001). Google Scholar
24. L. C. Udeigwe, P. W. Munro and G. B. Ermentrout, Emergent dynamical properties of the BCM learning rule, J. Math. Neurosci. 7 (2017) 2. Crossref, Google Scholar
25. E. W. Weisstein, Routh–Hurwitz theorem (2014), http://mathworld.wolfram.com/Routh-HurwitzTheorem.html. Google Scholar
26. F. R. Gantmacher, Applications of the Theory of Matrices (Wiley, New York, 1959). Google Scholar
27. R. A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen. 7 (1936) 179–188. Crossref, Google Scholar
28. K. Bache and M. Lichman, UCI machine learning repository (2013). Google Scholar
29. D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach. Intell. 1 (1979) 224–227. Crossref, Google Scholar
30. M. Inaba, N. Katoh and H. Imai, Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering, in Proc. 10th ACM Symp. Computational Geometry (Association for Computing Machinery, New York, NY, USA, 1994), pp. 332–339. Crossref, Google Scholar
31. C. Bishop, Pattern Recognition and Machine Learning (Springer, 2006). Google Scholar
32. G. Field and E. Chichilnisky, Information processing in the primate retina: Circuitry and coding, Annu. Rev. Neurosci. 30 (2007) 1–30. Crossref, Google Scholar
33. R. Johansson and A. B. Vallbo, Tactile sensory coding in the glabrous skin of the human hand, Trends Neurosci. 6 (1983) 27–32. Crossref, Google Scholar
34. C. Pantev, H. Okamoto, B. Ross, W. Stoll, E. Ciurlia-Guy, R. Kakigi and T. Kubo, Lateral inhibition and habituation of the human auditory cortex, Eur. J. Neurosci. 19(8) (2004) 2337–2344. Crossref, Google Scholar
35. S. Yantis, Sensation and Perception (Worth Publishers, 2013). Google Scholar
36. scikit-learn (2024). Google Scholar

Vol. 02

Metrics

Downloaded 211 times

History

Received 2 October 2023

Revised 2 March 2024

Accepted 18 March 2024

Published: 17 April 2024

Information

This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution 4.0 (CC BY) License which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Keywords

PDF download