Open Access

A Forward Learning Algorithm for Neural Memory Ordinary Differential Equations

Department of Computer Science, Sichuan University, No. 24 South Section 1, Yihuan Road, Chengdu 610065 Sichuan, P. R. China

E-mail Address: xuxiuyuan@scu.edu.cn

Search for more papers by this author

Haiying Luo

https://orcid.org/0009-0004-7005-4490

Department of Computer Science, Sichuan University, No. 24 South Section 1, Yihuan Road, Chengdu 610065 Sichuan, P. R. China

E-mail Address: luohaiying_cs@stu.scu.edu.cn

Search for more papers by this author

Zhang Yi

https://orcid.org/0000-0002-5867-9322

Department of Computer Science, Sichuan University, No. 24 South Section 1, Yihuan Road, Chengdu 610065 Sichuan, P. R. China

E-mail Address: zhangyi@scu.edu.cn

Search for more papers by this author

, and

Haixian Zhang

https://orcid.org/0000-0002-9821-508X

Department of Computer Science, Sichuan University, No. 24 South Section 1, Yihuan Road, Chengdu 610065 Sichuan, P. R. China

E-mail Address: zhanghaixian@scu.edu.cn

Corresponding author.

Search for more papers by this author

https://doi.org/10.1142/S0129065724500485Cited by:1 (Source: Crossref)

Abstract

The deep neural network, based on the backpropagation learning algorithm, has achieved tremendous success. However, the backpropagation algorithm is consistently considered biologically implausible. Many efforts have recently been made to address these biological implausibility issues, nevertheless, these methods are tailored to discrete neural network structures. Continuous neural networks are crucial for investigating novel neural network models with more biologically dynamic characteristics and for interpretability of large language models. The neural memory ordinary differential equation (nmODE) is a recently proposed continuous neural network model that exhibits several intriguing properties. In this study, we present a forward-learning algorithm, called nmForwardLA, for nmODE. This algorithm boasts lower computational dimensions and greater efficiency. Compared with the other learning algorithms, experimental results on MNIST, CIFAR10, and CIFAR100 demonstrate its potency.

Keywords:

1. Introduction

The introduction of residual modules addresses issues such as gradient vanishing in deep neural networks, removing the limitation on network scale imposed by depth.¹ As a result, large language model (LLM)² technology has experienced rapid development in recent years. As network models expand infinitely in depth following the pattern of residual blocks, they gradually transition from a discrete structure towards a continuous one.³ Thus, this study believes that continuous form of neural networks is of significant importance for proposing novel and more powerful models and investigating the interpretability of current LLMs.

In recent years, neural ordinary differential equations (neuralODEs)^3,4,5 have garnered extensive attention in the field of deep learning. NeuralODEs are described by ordinary differential equations, and they offer many advantages over traditional multi-layer neural networks, such as memory efficiency, adaptive computation, parameter efficiency, and more. Furthermore, neuralODEs are suitable for solving traditional deep-learning problems such as classification⁶ and segmentation.⁷

The learning algorithms for the current stage of the neuralODE model are inspired by backpropagation (BP) algorithm.^8,9 Despite the significant success of the BP algorithm in various fields^{10,11,12,13,14,15} of artificial intelligence, it is still considered to lack biological plausibility as the brain does not perform symmetric backward connections or synchronized computations. From an engineering perspective, BP is incompatible with massive levels of neuralODE and restricts potential hardware designs.

Existing biologically plausible approaches mainly address weight transport, nonlocality, activation freezing, and update locking.¹⁶ These methods can be broadly categorized into two classes based on whether feedback backward path is required.^{16,17,18,19,20,21} Almost all of these methods are proposed for discrete neural networks, however, compared with the discrete neural networks, continuous networks with dynamical characteristics may inherently possess greater biological plausibility. Forward learning algorithms based on continuous neural networks support real-time learning and offer superior computational efficiency.

The recently proposed neural memory ordinary differential equation (nmODE) is a novel continuous model of artificial neural network.⁵ This model is described in continuous form using ODEs and possesses a simple and special structure that separates learning from memory, endowing several interesting properties. First, it has a global attractor for each external input. Then, nmODE does not have the problem of learning features homeomorphic to input data space⁴ that frequently occurs in most existing neuralODEs. In nmODE, a learning algorithm called nmLA is developed for nmODE, but it uses a backpropagation mechanism in the algorithm.⁵

In this study, we propose another learning algorithm for the nmODE. This learning algorithm calculates gradients in a forward manner and does not require backpropagation of gradients with more biological plausibility. We evaluate the proposed learning algorithm primarily on some supervised image classification datasets, including MNIST,²² CIFAR10, CIFAR100,²³ and show that it outperforms other algorithms.

The primary contributions of this study include the following:

•	A more biologically plausible learning algorithm is proposed for nmODE, called nmForwardLA, without weight transport, nonlocal, frozen activity and update locking issues. We derive its core mathematical formula (Eq. (12)) through precise mathematical derivation.
•	This method involves only one forward pass of a two-dimensional ODE, enabling real-time inference and gradient calculation for updates. Compared with the other neuralODE training algorithms, it has a more straightforward computation and lower space complexity.
•	We validate the feasibility and correctness of our proposed method on three classic datasets, achieving competitive results compared with other learning algorithms.

2. Related Works

Deep neural networks’ success stems from powerfully nonlinearly expressive network architectures and simple yet effective learning algorithms. This section reviews relevant work from neural network architecture and biologically plausible learning methods.

2.1. Neural network architecture

Neural networks were invented in the 1950s, and artificial neural network models have succeeded significantly in various fields. Their architecture can be divided into discrete and continuous.

Discrete neural networks The industry’s most commonly used and highly successful neural networks are discrete neural networks. The earliest ones are single-layer networks,²⁴ such as perceptrons, which easily handle linearly separable problems but do not solve linearly inseparable problems, such as the XOR problem.²⁵ To address this issue, later researchers added one or more hidden layers to the perceptron, allowing the network model to handle more general function relationships. These models were typically referred to as multi-layer perceptrons (MLPs).²⁶ Inspired by the cat’s visual system, Kunihiko Fukushima introduced a multi-layer neural network with convolution and subsampling operators on top of the multi-layer perceptron, called Neocognitron.²⁷ Lecun built upon this foundation and invented Convolutional Neural Networks (CNNs), achieving significant success in handwritten digit recognition.²⁸ Hinton et al. employed a layer-wise pretraining method, initially pretraining each layer’s parameters and fine-tuning them using backpropagation (BP), thus, effectively training deep neural networks.²⁹ The first modern deep convolutional neural network model, AlexNet,³⁰ marked the beginning of a breakthrough in deep learning technology, particularly in image classification. In the past decade, deep neural networks have continuously expanded the scale of neural networks, focusing on increasing depth.^31,32,33 The introduction of ResNet effectively addressed the issue of gradient vanishing, enabling neural networks to be extended to greater depths.¹ In recent years, the models have grown increasingly more extensive, and transformer architecture has unified effective semantic learning for text and image features.²

Continuous neural networks Models like residual networks,¹ transformer networks,² and normalizing flows construct intricate transformations by sequentially composing a series of transformations to a hidden state. The layers’ number in a network model increases, and the time steps become relatively minor; in the limit, the continuous dynamics of hidden units can be parametrized using an ordinary differential equation (ODE) specified by a neural network. Many successfully applied neural network models, including ResNet¹ and Transformer,² are essentially particular forms of discretized NeuralODEs. As the network scale increases, these models will likely converge toward NeuralODEs. Thus, we suppose researching continuous forms of neural networks is of significant importance for proposing novel and more powerful models and investigating the interpretability of current large-language models (LLMs).

The research on continuous artificial neural networks mainly focuses on the following two aspects: (1) Many recent works have proposed learning differential equations from data. Some work trained feed-forward or recurrent neural networks to approximate a differential equation,^34,35 with applications such as fluid simulation.³⁶ There was also a significant work on connecting Gaussian Processes (GPs) with ODE solvers and GPs were adapted to fit differential equations.³⁷ Stochastic variational inference was utilized to regain the solution of a given stochastic differential equation.³⁸ (2) Another category of work mainly focused on implementing differentiation through ODE solvers.^39,40 The neuralODE was first proposed and trained an ODE solver end-to-end directly.³ This work highlighted the potential of a general integration of black-box ODE solvers into automatic differentiation for deep learning and generative modeling. It was proved that neuralODEs using initial value as data input can learn only features homeomorphic to the input data space.⁴ Dupont et al. added new augments to the networks to address this issue, however, additional computations should be required.⁴ Although neither method needed to store intermediate states during the forward computation, they achieved optimal storage efficiency by dynamically reconstructing these states in real time during the backward process for gradient computation. Yi constructed a novel type of continuous neural network with evident dynamical characteristics and a global attractor without increasing additional computational costs.⁵ Additionally, Yi proposed a novel learning algorithm called nmLA for nmODE, which was decoupled into a three-dimensional adjoint equation, thereby significantly enhancing computational efficiency.⁵

The nmODE is a more biologically plausible network architecture with nonlinear expressive capabilities and attractor-based memory. While nmLA dramatically reduces the complexity of ODE training through a decoupled three-dimensional ODE solver, it still requires a reverse process, which leads to issues: weight transport free and update locking.

2.2. Biological plausible learning methods

Since the era of perceptrons, developing learning algorithms for neural networks, particularly those that biological brains could implement, has been a central focus. Dellaferrera and Kreiman¹⁶ systematically summarize biologically plausible learning algorithms.

Learning with a feedback path The backpropagation algorithm consists of both a forward and a backward pass.^8,9 The signal progresses from the input to the output in the forward path, and the error is determined by the difference between the model’s output and the target. In reverse, the error travels from the output layer to the input one by the same weights utilized in the forward during the backward path, resulting in the emergence of the weight symmetry issue. Recently, various learning algorithms have been suggested to alleviate the constraint imposed by symmetric weights. The neural networks’ effective learning backpropagated the learning signal through connections with the same sign, not the magnitude, of the feedforward weights (sign symmetry algorithm).^18,41 Alternatively, effective learning was accomplished through fixed random connections called feedback-alignment (FA).¹⁹ The effectiveness of FA suggested that the modulatory learning signals conveyed by fixed random matrices are valuable. This is due to the alignment of forward weights with the fixed backward matrices during the update process.¹⁹ Furthermore, Nøkland illustrated learning signals can be transferred even through random feedback matrices effectively.²⁰ Akrout et al.⁴² proposed a neural circuit called a weight mirror to fine-tune the FA’s initially random feedback weights and improved its alignment with the forward path, resulting in enhanced performance compared with FA. However, it is worth noting that these approaches all necessitate a feedback path to guide the model’s learning and thus cannot address nonlocality, update locking, and activity freezing issues.¹⁶

Learning without a feedback path Recent studies have employed methods based on local error handling to address update locking issue, proving effective in training neural networks.^43,44,45 These methods trained each layer independently by employing auxiliary fixed random classifiers. However, they still faced weight transport issues at the classifier’s level and necessitated a significantly greater computational workload. To achieve the same objective, Frenkel et al. proposed a method named DRTP that utilized fixed random learning signals to update parameters using labels instead of network error.⁴⁶ This strategy successfully overcame weight symmetry and updated locking issues while incurring no additional computational overhead. However, it still required frozen activity while the learning information propagates through the network. In addition, DRTP exhibited a more noticeable declined in performance than BP and FA algorithms. More recently, a category of training schemes known as “forward-only” algorithms has emerged,^16,21,47 utilizing the backward path by another forward one. This category includes the Forward–Forward algorithm (FF)²¹ and another top-down feedback connections named PEPITA.^16,17 Both algorithms involved presenting a clean input sample during the initial forward pass. In the case of FF, the second forward pass introduced a corrupted data sample obtained by merging different samples with masks. In contrast, PEPITA modulated the input in the second forward pass through information about the error from the first forward pass. Some other original methods were designed for specific networks with greater biological plausibility. Clark et al. proposed a method, called GEVB, which did not transfer learning information by fixed random weights and resolves weight transport and locality issues.⁴⁸ Ren et al.⁴⁹ proposed a forward gradient learning algorithm inspired by the design of MLPMixer⁵⁰ to compute a noisy directional gradient. Chen et al. introduced a learning method that treated the ODE solver as a black box and computed gradients by using the adjoint sensitivity method.³ This method did not require freezing intermediate states during the forward process but dynamically recalculated these results in real time during the backwards. However, the algorithm still relied on a globally defined cost function over time. Yi proposed nmLA to train nmODE, sampling at each observation time, updating weights in real-time, further addressing the nonlocal issue.⁵ While neither method required the activation values used during the forward computation, there was still a backward process that used the same weights as the forward process, thus not addressing the issue of weight transport.

Most current biologically plausible learning methods revolve around discrete neural network architectures. Only a few studies have recently begun to explore learning algorithms for continuous neural network architectures, and they still need to address biologically implausible issues entirely.

3. The Proposed nmForwardLA

3.1. nmODE

Traditional math model of neurons can be described simply by

p = f (γ), <math display="block" altimg="eq-00001.gif"><mi>p</mi><mo>=</mo><mi>f</mi><mo stretchy="false">(</mo><mi>γ</mi><mo stretchy="false">)</mo><mo>,</mo></math> (1)

where

$γ$ denotes the total input to a neuron, f denotes an activation function. This model is a simple static mapping from total input

$γ$ to the output p. However, in the brain, electricity flows continuously among neurons, the traditional model of neurons is actually a very simplified math model. In Ref. 5, a new model of neurons described by a one-dimensional ODE is proposed by

ṗ = - p + sin 2 (p + γ), <math display="block" altimg="eq-00004.gif"><mi>ṗ</mi><mo>=</mo><mo>-</mo><mi>p</mi><mo>+</mo><msup><mrow><mo>sin</mo></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">(</mo><mi>p</mi><mo>+</mo><mi>γ</mi><mo stretchy="false">)</mo><mo>,</mo></math> (2)

where p denotes the state of the neuron. This model generates the traditional neuron model (1) from static mapping to a dynamical system (2).

Utilizing the suggested model of neurons, a novel model of continuous neural networks referred to as nmODE (neural memory ordinary differential equation) is introduced in Ref. 5 by

{ẏ (t) = - y (t) + sin 2 [y (t) + γ], γ = W (1) x + b, <math display="block" altimg="eq-00005.gif"><mfenced separators="" open="{" close=""><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="left"><mi>ẏ</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mo>-</mo><mi>y</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><msup><mrow><mo>sin</mo></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">[</mo><mi>y</mi><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><mi>γ</mi><mo stretchy="false">]</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mi>γ</mi><mo>=</mo><msup><mrow><mi>W</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msup><mi>x</mi><mo>+</mo><mi>b</mi><mo>,</mo></mtd></mtr></mtable></mrow></mfenced></math> (3)

where

$x \in R^{m}$ denotes the external input, We denote the memory neurons’ state at time t by

$y (t) \in R^{n}$ ,

$W^{(1)} = (w_{i j}^{(1)}) \in R^{n \times m}$ is the connection weight matrix, and

$b \in R^{n}$ denotes bias. This model is special in that the dynamics of memory neurons are uncoupled which makes the dynamics of nmODE to be extremely simple and clearly. In order to make the nmODE for efficiently learning, a set of decision making neurons denoted by

$a = {[a_{1}, \dots, a_{r}]}^{T} \in R^{r}$ is introduced as

{a i (t) = s (z i (t)), z i (t) = n \sum j = 1 w (2) i j y j (t), <math display="block" altimg="eq-00011.gif"><mfenced separators="" open="{" close=""><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="left"><msub><mrow><mi>a</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mi>s</mi><mo stretchy="false">(</mo><msub><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><msub><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>2</mn><mo stretchy="false">)</mo></mrow></msubsup><msub><mrow><mi>y</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr></mtable></mrow></mfenced></math> (4)

where

$W^{(2)} = (w_{i j}^{(2)}) \in R^{r \times n}$ is another connection weight matrix for learning, and

$s (\cdot)$ is the softmax activation function, see Fig. 1 for illustration.

3.2. The learning rule

We present a novel forward learning algorithm for the nmODE. Define

γ i = n \sum j = 1 w (1) i j x j + b i, <math display="block" altimg="eq-00014.gif"><mrow><msub><mrow><mi>γ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><munderover accentunder="true" accent="true"><mrow><mo>\sum</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>+</mo><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo></mrow></math>

then, we can rewrite the nmODE as

ẏ i (t) = - y i (t) + sin 2 [y i (t) + γ i], (i = 1, \dots, n) . <math display="block" altimg="eq-00015.gif"><mrow><msub><mrow><mi>ẏ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mo>-</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><msup><mrow><mo>sin</mo></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">[</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mrow><mi>γ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">]</mo><mo>,</mo><mspace width="1em" class="quad"></mspace><mo stretchy="false">(</mo><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>\dots</mo><mo>,</mo><mi>n</mi><mo stretchy="false">)</mo><mo>.</mo></mrow></math>

Let network learn a target

$d_{i} (i = 1, \dots, r)$ . Given a fixed time

$\bar{t}$ , using the output

$a_{i} (\bar{t})$ together with the learning target

$d_{i}$ , a cost function

$J (\bar{t})$ can be constructed. The weights (

$W^{(1)}$ ) can be calculated as follows:

∂J(ˉt)∂w(1)ij=∂J(ˉt)∂yi(ˉt)⋅∂yi(ˉt)∂w(1)ij<math display="block" altimg="eq-00022.gif"><mfrac><mrow><mi>∂</mi><mi>J</mi><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup></mrow></mfrac><mo>=</mo><mfrac><mrow><mi>∂</mi><mi>J</mi><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow></mfrac><mo>⋅</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup></mrow></mfrac></math>(5)

=δyi(ˉt)⋅∂yi(ˉt)∂w(1)ij<math display="block" altimg="eq-00023.gif"><mo>=</mo><msubsup><mrow><mi>δ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>y</mi></mrow></msubsup><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup></mrow></mfrac></math>(6)

=δyi(ˉt)⋅∂yi(ˉt)∂(w(1)ijxj)⋅xj<math display="block" altimg="eq-00024.gif"><mo>=</mo><msubsup><mrow><mi>δ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>y</mi></mrow></msubsup><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mo stretchy="false">(</mo><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac><mo>⋅</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub></math>(7)

and

∂J(ˉt)∂bi=δyi(ˉt)⋅∂yi(ˉt)∂bi.<math display="block" altimg="eq-00025.gif"><mfrac><mrow><mi>∂</mi><mi>J</mi><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mo>=</mo><msubsup><mrow><mi>δ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>y</mi></mrow></msubsup><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mo>.</mo></math>(8)

Next, we calculate

$\frac{\partial y_{i} (\bar{t})}{\partial w_{i j}^{(1)}}$ and

$\frac{\partial y_{i} (\bar{t})}{\partial b_{i}}$ first, they cannot be directly calculated, next, we establish a set of ODE to solve this problem. Using

ẏ i (t) = - y i (t) + sin 2 [y i (t) + γ i (t)], <math display="block" altimg="eq-00028.gif"><msub><mrow><mi>ẏ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>=</mo><mo>-</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><msup><mrow><mo>sin</mo></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">[</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mrow><mi>γ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo stretchy="false">]</mo><mo>,</mo></math> (9)

and noting that

$y_{i}^{0}$ is independent of

$w_{i j}^{(1)} x_{j}$ and

$b_{i}$ , Taking the partial derivative on both sides of the equation, we get

{ddt[∂yi(t)∂(w(1)ijxj)]=−∂yi(t)∂(w(1)ijxj)+sin2[yi(t)+γi]⋅[1+∂yi(t)∂(w(1)ijxj)],ddt[∂yi(t)∂bi]=−∂yi(t)∂bi+sin2(yi(t)+γi)⋅[1+∂yi(t)∂bi],∂y0i∂(w(1)ijxj)=0,∂y0i∂bi=0.<math display="block" altimg="eq-00032.gif"><mfenced separators="" open="{" close=""><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="left"><mfrac><mrow><mi>d</mi></mrow><mrow><mi>d</mi><mi>t</mi></mrow></mfrac><mfenced separators="" open="[" close="]"><mrow><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mo stretchy="false">(</mo><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac></mrow></mfenced><mo>=</mo><mo>−</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mo stretchy="false">(</mo><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac><mo>+</mo><mo>sin</mo><mn>2</mn><mo stretchy="false">[</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mrow><mi>γ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">]</mo></mtd></mtr><mtr><mtd columnalign="left"><mspace width="2em" class="hspace"></mspace><mspace width="2em" class="qquad"></mspace><mspace width="2em" class="qquad"></mspace><mspace width="1em" class="quad"></mspace><mspace width="2em" class="qquad"></mspace><mo>⋅</mo><mfenced separators="" open="[" close="]"><mrow><mn>1</mn><mo>+</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mo stretchy="false">(</mo><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac></mrow></mfenced><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mfrac><mrow><mi>d</mi></mrow><mrow><mi>d</mi><mi>t</mi></mrow></mfrac><mfenced separators="" open="[" close="]"><mrow><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac></mrow></mfenced><mo>=</mo><mo>−</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mo>+</mo><mo>sin</mo><mn>2</mn><mo stretchy="false">(</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo><mo>+</mo><msub><mrow><mi>γ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd columnalign="left"><mspace width="2em" class="hspace"></mspace><mspace width="2em" class="qquad"></mspace><mspace width="1em" class="quad"></mspace><mspace width="2em" class="qquad"></mspace><mo>⋅</mo><mfenced separators="" open="[" close="]"><mrow><mn>1</mn><mo>+</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac></mrow></mfenced><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mfrac><mrow><mi>∂</mi><msubsup><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>0</mn></mrow></msubsup></mrow><mrow><mi>∂</mi><mo stretchy="false">(</mo><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac><mo>=</mo><mn>0</mn><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mfrac><mrow><mi>∂</mi><msubsup><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>0</mn></mrow></msubsup></mrow><mrow><mi>∂</mi><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mo>=</mo><mn>0</mn><mo>.</mo></mtd></mtr></mtable></mrow></mfenced></math>(10)

Solving this ODE from

$t = 0$ to

$t = \bar{t}$ , we can get

$\frac{\partial y_{i} (\bar{t})}{\partial (w_{i j}^{(1)} x_{j})}$ and

$\frac{\partial y_{i} (\bar{t})}{\partial b_{i}}$ . We can obtain the following weight learning rule using the gradient descent algorithm as follows :

{w(2)ij←w(2)ij−α⋅δzi(ˉt)⋅yj(ˉt),w(1)ij←w(1)ij−β⋅δyi(ˉt)⋅∂yi(ˉt)∂(w(1)ijxj)⋅xj,bi←bi−β⋅δyi(ˉt)⋅∂yi(ˉt)∂bi,<math display="block" altimg="eq-00037.gif"><mfenced separators="" open="{" close=""><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="left"><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>2</mn><mo stretchy="false">)</mo></mrow></msubsup><mo>←</mo><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>2</mn><mo stretchy="false">)</mo></mrow></msubsup><mo>−</mo><mi>α</mi><mo>⋅</mo><msubsup><mrow><mi>δ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>z</mi></mrow></msubsup><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo><mo>⋅</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><mo>←</mo><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><mo>−</mo><mi>β</mi><mo>⋅</mo><msubsup><mrow><mi>δ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>y</mi></mrow></msubsup><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><mo stretchy="false">(</mo><msubsup><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mo stretchy="false">(</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub><mo stretchy="false">)</mo></mrow></mfrac><mo>⋅</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>←</mo><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>−</mo><mi>β</mi><mo>⋅</mo><msubsup><mrow><mi>δ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>y</mi></mrow></msubsup><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo><mo>⋅</mo><mfrac><mrow><mi>∂</mi><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo stretchy="false">(</mo><mover accent="true"><mrow><mi>t</mi></mrow><mo>̄</mo></mover><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mrow><mi>b</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mo>,</mo></mtd></mtr></mtable></mrow></mfenced></math>(11)

where

$α$ and

$β$ are learning rates.

Directly solving the ODE (10) is not easy in practical applications since the dimension could be very high. Fortunately, these equations are uncoupled for their variables, we can define another ODE, called adjODE, as

{ṗ = - p + sin 2 (p + γ), ˙ q = - q + sin 2 (p + γ) \cdot (1 + q) . <math display="block" altimg="eq-00040.gif"><mfenced separators="" open="{" close=""><mrow><mtable displaystyle="true" equalrows="false" equalcolumns="false"><mtr><mtd columnalign="left"><mi>ṗ</mi><mo>=</mo><mo>-</mo><mi>p</mi><mo>+</mo><msup><mrow><mo>sin</mo></mrow><mrow><mn>2</mn></mrow></msup><mo stretchy="false">(</mo><mi>p</mi><mo>+</mo><mi>γ</mi><mo stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd columnalign="left"><mover accent="true"><mrow><mi>q</mi></mrow><mo>̇</mo></mover><mo>=</mo><mo>-</mo><mi>q</mi><mo>+</mo><mo>sin</mo><mn>2</mn><mo stretchy="false">(</mo><mi>p</mi><mo>+</mo><mi>γ</mi><mo stretchy="false">)</mo><mo>\cdot</mo><mo stretchy="false">(</mo><mn>1</mn><mo>+</mo><mi>q</mi><mo stretchy="false">)</mo><mo>.</mo></mtd></mtr></mtable></mrow></mfenced></math> (12)

The adjODE (12) can be looked as a basic function for solving (10). In the adjODE, p represents the state of a neuron and q can be looked as a gradient equation.

3.3. Implementation of nmForwardLA

The above gives a learning method for the nmODE at time $\bar{t}$ , in order to use nmODE to learn efficiently, one should let the nmODE to learn iteratively in many time steps, where K defines the number of steps for time $\bar{t}$ . Algorithm 1 shows a detailed pseudocode.

3.4. Analysis of the advantages of nmForwardLA

We discuss the differences among three neural ODE learning algorithms, illustrating the processes of inference and learning and the biologically plausibility features. As shown in Fig. 3(a), Chen et al.³ introduce a reverse-model augmented ordinary differential equation (ODE) for computing scalar-valued loss gradients concerning any ODE solver’s inputs. This method comprises two processes: a forward computation and a backward computation. The trajectory of the forward computation process no longer necessitates storing intermediate results. Instead, an inverse ODE is initially used to reconstruct the intermediate states required for computing gradients. This enables us to train models with a constant memory cost and address a significant bottleneck in training deep models. However, this method necessitates the simultaneous computation of four high-dimensional adjoint sensitivity equations, still retaining high time and space complexity. nmLA⁵ is proposed as the learning algorithm for nmODE. Due to the exquisitely designed structure of nmODE, this method can be decoupled into three low-dimensional adjoint sensitivity equations. This approach has superior time and space complexity compared with the first method.³ Compared with the two methods, the proposed nmForwardLA in this study exhibits more superior optimization performance, requiring only a single forward adjoint equations to compute both the inference state and gradients directly.

Fig. 2. An illustration for network learning iteratively in many time steps (K).

The first two methods in the figure do not require recording intermediate states during the forward process. They rely solely on the final state of the forward process to reconstruct the intermediate states needed for computing gradients. Therefore, these two methods do not suffer from the issue of frozen activity. The weights in both of these methods are updated based on the forward pass of the algorithm without the need to wait for the completion of the backward pass. Therefore, these two methods only partially address the issue of update locking. Compared with the first method, nmLA incurs the computational cost of calculating a local cost function in time, thus adhering to the local rule in terms of biological plausibility. However, these two methods use the same weights $W$ in both the forward and backward computation processes, hence also encountering the issue of weight-transport. nmFowardLA proposed in this study is a perpetually forward method in time, with no reverse process. Therefore, this method does not suffer from the issue of weight-transport. Additionally, both the inference and learning processes are real-time, eliminating the issues of update-locking and activity freezing. Furthermore, this method also adheres to the local rule in terms of time.

Further analysis is provided as depicted in Fig. 3. Methods (a) and (b) reconstruct the required observation states for gradient updates at the termination time of forward computation. Our approach, however, computes the observation states needed for calculating gradients in real time during the forward computation process. The observation values obtained by the former method may contain errors compared with the actual observation values, while the observations we obtain are the actual ones. Therefore, we can obtain more accurate learning gradients.

4. Experimental Results

In this section, we aim to demonstrate the effectiveness of our proposed learning algorithm (nmForwardLA) through a set of experiments. First, we present the results of our experiments on the widely known MNIST²² data set used for classification. Second, we showcase the results of our proposed forward learning method upon CIFAR-10 and CIFAR-100.²³ For this work, we implemented the proposed method using Pytorch 1.11.0 with Python 3.7.4. The experiments were conducted on an Ubuntu server 18.04 equipped with a Titan RTX (24GB) using CUDA 10.2.

4.1. Image classification experiments

Dataset: MNIST is a widely used dataset that serves as a benchmark for testing newly-created neural network models. The dataset consists of $50, 000$ training images and $10, 000$ testing images, each of which is $28 \times 28$ pixels in size. On the other hand, CIFAR-10 is a natural image dataset primarily utilized in the field of computer vision. This dataset contains a total of $60, 000$ $32 \times 32$ color images, distributed among 10 categories. CIFAR-100 is another a natural image dataset and contains a total of $60, 000$ $32 \times 32$ color images, distributed among 100 categories.

Experimental settings: We utilized an nmODE⁵ block and implemented Eq. (12) using the $t o r c h d i ff$ framework.³ To do this, we re-implemented $b a c k w a r d$ and $f o r w a r d$ methods using PyTorch. We utilized Adam⁵¹ as our optimizer for the model, with an initial learning rate of $1 e - 5$ . Batchsize is set to 256. The total number of training epochs is set to 3000, and the best parameters were saved based on the test set results. Every 1000 training epochs, the learning rate was reduced by a factor of ten, and the best model parameters were reloaded to continue training. The MNIST dataset generally converges to optimal data within 1000 epochs, while the CIFAR-10 dataset is relatively difficult to converge and still shows a downward trend after exceeding 1000 epochs. The selection of hyperparameters, such as K and the number of hidden neurons, required further analysis for our proposed learning algorithm.

Classification results on the two 10 categories dataset: The network architecture used in the experiment is based on nmODE,⁵ which does not have the concept of layers in traditional networks. However, the number of neurons in the hidden layer of the model has a significant impact on the final performance. Here, we compare the effect of different numbers of neurons, such as 128, 256, 512, 1024, 2048, 4096, and 8192, on the model’s performance based on the nmForwardLA learning algorithm on MNIST and CIFAR-10. As shown in Table 1 and Fig. 4(c), our results show that the test performance of the model on both datasets initially increases with an increase in the number of neurons, followed by a decrease. On MNIST, the best performance is achieved when 4096 neurons are selected, while on CIFAR-10, the best performance is achieved with 2048 neurons.

Fig. 4. The analysis of loss during training and testing.

**Table 1. Accuracy of the proposed method using different neurons.**
Num. of Neur.	Num. of Para.	MNIST	CIFAR-10
128	$199, 424$	$86.95 % \pm 0.23$	$47.32 % \pm 2.13$
256	$199, 424$	$89.56 % \pm 0.46$	$51.86 % \pm 1.61$
512	$170, 752$	$89.14 % \pm 0.48$	$56.27 % \pm 1.32$
1,024	$797, 696$	$99.11 % \pm 0.10$	$61.17 % \pm 0.66$
2,048	$1, 595, 392$	$99.37 % \pm 0.18$	$62.34 % \pm 0.15$
4,096	$3, 190, 784$	$99.53 % \pm 0.04$	$55.89 % \pm 1.56$
8,192	$6, 504, 448$	$99.42 % \pm 0.09$	$55.80 % \pm 0.59$

We compared the performance of our proposed nmForwardLA algorithm with other forward algorithms on two datasets, and the results are shown in Table 2. Among them, FF²¹ refers to Hinton’s forward–forward learning, and FG⁴⁷ refers to forward Gradient algorithm proposed by Baydin et al. PEPITA (WM)¹⁷ refers to PEPITA trained using weight mirroring strategy proposed recently. As shown in the table, our learning algorithm achieved a better performance. In addition, we also compared our model with two other ODE-based network models, namely NODE³ and ANODE,⁴ and our model also achieved the best results compared with them.

**Table 2. Comparison with other algorithm.**
Methods	MNIST	CIFAR-10
FF²¹	$99.36 %$	$59.00 %$
FG⁴⁷	$90.91 %$	N/A
PEPITA (WM)¹⁷	$98.29 % \pm 0.13$	$56.33 % \pm 1.35$
NODE³	$96.40 %$	$53.70 %$
ANODE⁴	$98.20 %$	$60.60 %$
nmODE⁵	$99.19 % \pm 0.10$	N/A
nmForwardLA	$99.37 % \pm 0.18$	$62.34 % \pm 0.15$

Analysis of training and validation loss: Here, we analyze the changes in loss during the training process (Fig. 4). As shown in Fig. 4(a), we display the changes in loss over 1000 training epochs with $K = 10$ parameters, so we have recorded $10, 000$ observation values of loss during our MNIST training. The small graph in the upper right corner of Fig. 4(a) shows that we selected 200 observation values (the area between the two green lines) for display. As shown in the figure, the loss oscillates locally within the K cycles but shows a downward trend overall. Under our learning algorithm, the network can converge in terms of training epochs (Fig. 4(b)) during the training process. We take the average loss of all batches within an epoch at time $K \bar{t}$ and plot the model’s loss values on the training and testing sets, as shown in Fig. 4(b). The model can quickly converge stably on both the training and testing sets.

Classification results on the 100 categories dataset: We applied our proposed learning algorithm to more categories of classification tasks. We trained and validated it on the CIFAR-100 dataset, comparing it with several more biologically inspired learning algorithms. The relevant results are shown in Table 3. We also achieved optimal performance on the CIFAR-100 dataset.

**Table 3. Experimental results upon CIFAR-100.**
Methods	CIFAR-100
FA¹⁹	$22.75 % \pm 0.28$
DRTP⁴⁶	$17.59 % \pm 0.18$
PEPITA¹⁶	$27.04 % \pm 0.19$
nmForwardLA	$28.27 % \pm 0.23$

4.2. Experimental analysis of efficiency

Analysis of hyper-parameter K: Here we mainly analyze the impact of hyperparameter K of nmForwardLA on the final performance of the model. For K, we select the following values: 1, 3, 5, 10, 20, 50, 100. The comparison indicators include accuracy and running efficiency. The corresponding results are shown in Table 4 and Fig. 4(d) below. When $K = 10$ , the ACC is the highest, followed by a decrease in ACC. However, the algorithm’s running efficiency continues to slow down as K increases. Compared with the changes in ACC, the running efficiency shows a more significant variation with the changes in K.

**Table 4. Performance of the proposed method using different hyper-parameters K.**
K	1	3	5	10	20	50	100
Acc.	$99.34 % \pm 0.07$	$99.56 % \pm 0.04$	$99.40 % \pm 0.06$	$99.60 % \pm 0.05$	$99.29 % \pm 0.10$	$99.12 % \pm 0.05$	$98.82 % \pm 0.03$
Efficiency (samples/sec)	$11585.85$	$10315.61$	$9242.53$	$7383.78$	$4822.16$	$3667.46$	$2285.52$

Training efficiency comparison with other ODE-like methods: This section compares the forward learning algorithm proposed in this study with two other commonly used ODE learning algorithms. The specific processes of these algorithms have been analyzed in Sec. 3.4. The datasets are MNIST, and the ODE solver library relies on torchdiff.³ The hyperparameter $\bar{t}$ for odeint is set to $0.05$ . The batch size for network training is set to 256. We measure the processing speed during training, with speed being assessed in terms of samples processed per second (samples/sec). The corresponding statistical results are shown in Table 5. The training process speeds for the three algorithms are as follows: $4437.35$ , $5755.23$ , and $12027.13$ , respectively. Among them, the method proposed in this study has the fastest training speed.

**Table 5. Efficiency comparison on ODE methods.**
Methods	NODE^3,4	nmLA⁵	nmForwardLA
Efficiency	$4437.35$	$5755.23$	$12027.13$

5. Discussion and Future Work

The nmODE is a newly proposed neural network architecture,⁵ and related research has shown its effectiveness in tasks such as classification⁶ and segmentation.⁷ Like other deep neural networks based on ODEs,^3,4 nmODE similarly suffers from significant computational overhead in practical applications, and when the dimensions of ODEs are huge, direct solution computation may not be feasible. This paper presents a more efficient and biologically plausible learning algorithm for the nmODE network architecture. The proposed nmForwardLA algorithm completes inference and gradient computation with one forward pass. The proposed learning algorithm has been rigorously tested and validated on real datasets, including the widely used MNIST, CIFAR10, and CIFAR100. The experimental results unequivocally demonstrate the algorithm’s correctness and effectiveness, further supporting its potential in practical applications. Additionally, the experiments further analyze the impact of the hyperparameter K on the algorithm’s final performance and efficiency.

Although this study provides evidence and validation of the effectiveness and correctness of the proposed forward algorithm, there is still much work to be done in the future. First, the algorithm has only been preliminarily validated on classification data. Further validation in various scenarios,^{7,10,11,12,13,14,15} such as segmentation, and computer-aided medical applications, is needed before practical use. Second, the algorithm derives its core formula using ODEs and is implemented using ODESolver. Designing a faster and more accurate implementation process for continuous neural network remains one of the future research directions.

6. Conclusion

This study proposes a forward learning algorithm for nmODE, deriving its core mathematical formula (Eq. (12)) through precise mathematical derivation and providing specific implementation details. The algorithm has three advantages: First, it exhibits greater biological plausibility while meeting weight transport-free, locality, nonfrozen activation, and update unlocking. Second, during the training process, the algorithm only requires a single forward pass of the ODESolver to complete both the forward computation of the network model and gradient updates. Compared the to existing ODE network learning algorithms,^3,4,5 it is faster, a fact verified through theoretical analysis and experimental results, and we have implemented the algorithm and validated it on three classic classification datasets, demonstrating its correctness and potency. Third, the existing methods reconstruct observation states for gradient updates at the end of the forward computation, while our approach computes them in real-time, ensuring accurate learning gradients.

Novel continuous neural networks characterized by ordinary differential equations (ODEs) are gradually progressing. However, such models still need further improvement in practical applications due to the requirement for precise computation and operational efficiency. From a biologically plausible perspective, we are developing a concise algorithm to make breakthroughs. We hope our conceptual and technical contributions can lead to a new continuous neural network algorithm achieved through simplicity, principled approaches, and less engineering.

Acknowledgments

This work was supported by the National Major Science and Technology Projects of China under Grant 2018AAA0100201, the National Natural Science Foundation of China under Grant 62106163, the Natural Science Foundation Project of Sichuan Province under Grant 2023YFG0283, and the CAAI-Huawei MindSpore Open Fund under Grant 21H1235.

ORCID

Xiuyuan Xu https://orcid.org/0000-0002-2505-4350

Haiying Luo https://orcid.org/0009-0004-7005-4490

Zhang Yi https://orcid.org/0000-0002-5867-9322

Haixian Zhang https://orcid.org/0000-0002-9821-508X

References

1. K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2016), pp. 770–778. Crossref, Google Scholar
2. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 30 (Curran Associates, 2017), pp. 5998–6008. Google Scholar
3. R. T. Q. Chen, Y. Rubanova, J. Bettencourt and D. K. Duvenaud, Neural ordinary differential equations, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 31 (Curran Associates, 2018), pp. 6572–6583. Google Scholar
4. E. Dupont, A. Doucet and Y. W. Teh, Augmented Neural ODEs, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 32 (Curran Associates, 2019), pp. 3140–3150. Google Scholar
5. Z. Yi, nmODE: neural memory ordinary differential equation, Artif. Intell. Rev. 56 (2023) 14403–14438. Crossref, Web of Science, Google Scholar
6. H. Niu, Z. Yi and T. He, A bidirectional feedforward neural network architecture using the discretized neural memory ordinary differential equation, Int. J. Neural Syst. 34(04) (2024) 2450015. Link, Web of Science, Google Scholar
7. J. Hu, C. Yu, Z. Yi and H. Zhang, Enhancing robustness of medical image segmentation model with neural memory ordinary differential equation, Int. J. Neural Syst. 33(12) (2023) 2350060. Link, Web of Science, Google Scholar
8. D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning representations by back-propagating errors, Nature 323 (1986) 533–536. Crossref, Web of Science, Google Scholar
9. S.-L. Hung and H. Adeli, Parallel backpropagation learning algorithms on cray y-mp8/864 supercomputer, Neurocomputing 5(6) (1993) 287–302. Crossref, Web of Science, Google Scholar
10. H. S. Nogay and H. Adeli, Diagnostic of autism spectrum disorder based on structural brain MRI images using, grid search optimization, and convolutional neural networks, Biomed. Signal Process. Control 79 (2023) 104234. Crossref, Web of Science, Google Scholar
11. X. Liu, S. J. Dyke, A. Lenjani, I. Bilionis, X. Zhang and J. Choi, Automated image localization to support rapid building reconnaissance in a large-scale area, Comput.-Aided Civ. Infrastruct. Eng. 38(1) (2023) 3–25. Crossref, Web of Science, Google Scholar
12. H. S. Nogay and H. Adeli, Multiple classification of brain MRI autism spectrum disorder by age and gender using deep learning, J. Med. Syst. 48(1) (2024) 15. Crossref, Medline, Web of Science, Google Scholar
13. Y. Wang, L. Zhen, T.-E. Tan, H. Fu, Y. Feng, Z. Wang, X. Xu, R. S. M. Goh, Y. Ng, C. Calhoun, G. S. Tan, J. K. Sun, Y. Liu and D. S. Ting, Geometric correspondence-based multimodal learning for ophthalmic image analysis, IEEE Trans. Med. Imaging 43 (2024) 1945–1957. Crossref, Medline, Web of Science, Google Scholar
14. M. H. Rafiei, L. V. Gauthier, H. Adeli and D. Takabi, Self-supervised learning for electroencephalography, IEEE Trans. Neural Netw. Learn. Syst. 35(2) (2024) 1457–1471. Crossref, Medline, Web of Science, Google Scholar
15. T. Hu, L. Xie, L. Zhang, G. Li and Z. Yi, Deep multimodal neural network based on data-feature fusion for patient-specific quality assurance, Int. J. Neural Syst. 32(01) (2022) 2150055. Link, Web of Science, Google Scholar
16. G. Dellaferrera and G. Kreiman, Error-driven input modulation: Solving the credit assignment problem without a backward pass, in Int. Conf. Machine Learning (ICML) (PMLR, 2022), pp. 4937–4955. Google Scholar
17. R. F. Srinivasan, F. Mignacco, M. Sorbaro, M. Refinetti, A. Cooper, G. Kreiman and G. Dellaferrera, Forward learning with top-down feedback: Empirical and analytical characterization, preprint (2023), arXiv:2302.05440. Google Scholar
18. W. Xiao, H. Chen, Q. Liao and T. Poggio, Biologically-plausible learning algorithms can scale to large datasets, preprint (2018), arXiv:1811.03567. Google Scholar
19. T. P. Lillicrap, D. Cownden, D. B. Tweed and C. J. Akerman, Random synaptic feedback weights support error backpropagation for deep learning, Nat. Commun. 7(1) (2016) 13276. Crossref, Medline, Google Scholar
20. A. Nøkland, Direct feedback alignment provides learning in deep neural networks, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 29 (Curran Associates, 2016), pp. 1045–1053. Google Scholar
21. G. Hinton, The forward-forward algorithm: Some preliminary investigations, preprint (2022), arXiv:2212.13345. Google Scholar
22. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86(11) (1998) 2278–2324. Crossref, Web of Science, Google Scholar
23. A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images, Tech. Report, University of Toronto, Toronto, Ontario (2009). Google Scholar
24. F. Rosenblatt, The perceptron, a perceiving and recognizing automaton, Project Para, Cornell Aeronaunt Laboratory, Rep., 85-460.1, Cornell University, Ithaca, NY (1957). Google Scholar
25. M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry (MIT Press Cambridge, 1969). Google Scholar
26. A. G. Ivakhnenko and V. G. Lapa, Cybernetic predicting devices, Cybern. Syst. Anal. 1(1) (1965) 55–61. Google Scholar
27. K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition, Biol. Cybern. 36(4) (1980) 193–202. Crossref, Medline, Web of Science, Google Scholar
28. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86(11) (1998) 2278–2324. Crossref, Web of Science, Google Scholar
29. G. E. Hinton, S. Osindero and Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural Comput. 18(7) (2006) 1527–1554. Crossref, Medline, Web of Science, Google Scholar
30. A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 25 (Curran Associates, 2012), pp. 1097–1105. Google Scholar
31. K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, in Proc. Int. Conf. Learning Representations (ICLR) (Computational and Biological Learning Society, 2015). Google Scholar
32. M. Lin, Q. Chen and S. Yan, Network in network, in Proc. Int. Conf. Learning Representations (ICLR) (Computational and Biological Learning Society, 2014). Google Scholar
33. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed and D. Anguelov, Going deeper with convolutions, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (IEEE, 2015), pp. 1–9. Crossref, Google Scholar
34. M. Raissi and G. E. Karniadakis, Hidden physics models: Machine learning of nonlinear partial differential equations, J. Comput. Phys. 357 (2018) 125–141. Crossref, Web of Science, Google Scholar
35. Z. Long, Y. Lu, X. Ma and B. Dong, PDE-net: Learning PDEs from data, in Proc. 35th Int. Conf. Machine Learning (ICML), Vol. 80 (PMLR, 2018), pp. 3208–3216. Google Scholar
36. S. Wiewel, M. Becher and N. Thuerey, Latent space physics: Towards learning the temporal evolution of fluid flow, Comput. Graph. Forum 38 (2019) 71–82. Crossref, Web of Science, Google Scholar
37. M. Raissi, P. Perdikaris and G. E. Karniadakis, Numerical gaussian processes for time-dependent and nonlinear partial differential equations, SIAM J. Sci. Comput. 40(1) (2018) A172–A198. Crossref, Web of Science, Google Scholar
38. T. Ryder, A. Golightly, A. S. McGough and D. Prangle, Black-box variational inference for stochastic differential equations, in Proc. 35th Int. Conf. Machine Learning (ICML), Vol. 80 (PMLR, 2018), pp. 4423–4432. Google Scholar
39. B. Carpenter, M. D. Hoffman, M. Brubaker, D. Lee, P. Li and M. Betancourt, The Stan math library: Reverse-mode automatic differentiation in C++, preprint (2015), arXiv:1509.07164. Google Scholar
40. V. Melicher, T. Haber and W. Vanroose, Fast derivatives of likelihood functionals for ODE based models using adjoint-state method, Comput. Stat. 32(4) (2017) 1621–1643. Crossref, Web of Science, Google Scholar
41. Q. Liao, J. Leibo and T. Poggio, How important is weight symmetry in backpropagation? in Proc. AAAI Conf. Artificial Intelligence (AAAI), Vol. 30, No. 1 (AAAI Press, 2016), pp. 1837–1844. Crossref, Google Scholar
42. M. Akrout, C. Wilson, P. Humphreys, T. Lillicrap and D. B. Tweed, Deep learning without weight transport, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 32 (Curran Associates, 2019), pp. 976–984. Google Scholar
43. A. Nøkland and L. H. Eidnes, Training neural networks with local error signals, in Int. Conf. Machine Learning (ICML) (PMLR, 2019), pp. 4839–4850. Google Scholar
44. E. Belilovsky, M. Eickenberg and E. Oyallon, Decoupled greedy learning of cnns, in Int. Conf. Machine Learning (ICML) (PMLR, 2020), pp. 736–745. Google Scholar
45. H. Mostafa, V. Ramesh and G. Cauwenberghs, Deep supervised learning using local errors, Front. Neurosci. 12 (2018) 608. Crossref, Medline, Web of Science, Google Scholar
46. C. Frenkel, M. Lefebvre and D. Bol, Learning without feedback: Fixed random learning signals allow for feedforward training of deep neural networks, Front. Neurosci. 15 (2021) 629892. Crossref, Medline, Web of Science, Google Scholar
47. A. G. Baydin, B. A. Pearlmutter, D. Syme, F. Wood and P. Torr, Gradients without backpropagation, preprint (2022), arXiv:2202.08587. Google Scholar
48. D. Clark, L. Abbott and S. Chung, Credit assignment through broadcasting a global error vector, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 34 (Curran Associates, 2021), pp. 10053–10066. Google Scholar
49. M. Ren, S. Kornblith, R. Liao and G. Hinton, Scaling forward gradient with local losses, The Eleventh Int. Conf. Learning Representations (ICLR), (Computational and Biological Learning Society, 2023). Google Scholar
50. I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic and A. Dosovitskiy, MLP-Mixer: An all-MLP Architecture for Vision, in Advances in Neural Information Processing Systems (NeurIPS), Vol. 34 (Curran Associates, 2021), pp. 24261–24272. Google Scholar
51. D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 3rd International Conf. on Learning Representations (ICLR), (Computational and Biological Learning Society, 2015). Google Scholar

Vol. 34, No. 09

Metrics

Downloaded 455 times

History

Received 4 January 2024

Accepted 24 May 2024

Published: 21 June 2024

Information

This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution 4.0 (CC BY) License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Keywords

PDF download