Processing math: 100%
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

A Forward Learning Algorithm for Neural Memory Ordinary Differential Equations

    https://doi.org/10.1142/S0129065724500485Cited by:1 (Source: Crossref)

    Abstract

    The deep neural network, based on the backpropagation learning algorithm, has achieved tremendous success. However, the backpropagation algorithm is consistently considered biologically implausible. Many efforts have recently been made to address these biological implausibility issues, nevertheless, these methods are tailored to discrete neural network structures. Continuous neural networks are crucial for investigating novel neural network models with more biologically dynamic characteristics and for interpretability of large language models. The neural memory ordinary differential equation (nmODE) is a recently proposed continuous neural network model that exhibits several intriguing properties. In this study, we present a forward-learning algorithm, called nmForwardLA, for nmODE. This algorithm boasts lower computational dimensions and greater efficiency. Compared with the other learning algorithms, experimental results on MNIST, CIFAR10, and CIFAR100 demonstrate its potency.

    1. Introduction

    The introduction of residual modules addresses issues such as gradient vanishing in deep neural networks, removing the limitation on network scale imposed by depth.1 As a result, large language model (LLM)2 technology has experienced rapid development in recent years. As network models expand infinitely in depth following the pattern of residual blocks, they gradually transition from a discrete structure towards a continuous one.3 Thus, this study believes that continuous form of neural networks is of significant importance for proposing novel and more powerful models and investigating the interpretability of current LLMs.

    In recent years, neural ordinary differential equations (neuralODEs)3,4,5 have garnered extensive attention in the field of deep learning. NeuralODEs are described by ordinary differential equations, and they offer many advantages over traditional multi-layer neural networks, such as memory efficiency, adaptive computation, parameter efficiency, and more. Furthermore, neuralODEs are suitable for solving traditional deep-learning problems such as classification6 and segmentation.7

    The learning algorithms for the current stage of the neuralODE model are inspired by backpropagation (BP) algorithm.8,9 Despite the significant success of the BP algorithm in various fields10,11,12,13,14,15 of artificial intelligence, it is still considered to lack biological plausibility as the brain does not perform symmetric backward connections or synchronized computations. From an engineering perspective, BP is incompatible with massive levels of neuralODE and restricts potential hardware designs.

    Existing biologically plausible approaches mainly address weight transport, nonlocality, activation freezing, and update locking.16 These methods can be broadly categorized into two classes based on whether feedback backward path is required.16,17,18,19,20,21 Almost all of these methods are proposed for discrete neural networks, however, compared with the discrete neural networks, continuous networks with dynamical characteristics may inherently possess greater biological plausibility. Forward learning algorithms based on continuous neural networks support real-time learning and offer superior computational efficiency.

    The recently proposed neural memory ordinary differential equation (nmODE) is a novel continuous model of artificial neural network.5 This model is described in continuous form using ODEs and possesses a simple and special structure that separates learning from memory, endowing several interesting properties. First, it has a global attractor for each external input. Then, nmODE does not have the problem of learning features homeomorphic to input data space4 that frequently occurs in most existing neuralODEs. In nmODE, a learning algorithm called nmLA is developed for nmODE, but it uses a backpropagation mechanism in the algorithm.5

    In this study, we propose another learning algorithm for the nmODE. This learning algorithm calculates gradients in a forward manner and does not require backpropagation of gradients with more biological plausibility. We evaluate the proposed learning algorithm primarily on some supervised image classification datasets, including MNIST,22 CIFAR10, CIFAR100,23 and show that it outperforms other algorithms.

    The primary contributions of this study include the following:

    A more biologically plausible learning algorithm is proposed for nmODE, called nmForwardLA, without weight transport, nonlocal, frozen activity and update locking issues. We derive its core mathematical formula (Eq. (12)) through precise mathematical derivation.

    This method involves only one forward pass of a two-dimensional ODE, enabling real-time inference and gradient calculation for updates. Compared with the other neuralODE training algorithms, it has a more straightforward computation and lower space complexity.

    We validate the feasibility and correctness of our proposed method on three classic datasets, achieving competitive results compared with other learning algorithms.

    2. Related Works

    Deep neural networks’ success stems from powerfully nonlinearly expressive network architectures and simple yet effective learning algorithms. This section reviews relevant work from neural network architecture and biologically plausible learning methods.

    2.1. Neural network architecture

    Neural networks were invented in the 1950s, and artificial neural network models have succeeded significantly in various fields. Their architecture can be divided into discrete and continuous.

    Discrete neural networks The industry’s most commonly used and highly successful neural networks are discrete neural networks. The earliest ones are single-layer networks,24 such as perceptrons, which easily handle linearly separable problems but do not solve linearly inseparable problems, such as the XOR problem.25 To address this issue, later researchers added one or more hidden layers to the perceptron, allowing the network model to handle more general function relationships. These models were typically referred to as multi-layer perceptrons (MLPs).26 Inspired by the cat’s visual system, Kunihiko Fukushima introduced a multi-layer neural network with convolution and subsampling operators on top of the multi-layer perceptron, called Neocognitron.27 Lecun built upon this foundation and invented Convolutional Neural Networks (CNNs), achieving significant success in handwritten digit recognition.28 Hinton et al. employed a layer-wise pretraining method, initially pretraining each layer’s parameters and fine-tuning them using backpropagation (BP), thus, effectively training deep neural networks.29 The first modern deep convolutional neural network model, AlexNet,30 marked the beginning of a breakthrough in deep learning technology, particularly in image classification. In the past decade, deep neural networks have continuously expanded the scale of neural networks, focusing on increasing depth.31,32,33 The introduction of ResNet effectively addressed the issue of gradient vanishing, enabling neural networks to be extended to greater depths.1 In recent years, the models have grown increasingly more extensive, and transformer architecture has unified effective semantic learning for text and image features.2

    Continuous neural networks Models like residual networks,1 transformer networks,2 and normalizing flows construct intricate transformations by sequentially composing a series of transformations to a hidden state. The layers’ number in a network model increases, and the time steps become relatively minor; in the limit, the continuous dynamics of hidden units can be parametrized using an ordinary differential equation (ODE) specified by a neural network. Many successfully applied neural network models, including ResNet1 and Transformer,2 are essentially particular forms of discretized NeuralODEs. As the network scale increases, these models will likely converge toward NeuralODEs. Thus, we suppose researching continuous forms of neural networks is of significant importance for proposing novel and more powerful models and investigating the interpretability of current large-language models (LLMs).

    The research on continuous artificial neural networks mainly focuses on the following two aspects: (1) Many recent works have proposed learning differential equations from data. Some work trained feed-forward or recurrent neural networks to approximate a differential equation,34,35 with applications such as fluid simulation.36 There was also a significant work on connecting Gaussian Processes (GPs) with ODE solvers and GPs were adapted to fit differential equations.37 Stochastic variational inference was utilized to regain the solution of a given stochastic differential equation.38 (2) Another category of work mainly focused on implementing differentiation through ODE solvers.39,40 The neuralODE was first proposed and trained an ODE solver end-to-end directly.3 This work highlighted the potential of a general integration of black-box ODE solvers into automatic differentiation for deep learning and generative modeling. It was proved that neuralODEs using initial value as data input can learn only features homeomorphic to the input data space.4 Dupont et al. added new augments to the networks to address this issue, however, additional computations should be required.4 Although neither method needed to store intermediate states during the forward computation, they achieved optimal storage efficiency by dynamically reconstructing these states in real time during the backward process for gradient computation. Yi constructed a novel type of continuous neural network with evident dynamical characteristics and a global attractor without increasing additional computational costs.5 Additionally, Yi proposed a novel learning algorithm called nmLA for nmODE, which was decoupled into a three-dimensional adjoint equation, thereby significantly enhancing computational efficiency.5

    The nmODE is a more biologically plausible network architecture with nonlinear expressive capabilities and attractor-based memory. While nmLA dramatically reduces the complexity of ODE training through a decoupled three-dimensional ODE solver, it still requires a reverse process, which leads to issues: weight transport free and update locking.

    2.2. Biological plausible learning methods

    Since the era of perceptrons, developing learning algorithms for neural networks, particularly those that biological brains could implement, has been a central focus. Dellaferrera and Kreiman16 systematically summarize biologically plausible learning algorithms.

    Learning with a feedback path The backpropagation algorithm consists of both a forward and a backward pass.8,9 The signal progresses from the input to the output in the forward path, and the error is determined by the difference between the model’s output and the target. In reverse, the error travels from the output layer to the input one by the same weights utilized in the forward during the backward path, resulting in the emergence of the weight symmetry issue. Recently, various learning algorithms have been suggested to alleviate the constraint imposed by symmetric weights. The neural networks’ effective learning backpropagated the learning signal through connections with the same sign, not the magnitude, of the feedforward weights (sign symmetry algorithm).18,41 Alternatively, effective learning was accomplished through fixed random connections called feedback-alignment (FA).19 The effectiveness of FA suggested that the modulatory learning signals conveyed by fixed random matrices are valuable. This is due to the alignment of forward weights with the fixed backward matrices during the update process.19 Furthermore, Nøkland illustrated learning signals can be transferred even through random feedback matrices effectively.20 Akrout et al.42 proposed a neural circuit called a weight mirror to fine-tune the FA’s initially random feedback weights and improved its alignment with the forward path, resulting in enhanced performance compared with FA. However, it is worth noting that these approaches all necessitate a feedback path to guide the model’s learning and thus cannot address nonlocality, update locking, and activity freezing issues.16

    Learning without a feedback path Recent studies have employed methods based on local error handling to address update locking issue, proving effective in training neural networks.43,44,45 These methods trained each layer independently by employing auxiliary fixed random classifiers. However, they still faced weight transport issues at the classifier’s level and necessitated a significantly greater computational workload. To achieve the same objective, Frenkel et al. proposed a method named DRTP that utilized fixed random learning signals to update parameters using labels instead of network error.46 This strategy successfully overcame weight symmetry and updated locking issues while incurring no additional computational overhead. However, it still required frozen activity while the learning information propagates through the network. In addition, DRTP exhibited a more noticeable declined in performance than BP and FA algorithms. More recently, a category of training schemes known as “forward-only” algorithms has emerged,16,21,47 utilizing the backward path by another forward one. This category includes the Forward–Forward algorithm (FF)21 and another top-down feedback connections named PEPITA.16,17 Both algorithms involved presenting a clean input sample during the initial forward pass. In the case of FF, the second forward pass introduced a corrupted data sample obtained by merging different samples with masks. In contrast, PEPITA modulated the input in the second forward pass through information about the error from the first forward pass. Some other original methods were designed for specific networks with greater biological plausibility. Clark et al. proposed a method, called GEVB, which did not transfer learning information by fixed random weights and resolves weight transport and locality issues.48 Ren et al.49 proposed a forward gradient learning algorithm inspired by the design of MLPMixer50 to compute a noisy directional gradient. Chen et al. introduced a learning method that treated the ODE solver as a black box and computed gradients by using the adjoint sensitivity method.3 This method did not require freezing intermediate states during the forward process but dynamically recalculated these results in real time during the backwards. However, the algorithm still relied on a globally defined cost function over time. Yi proposed nmLA to train nmODE, sampling at each observation time, updating weights in real-time, further addressing the nonlocal issue.5 While neither method required the activation values used during the forward computation, there was still a backward process that used the same weights as the forward process, thus not addressing the issue of weight transport.

    Most current biologically plausible learning methods revolve around discrete neural network architectures. Only a few studies have recently begun to explore learning algorithms for continuous neural network architectures, and they still need to address biologically implausible issues entirely.

    3. The Proposed nmForwardLA

    3.1. nmODE

    Traditional math model of neurons can be described simply by

    p=f(γ),(1)
    where γ denotes the total input to a neuron, f denotes an activation function. This model is a simple static mapping from total input γ to the output p. However, in the brain, electricity flows continuously among neurons, the traditional model of neurons is actually a very simplified math model. In Ref. 5, a new model of neurons described by a one-dimensional ODE is proposed by
    =p+sin2(p+γ),(2)
    where p denotes the state of the neuron. This model generates the traditional neuron model (1) from static mapping to a dynamical system (2).

    Utilizing the suggested model of neurons, a novel model of continuous neural networks referred to as nmODE (neural memory ordinary differential equation) is introduced in Ref. 5 by

    {(t)=y(t)+sin2[y(t)+γ],γ=W(1)x+b,(3)
    where xRm denotes the external input, We denote the memory neurons’ state at time t by y(t)Rn, W(1)=(w(1)ij)Rn×m is the connection weight matrix, and bRn denotes bias. This model is special in that the dynamics of memory neurons are uncoupled which makes the dynamics of nmODE to be extremely simple and clearly. In order to make the nmODE for efficiently learning, a set of decision making neurons denoted by a=[a1,,ar]TRr is introduced as
    {ai(t)=s(zi(t)),zi(t)=nj=1w(2)ijyj(t),(4)
    where W(2)=(w(2)ij)Rr×n is another connection weight matrix for learning, and s() is the softmax activation function, see Fig. 1 for illustration.

    Fig. 1.

    Fig. 1. The architecture of the nmODE network.

    3.2. The learning rule

    We present a novel forward learning algorithm for the nmODE. Define

    γi=nj=1w(1)ijxj+bi,
    then, we can rewrite the nmODE as
    i(t)=yi(t)+sin2[yi(t)+γi],(i=1,,n).
    Let network learn a target di(i=1,,r). Given a fixed time ˉt, using the output ai(ˉt) together with the learning target di, a cost function J(ˉt) can be constructed. The weights (W(1)) can be calculated as follows:
    J(ˉt)w(1)ij=J(ˉt)yi(ˉt)yi(ˉt)w(1)ij(5)
    =δyi(ˉt)yi(ˉt)w(1)ij(6)
    =δyi(ˉt)yi(ˉt)(w(1)ijxj)xj(7)
    and
    J(ˉt)bi=δyi(ˉt)yi(ˉt)bi.(8)
    Next, we calculate yi(ˉt)w(1)ij and yi(ˉt)bi first, they cannot be directly calculated, next, we establish a set of ODE to solve this problem. Using
    i(t)=yi(t)+sin2[yi(t)+γi(t)],(9)
    and noting that y0i is independent of w(1)ijxj and bi, Taking the partial derivative on both sides of the equation, we get
    {ddt[yi(t)(w(1)ijxj)]=yi(t)(w(1)ijxj)+sin2[yi(t)+γi][1+yi(t)(w(1)ijxj)],ddt[yi(t)bi]=yi(t)bi+sin2(yi(t)+γi)[1+yi(t)bi],y0i(w(1)ijxj)=0,y0ibi=0.(10)
    Solving this ODE from t=0 to t=ˉt, we can get yi(ˉt)(w(1)ijxj) and yi(ˉt)bi. We can obtain the following weight learning rule using the gradient descent algorithm as follows :
    {w(2)ijw(2)ijαδzi(ˉt)yj(ˉt),w(1)ijw(1)ijβδyi(ˉt)yi(ˉt)(w(1)ijxj)xj,bibiβδyi(ˉt)yi(ˉt)bi,(11)
    where α and β are learning rates.

    Directly solving the ODE (10) is not easy in practical applications since the dimension could be very high. Fortunately, these equations are uncoupled for their variables, we can define another ODE, called adjODE, as

    {=p+sin2(p+γ),˙q=q+sin2(p+γ)(1+q).(12)
    The adjODE (12) can be looked as a basic function for solving (10). In the adjODE, p represents the state of a neuron and q can be looked as a gradient equation.

    3.3. Implementation of nmForwardLA

    The above gives a learning method for the nmODE at time ˉt, in order to use nmODE to learn efficiently, one should let the nmODE to learn iteratively in many time steps, where K defines the number of steps for time ˉt. Algorithm 1 shows a detailed pseudocode.

    3.4. Analysis of the advantages of nmForwardLA

    We discuss the differences among three neural ODE learning algorithms, illustrating the processes of inference and learning and the biologically plausibility features. As shown in Fig. 3(a), Chen et al.3 introduce a reverse-model augmented ordinary differential equation (ODE) for computing scalar-valued loss gradients concerning any ODE solver’s inputs. This method comprises two processes: a forward computation and a backward computation. The trajectory of the forward computation process no longer necessitates storing intermediate results. Instead, an inverse ODE is initially used to reconstruct the intermediate states required for computing gradients. This enables us to train models with a constant memory cost and address a significant bottleneck in training deep models. However, this method necessitates the simultaneous computation of four high-dimensional adjoint sensitivity equations, still retaining high time and space complexity. nmLA5 is proposed as the learning algorithm for nmODE. Due to the exquisitely designed structure of nmODE, this method can be decoupled into three low-dimensional adjoint sensitivity equations. This approach has superior time and space complexity compared with the first method.3 Compared with the two methods, the proposed nmForwardLA in this study exhibits more superior optimization performance, requiring only a single forward adjoint equations to compute both the inference state and gradients directly.

    Fig. 2.

    Fig. 2. An illustration for network learning iteratively in many time steps (K).

    Fig. 3.

    Fig. 3. Intuitive comparison of three learning algorithms for neural ordinary differential: (a) Reverse-mode augmented ODE learning algorithm. (b) nmLA. (c) nmForwardLA. The top row presents an intuitive illustration of the forward computation and learning process. The middle row provides detailed descriptions of the forward and backward computation processes. The bottom shows the resolution status of the four biological implausible issues. Figure 2 shows an intuitive illustration. Theoretically, The proposed method offers better efficiency, less error, and biological plausibility.

    The first two methods in the figure do not require recording intermediate states during the forward process. They rely solely on the final state of the forward process to reconstruct the intermediate states needed for computing gradients. Therefore, these two methods do not suffer from the issue of frozen activity. The weights in both of these methods are updated based on the forward pass of the algorithm without the need to wait for the completion of the backward pass. Therefore, these two methods only partially address the issue of update locking. Compared with the first method, nmLA incurs the computational cost of calculating a local cost function in time, thus adhering to the local rule in terms of biological plausibility. However, these two methods use the same weights W in both the forward and backward computation processes, hence also encountering the issue of weight-transport. nmFowardLA proposed in this study is a perpetually forward method in time, with no reverse process. Therefore, this method does not suffer from the issue of weight-transport. Additionally, both the inference and learning processes are real-time, eliminating the issues of update-locking and activity freezing. Furthermore, this method also adheres to the local rule in terms of time.

    Further analysis is provided as depicted in Fig. 3. Methods (a) and (b) reconstruct the required observation states for gradient updates at the termination time of forward computation. Our approach, however, computes the observation states needed for calculating gradients in real time during the forward computation process. The observation values obtained by the former method may contain errors compared with the actual observation values, while the observations we obtain are the actual ones. Therefore, we can obtain more accurate learning gradients.

    4. Experimental Results

    In this section, we aim to demonstrate the effectiveness of our proposed learning algorithm (nmForwardLA) through a set of experiments. First, we present the results of our experiments on the widely known MNIST22 data set used for classification. Second, we showcase the results of our proposed forward learning method upon CIFAR-10 and CIFAR-100.23 For this work, we implemented the proposed method using Pytorch 1.11.0 with Python 3.7.4. The experiments were conducted on an Ubuntu server 18.04 equipped with a Titan RTX (24GB) using CUDA 10.2.

    4.1. Image classification experiments

    Dataset: MNIST is a widely used dataset that serves as a benchmark for testing newly-created neural network models. The dataset consists of 50,000 training images and 10,000 testing images, each of which is 28×28 pixels in size. On the other hand, CIFAR-10 is a natural image dataset primarily utilized in the field of computer vision. This dataset contains a total of 60,00032×32 color images, distributed among 10 categories. CIFAR-100 is another a natural image dataset and contains a total of 60,00032×32 color images, distributed among 100 categories.

    Experimental settings: We utilized an nmODE5 block and implemented Eq. (12) using the torchdiff framework.3 To do this, we re-implemented backward and forward methods using PyTorch. We utilized Adam51 as our optimizer for the model, with an initial learning rate of 1e5. Batchsize is set to 256. The total number of training epochs is set to 3000, and the best parameters were saved based on the test set results. Every 1000 training epochs, the learning rate was reduced by a factor of ten, and the best model parameters were reloaded to continue training. The MNIST dataset generally converges to optimal data within 1000 epochs, while the CIFAR-10 dataset is relatively difficult to converge and still shows a downward trend after exceeding 1000 epochs. The selection of hyperparameters, such as K and the number of hidden neurons, required further analysis for our proposed learning algorithm.

    Classification results on the two 10 categories dataset: The network architecture used in the experiment is based on nmODE,5 which does not have the concept of layers in traditional networks. However, the number of neurons in the hidden layer of the model has a significant impact on the final performance. Here, we compare the effect of different numbers of neurons, such as 128, 256, 512, 1024, 2048, 4096, and 8192, on the model’s performance based on the nmForwardLA learning algorithm on MNIST and CIFAR-10. As shown in Table 1 and Fig. 4(c), our results show that the test performance of the model on both datasets initially increases with an increase in the number of neurons, followed by a decrease. On MNIST, the best performance is achieved when 4096 neurons are selected, while on CIFAR-10, the best performance is achieved with 2048 neurons.

    Fig. 4.

    Fig. 4. The analysis of loss during training and testing.

    Table 1. Accuracy of the proposed method using different neurons.

    Num. of Neur.Num. of Para.MNISTCIFAR-10
    128199,42486.95%±0.2347.32%±2.13
    256199,42489.56%±0.4651.86%±1.61
    512170,75289.14%±0.4856.27%±1.32
    1,024797,69699.11%±0.1061.17%±0.66
    2,0481,595,39299.37%±0.1862.34%±0.15
    4,0963,190,78499.53%±0.0455.89%±1.56
    8,1926,504,44899.42%±0.0955.80%±0.59

    We compared the performance of our proposed nmForwardLA algorithm with other forward algorithms on two datasets, and the results are shown in Table 2. Among them, FF21 refers to Hinton’s forward–forward learning, and FG47 refers to forward Gradient algorithm proposed by Baydin et al. PEPITA (WM)17 refers to PEPITA trained using weight mirroring strategy proposed recently. As shown in the table, our learning algorithm achieved a better performance. In addition, we also compared our model with two other ODE-based network models, namely NODE3 and ANODE,4 and our model also achieved the best results compared with them.

    Table 2. Comparison with other algorithm.

    MethodsMNISTCIFAR-10
    FF2199.36%59.00%
    FG4790.91%N/A
    PEPITA (WM)1798.29%±0.1356.33%±1.35
    NODE396.40%53.70%
    ANODE498.20%60.60%
    nmODE599.19%±0.10N/A
    nmForwardLA99.37%±0.1862.34%±0.15

    Analysis of training and validation loss: Here, we analyze the changes in loss during the training process (Fig. 4). As shown in Fig. 4(a), we display the changes in loss over 1000 training epochs with K=10 parameters, so we have recorded 10,000 observation values of loss during our MNIST training. The small graph in the upper right corner of Fig. 4(a) shows that we selected 200 observation values (the area between the two green lines) for display. As shown in the figure, the loss oscillates locally within the K cycles but shows a downward trend overall. Under our learning algorithm, the network can converge in terms of training epochs (Fig. 4(b)) during the training process. We take the average loss of all batches within an epoch at time Kˉt and plot the model’s loss values on the training and testing sets, as shown in Fig. 4(b). The model can quickly converge stably on both the training and testing sets.

    Classification results on the 100 categories dataset: We applied our proposed learning algorithm to more categories of classification tasks. We trained and validated it on the CIFAR-100 dataset, comparing it with several more biologically inspired learning algorithms. The relevant results are shown in Table 3. We also achieved optimal performance on the CIFAR-100 dataset.

    Table 3. Experimental results upon CIFAR-100.

    MethodsCIFAR-100
    FA1922.75%±0.28
    DRTP4617.59%±0.18
    PEPITA1627.04%±0.19
    nmForwardLA28.27%±0.23

    4.2. Experimental analysis of efficiency

    Analysis of hyper-parameter K: Here we mainly analyze the impact of hyperparameter K of nmForwardLA on the final performance of the model. For K, we select the following values: 1, 3, 5, 10, 20, 50, 100. The comparison indicators include accuracy and running efficiency. The corresponding results are shown in Table 4 and Fig. 4(d) below. When K=10, the ACC is the highest, followed by a decrease in ACC. However, the algorithm’s running efficiency continues to slow down as K increases. Compared with the changes in ACC, the running efficiency shows a more significant variation with the changes in K.

    Table 4. Performance of the proposed method using different hyper-parameters K.

    K135102050100
    Acc.99.34%±0.0799.56%±0.0499.40%±0.0699.60%±0.0599.29%±0.1099.12%±0.0598.82%±0.03
    Efficiency (samples/sec)11585.8510315.619242.537383.784822.163667.462285.52

    Training efficiency comparison with other ODE-like methods: This section compares the forward learning algorithm proposed in this study with two other commonly used ODE learning algorithms. The specific processes of these algorithms have been analyzed in Sec. 3.4. The datasets are MNIST, and the ODE solver library relies on torchdiff.3 The hyperparameter ˉt for odeint is set to 0.05. The batch size for network training is set to 256. We measure the processing speed during training, with speed being assessed in terms of samples processed per second (samples/sec). The corresponding statistical results are shown in Table 5. The training process speeds for the three algorithms are as follows: 4437.35, 5755.23, and 12027.13, respectively. Among them, the method proposed in this study has the fastest training speed.

    Table 5. Efficiency comparison on ODE methods.

    MethodsNODE3,4nmLA5nmForwardLA
    Efficiency4437.355755.2312027.13

    5. Discussion and Future Work

    The nmODE is a newly proposed neural network architecture,5 and related research has shown its effectiveness in tasks such as classification6 and segmentation.7 Like other deep neural networks based on ODEs,3,4 nmODE similarly suffers from significant computational overhead in practical applications, and when the dimensions of ODEs are huge, direct solution computation may not be feasible. This paper presents a more efficient and biologically plausible learning algorithm for the nmODE network architecture. The proposed nmForwardLA algorithm completes inference and gradient computation with one forward pass. The proposed learning algorithm has been rigorously tested and validated on real datasets, including the widely used MNIST, CIFAR10, and CIFAR100. The experimental results unequivocally demonstrate the algorithm’s correctness and effectiveness, further supporting its potential in practical applications. Additionally, the experiments further analyze the impact of the hyperparameter K on the algorithm’s final performance and efficiency.

    Although this study provides evidence and validation of the effectiveness and correctness of the proposed forward algorithm, there is still much work to be done in the future. First, the algorithm has only been preliminarily validated on classification data. Further validation in various scenarios,7,10,11,12,13,14,15 such as segmentation, and computer-aided medical applications, is needed before practical use. Second, the algorithm derives its core formula using ODEs and is implemented using ODESolver. Designing a faster and more accurate implementation process for continuous neural network remains one of the future research directions.

    6. Conclusion

    This study proposes a forward learning algorithm for nmODE, deriving its core mathematical formula (Eq. (12)) through precise mathematical derivation and providing specific implementation details. The algorithm has three advantages: First, it exhibits greater biological plausibility while meeting weight transport-free, locality, nonfrozen activation, and update unlocking. Second, during the training process, the algorithm only requires a single forward pass of the ODESolver to complete both the forward computation of the network model and gradient updates. Compared the to existing ODE network learning algorithms,3,4,5 it is faster, a fact verified through theoretical analysis and experimental results, and we have implemented the algorithm and validated it on three classic classification datasets, demonstrating its correctness and potency. Third, the existing methods reconstruct observation states for gradient updates at the end of the forward computation, while our approach computes them in real-time, ensuring accurate learning gradients.

    Novel continuous neural networks characterized by ordinary differential equations (ODEs) are gradually progressing. However, such models still need further improvement in practical applications due to the requirement for precise computation and operational efficiency. From a biologically plausible perspective, we are developing a concise algorithm to make breakthroughs. We hope our conceptual and technical contributions can lead to a new continuous neural network algorithm achieved through simplicity, principled approaches, and less engineering.

    Acknowledgments

    This work was supported by the National Major Science and Technology Projects of China under Grant 2018AAA0100201, the National Natural Science Foundation of China under Grant 62106163, the Natural Science Foundation Project of Sichuan Province under Grant 2023YFG0283, and the CAAI-Huawei MindSpore Open Fund under Grant 21H1235.

    ORCID

    Xiuyuan Xu  https://orcid.org/0000-0002-2505-4350

    Haiying Luo  https://orcid.org/0009-0004-7005-4490

    Zhang Yi  https://orcid.org/0000-0002-5867-9322

    Haixian Zhang  https://orcid.org/0000-0002-9821-508X