Decoding Continuous Tracking Eye Movements from Cortical Spiking Activity
Abstract
Eye movements are the primary way primates interact with the world. Understanding how the brain controls the eyes is therefore crucial for improving human health and designing visual rehabilitation devices. However, brain activity is challenging to decipher. Here, we leveraged machine learning algorithms to reconstruct tracking eye movements from high-resolution neuronal recordings. We found that continuous eye position could be decoded with high accuracy using spiking data from only a few dozen cortical neurons. We tested eight decoders and found that neural network models yielded the highest decoding accuracy. Simpler models performed well above chance with a substantial reduction in training time. We measured the impact of data quantity (e.g. number of neurons) and data format (e.g. bin width) on training time, inference time, and generalizability. Training models with more input data improved performance, as expected, but the format of the behavioral output was critical for emphasizing or omitting specific oculomotor events. Our results provide the first demonstration, to our knowledge, of continuously decoded eye movements across a large field of view. Our comprehensive investigation of predictive power and computational efficiency for common decoder architectures provides a much-needed foundation for future work on real-time gaze-tracking devices.
1. Introduction
Eye movements play an essential role in vision by redirecting the highest-acuity portion of the retina (“fovea”) toward points of interest. However, accurate eye movements pose a challenge to the nervous system because objects in the world can move unpredictably, and self-generated movements also cause changes in the images on the fovea. Thus, even seemingly simple behaviors like pouring a glass of water or crossing the street require the execution of rapid, coordinated eye movements. Visual deficits such as strabismus and macular degeneration therefore have debilitating effects on quality of life measures and on a patient’s ability to successfully navigate the world. Inspired by work on invasive Brain–Computer Interfaces (iBCIs),1,2 which decode reach kinematics from motor cortical activity for rehabilitation devices, we apply similar information theoretic approaches to the oculomotor system.3,4,5 This approach allows us to reconstruct eye movement trajectories from the activity of populations of neurons, establishing an upper bound on the neuronal information used to guide eye behavior. The development of an oculomotor BCI is a promising step toward incorporating signals from multi-effector iBCIs, offering improved speed and adaptability in real-world scenarios.6,7,8
To keep the fovea stabilized on moving objects, neural circuits must process what is seen and generate precise motor commands to the eye muscles. For over half a century, researchers have focused on how information that enters the eye is propagated through and processed by the brain to control eye movements.9,10 However, the complexity of neuronal activity makes it difficult to decipher the specific neuronal patterns that represent commands to move the eyes. Machine learning tools have the potential to provide insights into the composition of cortical representations and their suitability for reconstructing eye movement trajectories. A neuron-to-gaze decoder could be useful for developing real-time (or online) brain–computer interfaces to assist patients who cannot produce eye movements, such as those locked-in from strokes or neurodegenerative diseases. It can also be used as an offline tool for providing insights into the mappings between brain and behavior. Thus, tracking eye movements will provide an excellent model system for understanding the mappings and moment-by-moment feedback between visual-motor neurons and motor behavior.
Prior work has demonstrated that decoding eye movements from neuronal activity is possible, though it was limited in scope. McFarland and Butts developed a powerful decoder using an Expectation–Maximization (EM) algorithm to infer small fixational eye movements (less than 1∘ in amplitude) based on neuronal activity in the primary visual cortex (V1).11 However, given the relatively limited amount of visual space that a patch of V1 represents, applying this approach to natural visual behavior would require an impractically large number of implanted electrodes over a large swath of occipital cortex. Likewise, previous efforts to decode eye movement direction have relied on coarse classification (e.g. up, down, left, or right) with primarily linear discriminant analysis and support vector machine models, which would have limited utility during natural viewing environments.3,12 Here, we sought to reconstruct eye movement trajectories using the spiking activity of small neuronal populations in cerebral cortex. The shift in focus toward continuous behavioral readout is crucial for advancing BCI technology because accurate decoding of eye movements is essential for interacting with visualized objects.13 The current state-of-the-art, exemplified by products like Apple’s Vision Pro Glasses and Meta Quest 3, record and process eye movements to update the visual display. However, any method that relies on post-movement processing, instead of predictive eye movement control is likely insufficient for real-time perceptual rehabilitation because of the delays inherent in the movement recording and processing.
Neurons in cortical brain regions, like the Frontal Eye Fields (FEF)14 and Medial Temporal (MT) area,15 are ideally suited to iBCI efforts because they contain information regarding motion, visual processing, and eye movement generation. For example, electrically micro-stimulating FEF neurons with small amounts of current evokes eye movements in primates,16,17 and neurons in these brain regions are sensitive to the tuning properties (e.g. direction and speed) of targets appearing and moving in the visual scene.18 Given the sensitivity of cortical neurons to gaze-related information, supervised models trained with neuronal spiking and eye tracking data should be able to reconstruct eye trajectories and provide insights into how visual and motor information is integrated to guide behavior. Thus, our goal in this work is not to devise a model of oculomotor brain function, but to evaluate existing decoding models for eye movements. While previous work has shown limited success in classifying fast, exploratory eye movements called saccades, we know of no studies that have shown temporally precise decoding of continuous eye movements across the visual field.
In this work, we demonstrate that a diverse set of eight decoding models — including the Wiener Filter (WF), Wiener Cascade (WC) filter, eXtreme Gradient Boosting (XGBoost), Support Vector Regression (SVR), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM) network — can reliably reconstruct moment-by-moment eye movement trajectories from neuronal spiking activity. We found that: (1) neural network models yielded the highest prediction accuracy; (2) accuracy decreased with kinematic derivative (position > velocity > acceleration); and (3) the relative performances of kinematic derivatives and model architectures were consistent across sessions. To facilitate the translation of such methods to real-time BCIs, we probed how the quantity and format of the data influenced the decoders’ performance. We found that increasing the training set size and neuronal population size improved prediction accuracy. Decoding performance remained stable across different input bin widths, with the total amount of prior input context having a greater impact on accuracy than how it was segmented. Although using more fine input bins typically increased accuracy, it is important to balance this with computational efficiency, as excessive neuronal context can lead to overfitting or overwhelm real-time processing. Lastly, while increasing output bin width improved accuracy, it also obscured finer details like saccades. This trade-off suggests that relying solely on how well the model’s predictions match the data (explained variance) is not enough to fully assess its usefulness. Instead, the output bin width should be carefully chosen to balance prediction accuracy with ensuring that key behavioral events are accurately captured.19
Our results introduce a novel approach for understanding the neuronal mechanisms of oculomotor control and open avenues for brain–computer interfaces aimed at restoring or enhancing motor function. Expanding on established methods used in arm movement and reach-BCI applications, we adapt these techniques to eye movements. While past studies have focused on classifying coarse eye movements, we take a step further by decoding temporally-precise eye movements, similar to how reach and cursor trajectories are decoded. By leveraging the similarities between arm and smooth visual tracking, we advance neuron-to-gaze interface development, bringing us closer to real-time, fine-tuned eye movement prediction.
2. Methods
2.1. Data acquisition and processing
Extracellular neuronal activity in area MT and FEF was recorded at 40 kHz simultaneously in the right hemisphere of an adult male rhesus monkey (Macaca mulatta) using 24-contact multi-contact linear electrode arrays (Plexon S-probes; Dallas, TX) and a Plexon Omniplex recording system. MT and FEF were identified based on stereotaxic coordinates, anatomical landmarks, and neurophysiological properties such as cortical depth, receptive field features, and latency of responses to task-relevant events. Neuronal activity across channels was monitored during recording, but there was no requirement for the activity to be solely related to movement of the eyes. The Institutional Animal Care and Use Committee of Duke University approved all procedures.
Signals from each microelectrode were amplified and bandpass filtered. Waveform segments that exceeded a threshold were digitized (30kHz) and sorted off-line with an automated sorting algorithm.20 We removed trials where neuronal response exceeded three standard deviations from the mean to eliminate outliers and ensure more reliable input for the decoders.21,22 Isolated neuronal units that failed to fire consistently during recording sessions were excluded from our analyses, ensuring each neuron fired at least once during every trial. The scleral search coil technique was used to track eye movements, and horizontal and vertical eye position data were stored at 1 kHz.23,24 We employed the Savitzky–Golay digital filter with a polynomial order of 2 and frame length of 25 for smoothing the eye traces.25
2.2. Behavioral task
Trials began with the presentation of a static patch of black and white dots (72 dots, each dot 5 pixels square at either 12% or 100% contrast) in a 4∘×4∘ circular aperture in the center of the screen. The dot patch contained a small black fixation point (0.4∘×0.4∘) in its center to encourage stable fixation. The stimulus remained on while the animal fixated for 800–1600ms, after which the dots drifted, within an invisible static aperture, in the upcoming direction of pursuit for 100ms. This brief drifting stimulus was used to encourage smooth pursuit initiation in the absence of saccadic eye movements.26 After 100ms, the entire dot patch stimulus translated across the screen at a fixed velocity of either 10 or 20∘/s for 800ms. In each recording session, the dot patch moved in one of four possible directions, pseudorandomly selected on each trial, with directions spaced 90∘ from each other. Stimuli were presented on a gamma-corrected CRT monitor (80Hz refresh rate, 2304×1440 pixels resolution) on a uniform gray background. Eye position accuracy was maintained within a 2∘×2∘. Window during fixation and a 4∘×4∘. Window during pursuit. The monkey was required to maintain its gaze within the window over the course of the trial to receive a liquid reward.
2.3. Decoder architectures
We tested eight decoding methods: two statistical model-based methods (WF and WC), two ensemble and kernel methods (XGBoost and SVR), and four neural network methods (Feedforward Dense Neural Network, Vanilla RNN, GRU, and LSTM). All models use neuronal spiking activity as inputs, with each feature representing a single neuron. They predict horizontal (x) and vertical (y) eye movements. The models vary in approach, from mathematical methods to neural networks, providing different interpretations of the link between neuronal data and eye behavior. The base Python code for WF, WC, and the machine learning methods is from the freely-available Neural Decoding package.27
2.3.1. Statistical model-based methods
Statistical model-based methods use mathematical principles and signal processing to interpret relationships between variables and make predictions based on observed information.
Wiener Filter (WF). The WF, as implemented in our approach, combines a preprocessing filter step with a linear regression model to predict output from discrete inputs. First, the input signals are passed through a WF (from scipy.signal module) to enhance relevant components while reducing noise.28 This filter works by analyzing the statistical properties of the input signal to distinguish between consistent patterns (signal) and random fluctuations (noise). By leveraging information about both the signal and noise, and assuming the signal’s properties remain consistent over time, the WF enhances parts of the signal that correlate with the desired output while reducing uncorrelated noise. The filtered inputs are then used in a linear regression model, which determines the least-squared-error linear solution to construct predictions. The model assumes a linear mapping between spikes in relevant time bins and the output. Importantly, the WF, when combined with the linear regression model, is suitable for stationary systems as it assumes a consistent correlation and mean between neuron inputs and output kinematics without the need for hyperparameterization.29 WF are often used in real-time BCI applications due to their simplicity and efficiency in processing neural data.30,31
Wiener Cascade (WC). The WC addresses nonlinear signal processing by layering sequential filters to refine predictions. This cascade integrates linear filters and static nonlinear filters. In the initial preprocessing step, linear filters reduce noise by enhancing components correlated with the desired output, while nonlinear filters capture and process more complex relationships within the data. The combination of both types of filters can enhance the WC model’s accuracy compared to the basic WF. The only hyperparameter is the polynomial degree, which defines the complexity of the nonlinear filter component. After preprocessing, the filtered inputs are used in a linear regression model to make final predictions.32
2.3.2. Ensemble and kernel methods
Ensemble and kernel methods differ from statistical-based models by combining models or specialized kernels to capture complex relationships within data, offering increased flexibility and improved accuracy compared to traditional statistical approaches.
eXtreme Gradient Boosting (XGBoost). XGBoost is an ensemble learning algorithm that combines multiple models through a boosting process using gradient descent to minimize errors. The iterative process aims to enhance model accuracy by adding weaker models to the ensemble. XGBoost assumes nonlinear mappings for input/output relationships. The hyperparameters for XGBoost—including maximum tree depth, number of boosting rounds, learning rate, and subsample ratio—control the complexity of individual trees, the number of iterations in boosting, the rate at which the model learns during training, and the fraction of samples used in each boosting round, respectively.33
Support Vector Regression (SVR). SVR, an adaptation of Support Vector Machines for regression tasks, operates by projecting input data into a high-dimensional space via a nonlinear kernel. This transformation facilitates the linear mapping of data points to outputs while minimizing an objective function. By allowing for a nonlinear input/output relationship, SVR aims to find the hyperplane that best fits the data while allowing for a tolerance model and ensuring flexibility within specified parameters. The epsilon parameter ( ϵ), which delineates the width of hyperplane tolerance, was set to its default value of 0.1. However, included in the hyperparameters fitted during model validation was a regularization parameter ( C), which influences the model’s flexibility and closeness of fit to the data. Higher C values decrease regularization, enabling a tighter fit to data, while lower C values increase regularization, leading to a smoother decision surface and potentially narrower margin around the hyperplane. C, a positive parameter, impacts penalty through a squared L2 method, penalizing poorly fit data points. The hyperparameter tuning process also involves adjusting the maximum iterations parameter, which sets the limit for the number of iterations executed when searching for an optimal hyperplane.34 In neural decoding applications, SVR has been used to predict where the eyes will move next by analyzing broadband brain signals from the prefrontal cortex just before the movement starts, as shown in a previous study.3
2.3.3. Neural network models
Neural network models leverage interconnected nodes to relate input and output data. These models excel at handling large datasets and can successfully extract relevant features, allowing them to adapt to diverse tasks without relying on explicitly predefined features or kernels. If neuronal activity relates nonlinearly to the output behavior, using a decoder that does not assume a specific input/output relationship can be advantageous. Many recent papers have demonstrated that neural networks perform exceptionally well in decoding applications, consistently achieving high accuracy across various neural datasets.35,36
Deep Neural Network (DNN). A feedforward DNN, also known as a fully-connected or dense neural network, consists of multiple layers of interconnected nodes (neuron-like units) organized in a hierarchical manner. Nodes in each hidden layer receive inputs from all preceding layer nodes and apply activation functions to the inputs’ weighted sums to generate outputs. This architecture enables complex nonlinear transformations of input data, sequentially processing inputs through hidden units and onward to the output layer via linear mappings followed by nonlinear operations. The hyperparameters for the four neural network architectures employed in this work (DNN, RNN, GRU, LSTM) consist of the units count, dropout fraction, batch size, and epochs count. These parameters determine the network architecture’s configuration and regulate the number of neurons, the dropout rate for regularization, the size of data batches processed during training, and the number of training epochs defining the iterations over the dataset.37,38
Recurrent Neural Network (RNN). The vanilla-RNN accommodates flexible nonlinear input/output mappings with varying lengths, making it well-suited for processing sequentially dependent data. The hidden state, resulting from a combination of inputs and the previous hidden state, undergoes an output nonlinearity before being linearly mapped to generate the output. Unlike feedforward networks, RNNs explicitly model temporal changes within a system, allowing them to capture dependencies and patterns over time. This recurrent architecture enables the network to retain memory of past inputs, providing context and enabling more informed predictions. The hyperparameters are number of units, dropout fraction, batch size, and number of epochs.39
Gated Recurrent Unit (GRU). GRU, a variant of RNNs, enhances the ability of the model to capture long-term dependencies (e.g. compared to traditional RNNs) by introducing two specialized gates: a reset gate and an update gate. These gates regulate the flow of information within the network, allowing it to selectively retain and update relevant information over time. By leveraging hidden states instead of cell states, GRUs address the short-term memory limitations of vanilla RNNs and allow for more effective learning and representation of sequential data. The hyperparameters are number of units, dropout fraction, batch size, and number of epochs.40
Long Short-Term Memory (LSTM). LSTM, a type of RNN, addresses the limitations of vanishing gradients and models long-term dependencies using specialized cells equipped with three key gates: input, output, and forget gates. These gates regulate the flow of information within the network, allowing it to selectively retain, update, or discard relevant information over time. By leveraging these gating mechanisms, LSTM can effectively capture the context and temporal dynamics of sequential data, making it particularly well-suited for tasks requiring long-term dependencies. Additionally, using unidirectional LSTM as we do here, the model processes the input sequence from left to right, ensuring that it maintains a consistent flow of information throughout the prediction process. The hyperparameters include the number of units, dropout fraction, batch size, and number of epochs.41
2.4. Model training and evaluation
2.4.1. Data formatting
Previous neural decoding studies used a “continuous” dataset of the entire recording session, including both within-trial and inter-trial time periods.27,42,43 Due to software limitations, we excluded inter-trial intervals in our study. As a result, our recordings started at the onset of the fixation point for each trial and ended shortly after reward delivery. Instead, we concatenated segments for all correct trials (as shown in Fig. 1(c) in the Results, Sec. 3.1), where the monkey maintained fixation on the target within the allowable eye tracking window (see above). This approach allowed us to create a dataset, made up of trials from all task conditions (target direction, contrast, speed), for model training and testing. Trial lengths were variable due to the timing of target movement, which prevented the monkey from anticipating motion onset. To standardize trial length, we extracted a 1600ms window from each trial, spanning from 500ms before the target began to move to 300ms after it stopped. This window provided sufficient time bins for each trial, capturing the full duration of target motion while excluding the time bins when a visual stimulus was not present on the screen.

Fig. 1. (Color online) Neuron-to-gaze decoding pipeline. (a) Cortical spiking activity (D total channels) and eye movements are recorded as the subject visually tracks a moving target on a screen. Spike counts are detected via threshold crossings in the raw recordings and discretized into time bins, with average eye coordinates calculated for each bin. The supervised decoding model is trained to map preceding spike count bins to corresponding eye kinematics; during evaluation, it predicts eye kinematics from held-out spike counts. Model performance is assessed by calculating the R2 metric, comparing predicted (pink line) to recorded eye kinematics (black line, ground truth). (b) Four representative trials illustrate the target position (black dashed line), recorded eye position (gray line), and predicted eye position by a long short-term memory (lstm) network (pink line). Traces are shown in degrees of visual angle, with the x-axis for horizontal and y-axis for vertical directions. In the right panel, recorded (black) and predicted (pink) horizontal (x) and vertical (y) eye positions over time (50-ms bin size) are shown for the same four trials. Trials initiate with the monkey’s eyes fixated on the visual target (0∘; gray horizontal line). Eye position diverges as the target moves in one of four directions. (c) Temporally-aligned spike rasters and eye traces are concatenated and split into training, validation, and test datasets for K-fold cross-validation. During training, the model receives spiking activity (input) and eye kinematics (output), while during testing, it predicts eye movements from unlabeled inputs using learned weights. This is repeated K times (K=10), resulting in KR2 values for each model. The right panel displays performance accuracy (¯R2x,y) for each model (different colors) and cross-validation fold (x markers) for one recording session. Error bars represent the ± standard error of the mean (SEM) across cross-validation folds (n=10).
We aligned the neuronal and behavioral recordings, combined the neurons from both brain areas into a single input dataset, and collapsed all trials across target conditions (direction, speed, contrast). While conventional methods typically use equal bin widths for discretizing movement kinematics and neuronal firing rates, we independently varied input and output bin widths to investigate the temporal relationship between eye movements and neural activity with our model. We segmented continuous eye trajectories into bins and computed the average horizontal (x) and vertical (y) eye position coordinates within each temporal bin (see Fig. 1(a) in the Results, Sec. 3.1). Next, we extracted neuronal spike times leading up to and including each output bin, serving as the antecedent neuronal information for making predictions (see Fig. 2 in the Results, Sec. 3.2). This time window was then divided into multiple smaller bins, and within each, we counted the number of spikes, resulting in multiple bins of firing rates corresponding to each output bin (e.g. 400ms of preceding neural data, split into eight 50-ms bins, used to predict a movement in the current 50ms bin).

Fig. 2. Detailed decoding schematic. (a) The spike trains of D neurons (neuronal population size) across B discrete time bins (number of preceding input bins) with a given input bin width are used to decode the output kinematics for each time bin (shown here with B=4, including 3 preceding bins and 1 concurrent bin). The training set size is the total duration of the training split from cross-validation, divided into bins of output bin width. The gaze outputs ( G) shown are horizontal (x) and vertical (y) eye position over time for two example trials. (b) The input data ( N) is structured as T training observations × B preceding input bins × D neurons, where each point in the matrix corresponds to the firing rate of an individual neuron. The output data has dimensions T×2 for simultaneous x- and y-kinematics prediction. Evaluation involves measuring the decoder’s accuracy in predicting x and y gaze outputs from unseen neuronal inputs using the R2 scoring metric.
We excluded initial output bins for each trial within the dataset because using “prior” neuronal activity to predict the outputs at the beginning of each trial would inappropriately involve information from the previous trial and potentially include large gaps of time in between the trials. Additionally, we removed any output eye kinematics bins that averaged over the eye kinematics between two trials. Since overlapping neuronal data is used to forecast target eye kinematics, we assume statistical dependence in predictions due to the interconnected and sequential nature of the overlapping input signals.
2.4.2. Scoring metric
To determine the goodness of fit for the supervised decoding model, we calculated the coefficient of determination (R2) as
2.4.3. Cross-validation and regularization
For each recording session (n=10), we conducted ten-fold cross-validation by dividing the dataset into 70% training, 20% validation, and 10% test sets. The test set consisted of successive trials to maintain temporal continuity, ensuring no overlap of trials across different cross-validation folds. The training and validation sets were formed from the remainder of the trials, so across different cross-validation folds there was some overlap in data used to train the models. To optimize hyperparameters, Bayesian optimization was conducted on the validation set, aiming to maximize the R2 value (averaged for x and y eye coordinates) for the specified kinematic variable.45
Input data during training were z-scored, and the output was zero-centered by subtracting the mean, except in the case of SVR, which was z-scored for better handling of varying scales in the output data.46 During validation and testing, datasets were mean-centered using the mean derived from the training dataset. This process maintained consistency in preprocessing across training, validation, and test sets and ensured that the models were trained and evaluated on comparable data distributions. To address the interdependence of estimates across folds due to overlapping training sets, Standard Error of the Mean (SEM) was computed across cross-validation folds using the formula σ×√1J+1J−1, where σ represents the standard deviation, and J denotes the number of folds used in the cross-validation process.47
Finally, we applied early stopping as a regularization technique during model training to prevent overfitting on the training data. Specifically, a subset of the training data was reserved to monitor the loss function’s convergence. We set a patience of five epochs to halt training when the validation loss ceased to decrease, thereby preventing the model from learning noise in the data and improving its generalization performance. Several hyperparameters, optimized through Bayesian optimization on the validation set, serve to regulate model complexity and prevent overfitting. For instance, parameters like the number of boosting rounds in XGBoost or the dropout fraction in neural network models act as regularization measures by discouraging excessively intricate models and promoting generalization.
2.4.4. Other evaluation metrics
Training time. Training time, measured in minutes per cross-validation fold, represents the time required for the model to fit to the training data and optimize hyperparameters using the validation set. The training time is averaged across cross-validation folds and training repetitions.48
Inference time. Inference time quantifies the average rate, measured in seconds per output bin, required for a trained decoder to generate predictions from unseen input data. In online applications or tasks requiring quick responses, faster inference is desirable. However, it is crucial to balance speed with accuracy.49 In typical real-time applications, inference time also includes preprocessing the input data (e.g. spike sorting and binning).
Generalization gap. The train-test generalization gap (Δ) is defined as the mean percentage difference in prediction accuracy between the training and test datasets, calculated as
2.4.5. Hyperparameter optimization
To determine the optimal hyperparameters for each validation dataset, we employed Bayesian optimization, a technique designed to navigate complex search spaces efficiently.51 This approach systematically explores the hyperparameter space using the Upper Confidence Bound (UCB) acquisition function. The UCB function, configured with a kappa value of 10, serves to balance the trade-off between exploration (searching less certain regions) and exploitation (searching regions likely to offer improvement based on current knowledge).52 Prioritizing exploration over exploitation by increasing the kappa value promotes broad sampling within the search space and prevents the algorithm from getting trapped in local minima, enabling the discovery of near optimal hyperparameter configurations.
The hyperparameter search spaces for each model architecture are summarized in Table 1. The number of iterations in Bayesian optimization determines the quantity of steps performed during the process. More iterations enhance the likelihood of discovering a favorable maximum. Additionally, the number of initial points denotes the random exploration steps executed at the beginning of the process, which aids in diversifying the exploration space. The objective was to identify the optimal combination of parameters and the corresponding target value to achieve the best performance.
Decoding model | ||||||||
---|---|---|---|---|---|---|---|---|
wf | wc | xgb | svr | dnn | rnn | gru | lstm | |
Optimization parameter | ||||||||
Initial points ( init_points) | — | 20 | 10 | 5 | 10 | 10 | 10 | 10 |
Iterations ( n_iters) | — | 20 | 10 | 5 | 10 | 10 | 10 | 10 |
Hyperparameter | ||||||||
Polynomial degree ( degree) | — | [1, 10] | — | — | — | — | — | — |
Max tree depth ( max_depth) | — | — | [2, 8] | — | — | — | — | — |
Boosting rounds ( num_round) | — | — | [200, 1000] | — | — | — | — | — |
Subsample ratio ( subsample) | — | — | [0.25, 0.75] | — | — | — | — | — |
Learning rate ( eta) | — | — | [0.01, 0.3] | — | — | — | — | — |
Regularization parameter ( C) | — | — | — | [0.5, 10] | — | — | — | — |
Number of units ( num_units) | — | — | — | — | [50, 300] | [50, 300] | [50, 300] | [50, 300] |
Dropout fraction ( frac_dropout) | — | — | — | — | [0.1, 0.5] | [0.1, 0.5] | [0.1, 0.5] | [0.1, 0.5] |
Batch size ( batch_size) | — | — | — | — | [32, 128] | [32, 128] | [32, 128] | [32, 128] |
Number of epochs ( n_epochs) | — | — | — | — | [2, 15] | [2, 15] | [2, 15] | [2, 15] |
2.5. Statistics
To assess whether there was a statistically significant difference between decoders trained with sub-sampled data (fewer training observations or neurons) and those trained with the full dataset, we used the nonparametric Mann–Whitney U test via the mannwhitneyu function from the scipy.stats module in Python.53,54 Each decoder was run 1000 times (100 repetitions, 10 folds), providing a robust distribution of performance metrics for comparison. This large sample size ensured reliable data for comparison, and the Mann–Whitney U test provided a robust analysis without assuming normal distribution of performance metrics.
Resampling training data. To evaluate how the size of the training set impacts decoding performance, we systematically varied the training set size for each model-kinematic variable pair. This process involved pseudorandomly sub-sampling the trials in the dataset 100 times, without replacement, for each training set size to ensure robustness in our analyses and estimate performance variability. Importantly, the validation and test sets remained unchanged throughout these analyses. This approach allowed us to assess how changes in training set size affected decoding accuracy across different decoding architectures and kinematics.
Resampling neuronal population. To investigate the relationship between performance and the number of neurons used to predict eye behavior, we ran each model-kinematic variable pair 100 times to minimize the likelihood that the results were directly related to the specific neurons selected for one particular run. Within each of the 100 runs, we pseudorandomly selected a subset of the recorded neurons without replacement. This process involved selecting a distinct set of D neurons from the distribution, where D denotes the number of neurons chosen for training and testing the decoder in each iteration.
2.6. Computer cluster and parallelization
Decoding models were executed in parallel on a shared memory cluster with a 24-core Skylake CPU (Intel Xeon Gold 6126 2.60GHz) and 192GB RAM/node; this was possible due to the independent and “embarrassingly” parallel nature of each decoding run. For neural network models, specific settings utilizing the workers parameter in TensorFlow allowed for further threading and optimization of computations within individual model runs, leveraging multiple CPU cores to enhance the efficiency of complex operations during training or inference processes. Each decoder was trained and tested using two CPU cores (workers), when measuring efficiency comparisons across models.
2.7. Data availability
Pre-processed data files and source code are available upon request from the authors.
3. Results
We used eight decoders of varying complexity to determine if tracking eye movements can be reliably reconstructed from the spiking activity of neuronal populations in the cerebral cortex. Models tested ranged from simple linear filters to RNNs. Our overarching goals were threefold: (1) Benchmark how well various model architectures extract gaze-related information from populations of cortical neurons; (2) Measure how data quantity and format influence decoder performance, illustrating how different data representations can impact the model’s ability to leverage information from the data; (3) Evaluate the practicality and generalizability of these decoders for real-time applications.
3.1. Pipeline for decoding continuous eye kinematics
We measured the spiking activity of cortical neurons while a macaque monkey maintained its gaze on a moving patch of dots (Fig. 1(a)). We segmented the continuous neuronal and gaze data into bins to enable time-based predictions. Extracellular voltage signals were evaluated for threshold crossings and transformed into a binary format (i.e. spike counts). The eye behavior (i.e. position, velocity, or acceleration) was similarly discretized into time bins to align with neural data, ensuring the total number of observations remained consistent. We trained decoding models to predict horizontal (x) and vertical (y) eye kinematics over time using binned firing rates from a neuronal population. During training, the model received labeled inputs, with each output bin associated with multiple preceding input bins of neuronal firing rates.55,56 To assess the model’s ability to generalize, we evaluated its prediction accuracy on a separate held-out test dataset, comparing the predicted eye kinematics to the recorded eye kinematics (ground truth) using the Coefficient of Determination (R2, see Sec. 2.4.2). This scoring metric is commonly used in neural decoding applications to gauge the degree of similarity between predicted and recorded movements.42,57
During a single recording session, thousands of repetitions (“trials”) of the visual tracking task were performed. Each trial began when the monkey fixated on a small patch of dots in the center of the screen. As the target linearly translated away from the screen center at a constant velocity, the monkey had to maintain its gaze on the moving target until it stopped before the monitor edge. Successful tracking was defined as keeping the gaze within a 4∘×4∘ invisible window centered on the stimulus, and it resulted in the monkey receiving a liquid reward. The target could move in one of four possible directions (90∘ visual angle increments; Fig. 1(b), left), selected pseudo-randomly on each trial (see Sec. 2.2). In Fig. 1(b), target position, recorded eye position, and predicted eye position from the LSTM decoding model are depicted for one example trial in each of the four directions. The similarity between the recorded and predicted eye traces in the four trials indicates that the model can approximate the initiation, direction, and finer dynamics of the eye movements.
To create a dataset for model training, the neuronal activity and eye traces from all trials in a recording session were concatenated into a single sequence (see Sec. 2.4.1). The dataset was then partitioned into an 70:20:10 ratio for training, validation, and testing (Fig. 1(c), left). During training and validation, the supervised model received labels representing the x and y eye coordinates for each set of preceding time bins of spiking activity; during testing, the model predicted eye kinematics from the neuronal data without access to the labels. To mitigate overfitting, we employed 10-fold cross-validation, where each iteration (“fold”) minimized the mean squared error between predicted and recorded kinematics for distinct data splits (see Sec. 2.4.3). We trained eight decoding algorithms, including the Wiener Filter (wf), Wiener Cascade (wc), XGBoost (xgb), Support Vector Regression (svr), Feedforward Deep Neural Network (dnn), Recurrent Neural Network (rnn), Gated Recurrent Unit (gru), and Long Short-Term Memory (lstm), to predict eye kinematics over time. These models operate on distinct assumptions and capture varying aspects of the data (see Sec. 2.3).
The prediction accuracy (R2) for each cross-validation fold of an example recording session is displayed for each model architecture in the right panel of Fig. 1(c). Higher R2 values indicate a better fit between the model’s predictions and the variance in the recorded data. Simpler models (e.g. WF) explained around 70–75% of the variance in eye position, whereas more complex neural network models (e.g. feedforward neural network, LSTM network) could explain up to 85–90% of the variance. Prediction accuracy across cross-validation folds for each model type had relatively small error bars, indicating high consistency in performance across different subsets of the data.
3.2. Detailed decoding schematic and data formatting parameters
Given that work on continuous eye movement decoding is relatively novel, we aimed to establish the performance and stability of these models by assessing their sensitivity to data quantity and formatting. Here, we outline each parameter and demonstrate how each affects the dimensionality of input and output data. This detailed examination is crucial for interpreting how and why these parameters impact different aspects of model performance.
In Fig. 2(a), we detail the structure of the data used for training and testing the decoding models. The formatting parameters (blue text) determine the amount of information available to the model during training and influence how each model exploits relevant input patterns. The data parameters influencing the “amount” of data available for decoding, such as training set size and neuronal population size, are often constrained by data acquisition factors such as neurons per channel or the number of trials completed by the subject. On the other hand, data “format” parameters, like bin widths and number of preceding input bins are deliberate design decisions established prior to model training. Format parameters are therefore informed by the scientific questions under investigation and the behavioral outcomes of the decoding task.
We explicitly define each parameter and consider its influence the decoding model:
(i) | Training set size: the duration of data provided to the model during training, divided into numerous time bins or observations. Insufficient training data may lead to overfitting, where the model memorizes inputs instead of learning meaningful patterns, often as a result of constraints like session length or subject trial count. | ||||
(ii) | Neuronal population size: the quantity of model features, where each feature represents the firing rate of an individual neuron in a given time bin. Increasing the population size may enhance the model’s ability to capture relevant neuronal signals related to the decoded behavior but may also introduce redundant or irrelevant information, potentially leading to increased computational complexity and longer training times. | ||||
(iii) | Input bin width: the bin size (in ms) for discretizing the spikes. Opting for smaller bin sizes allows the model to capture rapid fluctuations in neuronal activity, which can aid in the interpretation of transient neuronal dynamics related to eye kinematics. Coarser input bins offer a more generalized view of the data but risk oversimplification, potentially obscuring pertinent neuronal information related to eye behavior. | ||||
(iv) | Number of preceding input bins: the extent of prior neuronal information available to predict each gaze output bin, spanning temporal connections across B consecutive bins. Increasing the number of input bins (i.e. more past data points) can help capture long-term dependencies, but too many bins can introduce irrelevant information and increase computational load. | ||||
(v) | Output bin width: the bin size (in ms) for discretizing the output eye kinematics. Larger bin widths smooth over finer timescale variations in the output data, resulting in a temporally less precise representation of the eye movement behavior and fewer total observations. |
During model training, the decoding algorithm received labeled input–output pairs consisting of spike counts and eye coordinates (Fig. 2(b)). Hyperparameters, which control how the model learns and operates (such as learning rate or regularization strength), were optimized using Bayesian optimization on the validation dataset (see Sec. 2.4.5). The model’s performance was then evaluated on a separate, held-out test dataset to ensure the results were robust. The data formatting parameters affect the dimensionality, temporal resolution, and computational requirements of the input and output data matrices. The training set size (T), determined by both the total duration of the training set (in minutes) and the output bin width, dictates the number of observations (rows of input and output matrices, N and M, respectively) available for learning. The number of preceding bins (B) defines how many input bins are mapped to each output bin, providing the temporal context for recurrent network models and the neuronal population size (D) represents the number of features in the input matrix. For feedforward decoders, not depicted in Fig. 2(b), the features are consolidated into a two-dimensional matrix of size (T×(B∗D)); consequently, these models lack access to temporal connections, as all neurons and bins are combined and treated as independent features, which limits the ability to capture sequential dependencies in the data.
We systematically varied the five parameters described above to evaluate their impact on decoding performance across different model architectures and kinematic derivatives. This analysis, which adjusts the amount and format of information available to the models, is valuable for designing real-time gaze decoders and for understanding the learning dynamics and stability of these models. Parameter selection depends heavily on the objective of the decoding task, which can vary based on the behavior, neuronal data, and goals of the application. For real-time applications like BCIs, minimizing prediction lag is critical, so speed is prioritized over detailed accuracy. However, in other cases, such as studying neural representations of eye movements, higher temporal resolution may be more valuable, depending on the specific behavioral or neural data being analyzed. In this work, we benchmark eight model types for this behavior and neuronal data, providing a foundation for future decoding applications and developing new tools to help better understand how the brain controls eye movements.
3.3. Decoding benchmarks across models and kinematic variables
To determine how well each decoding model could reconstruct eye kinematics during visual tracking, we trained and evaluated decoders to predict eye position, eye velocity, and eye acceleration from neuronal population responses. Although we explored a range of data formats (see Sec. 3.5 below), we initially used eight 50-ms bins of input spiking activity recorded from 65 neurons in a single recording session to decode 50-ms output bins of eye movement kinematics. These initial values were chosen based on prior studies for decoding limb movements from cortical spiking activity.27,43,58
Eye trajectories from the test data split of a representative recording session are displayed in Fig. 3. Recorded (gray lines) and predicted (colored lines) eye trajectories provide a qualitative intuition of R2 values and facilitate comparisons between decoders. Impressively, all decoders approximated the recorded eye traces, as the predicted (colored) traces closely overlay the recorded (gray) traces for these four example trials. Decoding accuracy rose with model complexity, with all eight models achieving relatively high accuracy when decoding eye position (left column). However, R2 values uniformly decreased for all model architectures when predicting eye velocity (Fig. 3, middle column), and further decreased when predicting eye acceleration (Fig. 3, right column). Note that in the velocity and acceleration traces, the sharp peaks represent instances where the monkey made corrective saccadic eye movements during the pursuit task.

Fig. 3. (Color online) Eye kinematic reconstructions. Predicted eye movement trajectories from eight decoders are shown, derived from the same four trials as in Fig. 1 (∼6.4s). Predictions include horizontal (x, solid) and vertical (y, dashed) kinematics, for position, velocity, and acceleration (left to right sub-panels). Recorded eye kinematics are shown as gray lines, while predicted eye kinematics are color-coded by model architecture; wf = wiener filter, wc = wiener cascade, xgb = XGBoost, svr = support vector regression, dnn = dense (feed-forward) neural network, rnn = recurrent (vanilla) neural network, gru = gated recurrent unit, lstm = long short-term memory. R2, for x and y coordinates are shown above each kinematic prediction. Eight 50-ms input bins were used to decode each 50-ms output bin.
To summarize decoder performance across sessions, we assessed individual decoders for each model architecture and kinematic variable (Fig. 4). The mean training (gray lines) and testing (black lines) prediction accuracies are illustrated for the representative session used throughout this paper (solid lines) and for four other sessions (dotted lines). Although maximum performance varied between recording sessions, the relative trends across model types and kinematic derivatives remained consistent. In agreement with the reconstructed eye traces in Fig. 3, eye position decoding (left column) consistently exhibited the highest prediction accuracy across sessions and decoders. Eye velocity decoding (middle column) yielded intermediate prediction accuracy, while acceleration decoding (right column) consistently exhibited the lowest accuracy. Across all kinematic derivatives and recording sessions, neural network models (rightmost points) consistently outperformed the other model types (leftmost points).

Fig. 4. (Color online) Decoder performance and generalization across recording sessions. Prediction accuracies (R2) for each decoding model (colors along the x-axis) and kinematic variable (left to right sub-panels) are shown for an example recording session (thick, solid lines). Error bars represent the mean ± SEM across ten cross-validation folds, where each • is the ¯R2x,y for each fold (n=10). Mean cross-validated R2 values for nine other sessions are shown (thin, dotted lines). Markers and error bars with higher saturation represent the test prediction accuracy, while the lower saturation lines show training prediction accuracy. wf = wiener filter, wc = wiener cascade, xgb = XGBoost, svr = support vector regression, dnn = deep neural network, rnn = recurrent neural network, gru = gated recurrent unit, lstm = long short-term memory.
Analyzing both train (R2train) and test (R2test) prediction accuracy allowed us to assess the models’ performance on the data they were trained on and their ability to generalize to unseen data. The difference between test and train accuracy was most pronounced for acceleration decoding (Fig. 4; right column), with a drop in R2test compared to R2train (50–100% lower), across all model types. In contrast, position and velocity decoding (Fig. 4; left, middle columns) exhibited smaller discrepancies between train and test performance. For all kinematic derivatives, XGBoost and SVR (green and teal points) showed the greatest tendency to overfit. Table 2 shows the mean decoding accuracies for position and velocity across five example sessions, alongside results from previous work, where cursor velocities controlled by a manipulandum were decoded from motor cortex spiking activity using the same eight models.27 Our findings reveal that the trends in model performance for skeletomotor decoding are consistent with our gaze position decoding, with comparable prediction accuracies. However, gaze velocity decoding explains about 20–30% less variance compared to their reach velocity decoding.
Decoding accuracy (R2) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset | Dataset size (min) | Population size (neurons) | wf | wc | xgb | svr | dnn | rnn | gru | lstm |
Session 26 | 19.3 | 73 | 0.85 | 0.86 | 0.85 | 0.90 | 0.91 | 0.89 | 0.93 | 0.93 |
0.51 | 0.50 | 0.49 | 0.54 | 0.54 | 0.55 | 0.57 | 0.57 | |||
Session 29 | 63.6 | 65 | 0.72 | 0.72 | 0.73 | 0.80 | 0.88 | 0.87 | 0.89 | 0.89 |
0.50 | 0.50 | 0.48 | 0.52 | 0.60 | 0.62 | 0.64 | 0.64 | |||
Session 24 | 60.1 | 52 | 0.56 | 0.56 | 0.60 | 0.68 | 0.81 | 0.78 | 0.81 | 0.81 |
0.46 | 0.45 | 0.49 | 0.53 | 0.61 | 0.63 | 0.66 | 0.66 | |||
Session 22 | 46.6 | 52 | 0.40 | 0.40 | 0.42 | 0.52 | 0.65 | 0.65 | 0.66 | 0.66 |
0.36 | 0.34 | 0.35 | 0.42 | 0.48 | 0.51 | 0.52 | 0.52 | |||
Session 32 | 47.6 | 55 | 0.42 | 0.43 | 0.40 | 0.49 | 0.60 | 0.59 | 0.59 | 0.62 |
0.31 | 0.32 | 0.28 | 0.35 | 0.42 | 0.45 | 0.45 | 0.47 | |||
Glaser et al.27 | 10 | 164 | 0.78 | 0.79 | 0.82 | 0.81 | 0.86 | 0.84 | 0.87 | 0.88 |
In summary, we demonstrated that decoding models can accurately predict eye kinematics from neuronal population responses during visual tracking. Position decoding consistently outperformed velocity and acceleration decoding in terms of both prediction accuracy and generalization. This pattern held across recording sessions and model types, with neural networks achieving the highest performance. Impressively, these models can reach up to 90% accuracy. This remarkable level of accuracy underscores the transformative potential of these models for developing precise and reliable brain–machine interfaces for real-time applications. The ability to decode eye movements with such high fidelity is a major advancement, opening new possibilities for neuroprosthetics and other technologies that rely on accurate real-time neural decoding.
3.4. Impact of data quantity on decoding performance
The duration of the training set and size of the neuronal population impact the information available for model learning in decoding tasks. Recording more data, either with more electrodes or by having the subject perform more task repetitions, is expected to improve the model’s predictive accuracy on unseen data by providing more diverse training samples. By analyzing variations in R2 values across different training set durations and neuronal population sizes, we can gain insights into the relationship between data quantity and performance. If the models effectively learn from the neuronal data, we would expect test R2 values to increase with larger training sets and neuronal populations, indicating successful pattern capture and strong generalization to unseen examples. Quantifying model performance with limited data is particularly valuable for real-life applications, where extensive datasets may not be available.
3.4.1. Amount of training data
To determine how training set size affected prediction accuracy in an example recording session, we randomly sub-sampled the trials in the training data, which comprised 70% of the dataset. We explored training data percentages ranging from 2% (25 trials, 40s, 800 observations) to 100% (1275 trials, 34min, 40,800 observations) of the training set. For each sub-sample percentage (e.g. 20%), we conducted the sub-sampling procedure 100 times for each of the 10 cross-validation folds, resulting in 1000 repetitions per percentage (see Sec. 2.5). The same sub-sampling procedure was applied across all model architectures and kinematic derivatives, allowing us to compare how training set size affected performance. The models were tested on the same test data for each cross-validation fold.
As expected, prediction accuracy on test data improved steeply as the training set size increased (Fig. 5, top row). The increase was particularly pronounced between 2% and 10% of the training data, after which the improvement started to plateau. We tested for differences between decoding accuracy when the model was trained with all of the data versus smaller training sets using the Mann-Whitney–Wilcoxon two-sided test (see Sec. 2.5). We found that, across most models and kinematic derivatives, decreasing the training set size resulted in a statistically significant (p<0.0001) decrease in accuracy (Fig. 5, top insets); even small differences, such as 10% less training data (e.g. from 34min to 30min) led to statistically significant differences (p<0.0001).

Fig. 5. (Color online) Decoder performance in relation to (a) training set size (top sub-panels) and (b) neuronal population size (bottom sub-panels), for a representative recording session. The y-axis shows the ¯R2x,y for each model type (colors) and kinematic variable (left to right sub-panels); error bars denote the mean ± SEM across ten cross-validation folds and 100 sub-sampling iterations (n=1000). The inset heat map in each sub-panel shows which R2 values for different training set sizes or neuronal population sizes are statistically significantly different from the full dataset (shaded vertical bar). The x-axis of each heat map matches the x-axis in the sub-panel, and the y-axis corresponds to each model type (color-coded). A colored box in the heat map indicates a significant difference (p<0.0001 after Bonferroni correction) from the full dataset according to the Mann–Whitney Wilcoxon two-sided test, while a white box indicates no significant difference for that data-sampling/model combination.
Note: Eight 50-ms input bins were used to decode each 50-ms output bin.
However, the performance of a few decoders did not systematically improve with increased training set size. For example, XGBoost showed no significant difference (p>0.0001) in R2 between the full dataset and sub-sampled datasets for training set sizes ranging from 40s to 25min (Fig. 5, green lines on top right sub-panel). Additionally, SVR displayed an inverted-U relationship in R2 for velocity and acceleration decoding (Fig. 5, teal lines on top middle/right sub-panels), where performance initially increased but then decreased with more training data, suggesting overfitting.
Overall, providing decoders with more training data generally improved prediction accuracy. However, increasing the training set size introduced redundant information that did not linearly improve performance. Incorporating more diverse training data with varied target conditions and behavioral variability would likely enhance model performance more effectively than merely adding more repetitions of the same task (see Sec. 4). Additionally, some decoders, such as XGBoost, did not consistently benefit from increased training samples, indicating potential shortcomings in their architecture for learning meaningful relationships between input neuronal activity and output eye kinematics.
3.4.2. Neuronal population size
We expect that a larger number of neurons will enhance decoding performance by providing the model with a richer set of inputs to learn from. However, this outcome hinges upon the information content of the neurons being sampled. As observed in the previous section, the effect of adding neurons may reach an asymptote if the information conveyed by additional neurons is redundant with what is already provided. By characterizing the effect of different population sizes on accuracy, we can estimate the amount of redundant information in small populations of simultaneously recorded neurons.
For an example recording session, we trained individual decoders using relatively small neuron populations, ranging from 2 to 65 neurons. This amount of data is readily attainable given current invasive neuronal recording technologies.59 Since the information content of individual neurons is based on many factors that are difficult to control in extracellular recordings (e.g. cell type and network position), it was crucial to randomly select subsets of neurons at each iteration of the decoding process (see Sec. 2.5). Specifically, we selected D neurons to train the decoder and used the same D neurons to test it, iterating this process 100 times for each value of D.
We found that prediction accuracy on unseen test data improved steadily across all model and kinematic types as the neuronal population size increased (Fig. 5, bottom row). This was particularly evident for position and velocity decoding, where any decrease in number of neurons below the full population (D=65 neurons) resulted in a statistically significant (p<0.0001) decrease in performance (Fig. 5, insets). Increasing the number of neurons caused R2 to rise exponentially, reaching 0.65–0.9 for position decoding and 0.4–0.6 for velocity decoding. Decoding performance with only one neuron was near zero for all kinematic variables. Additionally, the error bars decreased in size with larger populations, indicating more stable performance.
However, performance improvements were less pronounced for acceleration decoding, especially for the statistical model-based (WF, WC) and ensemble methods (XGBoost), where accuracy remained near zero (Fig. 5, orange/yellow/green lines on bottom right sub-panel). This result suggests that acceleration decoding remains challenging across all model architectures. However, neural networks show slight but improvements with more neurons, indicating that including more neurons provides useful information for acceleration decoding and holds potential for further gains.
Overall, increasing the neuronal population size improved prediction accuracy across models and kinematic types, particularly for position and velocity decoding. This finding suggests that larger populations provide a richer set of inputs for the model to learn from, despite some information redundancy. Compared to increasing the training set size, adding more neurons had a more pronounced effect on decoding performance and showed fewer signs of saturation. Thus, for this data set, recording from additional neurons is more beneficial than having the subject perform more trials of the same behavior. Both factors are important, and in general, providing the model with as many neurons and as much training data as possible yields the best performance when computational constraints are not a concern.
In Sec. 3.6, we explore how these data parameters impact other evaluation metrics (e.g. training time and inference time) and discuss why providing as much data as possible may not be the best strategy for balancing prediction accuracy and computational efficiency.
3.5. Impact of data format on decoding performance
In Sec. 3.4, we demonstrated that prediction accuracy increased with data quantity. However, the temporal resolution of the data and the time period of preceding input neuronal activity used to predict behavior must also be considered. Specifically, these decisions involve the data “format” parameters: input bin width, number of preceding input bins, and output bin width, as described in Sec. 3.2. Traditionally, input and output bin sizes are equivalent for simplicity in processing and interpretation.27,43 However, input and output bin sizes do not have to be identical, and each can have a different effect on decoding performance. These design choices are highly dependent on the specific application and whether real-time predictions are necessary.
In this section, we independently vary these parameters to evaluate their impact on accuracy and stability across different model types and kinematic derivatives. This approach helps determine whether changes in accuracy stem from the models’ adaptability to specific formats or inherent differences in the encoded information within the neuronal population. Additionally, it offers insights into the models’ robustness in extracting behaviorally relevant information from diverse data formats.
3.5.1. Input bin width and context window
We investigated how the extent and granularity of input neuronal data influence decoding performance while keeping the bin width of the output eye kinematics constant at 50ms. We varied the input bin width and the number of preceding input bins independently to assess their individual impact on the models’ ability to extract behaviorally-relevant information for predicting eye movements. Understanding these dependencies across different model architectures — some of which do not account for the temporal arrangement of the input bins — and kinematic variables helps us pinpoint the optimal data format for accurate eye movement predictions. This analysis is essential for designing robust gaze BCIs.
We systematically varied the input bin width from 10ms to 300ms, and the number of preceding input bins, so the total input duration preceding each output bin ranged from 10ms to 900ms. We trained separate decoders — for each model, kinematic variable, and cross-validation fold — for each possible combination of input format parameters. To illustrate the impact of these variations, consider the scenario in which the model is provided with 20. 10-ms input bins versus two 100-ms input bins per output bin, resulting in the same total duration of prior neuronal context but different input bin widths. Alternatively, suppose the model is given ten 10-ms input bins versus twenty 10-ms input bins per output bin, maintaining the same input bin width but varying the duration of prior neuronal context. Here, we asked: does the format of the input data impact prediction accuracy, and, if so, to what extent?
To compare model performance across input bin widths and durations, we constructed heat maps (Fig. 6) where the x-axis represents the input bin width and the y-axis denotes the input duration preceding each output bin (calculated as the product of input bin width and number of preceding input bins). The heat maps reveal how surrogate measures of data granularity (x-axis) and prior context (y-axis) affect mean prediction accuracy. The color bars of each heat map are normalized, with the highest saturation square representing the maximum prediction accuracy and the lowest representing the minimum. Additionally, contour lines of Gaussian fits to the data along with white dots at the fitted peak are shown to summarize regions of optimal performance for different models and kinematic variables.

Fig. 6. (Color online) Decoder performance assessed in relation to input data parameters. Heat maps show how ¯R2x,y varies with input bin width (x-axis) and input duration preceding output (y-axis, calculated as input bin width × B preceding input bins). Each sub-panel represents a different model type (top to bottom) and kinematic variable (left to right). R2 values are divided into nine quantiles for input bin width and input duration, resulting in equal-frequency bins, with bin dimensions reflecting the quantile widths and heights. Color bars are scaled for each heat map to show the minimum and maximum R2 for each model-kinematic combination. Gaussian fits as shown as contour lines to highlight the central tendency, spread, and orientation of the data. Each subplot title includes the Gaussian fit’s R2gaussian, with the white • indicating the fitted peak position.
For position and velocity decoding, prediction accuracy (R2) remained relatively stable across a wide range of input bin widths and input durations preceding output (Fig. 6, left and middle columns). A smaller input bin width (10–40ms) combined with a longer input duration preceding output (>528ms) generally yielded the best prediction accuracy and created a broad near-optimal region. For neural networks (bottom four rows), the peak was more defined and shifted slightly lower on the y-axis, suggesting these models reached an asymptote in prediction accuracy more quickly with increasing temporal context compared to simpler models. Interestingly, the heatmaps for the deep feedforward neural network (dnn) and recurrent networks (rnn, gru, lstm) were similar, despite only the recurrent models having access to the temporal order of input bins. This suggests the feedforward network effectively captures temporal dynamics without explicitly modeling sequence order.
In contrast, the Gaussian fits for acceleration decoding (right column) tended to be elongated and inconsistent. The larger optimal input bin widths (10–100ms) and the shorter optimal input duration (431–623ms) for position and velocity suggest less reliance on prior context. However, it is important to note that acceleration accuracies were low across all input data parameters, indicating that decoding acceleration eye kinematics is barely feasible. Overall, the broad optimal regions on the heatmaps suggest that models are fairly robust to variations in input data format, allowing flexibility in selecting input parameters. The heatmaps were similar across models and kinematic variables, with only subtle differences in the location of the optimal spot or the spread of the Gaussian contours. These variations indicate that input data formats have a minor influence on the temporal dynamics of predicted kinematic variables. While using larger numbers of fine input bins generally maximized accuracy, real-world applications must balance this with considerations of computational efficiency and robustness. Providing too much neuronal context can lead to overfitting or create too many features for real-time processing.
3.5.2. Bin width of output eye kinematics
In this section, we will vary the bin widths of output eye kinematics while keeping all input data parameters constant (eight 50-ms input bins per output) to isolate and assess the impact of output bin size on prediction accuracy. Increasing the output bin width involves discretizing eye traces into larger time bins, creating fewer observations for the model to fit and predict. This temporal smoothing can enhance decoding prediction accuracy by capturing broader trends in eye movements. However, the improvement comes at the cost of discriminating finer behavioral details that may be crucial for analysis at smaller timescales. Selecting an appropriate output bin width is particularly important for BCI applications, where precise and timely predictions are essential for effective real-time control and responsiveness. Balancing the trade-off between capturing overall trends and maintaining detailed temporal resolution is key to optimizing BCI performance.
To test these ideas, we systematically modified the output bin width from 10ms to 300ms in 10ms increments, assessing how accuracy changed with different output bin widths. We trained separate decoders — one for each model, kinematic variable, and cross-validation fold — for each output bin size. To illustrate the impact of output bin width on eye kinematics, we present the horizontal (x) eye traces from the same four-trial snippet shown in Fig. 3. For each row in Fig. 7(a), we show the recorded (black line) and predicted (colored lines) eye kinematics binned using different output bin widths (top to bottom: 10, 20, 50, 300ms). In each sub-panel, we report the R2 value for the LSTM model.

Fig. 7. (Color online) Decoder performance as a function of chosen output bin width. (a) A ∼6.4-s snippet of horizontal (x) eye trajectories, for each decoding model (colors) and kinematic output type (left to right sub-panels), across different output bin sizes (top to bottom sub-panels). Recorded eye movement trajectories are represented by black lines, while predicted trajectories are in color. The output bin width increases vertically in each row (10, 50, 100, 300ms); a shaded gray bar on the far left of each row, with a fixed width of 0.6 s, shows how many bins fit into this duration based on the output bin width. Prediction accuracy R2x for the LSTM model is indicated in the bottom right corner of each sub-panel. Note that the y-axis limits vary across position and velocity (left and middle columns) versus acceleration (right column) plots for illustration purposes. (b) ¯R2x,y values are condensed into 15 evenly-spaced value ranges of output bin width on the x-axis. Error bars denote the mean ± SEM across cross-validation folds and output bin widths within each value range (n=20).
For position (Fig. 7(a), left sub-panels) decoding, increasing the output bin width removes some of the variation in behavior but does not change the general eye movement patterns. However, for velocity (middle sub-panels) and even more so for acceleration (right sub-panels) decoding, increasing the output bin width smoothed the sharp deviations in eye kinematics at times in which the monkey made rapid corrective eye movements (saccades). Saccades appear as sharp peaks in recorded x eye traces, indicating moments when the monkey realigned its eyes with the moving target. As the output bin width increases, the overall variance in eye kinematics decreases. This reduction in variance directly impacts the R2 value because R2 measures the proportion of predicted variance. Therefore, with less variance in the binned eye kinematics, R2 values were higher, which reflected the model’s improved ability to capture the smoother, less variable signal.
Figure 7(b) shows prediction accuracy as a function of output bin width. For position (left), larger bin widths slightly reduced accuracy; while for velocity and acceleration, larger bin widths improved performance. For acceleration decoding, neural network models improved relative to other model types as output bin width increased. Support Vector Regression (svr, teal line) struggled with the smallest bin width (10ms) for velocity and acceleration (teal lines). Despite this, the relationship between output bin width and prediction accuracy was fairly consistent across all model types.
In summary, increased output bin width enhanced prediction accuracy but also removed finer behavioral details. Larger output bin widths can completely obscure and even eliminate saccadic eye movements. Depending on the application, this may be desired or problematic (see Sec. 4). Given this trade-off, R2 alone is insufficient for describing model utility. The optimal output bin size depends on the specific application, and selecting bin size based solely on prediction accuracy may overlook important temporal details. This trade-off underscores the need to consider both prediction accuracy and the preservation of critical behavioral events when determining the appropriate output bin width.
3.6. Balancing predictive power and computational load
Thus far we have focused on investigating the impact of data quantity and format on the decoding accuracy of a held-out dataset. Across models and kinematic variables, we observed comparable effects on prediction accuracy when altering input and output bin width, the number of input bins preceding output bins, and the number of neurons. The similar patterns of results may be surprising given the highly variable model architectures (see Sec. 2). However, these analyses assumed unlimited computational resources and therefore ignored practical constraints needed for creating reliable real-time decoders. Further assessments of factors affecting model training and robustness are needed.
To evaluate how data parameters impact computational load and generalizability, we used three standard metrics (see Sec. 2.4.4):
(i) | Training time: The duration, measured in minutes per cross-validation fold using fixed computational resources, required for the model to train on the pre-processed data. Longer training times indicate increased computational demands during the model’s learning process, which includes model fitting and hyperparameterization. | ||||
(ii) | Inference time: The time, measured in seconds per output bin, required for the model to generate a prediction on unseen input data using tuned weights and hyperparameters. A shorter inference time indicates a more computationally efficient model during real-time predictions (in our case, predicted eye kinematics). | ||||
(iii) | Generalization gap: The percent difference between train and test prediction accuracy, averaged across decoding runs. This metric reflects model generalization, with a larger gap indicating a higher likelihood of overfitting, where the model performs well on training data but poorly on unseen test data. |
Figure 8 summarizes the effects of model parameters (columns) on the three evaluation metrics defined above (rows). Because the relationships between data parameters and evaluation metrics were highly similar across kinematic derivatives, we are only showing position decoding. Data parameters are divided into evenly spaced ranges, and each subplot features a colored line for each model if the Pearson’s correlation coefficient between the data parameter and evaluation metric was statistically significant (p<0.0001). Bolded axes in each row indicate the data parameter(s) with the highest average Pearson’s correlation across all model types. The y-axis is log-scaled to account for differences across model architectures, allowing us to observe how each model’s relative changes in computational efficiency and robustness are influenced by varying data parameters. For most data parameter changes, model performances scaled predictably, maintaining their relative positions to one another.

Fig. 8. Computational load and robustness of position decoding models. Each sub-panel shows changes in (a) training time, (b) inference time, and (c) generalization gap based on data parameters (left to right sub-panels): training set size, neuronal population size, input bin width, number of preceding input bins, and output bin width. Error bars indicate the mean ± SEM across 10 cross-validation folds, 100 repetitions (for the far left two columns) and parameter ranges, with sample sizes for each column (data parameter) as follows: n=2000,2000,20,40,20. Bolded axes indicate data parameters with the highest (or joint highest) overall correlation across all models for each metric; y-axes are plotted in log-scale.
Training time (Fig. 8(a)) is crucial when evaluating model efficiency, particularly when computational resources are limited or when balancing the trade-off between computational load and accuracy. SVR (teal line) consistently required the most training time, while the WF (orange line) required the least, as WF has no hyperparameters to tune and its outputs are a simple linear combination of the inputs. The strongest linear relationship between data parameters and training time was observed with training set size (Fig. 8(a), far left); larger training sets resulted in longer training times. Training time also increased with the number of model features, such as neuronal population size and the number of preceding input bins. Conversely, training time decreased with larger input and output bin widths because smaller bins result in more observations for the same training set size.
Inference time (Fig. 8(b)) is vital for applications requiring real-time predictions because it measures how quickly a trained model can generate predictions. Support vector regression (SVR, teal line) showed the slowest inference times, presumably because of its quadratic time complexity and memory-intensive nature. Conversely, XGBoost (green line) demonstrated the fastest inference times. Parameters such as training set size (far left column) and output bin width (far right column) had little effect on inference time, as they primarily affect the training phase and the number of predictions, respectively, rather than the time taken to generate each individual prediction. However, increasing the number of model features, especially the number of preceding input bins (Fig. 8(b), fourth column), increased inference time. This effect was most pronounced in non-neural network models (yellow, orange, green, and teal lines), where neurons and preceding bins are combined into one feature space, leading to greater computational demands (e.g. 65 neurons and 15 preceding bins result in 975 independent features).
Lastly, the generalization gap (Fig. 8(c)) indicates how well the model generalizes to unseen data, providing a measure of model robustness and reliability beyond just test accuracy. Across the data parameters, XGBoost (green line) exhibited the highest generalization gap, likely a result of its complexity and tendency to overfit, while the WFs (yellow and orange lines) had the lowest gap, likely because of their relative simplicity. However, a small generalization gap can also mean both training and test accuracy are both low. WC filters (yellow line) are more unstable than simple WFs (orange line) due to their nonlinear stage, causing both train and test accuracy to drop sharply in some instances, leading to a near-zero generalization gap and inconsistent error bars as training set size increases. Increasing training set size (far left column) and neuronal population size (second column) both decreased the generalization gap, highlighting their importance in model robustness. Interestingly, while the benefits of increased training set size on test prediction accuracy tend to plateau (Fig. 5(a)), the generalization gap continued to narrow with more data. The two data parameters showing the strongest linear relationship with the generalization gap were the number of preceding input bins (fourth column) and the output bin width (far right column). Increasing the number of preceding input bins decreased the generalization gap for most models. Additionally, decreasing the output bin width reduced the generalization gap for the feedforward models, suggesting that finer temporal resolution helps these models generalize better to unseen data.
The results indicate that evaluation metrics such as training time, inference time, and generalization gap are important for understanding model performance and practical application, especially in real-time BCI scenarios. Balancing predictive power with computational efficiency and robustness is crucial for determining the feasibility and reliability of using the models in real-world settings. For instance, XGBoost, despite being the fastest in terms of inference speed, exhibited the highest tendency to overfit with the specific parameters and hyperparameters used. Notably, neural network models were generally more stable across data parameter changes, as indicated by their flatter slopes of the lines compared to non-neural network models. This comprehensive evaluation highlights the importance of considering multiple metrics to ensure that models are not only accurate but also efficient and robust.
4. Discussion
We trained eight decoding models to successfully reconstruct tracking eye movements from cortical spiking activity. While previous research has primarily focused on decoding reach trajectories from motor cortical activity,27,31,42 or classifying eye movement directions from visual-motor neurons,12,58,60 we decoded eye kinematics on a moment-by-moment basis. We evaluated how different model architectures extract gaze-related information, assessed the impact of data quantity and format on decoder stability, and determined the practicality of the decoders for real-time applications. Our results showed that all eight models can accurately decode eye kinematics, with prediction accuracy generally decreasing as a function of kinematic derivative (position > velocity > acceleration). Neural network models consistently outperformed traditional linear filters or ensemble methods in terms of prediction accuracy (R2). Interestingly, the performance of RNNs was comparable to that of feedforward neural network models (Fig. 4).
Our results are largely consistent with previous skeletomotor studies. For example, decoding arm kinematics from primary motor cortical (M1) neurons explained around 75–90% of the variance in velocity.27,42 Many studies with high-performing BCIs, using neurons from M1 and pre-motor brain regions, found that velocity decoding performance was highest when using 16-ms bin widths61 whereas other works have suggested that 25–50ms bins may be best for closed-loop control.62 However, offline decoders typically show that velocity signals in the motor cortex are more robust and easier to decode than other movement kinematics,43 while we found that position was more reliable to decode than higher derivatives. Follow-up work is needed to determine if the difference in motor decoding across modalities is a function of the motor dynamics or the properties of the specific brain regions involved.
We also found that, as expected, prediction accuracy increased with training set size and neuronal population size. However, the improvement with increasing training set size plateaued at around 10min of data (Fig. 5(a)), likely because of factors such as the constrained nature of the task, behavioral variability, and the way in which we stitched together within-trial recordings rather than using continuous recordings from the entire session. Additionally, providing the models with more training data increased the amount of time required for training but also improved the generalization of the model, as seen by the narrowing difference between train and test accuracy (Fig. 8(c)). Performance improvements began to asymptote at around 50 out of 65 neurons, near the maximum number of neurons recorded in a single session, suggesting shared information content among neurons (Fig. 5(b)). The local co-variability of neighboring neuronal populations is a well-established neurophysiological finding.63 Thus, our “pure” information-based decoders reproduced this classical result regarding neuronal population size, suggesting that these decoders are viable tools for understanding brain physiology.
Lastly, we demonstrated that most decoders were robust to changes in data formatting, with mid-range parameter values generally yielding reliable performance. Regarding neuronal data parameters, input bin width had minimal impact on performance metrics, while the number of preceding input bins had a more significant effect. This indicates that the total amount of input context has a more direct impact on prediction accuracy than the way it is segmented. Although more preceding input bins provided additional information (Fig. 6), it also increased inference times (Fig. 8(b)). Output bin width is crucial for determining the timescale of eye kinematics, where wider bins may improve R2 by reducing variance but obscure finer behavioral details (Fig. 7) and promote overfitting, particularly in simpler models (Fig. 8(c)). Thus, differences in R2 do not necessarily reflect differences in the amount of gaze-related information in neurons, but rather how changing bin widths can affect the model’s ability to exploit that information. A high R2 should not be a goal in itself, as it can be increased by methods that lower the external validity of the model’s coefficients, thus not necessarily improving our understanding or prediction of the eye behavior.
When evaluating decoding performance, it is essential to consider the specific goals and requirements of the application. While offline comparisons can provide valuable insights, they do not necessarily translate to online performance.49 For offline decoding, maximizing prediction accuracy might justify sacrificing computational efficiency. However, online decoding demands a balance between accuracy and computational constraints. Although longer training times — caused by smaller output bin widths or more training data — can slow down online decoders, they do not necessarily impair real-time prediction accuracy. For real-time applications, it’s advisable to select the maximum number of preceding input bins that do not excessively delay predictions. Coarsely binning the input data reduces the number of model features, striking a balance between providing sufficient input context and maintaining efficient inference times. While offline analyses might recommend wider output bins for better accuracy, this can introduce lags in online systems, making them less responsive to real-time adjustments. Conversely, extremely small output bin widths may not be practical for complex models, as the inference time could exceed the output bin width, leading to delayed predictions. Additionally, exploring newer machine learning methods, such as advanced supervised approaches (e.g. Neural Dynamic Classification, Dynamic Ensemble Learning, and Finite Element Machine),64,65,66 self-supervised learning (e.g. transformers), and reinforcement learning could enhance model efficiency and make them valuable for more complex decoding tasks.
A caveat to our successful decoding efforts is that, while these models can effectively extract gaze-relevant information from neuronal activity, this does not necessarily imply that the brain employs similar mechanisms for processing such information. The training processes and search algorithms such as gradient descent are inconsistent with what is know about neuronal processing. Thus, the neurophysiological implications of our work are limited. A parsimonious biological perspective suggests that decoding performance may estimate the upper bound of behaviorally-relevant information encoded in cortical spiking activity. Future work may help bridge this knowledge gap by examining how prediction accuracy varies across different neuronal sub-populations. For example, while our study incorporated all eye movement data across various experimental conditions to enhance decoder training, subsequent studies could assess decoding abilities by separating training and test sets based on factors such as sensory-motor conditions, brain regions, cortical depth, or functional properties like direction selectivity.67
Although decoders are inadequate estimates of cortical brain function, such cross-decoding approaches may help us place constraints on the type of information retained across different eye movement scenarios. These avenues for future work are important because the neuronal activity recorded in our study is multiple synapses away from the eye muscles and therefore can contain many types of brain signals. In this respect, our work differs from work on limb movement BCIs which often rely on activity in primary motor cortex. This fact complicates the neurophysiological interpretation of the results, but strengthens the utility of our approach for vision-related BCIs because it does not rely on pre-selected types of brain activity. On the other hand, our work is indebted to the limb movement literature for elucidating the methods for decoding continuous movement signals, which we demonstrate also apply to tracking eye movements. While performance in our hands was high and comparable to performance in limb movement studies with more than twice as many recording channels,30,68 we attribute such success to the limited degrees of freedom for eye movements (three pairs of muscles) compared to limb movements with seven or more degrees of freedom.
Despite these limitations, the practical implications of our results for the design, interpretation, and implementation of vision-related BCIs are substantial. First and foremost, we demonstrated that a range of decoders can use cortical activity to predict eye movements on a moment-by-moment (<50ms) basis. While such “real-time” decoding of actions is increasingly common in the field of limb movement research,36,43,61 our results are a breakthrough in the field of vision-related BCIs. Eye movements are more rapid and precise than arm movements, providing an additional level of complexity for neuronal encoding in the brain and decoding in our models. Eye movements can be brief (100ms duration) and change mid-flight based on the demands of the environment.69 Thus, previous approaches such the coarse classification of eye movement direction is likely insufficient for visual rehabilitation devices aimed at controlling a cursor or maintaining visual stability. In these domains, continuous decoding is key.
5. Conclusion
We demonstrated that a variety of machine learning models can use activity from a relatively small number of cortical neurons to accurately decode eye position on a moment-by-moment basis. Neural networks yielded the highest accuracy for decoding. Yet simpler decoders, which showed a 20% decrease in performance (Fig. 4), exhibited at least a thousandfold increase in computational efficiency (Fig. 8). Models’ performances improved significantly as data quantity increased (Fig. 5), as expected. We found that high performance was achieved across a broad range of data formats for each type of decoder (Fig. 6). Notably, output bin size remained a critical factor given the rapid interactions of different types of eye movement (Fig. 7) and the goals of decoding. Our comprehensive assessment of decoding models, kinematic variables, and input/output parameters sets the stage for the use of neuronally-guided BCIs in less constrained tasks and the development of oculomotor BCIs for visual assistive devices.
Acknowledgments
This work was supported by the Whitehall Foundation, NIH Fellowship T32EY17271 (K.K.N), CORE Grant P30EY08098 to the Department of Ophthalmology, the Eye and Ear Foundation of Pittsburgh, and unrestricted funds from Research to Prevent Blindness, New York, NY. This research was supported in part by the University of Pittsburgh Center for Research Computing, RRID:SCR_022735 through the H2P cluster, which is supported by NSF award number OAC-2117681. We thank Stephen G. Lisberger for the use of the data, and Bing Liu for help with code and analysis.
ORCID
Kendra K. Noneman https://orcid.org/0000-0003-1972-5477
J. Patrick Mayo https://orcid.org/0000-0001-5696-4667