Forecasts of Residential Real Estate Price Indices for Ten Major Chinese Cities through Gaussian Process Regressions
Abstract
Due to the rapid growth of the Chinese housing market over the past ten years, forecasting home prices has become a crucial issue for investors and authorities alike. In this research, utilising Bayesian optimisations and cross validation, we investigate Gaussian process regressions across various kernels and basis functions for monthly residential real estate price index projections for ten significant Chinese cities from July 2005 to April 2021. The developed models provide accurate out-of-sample forecasts for the ten price indices from May 2019 to April 2021, with relative root mean square errors varying from 0.0207% to 0.2818%. Our findings could be used individually or in combination with other projections to formulate theories about the trends in the residential real estate price index and carry out additional policy analysis.
Introduction
The last ten years have seen a significant expansion of the Chinese real estate sector. Issues with predicting real estate prices have undoubtedly grown to be among the top concerns for investors and politicians. Understanding real estate price trends and fluctuations is crucial because they directly affect people’s decisions about where to live and how to invest in real estate, as well as the development and execution of regulatory agencies’ policies. Of course, many consumers and providers of projections are interested in predicting real estate prices.
Many academics and professionals are interested in making accurate and reliable predictions of financial and economic time-series data. For various forecasting applications, a wide variety of their extensions and adaptations as well as certain crucial time series techniques, such as the autoregressive (AR), vector autoregressive (VAR), and vector error correction (VEC) models, have been researched (Jin and Xu, 2024b; Yang et al., 2018). Neural networks, Gaussian process regressions, support vector regressions, regression trees, random forests, nearest neighbours, deep learning, ensemble learning, boosting, and bagging are just a few examples of the many machine learning techniques that have recently been found to be effective and promising solutions to a variety of real estate price forecasting problems. These assessments, while not exhaustive, are generally consistent with a variety of empirical studies on the adoption of machine learning techniques (Alade et al., 2021) for forecasting in finance and economics (Yang et al., 2008), and the neural network model appears to be one of the most widely used methods (Jin and Xu, 2024a,n) for predicting prices for real estate properties. However, the Gaussian process regression is not thoroughly investigated for predictions of time-series data from the real estate price index.
Neal’s research on Bayesian learning for neural networks is the basis for a unique regression technique (Neal, 2012). Since the technique depends on priors over functions of Gaussian processes, it is suitable for modelling noisy data. It was shown that at the limit of an infinite network, many different kinds of neural network-based Bayesian regression models would converge to Gaussian processes (Neal, 2012; Jin and Xu, 2024h,i). The Gaussian process has been effectively used in regressions to simulate both noisy (Jin and Xu, 2024l) and noise-free (Neal, 1997; Jin and Xu, 2024m) data. In their study, Brahim-Belhouari and Vesin (2001) compared radial basis function neural networks and Gaussian processes for forecasting issues involving stationary time-series data and found that Bayesian learning produces better prediction outcomes. Brahim-Belhouari and Bermak’s (2004) research suggests that it is advantageous to examine various covariance functions, which is the approach used in this study, and that prediction techniques based on Gaussian processes may be used to successfully resolve forecasting issues for non-stationary time-series data. Radial basis function neural networks are outperformed by Gaussian process regressions, according to the research by Brahim-Belhouari and Bermak (2004). Additionally, the precise matrix operations used to integrate the prior and noise models are what give the Gaussian process formulation its value and advantage (Brahim-Belhouari and Bermak, 2004). Similar to how we would utilise model averaging, Brahim-Belhouari and Bermak (2004) also suggested using the strategy of multi-model forecasting using Gaussian process predictors. A new study suggests that Gaussian process regressions might be used to accurately forecast steel prices for the resource industry.
Forecasting efforts have also been observed for residential real estate prices using classic econometric models and, more recently, machine learning techniques. For example, semiparametric models are more useful for forecasting and evaluating residential housing prices, according to Gençay and Yang’s (1996a,b) comparison of parametric and semiparametric classical econometric models. Glennon et al. (2018) discover that utilising several models increases the accuracy of property valuations based on house price indexes. Clapp and Giaccotto (1992) show that the evaluated value approach is more effective than the repeat sales methodology by creating a mechanism for reducing the impact of measurement mistakes connected with it. According to Kaboudan and Sarkar (2007), projections derived from equations calculated from city-wide disaggregated data have an advantage over those derived from local average pricing equations. Mei and Fang (2017) create a dynamic state forecasting model of the average selling price using multiple regression and trend analysis. Levesque (1994) uses an airport noise case study to examine the breakdown of residential property values. The performance of the AR integrated moving average model is examined by Hepşen and Vatansever (2011). Principal components analysis is used by Baroni et al. (2005) to create a repeat sales index that predicts apartment prices. Guo (2020) examines future price stability using both linear and non-linear regression models. For machine learning approaches, Paris (2008) investigates artificial neural networks to predict alterations in national and local price indices for the UK residential real estate market. Chi (2017) suggests using a spatial back-propagation neural network to estimate the price of residential real estate. In order to estimate demand for residential development, Bee-Hua (2000) proposes combining neural networks with evolutionary algorithms. Štubňová et al. (2020) discover that neural networks outperform regression models for estimating residential real estate market prices. For predicting residential unit rent prices, Seya and Shiroi (2021) compare the deep neural network with nearest neighbour Gaussian process and conclude that the former has greater potential. In order to estimate construction cost from economic variables and indices, Rafiei and Adeli (2018) suggest using an unsupervised deep Boltzmann machine (DBM) learning approach, a softmax layer to extract pertinent features from the input data, and a three-layer back-propagation neural network (or support vector machine) to transform the trained unsupervised DBM into a supervised regression network. The extra-trees regression technique and adial basis function-based support vector regression algorithm are both effective in modelling the fine-scale spatiotemporal distribution of residential land values, according to Zhang et al. (2021). Yoo et al. (2012) examine the hedonic modelling of residential property sales prices using cubist, random forest, and conventional ordinary least squares, and discover that the random forest produces the most accurate results. Researchers, Hong et al. (2020), Dimopoulos and Bakas (2019), and Dimopoulos et al. (2018), all show how machine learning models may be used to increase the accuracy of real estate mass assessments. Machine learning models are also helpful for residential land evaluations, according to Ai et al. (2020). Picchetti (2017) demonstrates how the gradient tree boosting approach may be utilised to get around the problem of sample heterogeneity in hedonic geospatial residential property price assessments. Different machine learning methods have demonstrated promising accuracy in the literature for real estate price forecasts. Based upon different empirical evidence, the mean absolute percentage forecast errors have ranged from below 1% to above 10%, considering that various time-series data have varied features and some are more difficult to forecast than others.
To continue this theme, we focus on Gaussian process regressions for residential real estate price index estimations for ten major Chinese cities between July 2005 and April 2021, a period during which the real estate market saw rapid growth. This is, to the best of our knowledge, the first forecast study that employs Gaussian process regressions to examine residential real estate price indices in the Chinese market. In past research, the Gaussian process regression approach was frequently used to investigate the Boston housing related issues from the standpoint of property valuation. Given the prominence of the residential real estate market in China, there may not be much motivation required. Forecasts of residential real estate price indices should thus be crucial and potentially difficult for investors and policymakers, since having a thorough grasp of pricing trends may benefit in decision making. There are several forecasting methodologies, including econometric and machine learning-based models. The Gaussian process regression technique is chosen because to the non-linear patterns presented by the residential real estate price indices under consideration here, as well as the literature’s recognised value and promise for real estate price forecasting. To train the Gaussian process regression, a Bayesian optimisation strategy using a range of basis functions and kernels, as well as the cross validation technique, will be employed. Because most earlier study focused on a single location when evaluating other types of real estate assets, our findings may help to a better understanding of applying machine learning technology to predict residential real estate pricing in the constantly expanding Chinese market. The coverage of these cities in the current work, combined with the availability of data, should represent an economically natural way to explore the forecast problem for residential real estate, given that demand and supply are most active in these major cities and that each one may have distinct price characteristics that are worth examining. Because prior research in this area has typically concentrated on older time periods when researching other forms of real estate, such as 1Q1981–4Q2002, Q12000–Q32010, 7M2013–12M2013, 6M1996–8M2014, 1M2013–2M2017, 1M2010–7M2017, 12M2010–10M2017, 1M2011–12M2017, and 1M2005–11M2018, our findings might potentially provide a more contemporary perspective on the efficacy of machine learning approaches for real estate price index estimates for the Chinese market. The current condition of uncertainty in the residential real estate market may cause price behaviour to display increased non-linearities. This study adds to the body of literature on the usefulness of the Gaussian process regression for real estate price index forecasts in dynamic environments by building Gaussian process regression models based on a more recent time period, from July 2005 to April 2021. This study also provides policymakers and investors with timely forecast tools for potential use. For many less sophisticated forecast users, machine learning methods may appear more challenging than econometric models. As a result, to assist technical predictions, we develop comparatively simple yet accurate Gaussian process regressions. Given that machine learning models, like econometric models, have the potential for overfitting or underfitting, our model development techniques have taken into account a trade-off between prediction accuracy and stabilities. We specifically undertake out-of-sample projections from May 2019 to April 2021 and get relative root mean square errors for the ten price indices that range from 0.0207% to 0.2818%. Our findings might be used individually or in combination with other forecasts to formulate theories about the trends in the residential real estate price index and carry out additional policy analysis.
Data
The data used for study come from the China Real Estate Index System (CREIS). It is an analytical tool created to reflect the status of the real estate markets and growth trends in the major Chinese cities. The platform was created in 1994 by the Real Estate Association, the Development Research Center of the State Council, and the National Real Estate Development Group Corporation. In 1995 and 2005, CREIS was audited by specialists from the Ministry of Land and Resources, the Ministry of Construction, the Real Estate Association, the Development Research Center of the State Council, the Banking Regulatory Commission, as well as various universities. Currently, CREIS publishes a number of real estate price indices on a monthly basis, including rental price indices, residential real estate sales price indices for both existing homes and newly constructed homes, price indices for villas, retail real estate price indices, and office price indices, among many others. This system has developed from a platform to include the majority of the Chinese real estate markets. In this study, we focus on investigating forecasting issues using residential real estate price indices.
Residential real estate price indices collected from CREIS cover the following ten significant Chinese cities: Wuhan, Chengdu, Hangzhou, Nanjing, Shenzhen, Guangzhou, Tianjing, Chongqing, Beijing, and Shanghai. According to CREIS, it uses phone surveys, field surveys, and web surveys to collect data. The samples used to produce the price index comprise all residential real estate in a certain city that is available for sale in a specific month. The base-period price index is based on the price of Beijing’s residential real estate in 12M2000, with an index value of 1,000. By normalising against this base-period price, residential real estate price indices for other locations and months are created.
According to CREIS, its residential real estate price index for a certain city is calculated as follows: I′t=∑PtiAt−1i∑Pt−1iAt−1i⋅I′t−1, where I′t and I′t−1 indicate price indices related to time t and time t−1, respectively, At−1i indicates Project i’s total area of construction related to time t−1, and Pti and Pt−1i indicate average prices of Project i’s residential real estate related to time t and time t−1, respectively. It is important to highlight that the residential real estate price indices of the ten cities are the only data that have been evaluated in the current investigation. We do not have access to any further CREIS platform data. The time period covered by the monthly data in this analysis is 7M2005–4M2021. Figure 1 shows the visualisations of the ten price indices, their first differences, distributional plots using histograms with kernel estimations, and quantile–quantile plots, using Beijing, Shanghai, Shenzhen, and Guangzhou as examples. The ten price indices and their first differences are summarised in Table 1. None of the ten price indices, based on the p-values of the Anderson–Darling and Kolmogorov–Smirnov tests provided in Table 1, follows a normal distribution at the 1% significance level. None of the ten price indices, based on the p-values of the Jarque–Bera test shown in Table 1, follows a normal distribution at the 5% significance level. For many various forms of financial and economic time-series data, non-normality is probably not shocking (Jin and Xu, 2024d; Jin et al., 2024). The price indices of Beijing, Tianjing, Chongqing, Hangzhou, Wuhan, and Chengdu are left-skewed, and the price indices of Shanghai, Shenzhen, Guangzhou, and Nanjing are right-skewed. Platykurtic price indices are determined for all of the ten cities.

Fig. 1. Residential real estate price indices of ten major cities in China during 7M2005–4M2021.
City | Series | Minimum | Mean | Median | Standard deviation | Maximum | Skewness | Kurtosis | Jarque–Bera | Anderson–Darling | Kolmogorov–Smirnov |
---|---|---|---|---|---|---|---|---|---|---|---|
Beijing | Price | 1,227 | 3329.126 | 3478.5 | 1075.114 | 4,565 | −0.437 | 1.943 | 0.006 | <0.001 | <0.001 |
First difference | −47 | 17.550 | 7.0 | 29.365 | 121 | 1.193 | 4.294 | <0.001 | <0.001 | <0.001 | |
Shanghai | Price | 1,567 | 2659.653 | 2509.5 | 630.990 | 3,590 | 0.113 | 1.828 | 0.011 | <0.001 | <0.001 |
First difference | −45 | 10.624 | 6.0 | 21.173 | 102 | 1.631 | 6.957 | <0.001 | <0.001 | <0.001 | |
Tianjing | Price | 909 | 1653.221 | 1660.0 | 302.764 | 2,039 | −0.581 | 2.738 | 0.011 | <0.001 | <0.001 |
First difference | −70 | 5.635 | 2.0 | 15.963 | 77 | 0.714 | 7.911 | <0.001 | <0.001 | <0.001 | |
Chongqing | Price | 580 | 911.642 | 938.5 | 154.988 | 1,126 | −0.592 | 2.408 | 0.007 | <0.001 | <0.001 |
First difference | −27 | 2.741 | 2.0 | 10.430 | 43 | 0.541 | 5.730 | <0.001 | <0.001 | <0.001 | |
Shenzhen | Price | 1,176 | 3405.211 | 3096.5 | 1185.993 | 4,966 | 0.050 | 1.725 | 0.008 | <0.001 | <0.001 |
First difference | −25 | 19.963 | 7.0 | 39.994 | 225 | 2.561 | 10.861 | <0.001 | <0.001 | <0.001 | |
Guangzhou | Price | 1,068 | 2261.089 | 2300.5 | 654.347 | 3,254 | 0.022 | 1.928 | 0.019 | <0.001 | <0.001 |
First difference | −38 | 11.344 | 8.0 | 21.305 | 77 | 0.720 | 3.616 | 0.003 | <0.001 | <0.001 | |
Hangzhou | Price | 1,206 | 1913.342 | 1913.0 | 373.503 | 2,488 | −0.163 | 2.131 | 0.034 | <0.001 | <0.001 |
First difference | −54 | 6.783 | 3.0 | 18.192 | 108 | 1.422 | 9.539 | <0.001 | <0.001 | <0.001 | |
Nanjing | Price | 706 | 1336.289 | 1289.0 | 337.566 | 1,828 | 0.007 | 1.841 | 0.013 | <0.001 | <0.001 |
First difference | −103 | 5.180 | 3.0 | 13.812 | 59 | −1.706 | 23.102 | <0.001 | <0.001 | <0.001 | |
Wuhan | Price | 539 | 1164.584 | 1154.0 | 335.519 | 1,630 | −0.143 | 1.941 | 0.017 | <0.001 | <0.001 |
First difference | −15 | 5.751 | 3.0 | 8.961 | 43 | 1.219 | 4.803 | <0.001 | <0.001 | <0.001 | |
Chengdu | Price | 611 | 942.289 | 959.0 | 152.381 | 1,181 | −0.216 | 2.148 | 0.030 | <0.001 | <0.001 |
First difference | −50 | 2.434 | 2.0 | 9.059 | 34 | −0.560 | 8.965 | <0.001 | <0.001 | <0.001 |
In the fields of finance and economics, manifestations of non-linear features at higher moments have already been extensively reported across a wide range of time-series data (Yang et al., 2008; Jin and Xu, 2024g). We apply the Brock–Dechert–Scheinkman (BDS) (Brock et al., 1996) test to examine the ten residential real estate price time-series data for any potential non-linear trends. We implement the BDS test based upon 2–10 as the embedding dimensions and 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0 multiplying a particular price index time series’ standard deviation as ϵ’s-distance utilised for assessing the proximity of different data points. We determine that the resultant p-values of the tests are all virtually zero. These findings suggest that each of the ten price indices has non-linearities. Given these facts, this work seeks to predict the ten non-normal and non-linear residential real estate price indices using Gaussian process regressions.
Method
The Gaussian process regression, a kind of probabilistic kernel model that has been demonstrated to be effective at forecasting a variety of non-linear patterns in a number of scientific disciplines (Jin and Xu, 2024c,f), is the forecasting technique being examined in this study. For the illustration of the model, the training data with an unknown distribution are indicated by {(xi,yi);i=1,2,…,T}, the d-dimensional predictors are indicated by xi∈ℝd, and the target is indicated by yi∈ℝ. The real estate price indices for each city are projected using 20 lagged price indices as predictors. For example, to forecast the price index for the 21 month, 20 price indices from the previous 20 consecutive months will be used as predictors.
Let y=xTβ+ε indicate a linear regression, where ε∼N(0,σ2) indicates the error item. Contrarily, Gaussian process regressions use explicit basis functions and latent variables to define the target variable (Jin and Xu, 2024e). The basis function might be expressed using b, and the latent variables from the Gaussian process might be expressed using l(xi) such that they meet the requirement of the joint Gaussian distribution. The covariance function of latent variables will represent the target’s smoothness, and the basis function’s goal is to project various predictors onto the space of features (Jin and Xu, 2024j,k).
The covariance and mean are two metrics that are frequently used to define a Gaussian process (GP). We are going to express the mean using m(x)=E(l(x)) and the covariance using k(x,x′)=Cov[l(x),l(x′)]. Then, we are going to express the Gaussian process regression using y=b(x)Tβ+l(x), where l(x)∼GP(0,k(x,x′)) and b(x)∈ℝp. Via θ, a hyper-parameter, we are going to parameterise k(x,x′) using k(x,x′|θ). When using a specific technique to train a Gaussian process regression, the following variables will normally be estimated: σ2, θ, and β. We are also going to define kernels, expressed as k’s, and basis functions, expressed as b’s, to be adopted for model training.
The two types of kernels considered in the current study are isotropic kernels and non-isotropic kernels (automatic relevance determination kernels). Both isotropic and non-isotropic kernels are examined using five distinct kernels. Equations (A.1)–(A.10) shown in the appendix offer specifications for all kernels under consideration. σl is utilised to indicate isotropic kernels’ characteristic length scale, α>0 is utilised to indicate the scale-mixture parameter, σf is utilised to indicate the standard deviation of the signal, and r=√(xi−xj)′(xi−xj). σl and σf’s positiveness is going to be achieved with θ=(θ1,θ2)=(log. Each predictor is going to have a distinct length scale for non-isotropic kernels, which is expressed as (). Correspondingly, is expressed as .
In a manner similar to how different kernels are considered, this work takes into account four different basis functions, which are detailed in Eqs. (A.11)–(A.14) in the appendix. In these equations,
Ten-fold cross validation and Bayesian optimisation, which is based on the expected improvement per second plus (EIPSP) technique, are used to estimate the model parameters. Let a GP model be expressed by . The Bayesian approach evaluates corresponding by selecting randomly chosen data points of ’s inside variable boundaries. Here, the number of data points utilised for preliminary judgments is . The algorithm will keep gathering data after encountering evaluation faults until it reaches successful evaluation cases. The algorithm’s first and second steps are then repeated, as shown below. The updating of will be the first stage in producing the posterior distribution over . The second step will include choosing a new data point (x) in order to determine the acquisition function’s () minimisation goal. A maximum of 100 iterations will be used. The purpose of using is to evaluate x’s goodness in regards to Q. Instead of evaluating values that would elevate the objective function, expected improvement acquisition functions evaluate expected amounts of improvements to the objective function. We are going to let express the corresponding numerical value of the lowest mean and express the data point at which the lowest posterior mean is reached. We can express the expected improvement (EI) using . The Bayesian strategy can offer higher advantages per unit of time by using the time-weighting scheme on the acquisition function since the amount of time required to assess the objective may vary depending on the location. Throughout optimisation processes, it is possible to keep a further Bayesian model of the amount of time needed to evaluate the objective as a function of x. In light of this, we can express the acquisition function’s EI per second (EIPS) as , where expresses the posterior mean associated with this extra timing GP model. To avoid the acquisition function overutilising a specific area and preventing a local minimum of the objective, the following changes to its behaviour might be made. Let express the posterior objective’s standard deviation corresponding to x and express the additive noises’ posterior standard deviation meeting the requirement of . We are going to express the exploration ratio using . After each iteration, the acquisition function based on the EIPSP algorithm determines if the following data point, x, meets the requirement of . The kernel function will be modified if this criterion is met by multiplying by the number of iterations, with x being regarded as overexploiting (Bull, 2011). In essence, the EIPSP technique adjustment raises for data points between observations. A new data point will then be produced using the newly fitted kernel. If it turns out that the new data point is similarly overexploiting, will be increased by ten times in subsequent trials. This strategy will be limited to five times in order to get a data point, x, that is not considered overexploited. The modified x will then be accepted as the next exploration ratio by the EIPSP algorithm. To obtain a more accurate overall response, the algorithm strikes a balance between focusing on previously investigated nearby data points and looking at new data points.
Bayesian optimisation procedures will be carried out over , basis functions, kernels, and whether or not predictors are standardised. Forecast performance is going to be determined according to the relative root mean square error (RRMSE), which allows for comparisons of different prediction outcomes across different models or targets (Li et al., 2013). The RRMSE can be expressed as , where expresses the target’s predicted numerical value, n expresses the number of observations utilised for performance assessments, and expresses the target variable’s observed numerical value. Two additional performance metrics are adopted to assess prediction accuracy: the mean absolute error (MAE) and root mean square error (RMSE), whose units are identical to the target variable and whose magnitude is related to the target variable. The RMSE can be expressed as . The MAE can be expressed as . Finally, we also consider the correlation coefficient (CC) for measuring performance, which can be expressed as , where and stand for averages.
Result
For each city, data from its residential real estate price indices are utilised for model training from 7M2005 to 4M2019, and for model performance testing for one-month ahead forecasts from 5M2019 to 4M2021. Figure 2 shows the outcomes of EIPSP optimisations based on training data for all price indices. These results indicate that (a) the isotropic rational quadratic kernel (Eq. (A.4)), empty basis function (Eq. (A.11)) and standardised predictors are chosen for Beijing, (b) the isotropic rational quadratic kernel (Eq. (A.4)), empty basis function (Eq. (A.11)), and non-standardised predictors are chosen for Shanghai, (c) the isotropic exponential kernel (Eq. (A.1)), linear basis function (Eq. (A.13)), and standardised predictors are chosen for Tianjing, (d) the isotropic rational quadratic kernel (Eq. (A.4)), constant basis function (Eq. (A.12)), and non-standardised predictors are chosen for Chongqing, (e) the isotropic Matern 3/2 kernel (Eq. (A.5)), constant basis function (Eq. (A.12)), and standardised predictors are chosen for Shenzhen, (f) the isotropic exponential kernel (Eq. (A.1)), empty basis function (Eq. (A.11)), and standardised predictors are chosen for Guangzhou, (g) the isotropic exponential kernel (Eq. (A.1)), constant basis function (Eq. (A.12)), and standardised predictors are chosen for Hangzhou, (h) the isotropic rational quadratic kernel (Eq. (A.4)), empty basis function (Eq. (A.11)), and standardised predictors are chosen for Nanjing, (i) the isotropic exponential kernel (Eq. (A.1)), empty basis function (Eq. (A.11)), and standardised predictors are chosen for Wuhan, and (j) the isotropic exponential kernel (Eq. (A.1)), empty basis function (Eq. (A.11)), and standardised predictors are chosen for Chengdu. For the ten GPR models created using the ten-fold cross validation for the residential real estate price index of each city, the results of parameter estimates are shown in Table 2. The initials ‘CV1’, ‘CV2’, …, and ‘CV10’ are used to denote these parameter estimations, where ‘CV’ stands for ‘cross validation’.

Fig. 2. Optimisation processes based upon the EIPSP algorithm for monthly residential real estate price indices.
CV1 | CV2 | CV3 | CV4 | CV5 | CV6 | CV7 | CV8 | CV9 | CV10 | |
---|---|---|---|---|---|---|---|---|---|---|
Parameter | Beijing: Isotropic rational quadratic kernel, empty basis function, and standardised predictors | |||||||||
8.713 | 8.947 | 8.998 | 8.697 | 9.072 | 8.826 | 8.715 | 8.919 | 8.974 | 8.692 | |
14.406 | 14.249 | 14.125 | 14.492 | 14.222 | 14.488 | 14.374 | 14.005 | 14.627 | 14.176 | |
0.004 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.004 | |
3394.053 | 3376.831 | 3359.211 | 3380.081 | 3370.464 | 3362.822 | 3364.386 | 3379.430 | 3381.626 | 3371.865 | |
Parameter | Shanghai: Isotropic rational quadratic kernel, empty basis function, and non-standardised predictors | |||||||||
5.736 | 5.531 | 7.232 | 5.695 | 5.661 | 7.164 | 5.572 | 7.152 | 7.363 | 5.589 | |
9671.714 | 10403.556 | 9929.080 | 9429.178 | 9643.957 | 10045.575 | 9462.530 | 9846.397 | 10026.156 | 10190.834 | |
0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 | |
2855.151 | 2850.939 | 2849.535 | 2855.446 | 2841.516 | 2842.975 | 2854.876 | 2852.664 | 2849.486 | 2858.678 | |
Parameter | Tianjing: Isotropic exponential kernel, linear basis function, and standardised predictors | |||||||||
2.270 | 2.299 | 2.312 | 2.332 | 2.282 | 2.344 | 2.261 | 2.258 | 2.324 | 2.283 | |
1717.865 | 1727.215 | 1728.874 | 1715.903 | 1727.174 | 1726.565 | 1732.113 | 1725.618 | 1726.524 | 1730.831 | |
−23.855 | −19.491 | −22.554 | −21.757 | −28.676 | −25.240 | −17.130 | −22.339 | −4.123 | −25.719 | |
−5.173 | 20.536 | 6.709 | −0.360 | 1.963 | 3.967 | −4.806 | 4.840 | −32.665 | 9.578 | |
61.436 | 25.971 | 54.152 | 57.942 | 61.270 | 55.223 | 48.676 | 56.648 | 68.301 | 50.558 | |
−28.691 | −24.983 | −31.967 | −37.638 | −31.608 | −27.420 | −38.024 | −42.840 | −17.456 | −16.017 | |
8.072 | 17.076 | 2.928 | 16.652 | 1.471 | 13.791 | 27.455 | 1.943 | −3.757 | −50.195 | |
11.769 | 8.668 | 24.403 | 22.846 | 35.550 | 18.028 | 25.853 | 30.516 | 19.727 | 63.047 | |
−32.785 | −26.763 | −37.517 | −50.651 | −48.305 | −30.389 | −44.989 | −22.315 | −37.739 | −25.656 | |
−2.616 | 26.624 | 6.064 | 15.176 | 11.833 | −4.324 | −6.950 | 2.218 | −5.043 | −14.760 | |
48.837 | 34.885 | 26.952 | 33.456 | 23.624 | −10.683 | 36.212 | 16.553 | 30.067 | 28.475 | |
−16.779 | −82.903 | −9.891 | −12.818 | −12.622 | 28.260 | −14.746 | −1.969 | 20.852 | 17.430 | |
−40.011 | 7.574 | −44.541 | −41.274 | −31.914 | −33.874 | −19.516 | −38.922 | −56.235 | −56.915 | |
14.900 | −7.082 | 14.207 | 6.809 | 15.505 | −1.397 | 2.261 | −5.796 | 13.798 | 38.316 | |
−36.517 | −3.004 | −45.249 | −29.111 | −49.212 | −29.749 | −36.862 | −19.959 | −33.266 | −83.117 | |
59.751 | 30.071 | 81.973 | 55.822 | 69.940 | 46.794 | 61.953 | 55.091 | 66.448 | 65.669 | |
−50.921 | −26.616 | −47.322 | −37.636 | −47.118 | −34.808 | −42.435 | −45.082 | −62.480 | 17.268 | |
45.260 | 22.869 | 21.828 | 20.794 | 38.321 | 43.650 | 20.278 | 48.551 | 38.952 | −16.148 | |
−54.270 | −32.550 | −43.782 | −39.035 | −58.090 | −43.307 | −35.116 | −62.919 | −43.149 | −65.776 | |
33.010 | 40.456 | 46.602 | 31.022 | 35.392 | 32.274 | 15.469 | 37.788 | 21.900 | 45.796 | |
−30.932 | −5.555 | 4.554 | 10.706 | 5.001 | −32.783 | 22.810 | 5.460 | −17.691 | −17.220 | |
265.865 | 225.342 | 223.488 | 231.818 | 235.252 | 266.019 | 225.041 | 227.571 | 265.693 | 263.809 | |
0.233 | 0.507 | 0.398 | 0.392 | 0.320 | 0.390 | 0.336 | 0.401 | 0.266 | 0.182 | |
11.831 | 13.142 | 13.437 | 13.007 | 12.424 | 11.762 | 12.165 | 13.196 | 11.459 | 9.576 | |
Parameter | Chongqing: Isotropic rational quadratic kernel, constant basis function, and non-standardised predictors | |||||||||
4.140 | 3.910 | 3.473 | 3.950 | 4.206 | 3.835 | 3.685 | 3.582 | 4.019 | 3.635 | |
877.736 | 880.853 | 886.232 | 886.659 | 882.419 | 885.206 | 894.449 | 882.460 | 880.916 | 883.589 | |
1079.054 | 948.102 | 956.706 | 1061.836 | 1107.167 | 918.901 | 909.187 | 1048.151 | 1037.970 | 953.585 | |
0.086 | 0.078 | 0.063 | 0.072 | 0.086 | 0.071 | 0.076 | 0.075 | 0.078 | 0.078 | |
269.608 | 243.520 | 238.179 | 260.364 | 273.387 | 232.212 | 229.531 | 258.342 | 259.731 | 243.181 | |
Parameter | Shenzhen: Isotropic Matern 3/2 kernel, constant basis function, and standardised predictors | |||||||||
10.590 | 10.418 | 10.458 | 10.561 | 10.610 | 10.575 | 10.429 | 10.585 | 10.571 | 10.556 | |
3405.083 | 3446.051 | 3449.565 | 3447.997 | 3465.798 | 3469.546 | 3461.686 | 3450.087 | 3465.318 | 3457.192 | |
7.530 | 4.524 | 4.453 | 4.501 | 4.141 | 4.188 | 4.226 | 4.972 | 4.249 | 4.151 | |
1088.517 | 735.484 | 710.153 | 710.102 | 684.236 | 690.440 | 682.685 | 745.662 | 679.481 | 686.742 | |
Parameter | Guangzhou: Isotropic exponential kernel, empty basis function, and standardised predictors | |||||||||
5.707 | 5.785 | 5.717 | 5.698 | 5.642 | 5.764 | 5.622 | 5.716 | 5.692 | 5.738 | |
2212.542 | 2171.164 | 2177.475 | 2182.294 | 2382.944 | 2191.010 | 2220.762 | 2183.514 | 2207.534 | 2156.864 | |
2517.003 | 2516.762 | 2516.039 | 2515.704 | 2538.658 | 2517.833 | 2509.334 | 2516.505 | 2518.143 | 2516.713 | |
Parameter | Hangzhou: Isotropic exponential kernel, constant basis function, and standardised predictors | |||||||||
3.081 | 3.158 | 3.169 | 3.158 | 3.121 | 3.205 | 3.090 | 3.070 | 3.158 | 3.142 | |
1918.529 | 1919.895 | 1921.062 | 1921.204 | 1920.555 | 1920.507 | 1934.084 | 1919.661 | 1921.003 | 1921.275 | |
354.900 | 350.720 | 351.694 | 353.756 | 362.072 | 343.792 | 355.115 | 362.508 | 354.651 | 371.115 | |
676.403 | 677.372 | 675.998 | 678.294 | 676.735 | 676.884 | 664.110 | 676.710 | 678.894 | 680.808 | |
Parameter | Nanjing: Isotropic rational quadratic kernel, empty basis function, and standardised predictors | |||||||||
2.974 | 3.003 | 2.994 | 2.963 | 2.922 | 2.957 | 2.989 | 2.874 | 2.953 | 2.974 | |
18.620 | 18.121 | 18.124 | 18.587 | 18.364 | 19.057 | 18.121 | 18.658 | 18.126 | 18.214 | |
0.003 | 0.003 | 0.002 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 | 0.003 | 0.003 | |
1429.918 | 1426.269 | 1426.653 | 1430.252 | 1432.148 | 1427.748 | 1428.890 | 1430.833 | 1429.406 | 1428.115 | |
Parameter | Wuhan: Isotropic exponential kernel, empty basis function, and standardised predictors | |||||||||
2.842 | 2.878 | 2.872 | 2.845 | 2.882 | 2.848 | 2.893 | 2.811 | 2.839 | 2.864 | |
2630.953 | 2601.939 | 2603.949 | 2643.151 | 2612.150 | 2659.777 | 2663.659 | 2793.757 | 2701.073 | 2606.324 | |
1263.865 | 1263.303 | 1263.888 | 1264.088 | 1263.440 | 1263.480 | 1264.242 | 1274.134 | 1264.117 | 1263.951 | |
Parameter | Chengdu: Isotropic exponential kernel, empty basis function, and standardised predictors | |||||||||
1.300 | 1.325 | 1.304 | 1.289 | 1.271 | 1.313 | 1.330 | 1.321 | 1.319 | 1.281 | |
4618.425 | 4517.889 | 4687.322 | 4623.863 | 4734.923 | 4675.811 | 4580.948 | 4520.558 | 4670.689 | 4759.040 | |
991.235 | 989.991 | 990.639 | 990.847 | 990.908 | 990.612 | 991.128 | 990.176 | 990.655 | 993.269 |
Models ‘CV1’, ‘CV2’, …, and ‘CV10’, which are the ten GPR models created for the residential real estate price index of each city, presented in Table 2, are used to predict the numerical values of the price index for the testing time period of 5M2019 to 4M2021. So, for each month of the testing period, each price index will have ten projected values. To get the final price index estimate for a particular month, the average of the ten projections is used. By successfully wiping out any potential idiosyncratic predictions created by a particular sub-model, this technique may help provide reliable and stable projections in the future. The literature has discussed the desirable qualities and benefits of equal weighing. The projected and actual residential real estate price indices for each city are compared in Fig. 3. Figure 4 representation of percentage forecast errors corresponds to Fig. 3. It is clear that projected price indices typically follow observed price indices quite closely. Additional prediction performance data in terms of the RMSE, RRMSE, MAE, and CC are summarised in Table 3 for the results in Figs. 3 and 4. In particular, RRMSEs for the ten price indices range from 0.0207% to 0.2818%. Based upon previous research, model prediction accuracy might be rated at the excellent level if , at the good level if , at the fair level if , and at the poor level if . According to these criteria, the GPR models constructed here have a high degree of prediction accuracy.

Fig. 3. The plot of forecasted vs. observed series for residential real estate price indices during the testing phase from 5M2019 to 4M2021.

Fig. 4. The plot of percentage forecast errors for residential real estate price indices during the testing phase from 5M2019 to 4M2021.
City | Testing RRMSE | Testing RMSE | Testing MAE | Testing CC |
---|---|---|---|---|
Beijing | 0.095% | 4.301 | 3.268 | 92.645% |
Shanghai | 0.087% | 3.079 | 2.392 | 99.135% |
Tianjing | 0.042% | 0.839 | 0.622 | 99.887% |
Chongqing | 0.282% | 3.129 | 2.085 | 93.804% |
Shenzhen | 0.096% | 4.733 | 3.604 | 88.225% |
Guangzhou | 0.061% | 1.944 | 1.467 | 99.899% |
Hangzhou | 0.021% | 0.506 | 0.427 | 99.981% |
Nanjing | 0.092% | 1.665 | 1.347 | 99.433% |
Wuhan | 0.036% | 0.579 | 0.432 | 99.833% |
Chengdu | 0.025% | 0.291 | 0.181 | 99.961% |
The findings of an error autocorrelation study conducted to evaluate the suitability of the built models are shown in Fig. 5. With a focus on normalised autocorrelations, the analysis is performed for up to 20 lags. These results exclude any blatant autocorrelations as significant autocorrelations suggest that there might exist room for improving forecast accuracy through further modelling these autocorrelations and confirm the overall validity of the models. It might be also important to note that although empirical data may be conflicting, including the AR conditional heteroskedasticity effect into a prediction model may increase its performance.

Fig. 5. Analysis of autocorrelations of errors based upon the GPR models for residential real estate price indices of ten cities.
We benchmark the GPR models against the following models that use the same predictors as the GPR models: the support vector regression (SVR) model, regression tree (RT) model, and AR model. For the testing phase from 5M2019 to 4M2021, Table 4 displays comparisons of these models based on the RMSE, where it can be seen that the GPR models lead to the lowest RMSE for each city. We also perform the Diebold–Mariano (Diebold and Mariano, 2002) test for assessing significance of the difference between the GPR models and each of the benchmark models in terms of forecast accuracy. It turns out that the p-values are all below 0.01, suggesting that the GPR models lead to statistically significant better forecast performance than the benchmark models for the price index of each city.
City | GPR | SVR | RT | AR |
---|---|---|---|---|
Beijing | 4.301 | 8.261 | 9.519 | 12.186 |
Shanghai | 3.079 | 5.659 | 7.423 | 7.374 |
Tianjing | 0.839 | 1.247 | 1.886 | 1.885 |
Chongqing | 3.129 | 6.439 | 7.450 | 8.925 |
Shenzhen | 4.733 | 5.935 | 7.655 | 13.392 |
Guangzhou | 1.944 | 3.767 | 3.922 | 5.905 |
Hangzhou | 0.506 | 1.013 | 0.869 | 1.517 |
Nanjing | 1.665 | 2.391 | 2.935 | 4.605 |
Wuhan | 0.579 | 1.073 | 0.947 | 1.314 |
Chengdu | 0.291 | 0.549 | 0.569 | 0.814 |
Implication
For investors and governments, forecasts of residential real estate price indices are an important topic. Investors need real estate price forecasts for portfolio allocation and adjustments, strategic planning, and risk management. Real estate price forecasts are crucial to policymakers for market assessments, policy development, implementation, and modification, especially for preventing market overheating and stimulating the economy when necessary. To the best of the authors’ knowledge, forecasting and valuation methods employed by numerous investors, including those in the public sector, are often based on econometric techniques, particularly time-series methods where price indices are relevant. Additionally, professional judgments from experts are still employed too. This has a reasonable basis because econometric methods and expert assessments are presumably relatively easy to develop, use, and maintain, have been widely adopted by many forecast users for many years, and many of them might be able to offer a respectable level of prediction accuracy. It may be difficult for some policymakers and investors to consider these types of models because some decision-makers still view machine learning designs as overly complex tools for making forecasts, but it is generally agreed that these models are worth investigating for their potential, especially given that computational capabilities are becoming more and more accessible and the realistic basis for potential irregularities in price time-series data. Actually, several decision-makers and savvy investors have recently expressed an increasing interest in machine learning methods for forecasting real estate values. The research being done here continues the tradition of investigating the possibility of Gaussian process regressions to resolve forecasting problems for residential real estate price indices. With the provided approach of developing such forecast models for the ten major Chinese cities and the proven strong prediction accuracy and prediction stabilities, the results suggest that machine learning techniques are well worth investigating, possibly for a greater diversity of real estate types and wider coverage of locations.
Conclusion
The topic of residential real estate price index estimates for ten major Chinese cities is the focus of the current study. By using the Gaussian process regression approach and monthly data from 7M2005 to 4M2021, we construct forecast models. When creating prediction models using Bayesian optimisations and cross validation, we pay particular attention to four basis functions, ten kernels, and two methods to predictor standardisation. With relative root mean square errors ranging from 0.0207% to 0.2818% for the ten price indices over a two-year period from 5M2019 to 4M2021, we find that the built models produce solid out-of-sample projections. These forecast models might be used by market players and policymakers to enhance their understanding of the residential real estate industry. Future research may be interesting in examining additional Bayesian optimisation techniques in addition to the expected improvement per second plus algorithm taken into consideration here. Additionally, the forecasting process may be broadened to incorporate other cities and a variety of different real estate price indices.
Appendix: Explored Kernels and Basis Functions
In this appendix, we list all explored kernels in Eqs. (A.1)–(A.10) and basis functions in Eqs. (A.11)–(A.14) :
ORCID
Bingzi Jin https://orcid.org/0009-0005-1620-7772
Xiaojie Xu https://orcid.org/0000-0002-4452-1540