Processing math: 22%
World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

Forecasts of Residential Real Estate Price Indices for Ten Major Chinese Cities through Gaussian Process Regressions

    https://doi.org/10.1142/S2810943024500136Cited by:7 (Source: Crossref)

    Abstract

    Due to the rapid growth of the Chinese housing market over the past ten years, forecasting home prices has become a crucial issue for investors and authorities alike. In this research, utilising Bayesian optimisations and cross validation, we investigate Gaussian process regressions across various kernels and basis functions for monthly residential real estate price index projections for ten significant Chinese cities from July 2005 to April 2021. The developed models provide accurate out-of-sample forecasts for the ten price indices from May 2019 to April 2021, with relative root mean square errors varying from 0.0207% to 0.2818%. Our findings could be used individually or in combination with other projections to formulate theories about the trends in the residential real estate price index and carry out additional policy analysis.

    Introduction

    The last ten years have seen a significant expansion of the Chinese real estate sector. Issues with predicting real estate prices have undoubtedly grown to be among the top concerns for investors and politicians. Understanding real estate price trends and fluctuations is crucial because they directly affect people’s decisions about where to live and how to invest in real estate, as well as the development and execution of regulatory agencies’ policies. Of course, many consumers and providers of projections are interested in predicting real estate prices.

    Many academics and professionals are interested in making accurate and reliable predictions of financial and economic time-series data. For various forecasting applications, a wide variety of their extensions and adaptations as well as certain crucial time series techniques, such as the autoregressive (AR), vector autoregressive (VAR), and vector error correction (VEC) models, have been researched (Jin and Xu, 2024b; Yang et al., 2018). Neural networks, Gaussian process regressions, support vector regressions, regression trees, random forests, nearest neighbours, deep learning, ensemble learning, boosting, and bagging are just a few examples of the many machine learning techniques that have recently been found to be effective and promising solutions to a variety of real estate price forecasting problems. These assessments, while not exhaustive, are generally consistent with a variety of empirical studies on the adoption of machine learning techniques (Alade et al., 2021) for forecasting in finance and economics (Yang et al., 2008), and the neural network model appears to be one of the most widely used methods (Jin and Xu, 2024a,n) for predicting prices for real estate properties. However, the Gaussian process regression is not thoroughly investigated for predictions of time-series data from the real estate price index.

    Neal’s research on Bayesian learning for neural networks is the basis for a unique regression technique (Neal, 2012). Since the technique depends on priors over functions of Gaussian processes, it is suitable for modelling noisy data. It was shown that at the limit of an infinite network, many different kinds of neural network-based Bayesian regression models would converge to Gaussian processes (Neal, 2012; Jin and Xu, 2024h,i). The Gaussian process has been effectively used in regressions to simulate both noisy (Jin and Xu, 2024l) and noise-free (Neal, 1997; Jin and Xu, 2024m) data. In their study, Brahim-Belhouari and Vesin (2001) compared radial basis function neural networks and Gaussian processes for forecasting issues involving stationary time-series data and found that Bayesian learning produces better prediction outcomes. Brahim-Belhouari and Bermak’s (2004) research suggests that it is advantageous to examine various covariance functions, which is the approach used in this study, and that prediction techniques based on Gaussian processes may be used to successfully resolve forecasting issues for non-stationary time-series data. Radial basis function neural networks are outperformed by Gaussian process regressions, according to the research by Brahim-Belhouari and Bermak (2004). Additionally, the precise matrix operations used to integrate the prior and noise models are what give the Gaussian process formulation its value and advantage (Brahim-Belhouari and Bermak, 2004). Similar to how we would utilise model averaging, Brahim-Belhouari and Bermak (2004) also suggested using the strategy of multi-model forecasting using Gaussian process predictors. A new study suggests that Gaussian process regressions might be used to accurately forecast steel prices for the resource industry.

    Forecasting efforts have also been observed for residential real estate prices using classic econometric models and, more recently, machine learning techniques. For example, semiparametric models are more useful for forecasting and evaluating residential housing prices, according to Gençay and Yang’s (1996a,b) comparison of parametric and semiparametric classical econometric models. Glennon et al. (2018) discover that utilising several models increases the accuracy of property valuations based on house price indexes. Clapp and Giaccotto (1992) show that the evaluated value approach is more effective than the repeat sales methodology by creating a mechanism for reducing the impact of measurement mistakes connected with it. According to Kaboudan and Sarkar (2007), projections derived from equations calculated from city-wide disaggregated data have an advantage over those derived from local average pricing equations. Mei and Fang (2017) create a dynamic state forecasting model of the average selling price using multiple regression and trend analysis. Levesque (1994) uses an airport noise case study to examine the breakdown of residential property values. The performance of the AR integrated moving average model is examined by Hepşen and Vatansever (2011). Principal components analysis is used by Baroni et al. (2005) to create a repeat sales index that predicts apartment prices. Guo (2020) examines future price stability using both linear and non-linear regression models. For machine learning approaches, Paris (2008) investigates artificial neural networks to predict alterations in national and local price indices for the UK residential real estate market. Chi (2017) suggests using a spatial back-propagation neural network to estimate the price of residential real estate. In order to estimate demand for residential development, Bee-Hua (2000) proposes combining neural networks with evolutionary algorithms. Štubňová et al. (2020) discover that neural networks outperform regression models for estimating residential real estate market prices. For predicting residential unit rent prices, Seya and Shiroi (2021) compare the deep neural network with nearest neighbour Gaussian process and conclude that the former has greater potential. In order to estimate construction cost from economic variables and indices, Rafiei and Adeli (2018) suggest using an unsupervised deep Boltzmann machine (DBM) learning approach, a softmax layer to extract pertinent features from the input data, and a three-layer back-propagation neural network (or support vector machine) to transform the trained unsupervised DBM into a supervised regression network. The extra-trees regression technique and adial basis function-based support vector regression algorithm are both effective in modelling the fine-scale spatiotemporal distribution of residential land values, according to Zhang et al. (2021). Yoo et al. (2012) examine the hedonic modelling of residential property sales prices using cubist, random forest, and conventional ordinary least squares, and discover that the random forest produces the most accurate results. Researchers, Hong et al. (2020), Dimopoulos and Bakas (2019), and Dimopoulos et al. (2018), all show how machine learning models may be used to increase the accuracy of real estate mass assessments. Machine learning models are also helpful for residential land evaluations, according to Ai et al. (2020). Picchetti (2017) demonstrates how the gradient tree boosting approach may be utilised to get around the problem of sample heterogeneity in hedonic geospatial residential property price assessments. Different machine learning methods have demonstrated promising accuracy in the literature for real estate price forecasts. Based upon different empirical evidence, the mean absolute percentage forecast errors have ranged from below 1% to above 10%, considering that various time-series data have varied features and some are more difficult to forecast than others.

    To continue this theme, we focus on Gaussian process regressions for residential real estate price index estimations for ten major Chinese cities between July 2005 and April 2021, a period during which the real estate market saw rapid growth. This is, to the best of our knowledge, the first forecast study that employs Gaussian process regressions to examine residential real estate price indices in the Chinese market. In past research, the Gaussian process regression approach was frequently used to investigate the Boston housing related issues from the standpoint of property valuation. Given the prominence of the residential real estate market in China, there may not be much motivation required. Forecasts of residential real estate price indices should thus be crucial and potentially difficult for investors and policymakers, since having a thorough grasp of pricing trends may benefit in decision making. There are several forecasting methodologies, including econometric and machine learning-based models. The Gaussian process regression technique is chosen because to the non-linear patterns presented by the residential real estate price indices under consideration here, as well as the literature’s recognised value and promise for real estate price forecasting. To train the Gaussian process regression, a Bayesian optimisation strategy using a range of basis functions and kernels, as well as the cross validation technique, will be employed. Because most earlier study focused on a single location when evaluating other types of real estate assets, our findings may help to a better understanding of applying machine learning technology to predict residential real estate pricing in the constantly expanding Chinese market. The coverage of these cities in the current work, combined with the availability of data, should represent an economically natural way to explore the forecast problem for residential real estate, given that demand and supply are most active in these major cities and that each one may have distinct price characteristics that are worth examining. Because prior research in this area has typically concentrated on older time periods when researching other forms of real estate, such as 1Q1981–4Q2002, Q12000–Q32010, 7M2013–12M2013, 6M1996–8M2014, 1M2013–2M2017, 1M2010–7M2017, 12M2010–10M2017, 1M2011–12M2017, and 1M2005–11M2018, our findings might potentially provide a more contemporary perspective on the efficacy of machine learning approaches for real estate price index estimates for the Chinese market. The current condition of uncertainty in the residential real estate market may cause price behaviour to display increased non-linearities. This study adds to the body of literature on the usefulness of the Gaussian process regression for real estate price index forecasts in dynamic environments by building Gaussian process regression models based on a more recent time period, from July 2005 to April 2021. This study also provides policymakers and investors with timely forecast tools for potential use. For many less sophisticated forecast users, machine learning methods may appear more challenging than econometric models. As a result, to assist technical predictions, we develop comparatively simple yet accurate Gaussian process regressions. Given that machine learning models, like econometric models, have the potential for overfitting or underfitting, our model development techniques have taken into account a trade-off between prediction accuracy and stabilities. We specifically undertake out-of-sample projections from May 2019 to April 2021 and get relative root mean square errors for the ten price indices that range from 0.0207% to 0.2818%. Our findings might be used individually or in combination with other forecasts to formulate theories about the trends in the residential real estate price index and carry out additional policy analysis.

    Data

    The data used for study come from the China Real Estate Index System (CREIS). It is an analytical tool created to reflect the status of the real estate markets and growth trends in the major Chinese cities. The platform was created in 1994 by the Real Estate Association, the Development Research Center of the State Council, and the National Real Estate Development Group Corporation. In 1995 and 2005, CREIS was audited by specialists from the Ministry of Land and Resources, the Ministry of Construction, the Real Estate Association, the Development Research Center of the State Council, the Banking Regulatory Commission, as well as various universities. Currently, CREIS publishes a number of real estate price indices on a monthly basis, including rental price indices, residential real estate sales price indices for both existing homes and newly constructed homes, price indices for villas, retail real estate price indices, and office price indices, among many others. This system has developed from a platform to include the majority of the Chinese real estate markets. In this study, we focus on investigating forecasting issues using residential real estate price indices.

    Residential real estate price indices collected from CREIS cover the following ten significant Chinese cities: Wuhan, Chengdu, Hangzhou, Nanjing, Shenzhen, Guangzhou, Tianjing, Chongqing, Beijing, and Shanghai. According to CREIS, it uses phone surveys, field surveys, and web surveys to collect data. The samples used to produce the price index comprise all residential real estate in a certain city that is available for sale in a specific month. The base-period price index is based on the price of Beijing’s residential real estate in 12M2000, with an index value of 1,000. By normalising against this base-period price, residential real estate price indices for other locations and months are created.

    According to CREIS, its residential real estate price index for a certain city is calculated as follows: It=PtiAt1iPt1iAt1iIt1, where It and It1 indicate price indices related to time t and time t1, respectively, At1i indicates Project i’s total area of construction related to time t1, and Pti and Pt1i indicate average prices of Project i’s residential real estate related to time t and time t1, respectively. It is important to highlight that the residential real estate price indices of the ten cities are the only data that have been evaluated in the current investigation. We do not have access to any further CREIS platform data. The time period covered by the monthly data in this analysis is 7M2005–4M2021. Figure 1 shows the visualisations of the ten price indices, their first differences, distributional plots using histograms with kernel estimations, and quantile–quantile plots, using Beijing, Shanghai, Shenzhen, and Guangzhou as examples. The ten price indices and their first differences are summarised in Table 1. None of the ten price indices, based on the p-values of the Anderson–Darling and Kolmogorov–Smirnov tests provided in Table 1, follows a normal distribution at the 1% significance level. None of the ten price indices, based on the p-values of the Jarque–Bera test shown in Table 1, follows a normal distribution at the 5% significance level. For many various forms of financial and economic time-series data, non-normality is probably not shocking (Jin and Xu, 2024d; Jin et al., 2024). The price indices of Beijing, Tianjing, Chongqing, Hangzhou, Wuhan, and Chengdu are left-skewed, and the price indices of Shanghai, Shenzhen, Guangzhou, and Nanjing are right-skewed. Platykurtic price indices are determined for all of the ten cities.

    Fig. 1.

    Fig. 1. Residential real estate price indices of ten major cities in China during 7M2005–4M2021.

    Table 1. Summary statistics of residential real estate price indices of ten major cities in China during 7M2005–4M2021.

    CitySeriesMinimumMeanMedianStandard deviationMaximumSkewnessKurtosisJarque–BeraAnderson–DarlingKolmogorov–Smirnov
    BeijingPrice1,2273329.1263478.51075.1144,565−0.4371.9430.006<0.001<0.001
    First difference−4717.5507.029.3651211.1934.294<0.001<0.001<0.001
    ShanghaiPrice1,5672659.6532509.5630.9903,5900.1131.8280.011<0.001<0.001
    First difference−4510.6246.021.1731021.6316.957<0.001<0.001<0.001
    TianjingPrice9091653.2211660.0302.7642,039−0.5812.7380.011<0.001<0.001
    First difference−705.6352.015.963770.7147.911<0.001<0.001<0.001
    ChongqingPrice580911.642938.5154.9881,126−0.5922.4080.007<0.001<0.001
    First difference−272.7412.010.430430.5415.730<0.001<0.001<0.001
    ShenzhenPrice1,1763405.2113096.51185.9934,9660.0501.7250.008<0.001<0.001
    First difference−2519.9637.039.9942252.56110.861<0.001<0.001<0.001
    GuangzhouPrice1,0682261.0892300.5654.3473,2540.0221.9280.019<0.001<0.001
    First difference−3811.3448.021.305770.7203.6160.003<0.001<0.001
    HangzhouPrice1,2061913.3421913.0373.5032,488−0.1632.1310.034<0.001<0.001
    First difference−546.7833.018.1921081.4229.539<0.001<0.001<0.001
    NanjingPrice7061336.2891289.0337.5661,8280.0071.8410.013<0.001<0.001
    First difference−1035.1803.013.81259−1.70623.102<0.001<0.001<0.001
    WuhanPrice5391164.5841154.0335.5191,630−0.1431.9410.017<0.001<0.001
    First difference−155.7513.08.961431.2194.803<0.001<0.001<0.001
    ChengduPrice611942.289959.0152.3811,181−0.2162.1480.030<0.001<0.001
    First difference−502.4342.09.05934−0.5608.965<0.001<0.001<0.001

    In the fields of finance and economics, manifestations of non-linear features at higher moments have already been extensively reported across a wide range of time-series data (Yang et al., 2008; Jin and Xu, 2024g). We apply the Brock–Dechert–Scheinkman (BDS) (Brock et al., 1996) test to examine the ten residential real estate price time-series data for any potential non-linear trends. We implement the BDS test based upon 2–10 as the embedding dimensions and 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0 multiplying a particular price index time series’ standard deviation as ϵ’s-distance utilised for assessing the proximity of different data points. We determine that the resultant p-values of the tests are all virtually zero. These findings suggest that each of the ten price indices has non-linearities. Given these facts, this work seeks to predict the ten non-normal and non-linear residential real estate price indices using Gaussian process regressions.

    Method

    The Gaussian process regression, a kind of probabilistic kernel model that has been demonstrated to be effective at forecasting a variety of non-linear patterns in a number of scientific disciplines (Jin and Xu, 2024c,f), is the forecasting technique being examined in this study. For the illustration of the model, the training data with an unknown distribution are indicated by {(xi,yi);i=1,2,,T}, the d-dimensional predictors are indicated by xid, and the target is indicated by yi. The real estate price indices for each city are projected using 20 lagged price indices as predictors. For example, to forecast the price index for the 21 month, 20 price indices from the previous 20 consecutive months will be used as predictors.

    Let y=xTβ+ε indicate a linear regression, where εN(0,σ2) indicates the error item. Contrarily, Gaussian process regressions use explicit basis functions and latent variables to define the target variable (Jin and Xu, 2024e). The basis function might be expressed using b, and the latent variables from the Gaussian process might be expressed using l(xi) such that they meet the requirement of the joint Gaussian distribution. The covariance function of latent variables will represent the target’s smoothness, and the basis function’s goal is to project various predictors onto the space of features (Jin and Xu, 2024j,k).

    The covariance and mean are two metrics that are frequently used to define a Gaussian process (GP). We are going to express the mean using m(x)=E(l(x)) and the covariance using k(x,x)=Cov[l(x),l(x)]. Then, we are going to express the Gaussian process regression using y=b(x)Tβ+l(x), where l(x)GP(0,k(x,x)) and b(x)p. Via θ, a hyper-parameter, we are going to parameterise k(x,x) using k(x,x|θ). When using a specific technique to train a Gaussian process regression, the following variables will normally be estimated: σ2, θ, and β. We are also going to define kernels, expressed as k’s, and basis functions, expressed as b’s, to be adopted for model training.

    The two types of kernels considered in the current study are isotropic kernels and non-isotropic kernels (automatic relevance determination kernels). Both isotropic and non-isotropic kernels are examined using five distinct kernels. Equations (A.1)–(A.10) shown in the appendix offer specifications for all kernels under consideration. σl is utilised to indicate isotropic kernels’ characteristic length scale, α>0 is utilised to indicate the scale-mixture parameter, σf is utilised to indicate the standard deviation of the signal, and r=(xixj)(xixj). σl and σf’s positiveness is going to be achieved with θ=(θ1,θ2)=(log. Each predictor is going to have a distinct length scale for non-isotropic kernels, which is expressed as σm (m=1,2,,d). Correspondingly, θ is expressed as θ=(θ1,θ2,,θd,θd+1)=(logσ1,logσ2,,logσd,logσf).

    In a manner similar to how different kernels are considered, this work takes into account four different basis functions, which are detailed in Eqs. (A.11)–(A.14) in the appendix. In these equations,

    X=(x1,x2,,xn),X2=x112x122x1d2x212x222x2d2xT12xT22xTd2,andB=(b(x1),b(x2),,b(xn)).

    Ten-fold cross validation and Bayesian optimisation, which is based on the expected improvement per second plus (EIPSP) technique, are used to estimate the model parameters. Let a GP model be expressed by f(x). The Bayesian approach evaluates corresponding yi=f(xi) by selecting Ns randomly chosen data points of xi’s inside variable boundaries. Here, the number of data points utilised for preliminary judgments is Ns=4. The algorithm will keep gathering data after encountering evaluation faults until it reaches Ns successful evaluation cases. The algorithm’s first and second steps are then repeated, as shown below. The updating of f(x) will be the first stage in producing the posterior distribution over Q(f|xi,yifori=1,,T). The second step will include choosing a new data point (x) in order to determine the acquisition function’s (a(x)) minimisation goal. A maximum of 100 iterations will be used. The purpose of using a(x) is to evaluate x’s goodness in regards to Q. Instead of evaluating values that would elevate the objective function, expected improvement acquisition functions evaluate expected amounts of improvements to the objective function. We are going to let μQ(xbest) express the corresponding numerical value of the lowest mean and xbest express the data point at which the lowest posterior mean is reached. We can express the expected improvement (EI) using EI(x,Q)=EQ[max(0,μQ(xbest)f(x))]. The Bayesian strategy can offer higher advantages per unit of time by using the time-weighting scheme on the acquisition function since the amount of time required to assess the objective may vary depending on the location. Throughout optimisation processes, it is possible to keep a further Bayesian model of the amount of time needed to evaluate the objective as a function of x. In light of this, we can express the acquisition function’s EI per second (EIPS) as EIPS(x)=EIQ(x)μS(x), where μS(x) expresses the posterior mean associated with this extra timing GP model. To avoid the acquisition function overutilising a specific area and preventing a local minimum of the objective, the following changes to its behaviour might be made. Let σF(x) express the posterior objective’s standard deviation corresponding to x and σPN express the additive noises’ posterior standard deviation meeting the requirement of σQ2(x)=σF2(x)+σPN2. We are going to express the exploration ratio using tσPN>0. After each iteration, the acquisition function based on the EIPSP algorithm determines if the following data point, x, meets the requirement of σF(x)<tσPNσPN. The kernel function will be modified if this criterion is met by multiplying θ by the number of iterations, with x being regarded as overexploiting (Bull, 2011). In essence, the EIPSP technique adjustment raises σQ for data points between observations. A new data point will then be produced using the newly fitted kernel. If it turns out that the new data point is similarly overexploiting, θ will be increased by ten times in subsequent trials. This strategy will be limited to five times in order to get a data point, x, that is not considered overexploited. The modified x will then be accepted as the next exploration ratio by the EIPSP algorithm. To obtain a more accurate overall response, the algorithm strikes a balance between focusing on previously investigated nearby data points and looking at new data points.

    Bayesian optimisation procedures will be carried out over σ, basis functions, kernels, and whether or not predictors are standardised. Forecast performance is going to be determined according to the relative root mean square error (RRMSE), which allows for comparisons of different prediction outcomes across different models or targets (Li et al., 2013). The RRMSE can be expressed as RRMSE=1ni=1n(yiobsyifor)21ni=1nyiobs, where yfor expresses the target’s predicted numerical value, n expresses the number of observations utilised for performance assessments, and yobs expresses the target variable’s observed numerical value. Two additional performance metrics are adopted to assess prediction accuracy: the mean absolute error (MAE) and root mean square error (RMSE), whose units are identical to the target variable and whose magnitude is related to the target variable. The RMSE can be expressed as RMSE=1ni=1n(yiobsyifor)2. The MAE can be expressed as MAE=1ni=1n|yiobsyifor|. Finally, we also consider the correlation coefficient (CC) for measuring performance, which can be expressed as CC=i=1n(yiobsyobs ¯)(yiforyfor¯)i=1n(yiobsyobs¯)2i=1n(yiforyfor¯)2, where yobs¯ and yfor¯ stand for averages.

    Result

    For each city, data from its residential real estate price indices are utilised for model training from 7M2005 to 4M2019, and for model performance testing for one-month ahead forecasts from 5M2019 to 4M2021. Figure 2 shows the outcomes of EIPSP optimisations based on training data for all price indices. These results indicate that (a) the isotropic rational quadratic kernel (Eq. (A.4)), empty basis function (Eq. (A.11)) and standardised predictors are chosen for Beijing, (b) the isotropic rational quadratic kernel (Eq. (A.4)), empty basis function (Eq. (A.11)), and non-standardised predictors are chosen for Shanghai, (c) the isotropic exponential kernel (Eq. (A.1)), linear basis function (Eq. (A.13)), and standardised predictors are chosen for Tianjing, (d) the isotropic rational quadratic kernel (Eq. (A.4)), constant basis function (Eq. (A.12)), and non-standardised predictors are chosen for Chongqing, (e) the isotropic Matern 3/2 kernel (Eq. (A.5)), constant basis function (Eq. (A.12)), and standardised predictors are chosen for Shenzhen, (f) the isotropic exponential kernel (Eq. (A.1)), empty basis function (Eq. (A.11)), and standardised predictors are chosen for Guangzhou, (g) the isotropic exponential kernel (Eq. (A.1)), constant basis function (Eq. (A.12)), and standardised predictors are chosen for Hangzhou, (h) the isotropic rational quadratic kernel (Eq. (A.4)), empty basis function (Eq. (A.11)), and standardised predictors are chosen for Nanjing, (i) the isotropic exponential kernel (Eq. (A.1)), empty basis function (Eq. (A.11)), and standardised predictors are chosen for Wuhan, and (j) the isotropic exponential kernel (Eq. (A.1)), empty basis function (Eq. (A.11)), and standardised predictors are chosen for Chengdu. For the ten GPR models created using the ten-fold cross validation for the residential real estate price index of each city, the results of parameter estimates are shown in Table 2. The initials ‘CV1’, ‘CV2’, …, and ‘CV10’ are used to denote these parameter estimations, where ‘CV’ stands for ‘cross validation’.

    Fig. 2.

    Fig. 2. Optimisation processes based upon the EIPSP algorithm for monthly residential real estate price indices.

    Table 2. Parameter estimates of ten GPR models for the residential real estate price index of each city.

    CV1CV2CV3CV4CV5CV6CV7CV8CV9CV10
    ParameterBeijing: Isotropic rational quadratic kernel, empty basis function, and standardised predictors
    σ8.7138.9478.9988.6979.0728.8268.7158.9198.9748.692
    σl14.40614.24914.12514.49214.22214.48814.37414.00514.62714.176
    α0.0040.0050.0050.0050.0050.0050.0050.0050.0050.004
    σf3394.0533376.8313359.2113380.0813370.4643362.8223364.3863379.4303381.6263371.865
    ParameterShanghai: Isotropic rational quadratic kernel, empty basis function, and non-standardised predictors
    σ5.7365.5317.2325.6955.6617.1645.5727.1527.3635.589
    σl9671.71410403.5569929.0809429.1789643.95710045.5759462.5309846.39710026.15610190.834
    α0.0020.0020.0020.0020.0020.0030.0020.0020.0020.002
    σf2855.1512850.9392849.5352855.4462841.5162842.9752854.8762852.6642849.4862858.678
    ParameterTianjing: Isotropic exponential kernel, linear basis function, and standardised predictors
    σ2.2702.2992.3122.3322.2822.3442.2612.2582.3242.283
    β01717.8651727.2151728.8741715.9031727.1741726.5651732.1131725.6181726.5241730.831
    β1−23.855−19.491−22.554−21.757−28.676−25.240−17.130−22.339−4.123−25.719
    β2−5.17320.5366.709−0.3601.9633.967−4.8064.840−32.6659.578
    β361.43625.97154.15257.94261.27055.22348.67656.64868.30150.558
    β4−28.691−24.983−31.967−37.638−31.608−27.420−38.024−42.840−17.456−16.017
    β58.07217.0762.92816.6521.47113.79127.4551.943−3.757−50.195
    β611.7698.66824.40322.84635.55018.02825.85330.51619.72763.047
    β7−32.785−26.763−37.517−50.651−48.305−30.389−44.989−22.315−37.739−25.656
    β8−2.61626.6246.06415.17611.833−4.324−6.9502.218−5.043−14.760
    β948.83734.88526.95233.45623.624−10.68336.21216.55330.06728.475
    β10−16.779−82.903−9.891−12.818−12.62228.260−14.746−1.96920.85217.430
    β11−40.0117.574−44.541−41.274−31.914−33.874−19.516−38.922−56.235−56.915
    β1214.900−7.08214.2076.80915.505−1.3972.261−5.79613.79838.316
    β13−36.517−3.004−45.249−29.111−49.212−29.749−36.862−19.959−33.266−83.117
    β1459.75130.07181.97355.82269.94046.79461.95355.09166.44865.669
    β15−50.921−26.616−47.322−37.636−47.118−34.808−42.435−45.082−62.48017.268
    β1645.26022.86921.82820.79438.32143.65020.27848.55138.952−16.148
    β17−54.270−32.550−43.782−39.035−58.090−43.307−35.116−62.919−43.149−65.776
    β1833.01040.45646.60231.02235.39232.27415.46937.78821.90045.796
    β19−30.932−5.5554.55410.7065.001−32.78322.8105.460−17.691−17.220
    β20265.865225.342223.488231.818235.252266.019225.041227.571265.693263.809
    σl0.2330.5070.3980.3920.3200.3900.3360.4010.2660.182
    σf11.83113.14213.43713.00712.42411.76212.16513.19611.4599.576
    ParameterChongqing: Isotropic rational quadratic kernel, constant basis function, and non-standardised predictors
    σ4.1403.9103.4733.9504.2063.8353.6853.5824.0193.635
    β877.736880.853886.232886.659882.419885.206894.449882.460880.916883.589
    σl1079.054948.102956.7061061.8361107.167918.901909.1871048.1511037.970953.585
    α0.0860.0780.0630.0720.0860.0710.0760.0750.0780.078
    σf269.608243.520238.179260.364273.387232.212229.531258.342259.731243.181
    ParameterShenzhen: Isotropic Matern 3/2 kernel, constant basis function, and standardised predictors
    σ10.59010.41810.45810.56110.61010.57510.42910.58510.57110.556
    β3405.0833446.0513449.5653447.9973465.7983469.5463461.6863450.0873465.3183457.192
    σl7.5304.5244.4534.5014.1414.1884.2264.9724.2494.151
    σf1088.517735.484710.153710.102684.236690.440682.685745.662679.481686.742
    ParameterGuangzhou: Isotropic exponential kernel, empty basis function, and standardised predictors
    σ5.7075.7855.7175.6985.6425.7645.6225.7165.6925.738
    σl2212.5422171.1642177.4752182.2942382.9442191.0102220.7622183.5142207.5342156.864
    σf2517.0032516.7622516.0392515.7042538.6582517.8332509.3342516.5052518.1432516.713
    ParameterHangzhou: Isotropic exponential kernel, constant basis function, and standardised predictors
    σ3.0813.1583.1693.1583.1213.2053.0903.0703.1583.142
    β1918.5291919.8951921.0621921.2041920.5551920.5071934.0841919.6611921.0031921.275
    σl354.900350.720351.694353.756362.072343.792355.115362.508354.651371.115
    σf676.403677.372675.998678.294676.735676.884664.110676.710678.894680.808
    ParameterNanjing: Isotropic rational quadratic kernel, empty basis function, and standardised predictors
    σ2.9743.0032.9942.9632.9222.9572.9892.8742.9532.974
    σl18.62018.12118.12418.58718.36419.05718.12118.65818.12618.214
    α0.0030.0030.0020.0030.0020.0020.0020.0020.0030.003
    σf1429.9181426.2691426.6531430.2521432.1481427.7481428.8901430.8331429.4061428.115
    ParameterWuhan: Isotropic exponential kernel, empty basis function, and standardised predictors
    σ2.8422.8782.8722.8452.8822.8482.8932.8112.8392.864
    σl2630.9532601.9392603.9492643.1512612.1502659.7772663.6592793.7572701.0732606.324
    σf1263.8651263.3031263.8881264.0881263.4401263.4801264.2421274.1341264.1171263.951
    ParameterChengdu: Isotropic exponential kernel, empty basis function, and standardised predictors
    σ1.3001.3251.3041.2891.2711.3131.3301.3211.3191.281
    σl4618.4254517.8894687.3224623.8634734.9234675.8114580.9484520.5584670.6894759.040
    σf991.235989.991990.639990.847990.908990.612991.128990.176990.655993.269

    Models ‘CV1’, ‘CV2’, …, and ‘CV10’, which are the ten GPR models created for the residential real estate price index of each city, presented in Table 2, are used to predict the numerical values of the price index for the testing time period of 5M2019 to 4M2021. So, for each month of the testing period, each price index will have ten projected values. To get the final price index estimate for a particular month, the average of the ten projections is used. By successfully wiping out any potential idiosyncratic predictions created by a particular sub-model, this technique may help provide reliable and stable projections in the future. The literature has discussed the desirable qualities and benefits of equal weighing. The projected and actual residential real estate price indices for each city are compared in Fig. 3. Figure 4 representation of percentage forecast errors corresponds to Fig. 3. It is clear that projected price indices typically follow observed price indices quite closely. Additional prediction performance data in terms of the RMSE, RRMSE, MAE, and CC are summarised in Table 3 for the results in Figs. 3 and 4. In particular, RRMSEs for the ten price indices range from 0.0207% to 0.2818%. Based upon previous research, model prediction accuracy might be rated at the excellent level if RRMSE<10%, at the good level if 10%<RRMSE<20%, at the fair level if 20%<RRMSE<30%, and at the poor level if RRMSE30%. According to these criteria, the GPR models constructed here have a high degree of prediction accuracy.

    Fig. 3.

    Fig. 3. The plot of forecasted vs. observed series for residential real estate price indices during the testing phase from 5M2019 to 4M2021.

    Fig. 4.

    Fig. 4. The plot of percentage forecast errors for residential real estate price indices during the testing phase from 5M2019 to 4M2021.

    Table 3. Forecast performance of the GPR models for residential real estate price indices of ten cities during the testing phase from 5M2019 to 4M2021.

    CityTesting RRMSETesting RMSETesting MAETesting CC
    Beijing0.095%4.3013.26892.645%
    Shanghai0.087%3.0792.39299.135%
    Tianjing0.042%0.8390.62299.887%
    Chongqing0.282%3.1292.08593.804%
    Shenzhen0.096%4.7333.60488.225%
    Guangzhou0.061%1.9441.46799.899%
    Hangzhou0.021%0.5060.42799.981%
    Nanjing0.092%1.6651.34799.433%
    Wuhan0.036%0.5790.43299.833%
    Chengdu0.025%0.2910.18199.961%

    The findings of an error autocorrelation study conducted to evaluate the suitability of the built models are shown in Fig. 5. With a focus on normalised autocorrelations, the analysis is performed for up to 20 lags. These results exclude any blatant autocorrelations as significant autocorrelations suggest that there might exist room for improving forecast accuracy through further modelling these autocorrelations and confirm the overall validity of the models. It might be also important to note that although empirical data may be conflicting, including the AR conditional heteroskedasticity effect into a prediction model may increase its performance.

    Fig. 5.

    Fig. 5. Analysis of autocorrelations of errors based upon the GPR models for residential real estate price indices of ten cities.

    We benchmark the GPR models against the following models that use the same predictors as the GPR models: the support vector regression (SVR) model, regression tree (RT) model, and AR model. For the testing phase from 5M2019 to 4M2021, Table 4 displays comparisons of these models based on the RMSE, where it can be seen that the GPR models lead to the lowest RMSE for each city. We also perform the Diebold–Mariano (Diebold and Mariano, 2002) test for assessing significance of the difference between the GPR models and each of the benchmark models in terms of forecast accuracy. It turns out that the p-values are all below 0.01, suggesting that the GPR models lead to statistically significant better forecast performance than the benchmark models for the price index of each city.

    Table 4. Benchmark analysis: Comparisons of RMSEs for the testing phase from 5M2019 to 4M2021.

    CityGPRSVRRTAR
    Beijing4.3018.2619.51912.186
    Shanghai3.0795.6597.4237.374
    Tianjing0.8391.2471.8861.885
    Chongqing3.1296.4397.4508.925
    Shenzhen4.7335.9357.65513.392
    Guangzhou1.9443.7673.9225.905
    Hangzhou0.5061.0130.8691.517
    Nanjing1.6652.3912.9354.605
    Wuhan0.5791.0730.9471.314
    Chengdu0.2910.5490.5690.814

    Implication

    For investors and governments, forecasts of residential real estate price indices are an important topic. Investors need real estate price forecasts for portfolio allocation and adjustments, strategic planning, and risk management. Real estate price forecasts are crucial to policymakers for market assessments, policy development, implementation, and modification, especially for preventing market overheating and stimulating the economy when necessary. To the best of the authors’ knowledge, forecasting and valuation methods employed by numerous investors, including those in the public sector, are often based on econometric techniques, particularly time-series methods where price indices are relevant. Additionally, professional judgments from experts are still employed too. This has a reasonable basis because econometric methods and expert assessments are presumably relatively easy to develop, use, and maintain, have been widely adopted by many forecast users for many years, and many of them might be able to offer a respectable level of prediction accuracy. It may be difficult for some policymakers and investors to consider these types of models because some decision-makers still view machine learning designs as overly complex tools for making forecasts, but it is generally agreed that these models are worth investigating for their potential, especially given that computational capabilities are becoming more and more accessible and the realistic basis for potential irregularities in price time-series data. Actually, several decision-makers and savvy investors have recently expressed an increasing interest in machine learning methods for forecasting real estate values. The research being done here continues the tradition of investigating the possibility of Gaussian process regressions to resolve forecasting problems for residential real estate price indices. With the provided approach of developing such forecast models for the ten major Chinese cities and the proven strong prediction accuracy and prediction stabilities, the results suggest that machine learning techniques are well worth investigating, possibly for a greater diversity of real estate types and wider coverage of locations.

    Conclusion

    The topic of residential real estate price index estimates for ten major Chinese cities is the focus of the current study. By using the Gaussian process regression approach and monthly data from 7M2005 to 4M2021, we construct forecast models. When creating prediction models using Bayesian optimisations and cross validation, we pay particular attention to four basis functions, ten kernels, and two methods to predictor standardisation. With relative root mean square errors ranging from 0.0207% to 0.2818% for the ten price indices over a two-year period from 5M2019 to 4M2021, we find that the built models produce solid out-of-sample projections. These forecast models might be used by market players and policymakers to enhance their understanding of the residential real estate industry. Future research may be interesting in examining additional Bayesian optimisation techniques in addition to the expected improvement per second plus algorithm taken into consideration here. Additionally, the forecasting process may be broadened to incorporate other cities and a variety of different real estate price indices.

    Appendix: Explored Kernels and Basis Functions

    In this appendix, we list all explored kernels in Eqs. (A.1)–(A.10) and basis functions in Eqs. (A.11)–(A.14) :

    Isotropic Exponential:k(xi,xj|θ)=σf2erσl,(A.1)
    Isotropic Squared Exponential:k(xi,xj|θ)=σf2e12(xixj)T(xixj)σl2,(A.2)
    Isotropic Matern 5/2:k(xi,xj|θ)=σf21+5rσl+5r23σl2e5rσl,(A.3)
    Isotropic Rational Quadratic:k(xi,xj|θ)=σf21+r22ασl2α,(A.4)
    Isotropic Matern 3/2:k(xi,xj|θ)=σf21+3rσle3rσl,(A.5)
    Nonisotropic Exponential:k(xi,xj|θ)=σf2em=1d(ximxjm)2σm2,(A.6)
    Nonisotropic Squared Exponential:k(xi,xj|θ)=σf2e12m=1d(ximxjm)2σm2,(A.7)
    Nonisotropic Matern 5/2:k(xi,xj|θ)=σf21+5m=1d(ximxjm)2σm2+53m=1d(ximxjm)2σm2×e5m=1d(ximxjm)2σm2,(A.8)
    Nonisotropic Rational Quadratic:k(xi,xj|θ)=σf21+12αm=1d(ximxjm)2σm2α,(A.9)
    Nonisotropic Matern 3/2:k(xi,xj|θ)=σf21+3m=1d(ximxjm)2σm2e3m=1d(ximxjm)2σm2,(A.10)
    Empty:B=Empty Matrix,(A.11)
    Constant:B=In×1,(A.12)
    Linear:B=[1,X],(A.13)
    Pure Quadratic:B=[1,X,X2].(A.14)

    ORCID

    Bingzi Jin  https://orcid.org/0009-0005-1620-7772

    Xiaojie Xu  https://orcid.org/0000-0002-4452-1540