Filling and validating rainfall data based on statistical techniques and artificial intelligence

The study of the hydric regime of rainfall helps in management analysis and decisionmaking in hydrographic basins, but a fundamental condition is the need for continuous time series of data. Therefore, this study compared gap filling methods in precipitation data and validated them using robust statistical techniques. Precipitation data from the municipality of Itirapina, which has four monitoring stations, were used. Four gap filling techniques were used, namely: normal ratio method, inverse distance weighting, multiple regression and artificial neural networks, in the period from 1979 to 1989. For validation and performance evaluation, the coefficient of determination (R2), mean absolute error (MAE), mean squared error (RMSE), Nash-Sutcliffe coefficient (Nash), agreement index (D), confidence index were used (C) and through non-parametric techniques with Mann-Witney and Kruskal-Wallis test. Excellent performances of real data were verified in comparison with estimated data, with values above 0.8 of the coefficient of determination (R2) and of Nash. Kruskal-Wallis and Mann-Whitney tests were not significant in Stations C1 and C2, demonstrating that there is a difference between real and estimated data and between the proposed methods. It was concluded that the multiple regression and neural network methods showed the best performance. From this study, efficient tools were found to fill the gap, thus promoting better management and operation of water resources.


INTRODUCTION
Rainfall is one of the variables with the greatest influence on society, environment and economy, as it has direct implications for agriculture, climate, hydrology, disaster management, among others. Therefore, evaluating its behavior allows it to assist in the analysis of water availability and decision-making in hydrographic basins. However, for such verifications to be carried out, a fundamental condition is the need for continuous time series of data in order to obtain consistent and reliable results (Correia et al. 2016).
A constant problem in developing countries is the absence of continuous data on meteorological variables and rainfall stations in different regions, which reinforces the importance of making the most of existing data. Since gaps in databases can influence data analysis and inferences, missing data-filling methods, such as multiple regression method (MR), inverse distance weighting (IDW), normal ratio method (NRM), and neural networks (NN), stand out in the scientific environment due to their excellent performance in calculating estimates (Depiné et al., 2014;Wanderley et al., 2014;Khosravi et al., 2015;Correia et al., 2016;Bier and Ferraz, 2017;Coutinho et al. 2018).
Several studies have highlighted the multiple linear regression method as an efficient tool used to fill gaps in time series, such as rainfall, temperature and humidity data, among others. According to Oliveira et al. (2010), multiple linear regression and regional weighting perform better in filling in gaps than regional vector and regional weighting methods based on linear regressions. However, the aforementioned authors have emphasized that these methodologies should not be used without prior regional analysis of their performance.
On the other hand, the inverse distance weighting method is one of the techniques mostly used to estimate missing data in hydrology and geographic sciences, since it presents satisfactory results in filling in gaps about rainfall data (Teegavarapu and Chandramouli, 2005;Shepard, 1968). Junqueira et al. (2018) have compared different rainfall missing data-filling methodologies and found that methods such as regional weighting, arithmetic mean and regional weighting based on regression have overestimated rainfall rates in the Mortes River Basin (MG), whereas linear regression, multiple regression and inverse distance weighting methods have underestimated them. According to Bier and Ferraz (2017), regional weighting was the most suitable method used for missing rainfall data-filling purposes, but it did not significantly stand out in comparison to methods such as multiple linear regression, regional weighting, inverse distance weighting, normal ratio method, United Kingdom traditional method and simple arithmetic mean. Normal ratio method has shown satisfactory results and low mean absolute errors (18.1%) in estimates.
The aforementioned authors have also pointed out that these estimates have represented monthly rainfall variations in a reasonable way. They were capable of detecting monthly rainfall peaks between original and estimated series, and it has evidenced the possibility of generating good estimates for monthly and annual rainfall data (Bier and Ferraz, 2017). Coutinho et al. (2018) used a Multilayer Perceptron-type neural network fault filling tool comparing monthly meteorological variables with Multiple Regression models in four stations in the state of Rio de Janeiro from 2002 to 2014. The neural network model presented a high linear correlation (r) with the recorded data of maximum air temperature (r from 0.94 to 0.98), obtaining a mean percentage error (EMP) between 1.05% and 2 .32%. Regarding the relative air humidity, the r coefficient remained between 0.77 and 0.94, and the use of RNA to estimate this variable resulted in an EMP between 1.85% and 2.41%. They concluded that the neural network method is an effective tool for filling and reliably estimating meteorological variables, as the estimated data were close to the real data.
Given the range of procedures and statistical techniques used to fill in missing data in climate series, it is of paramount importance to select the most appropriate methodology capable of meeting the needs of the study by taking into consideration the climatic and geographical reality the meteorological stations are inserted in, as well as of statistically proving the veracity of estimated data (Fante and Sant'Anna Neto, 2016).
Assessing and validating the performance of hydrological models is a crucial process to justify their continued use, as well as verifying their limitations. In this way, it is necessary to apply statistical criteria to quantify the quality and accuracy of the adjustment, in relation to the measured and estimated data, which enables a comprehensive evaluation of the models, enabling their proper reproduction, the prediction of future behavior and the identification of improvements in modeling.
The most commonly used criteria for evaluating models are commonly divided between performance indices, which include: coefficient of determination (R²), agreement (d) and confidence (c) index, and Nash-Sutcliffe efficiency (Nash) and error checking measures such as mean absolute error (MAE) and mean squared error (RMSE). There are no standard procedures in the literature to assess the performance of models, but it is preferable that different statistical techniques are applied, as different metrics quantify various aspects of model fit and accuracy, providing a quantitative and objective assessment of the agreement between simulated and estimated data (Jackson, 2019).
Thus, the present study compared gap filling methods in precipitation data and validated them using robust statistical techniques.

Study site featuring
The study site is located in Itirapina County-SP (Figure 1), approximately 218 km away from the capital; its population is estimated as 18,157 inhabitants and its territorial area covers 564.60 km² comprising Cerrado and Atlantic Forest biomes (IBGE, 2019).
The county has two important conservation units, namely: Experimental Station and Ecological Station. They are managed by the Forestry Institute, which manages an area of 5,512 hectares and aims at environmental preservation, research and education. The water network in these units, together with the other water bodies, are of paramount importance because they 4 Camila Bermond Ruezzene et al. play a key role in the overall balance of the region (Silva et al., 2006). Itirapina County is located in the Cuestas Basálticas region, on São Carlos Plateau. The climate in the region is classified as Cwa, based on Köppen's classification. Summer is hot and rainy, whereas winter is dry. Seasonality observed in the region comprises dry semester from April to September, as well as rainy semester from October to March. Mean annual rainfall reaches 1,450 mm per year and mean annual temperature is approximately 20.8°C (CEPAGRI, 2009;Santos et al., 2018b).
According to Santos et al. (2018a), land use and cover classification in Itirapina County corresponds to native vegetation (30.68%), agricultural crops such as sugarcane (19.93%), forestry (18.37%), citrus culture (1.55%), exposed soil (11.97%) and pasture (14.55%). The aforementioned authors have emphasized that the region presents low environmental vulnerability level due to its flat terrain and to the incidence of Oxisol in the county.

Rainfall data
Monthly historical data about rainfall recorded in Itirapina County (SP) from 1979 to 1989 were selected from stations available at the HIDROWEB portal, on the National Water Agency (ANA, 2020) platform. Table 1 shows the stations and their respective codes, altitude, geographic coordinates, as well as the time (in years) of each analyzed series -these data were consolidated before they were made available. The period from 1979 to 1989 was adopted due to the need to work with a continuous series of data, allowing better representation of the characteristics present in each station and the comparison of real and estimated data for each proposed method, using all stations available for that municipality. The climatological station (C4) located at the Center for Water Resources and Environmental Studies (CRHEA/USP) was included in the group of meteorological stations because it makes it possible to work with primary data. Considering all available data from this period of study, 11.37% of the data were removed and thus obtained in a homogeneous series and with the same number of data for all stations according to the methodology of Coutinho et al. (2018).

Missing data-filling
Criteria adopted to select the tested methods took into consideration methodologies already consolidated in the field. Thus, rather than being not limited to a single methodology, the current study used several of them in order to compare and validate them, based on different statistical techniques that will be addressed throughout the methodology and results. The following techniques stood out among the main missing data-filling methods and they were used in the current study: multiple regression (MR); inverse distance weighting (IDW), normal ratio method (NRM) and artificial neural networks (ANNs).

Multiple regression (MR)
Rainfall information about the behavior of a dependent variable Y in multiple regression depends on two, or more, independent variables Xj, j = 1, ..., p (Naghettini and Pinto, 2007). Therefore, a model likely to evaluate this association is enabled by Equation 1.
Where in: n is the number of observations, Yi is the observation of the dependent variable for the i-th individual, Xi = (Xi1, Xi2, ..., Xip) is a vector of observations of independent variables for the i-th individual, β = (β0, β1, β2, ..., βp) is a vector of regression coefficients (parameters) and ei is a random error component.It is presumed that these errors are independent and follow normal distribution with mean equal to zero and unknown variance σ 2 .

Inverse Distance Weighting (IDW)
The inverse distance weighting method is applied through the linear combination of observations within a given research radius, whose influence decreases as distance increases. According to Hubbard (1994), the IDW method for missing data-filling is calculated based on Equation 2.
Where in: Dx is the missing monthly data to be filled in the test station, Di corresponds to data deriving from the neighboring station of order "i" in the month when the failure in the test station takes place, and di is the distance between the test station and the neighboring station of order "i".

Normal Ratio Method (NRM)
According to Young (1992), the normal ratio method lies on weighting data based on records performed at neighboring stations; such a ratio can be calculated through Equation 3.
Where in: Dx is the monthly data that needs to be filled in the test station, Di corresponds to data deriving from the neighboring station of order "i" in the month when the failure in the test station takes place, and wi is the weight assigned to each neighboring station of order "i", as described in Equation 4.
Where in: ri is the correlation between the test station and the neighboring station of order "i", and ni is the number of months when data overlapped between the test station and the neighboring station of order "i". In other words, it is the size of the data series used to calculate the correlation coefficient.

Artificial Neural Networks (ANN)
Neural networks are calculated through mathematical functions; they are naturally prone to store knowledge and make it useful, like the process carried out by the human brain. Nonlinear functions are calculated, which can be appropriate for complex analyses, such as estimating rainfall data (Di Piazza et al., 2011;Depiné et al., 2014;Wanderley et al., 2014;Correia et al., 2016;Coutinho et al., 2018).
The current study has used Multilayer Perceptron (MLP) neural networks due to their greater versatility and applicability in this field. This network type can be used to estimate information and new desired conditions, as well as to find accurate answers to the analyses in question (Wanderley et al., 2014). Figure 2 shows an example of the neural network architecture corresponding to a station. Networks were trained in the MATLAB software, Version R2015a (https://www.mathworks.com/products/matlab.html) developed by the MathWorks company. It was done by using the Feed-forward backpropagation network type and mean square error as a performance function. The current research has defined that the network architecture should have 3 inputs (corresponding to the stations located in the county), 2 layers, 10 neurons, 1 output and a tan-sigmoid activation function. With respect to network training and validation, 70% of data were used for training; 15%, for testing; and 15%, for validation purposes, as established by the software itself. As in the example in Figure 2, to fill in the precipitation data for Station 2247180, all available data for that station was removed and three stations were used as input data and one 7 Filling and validating rainfall data based on … Rev. Ambient. Água vol. 16 n. 6, e2767 -Taubaté 2021 for output, obtaining a total of four networks for this municipality carrying out this entire process for the four seasons. Each trained neural network recognizes and adapts to the characteristic patterns of each station's rainfall data, providing gap-filling results.

Method-performance validation and evaluation
The coefficient of determination (R²) was calculated to verify the relationship between estimated and measured data. To assess the performance and errors of the failure filling methods, parameters such as mean absolute error (MAE), mean square error (RMSE), index of agreement (D), Pearson's correlation coefficient (r), confidence index (C) and Nash-Sutcliffe efficiency coefficient (Nash), which are applied in several hydrological studies (Goyal, 2014;Pereira et al., 2014;Wanderley et al., 2014;Bier and Ferraz, 2017;Coutinho et al., 2018;Junqueira et al., 2018).

Coefficient of determination (R²)
The coefficient of determination (R²) ( Equation 5) assesses the quality of model fit and indicates the extent to which it was capable of explaining the reference data -the higher the recorded value, the better it fits the model.
Where in: R² is the coefficient of determination measured in (%), Yi is the observed value of the dependent variable, ̂ is the estimated value of the dependent variable, and is the mean recorded for the dependent variable.

Mean absolute error (MAE)
According to Alves et al. (2012), MAE refers to the mean absolute deviation of interpolated values in comparison to the observed ones. It is considered an accurate and robust measure to check numerical models; ideally, its values should be as close or equal to zero as possible (Equation 6).
Where in: MAE is the mean absolute error (mm), Oj concerns values observed in the measurement stations, xj corresponds to values estimated through the missing data-filling method, and "n" is the number of observations.

Root mean square error (RMSE)
RMSE enables checking the mean magnitude of estimated errors. The obtained value is always positive; the closer to zero, the better the estimated values. This parameter can be calculated through Equation 7.
Where in: RMSE is the mean square error (mm), Oj concerns values observed in the measurement stations, xj corresponds to values estimated by the missing data-filling method, and "n" is the number of observations.

Confidence (c) and agreement (d) indices
The confidence index enables checking the precision and accuracy of results. The index of agreement is used in different simulations of a single phenomenon. Values recorded for this index range from 0 (lack of agreement) to 1 (excellent agreement). Table 2 shows the criteria used to assess performance. These parameters can be calculated through Equations 8, 9 and 10.
Where in: D refers to the agreement index (dimensionless), r Pearson's correlation coefficient (dimensionless), C confidence index (dimensionless); Oj are the values observed at the measurement stations, mean of observed values, mean estimated values, xj correspond to values estimated by the filling method and n to the number of observations.

Nash-Sutcliffe efficiency coefficient (Nash)
The Nash-Sutcliffe efficiency coefficient (Equation 11) is one of the most important and usual statistical methods applied in hydrology to assess the performance of hydrological models, as described by Pereira et al. (2014). This coefficient can range from -∞ to 1; the value corresponding to 1 represents the ideal adjustment of estimated data.
Where in: Nash is the Nash-Sutcliffe efficiency coefficient (dimensionless), are the observed rainfall data, are the rainfall data simulated by the model, is the mean recorded for data observed during the simulation period, and n is the number of events.
Model classification was herein adopted based on Silva et al. (2006): models presenting coefficient value higher than 0.75 were classified as adequate and good, those whose coefficient value ranged from 0.36 to 0.75 were classified as acceptable, whereas the ones presenting coefficient value lower than 0.36 were considered unacceptable.

Descriptive and inferential analysis and data normality tests
The assumption of normality was checked through the Anderson Darling test, due to its 9 Filling and validating rainfall data based on … Rev. Ambient. Água vol. 16 n. 6, e2767 -Taubaté 2021 sensitivity in giving more weight to distribution tail points (Espinosa et al., 2004). Shapiro-Wilk test was also applied, since it is one of the most efficient tests used to identify non-normal data (Shapiro and Wilk, 1965). Both normality tests adopted a significance level of 0.05. H0 -which corresponds to data presenting normal distribution -was rejected whenever p-values were lower than 0.05. Thus, whenever such data did not show normality, they were subjected to non-parametric analysis techniques, as well as to data median analysis.
The following hypotheses were used in the Anderson Darling and Shapiro-Wilk tests: H0: data follow normal distribution.
H1: data do not follow normal distribution.

Mann-Whitney (MW) and Kruskal-Wallis (KW) non-parametric tests
Mann-Whitney non-parametric test was carried out at a significance level of 0.05 in order to test whether there were significant differences between the value estimated through the methods and the real rainfall reference data. H0 was accepted and H1 was rejected whenever the p-value was higher than the significance level -the two samples presented the same distribution (Triola, 2008).

Hypothesis testing:
H0: data derived from equal samples. H1: data derived from different samples.
Non-parametric KW test initially suggested by Kruskal and Wallis (1952) was also applied at significance level of 0.05 in order to determine whether the medians among all four (4) missing data-filling methods used in the current study have significantly differed from each other. Thus, the null hypothesis was rejected whenever p-value was lower than, or equal to, the significance level, and it led to the conclusion that not all medians were equal to each other. Figure 3 shows the time-based rainfall variability among all four stations located in Itirapina County. They followed similar behavior, i.e., they recorded the highest rainfall values in January 1983 and 1981 (554 mm and 441.2 mm, respectively), which corresponded to the C4 station of CRHEA-USP.  Souza and Galvani (2017) have analyzed rainfall events recorded by 16 stations located in the Jacaré Guaçu River Basin, Itirapina County, from 1968 to 1998. Results have shown that this region presents great spatial and temporal variability in rainfall events, as seen in tropical areas. The annual rainfall amplitude among the analyzed stations reached 366.6 mm, whereas the seasonal amplitude reached 70.8 mm in the lesser rainy period and 291.4 mm in the rainy period.

RESULTS AND DISCUSSION
The full analysis of all series played a key role in helping to find likely explanations for temporal-spatial rainfall dynamics and for its main factors acting in the investigated river basin. Such representative features of each station have directly influenced results recorded for each method.
After the preliminary analysis was over, Anderson Darling and Shapiro-Wilk normality tests were applied to all real and estimated data; results of both tests have shown p-value lower than the significance level of 0.05, and it enabled concluding that the analyzed data did not follow normal distribution. Lima et al. (2008) have found similar results and indicated that high monthly rainfall variability is a decisive factor for such a behavior. Thus, non-parametric statistics should be used for other analyses.

Evaluating the performance of the missing data-filling method applied in Itirapina County
Tables 3, 4, 5 and 6 show the performance analysis applied to the normal ratio method, inverse distance weighting, multiple regression and neural networks, respectively, based on coefficient of determination (R²), mean absolute error (MAE), root mean square error (RMSE), Nash-Sutcliffe efficiency coefficient (Nash), index of agreement (D), confidence index (C) and performance.  All four methods used in the current researchnamely: NRM, IDW, MR and NNpresented excellent performances in all stations in Itirapina County. Values recorded for coefficient of determination (R²) and Nash coefficient were higher than 0.8; except for station C4, which recorded Nash coefficient value of 0.795. The other parameters presented low error values. Based on the MR method, stations C1 and C4 have shown the best performances in R² (0.906 and 0.931, respectively). Based on NN, station C1 presented the best performance in R² and Nash coefficient, as well as the lowest error values in comparison to other methods. Similar results were reported by Coutinho et al. (2018), who analyzed multiple linear regression and neural network models. According to them, linear regression methods have shown satisfactory results, high correlation indices and low mean errors in comparison to real data. Depiné et al. (2014) Correia et al. (2016 and Wanderley et al. (2014) have also concluded that the neural network method has efficiently reproduced the missing rainfall data-filling process and made it possible to compare the complex inputs and outputs of simulations. However, the aforementioned authors have pointed out that there may be variations in the results due to the procedure adopted for network testing, training and validation processes. They also highlighted that estimated data are more accurate when there is lower spatial variability in rainfall events. Table 7 shows the p-values of the Mann-Whitney test corresponding to each method and station. It was done to investigate whether there was significant difference between real and estimated data. The p-values of Kruskal-Wallis test are also shown to help identify whether there was significant difference among the proposed methods. Thus, NRM was not significant for Station C1 since the p-value was lower than 0.05. With respect to the IDW method, two of the four stations did not show significant values, namely: Stations C1 (p-value 0.024) and C2 (p-value 0). The other two stations have reached p-values of 0.459 and 0.125 for Stations C1 and C3, respectively. The MR and NN methods, on the other hand, recorded p-values higher than the significance level, indicating lack of difference between real and estimated data. Based on the NN method, all stations recorded a p-value higher than 0.7, except for Station C1. Thus, it is worth emphasizing the importance of adopting in-depth non-parametric statistical methods to perform this analysis type, since the exclusive use of Nash and determination coefficients may not help in identifying such issues.
The Krukal-Wallis test has evidenced significant differences among the four methods applied to Stations C1 and C2, which recorded p-values of 0.018 and 0.001, respectively. Significant p-values were observed for the other stations, although there was no difference among the four analyzed methods. It is worth emphasizing that the neural networks subjected to KW test recorded the best performance for Station C1, whereas MR, NRM and IDW presented inferior performances; however, any of the investigated methods can be used for missing data-filling purposes.
Non-parametric analyses applied to the two stations (C1 and C2) that did not record significant p-values have indicated significant differences between actual and estimated data. This outcome can be explained by the orographic effect associated with that region, as well as by episodes of rainfall rate fluctuations that were not identified by the NRM and IDW methods, a fact that hindered the estimates.

FINAL CONSIDERATIONS
The missing data-filling methods used and validated through the herein-adopted statistical techniques presented excellent performances. Based on analyses applied to all four methods, neural networks and multiple regression were the ones presenting the best results; thus, they are the most suitable methods to be used for missing rainfall data-filling purposes.
Thus, based on the use of efficient missing data-filling tools, it is possible to minimize the social, environmental and economic consequences by promoting better water-resource management and operation by public and private agents, as well as by society as a whole. In addition, these tools can help mitigate issues such as flooding, drought, water supply, electric power generation, among others, enabling the use of continuous data series for these studies.