Methodological approaches for imputing missing data into monthly flows series

Missing data is one of the main difficulties in working with fluviometric records. Database gaps may result from fluviometric stations components problems, monitoring interruptions and lack of observers. Incomplete series analysis generates uncertain results, negatively impacting water resources management. Thus, proper missing data consideration is very important to ensure better information quality. This work aims to analyze, comparatively, missing data imputation methodologies in monthly river-flow time series, considering, as a case study, the Doce River, located in Southeast Brazil. Missing data were simulated in 5%, 10%, 15%, 25% and 40% proportions following a random distribution pattern, ignoring the missing data generation mechanisms. Ten missing data imputation methodologies were used: arithmetic mean, median, simple and multiple linear regression, regional weighting, spline and Stineman interpolation, Kalman smoothing, multiple imputation and maximum likelihood. Their performances were compared through bias, root mean square error, absolute mean percentage error, determination coefficient and concordance index. Results indicate that for 5% missing data, any methodology for imputing can be considered, recommending caution for arithmetic mean method application. However, as the missing data proportion increases, it is recommended to use multiple imputation and maximum likelihood methodologies when there are support stations for imputation, and the Stineman interpolation and Kalman Smoothing methods when only the studied series is available.


INTRODUCTION
Knowledge about the water regime in a river basin is fundamental in hydrological studies, being an indispensable factor for adequate water-resource management (Moreira, 2006). In this sense, Brazilian Law 9,433 (Brasil, 1997), which instituted the National Water Resources Policy, indicates, among the management instruments, the National Water Resources Information System. In this context, the hydrometeorological monitoring network maintained by the National Water Agency (ANA) and the availability of the databases generated in the Hydrological Information System (HIDROWEB) become relevant.
The instrument National Water Resources Information System, however, is still incipient in relation to the others, mainly due to the limited number of monitoring stations and the incompleteness of the data generated . The existence of gaps in the historical series is due to technical or maintenance problems, non-ideal climatic conditions, instrumental failures or device errors during data collection, human error during data entry, calibration processes and/or data damage due to malfunction of machines storage, construction and organization of hydrometric databases (Gao, 2017;Johnston, 1999;Peña-Angulo et al., 2019;Tencaliec, 2017;Tucci, 1997).
According to McKnight et al. (2007), "in general, the term missing data means that some type of information about the phenomenon in which we are interested is missing" and, therefore, the sample is called incomplete. The analysis of incomplete flow-time series produces negative impacts, especially on stochastic decompositions of series, compromising information such as trend, stationarity, cycle and seasonality (Box and Cox, 1964).
As discussed by Roth et al. (1999) and Pigott (2001), the existence of missing values in time series generally decreases the capacity and precision of statistical analysis approaches and contributes to biased estimates of the relationship between variables, which may cause inaccurate assumptions in data set exploration which can negatively impact water resources management, for example, in determination of maximum permissible uptake and ecological flows, extreme flows estimation, flows forecasting, hydraulic systems designs, among others. Due to this fact, the reconstruction of incomplete series and the treatment of missing data must be seen as a priority in the data preparation procedure (Hamzah et al., 2020).
Rev. Ambient. Água vol. 17 n. 2, e2795 -Taubaté 2022 Because missing data imputation is a useful tool in water-resource management studies (Barnetche and Kobiyama, 2006), several authors have worked on the application of techniques for imputing missing data in hydrological studies resulting in a variety of methods ranging from simple imputation by mean or median to widely used statistical methods such as Regional Weighting (Ely et al., 2021); interpolations (linear, quadratic and cubic) (Gyau-Boakye andSchultz, 1994, Hamzah et al., 2020); methods based on linear regressions (single and multiple) (Kamwaga et al., 2018;Khalifeloo et al., 2015); Self Organizing Map (SOM) and Soil and Water Assessment Tool (SWAT) (Kim et al., 2015); to more advanced and robust methods, such as different Artificial Neural Network approaches (Canchala-Nastar et al., 2019;Elshorbagy et al., 2000;Nkiaka et al., 2016;Starrett et al., 2010;Vega-Garcia et al., 2019); machine learning methods (Heras and Matovelle, 2021;Rado et al., 2019); satellite radar altimetry and multiple imputation (Ekeu-Wei et al., 2018); combination of regression and autoregressive integrated moving average (ARIMA) models called dynamic regression (Tencaliec et al., 2015); Singular Spectrum Analysis (SSA) and Multichannel Singular Spectrum Analysis (MSSA) (Semiromi and Koch, 2019); among many others. The many methods that can be used for hydrological missing data imputation resulted in literature reviews as can be seen in Ben Aissia et al. (2017) and Hamzah et al. (2020). Ventura et al. (2016) carried out a study to compare statistical methods for filling gaps and to verify which method presents better results for meteorological data series. Three weather stations located in Porto Alegre, Rio de Janeiro and Manaus cities, in Brazil, were chosen. Failures were simulated in real data series and the performances of four methods were compared: simple average, moving average, simple linear regression and multiple linear regression. To verify the obtained results, the mean absolute error and the correlation coefficient were used. The results showed excellent performance of the multiple linear regression method for the variables temperature, humidity and dew point, while the simple average had the best result for the variable atmospheric pressure. None of the four methods presented good results for the variable solar radiation. Nunes et al. (2009) carried out a study with the objective of publicizing the Multiple Imputation (MI) method. The authors selected a 470 surgical patient death outcome data set and adjusted logistic models to it. Two incomplete data sets were generated, one presenting 5% and the other 20% of missing data for the variable albumin. Models were adjusted to the complete series, to the series presenting missing data and the series filled by using MI. The estimates obtained by the analysis of the series presenting missing data and with the filled series were different, mainly for those presenting 20% of missing data. The utilized MI was efficient, because the results achieved with the series filled by imputations were close to those obtained with the complete series. The results obtained considering series filled by using MI were superior to those obtained for series with missing data. Junger and Leon (2015) presented an imputation method via Maximum Likelihood (ML) that is suitable for multivariate time series using the EM algorithm (Expectation and Maximization) under the assumption of normal distribution. The authors used a database related with tem PM10 monitoring stations located in São Paulo city, Brazil. Different approaches were considered to filter the temporal component. A simulation study was carried out to compare the proposed and some frequently used methods of quality and performance. The simulations showed that when the amount of missing data was less than 5%, the complete data analysis generated satisfactory results, regardless of the mechanism that generated the missing data. Imputation quality began to degenerate when missing data proportions exceeded 10%. The proposed imputation method presented good accuracy and precision in different configurations with respect to the missing observation patterns. Most imputations obtained valid results, even under the non-random losses mechanism. Sattari et al. (2017) evaluated different methods of imputing missing data in monthly rainfall time series collected at six stations in southern Iran. Imputation methodologies analyzed include arithmetic mean, inverse distance interpolation, linear regression, multiple imputations, multiple linear regression analysis, non-linear iterative partial least squares algorithm, NR method, single best estimator, UK traditional method and M5 decision model tree. Results showed that arithmetic averaging method, multiple linear regression method and nonlinear iterative partial least squares algorithm perform best. Multiple regression methods provided a successful missing precipitation data estimation. Multiple imputation methods produced the most accurate results for precipitation data from five dependent stations. Finally, the decisiontree algorithm is explicit, and therefore it is used when insights into decision making are needed. Chen et al. (2019) verified the impact of using different methods for imputing missing data in rainfall series on the forecasting hydrological and non-point (H/NPS) pollution performance using the Soil and Water Assessment Tool (SWAT) model. Multiple imputation (MI) and maximum likelihood methods using expectation-maximization bootstrap algorithm (EMB) were considered. Different imputed data sets effects were investigated through a case study in the Daning River Basin, Three Gorges Reservoir Region, China. Results indicate that rainfall data imputation and H/NPS model performance obtained by EMB algorithm are superior to MI performance. Authors highlight the important implications for choosing appropriate imputation methods in H/NPS models to solve data scarcity problems for watershed studies. Hamzah et al. (2022) evaluated the performance of multiple imputations by chained equations (MICE) approach to predicting recurrence in streamflow datasets. To evaluate and verify MICE approach effectiveness in treating missing streamflow data, complete historical daily streamflow series from 2012 to 2014 were used. Later, MICE methods coupled with multiple linear regression (MLR) were applied to restore streamflow rates in Malaysia's Langat River Basin from 1978 to 2016. The best estimation methods are validated with tests such as adjusted R-squared (Adj R 2 ), residual standard error (RSE) and mean absolute percentage error (MAPE). Findings revealed that the classification and regression tree (CART) method combined with MLR outperformed the other approaches tested, with highest Adj R 2 value and lowest RSE and MAPE values observed regardless of missing conditions. Abu Romman et al. (2021) compared ten imputation methods that were used to impute rainfall depth data in an arid region of the Mediterranean. Series mean, linear interpolation, linear trend, arithmetic mean, normal ratio, inverse distance weighting, linear regression with GPCC data, linear regression with satellite data, stepwise multiple linear regression and multiple imputation were used for these imputations. The results showed that for intervals between 5 and 20% of failures, the stepwise multiple linear regression method produced best results with a root mean square error (RMSE) and mean absolute error (MAE) less than 7 and 2 mm, respectively. This was followed by the Monte Carlo Markov chain expectationmaximization-based multiple imputation method, which had an RSME and MAE of 1.01 and 0.08 mm, respectively, when the series had 20% failures. On the other hand, satellite data use for imputation was adequate for failures between 10 and 15%.
Currently, it is unclear in the literature which method is the most appropriate to deal with missing value imputation in river-flow time series. Considering this scenario, this research evaluates ten imputation techniques, based on single and multiple imputation methods: Arithmetic Mean (AM), Median (M), Simple Linear Regression (SLR), Multiple Linear Rev. Ambient. Água vol. 17 n. 2, e2795 -Taubaté 2022 Regression (MLR), Regional Weighting (RW), Spline Interpolation (SPLINE), Stineman interpolation (STINE), Kalman Smoothing with (KALMAN), Multiple Imputation (MI) and Maximum Likelihood (ML). It is important to emphasize that there are still few studies that use MI and ML imputation methodologies dealing with missing value imputation in river-flow time series, evidencing the need to develop works analyzing their performance, and these results can help sustainable water resource management. In this context, the present research objective is to comparatively analyze methodologies for imputing missing data by an application to Doce River, Brazil, monthly average fluviometric flow-time series.

Study Area
The Doce River watershed, Figure 1, is located in Southeast Brazil, occupying portions of Minas Gerais and Espírito Santo states between the parallels 17°45' and 21°15' South latitude and the meridians 39°55' and 43º45' West longitude. The Doce River presents 853 km total extension and 83,465 km 2 drainage area (Coelho, 2007). Of this area, 86% belongs to Minas Gerais state and the remaining 14% to Espírito Santo state, being, therefore, a federal dominion watershed.

Data
Six ANA hydrometeorological network Doce River fluviometric stations were considered in this work. The monthly flow data was obtained from the HIDROWEB Hydrological Information System (Agência Nacional de Águas, 2018). Station-selection criteria were related to the existence of consistent historical average monthly flow-time series for at least 20 years, according to literature recommendations (Longhi and Formiga, 2011;Sarmento, 2007).
Historical series stationarity analysis, that verifies mean and variance identity of two distinct hydrological series subperiods, utilized Fisher's F and t-Student tests, in order to evaluate time series hydrological regime behavior changes, which can be caused by several factors, such as reservoir construction upstream of the fluviometric station, water withdrawal for irrigation agricultural activities use and even local climate regime changes over time (Sousa et al., 2009). For the analysis, the software SisCAH 1.0 -Computational System for Hydrological Analysis -was used (Sousa et al., 2009).
The selected stations are shown in Figure 1. The list of stations considered is presented in Table 1, with respective ANA identification codes, identification nomenclature adopted in this study, geographic coordinates and drainage areas. Colatina station is located in Espírito Santo state, and the others in Minas Gerais state. After analyzing the available data, the study base period 1987 to 2014 was determined.

Missing data patterns
According to Silva (2012), the presence of missing data in a database can be characterized by observed failure behavior patterns which is of paramount importance to describe missing value locations in the series. First, Collins et al. (1991) described and divided the pattern of missingness into two groups: general (random) and special patterns. Special patterns including univariate missing data, unit nonresponse, and monotone missing data. "Missing general" or "random pattern" is where missing data occur in any of the variables in any position. In a "special pattern" case, if there is only one variable with missing data while the other variables are completely recorded, the pattern is called "univariate missing data". Additionally, when the multivariate pattern is detected, means that the missing values arise in more than one variable; if there are missing values on a block of variables for the same set of cases, and the remaining of the variables are all complete, the missing data pattern is called "unit nonresponse"; and, the pattern is said to be "monotone" whenever observations are ordered and item k is missing, and all k + 1,..., n cases are also missing (Collins et al., 1991;Silva, 2012). For the present study, the general pattern will be considered, because it assumes that the missing data follow a random distribution.

Missing data mechanisms
Little and Rubin (2019) classified three possible ways that data may go missing: Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing Not at Random (MNAR). As discussed by Hamzah et al. (2020) and Gao et al., (2018), MCAR describes data where the gaps are distinct from any of the variables in the dataset. In any event, the missing values probably correlated to other observed values, yet not to missing values; in that case, the missingness is assumed to be MAR. Missing data which are dependent on the observed value is known as MNAR. Missing data can be presented in the form of a probabilistic process that describes the association among the measured variables and the probability of missing value. For more details about Missing Data Mechanisms, see Little and Rubin (2019); based on these authors, missing value in streamflow study is determined as MCAR because the missingness episode in streamflow data of an area is not influenced by data in that area or any area. This mechanism of missing data is classified as ignorable, that is, there isn't need to model the mechanism as part of the estimation process (Allison, 2001).

Missing data imputation methods
The missing data imputation methodologies can be classified in three general classes: Single Imputation (SI), Multiple Imputation (MI) and Estimation. The basic principle of SI methods is to impute one value to each database missing data and, then, analyze it as if there Rev. Ambient. Água vol. 17 n. 2, e2795 -Taubaté 2022 weren't missing data (McKnight et al., 2007). The MI is characterized by being a method that consists of imputing a certain missing data by a data set and, after analyzing it, determining the best value to be taken to impute it. The Estimation method consists in estimating parameters that govern the missing values distribution from the observed data. For the present study, the Maximum Likelihood (ML) methodology will be employed. In addition to this general classification and more widely used in the literature to deal with missing data, imputation methodologies can also be classified as univariate or multivariate. The first occurs when the series itself provides information that can be used by imputation methodology. The second one is when it is necessary to use support stations to impute interest series. In this study, AM, M, SPLINE, STINE and KALMAN were classified as univariate imputation methods. In turn, SLR, MLR, RW, MI and ML are multivariate methods.

Single Imputation Methods Arithmetic Mean (AM)
This is the most commonly used single imputation technique where the missing values are replaced with the mean value of the time series. The mean of a series of values x1, x2, ..., xn is given by Equation 1.

Median (M)
Median imputation is another simple method often appropriate for highly skewed data (Junger and Leon, 2015). This method calculates the median of the variable based on all cases that have data for any variable and replaces the series missing values with the median of the variable (Kabir et al., 2019). The median of a series of values x1, x2, ..., xn is can be obtained by Equation 2 after sorting the dataset in ascending order: (2)

Simple (SLR) and Multiple Linear Regression (MLR)
Imputation with linear regression uses information from complete series to fill the missing values of the interest series. It can be of two types: (i) imputation by simple linear regression, when the purpose is to predict the dependent series behavior as a function of only one independent series (Equation 3) or (ii) imputation by multiple linear regression, when the dependent series behavior is a function of more than one independent series (Equation 4).
In Equation 3, X represents the linear regression equation dependent series; β0 a linear coefficient vector; β the angular coefficient; Υ the independent station and ε the model residuals. As in SLR case, only one independent series is used to adjust the regression, Pearson correlation coefficient was used as a criterion for choosing this series and, after the test was applied, the station time series that presented highest correlation coefficient with E6, which is the interest station, was then chosen to perform imputation.
In Equation 4, X represents the linear regression equation dependent series; β0 a linear coefficient vector; βi the angular coefficients; Υi the independent station and ε the model residuals. For MLR, stations E1, E2, E3, E4 and E5 series were used as independent series to fit the regression with E6.

Regional Weighting (RW)
Imputation by regional weighting method establishes linear regressions between the station that has series with missing data, X, and each neighboring station series, Υ1, Υ1, ..., Υn, incorporating distance information (Mello et al., 2017;Pruski et al., 2004). From each linear regression performed, a correlation coefficient is obtained for the data to be estimated. Equation 5 denotes the regional weighting method.
In Equation 5, NX and Ni represents the monthly average flows data for the station with missing data to be imputed and the order "i" neighboring station monthly average flow, respectively (m 3 .s -1 ); Di denotes the values observed in the order "i" neighboring stations during the month of occurrence in the station with the data to be imputed (m 3 .s -1 ); and n is the number of neighboring stations considered.

Stineman interpolation (STINE)
This is an advanced interpolation method where interpolation occurs based on (i) whether values of the specified points ordinates change monotonically and (ii) the slopes of the line segments joining specified points change monotonically (Turicchi et al., 2020).

Kalman Smoothing (KALMAN)
The Kalman filter calculates the mean and variance of the unobserved state, given the observations. This filter is a recursive algorithm; the current best estimate is updated whenever a new observation is obtained. Kalman Smoothing takes the form of a backwards recursion and it can be used to compute a smoothed estimator of the disturbance vector (Wijesekara and Liyanage, 2020).
In the present work, R package imputeTS was used for Spline Interpolation, Stineman Interpolation and Kalman smoothing imputations.

Multiple Imputation
In an attempt to develop a method that could reflect uncertainty over missing data imputations, Rubin (1987) created the MI method, in which each missing value is replaced by a set of plausible values representing this uncertainty about the value to be imputed. MI consists of the following three steps: (i) m complete databases Υobs, Υmis are obtained through appropriate imputation techniques; (ii) separately, m data banks are analyzed by a traditional statistical method, as if they were indeed complete data groups; (iii) the m results found in step (ii) are combined in a simple and suitable way for obtaining the so-called repeated imputation inference.
In this research, the imputation model used (i) is adjusted by the Bayesian Paradigm: from the result of the posterior distribution, a set of random extractions is made for the missing data from the observed data, thus obtaining the complete database, that is, multiple imputations are Rev. Ambient. Água vol. 17 n. 2, e2795 -Taubaté 2022 made using the linear regression method (Υ = α + βX), Υ ~ N (Xβ; I σ 2 ), where the response variable Υ will be the variable to be imputed for which the parameters are estimated from its own posterior distribution. The predicted values for Υobs and Υmis are calculated and, for each predicted Υmis, the observed unit with the closest predicted value is sought using it as the value to be imputed. The variability between imputations is generated through the steps used to estimate β and σ and which are repeated m times; in the next step (ii) Q of each imputed data set is estimated; finally, in step (iii), the m results obtained can be combined using the rules proposed by Rubin (1987). The idea is that from each analysis the estimates for the parameter of interest Q are obtained, that is, Qi for i = 1,2,3 ..., m. According to Schafer (1999), Q can be any scalar measure to be estimated, such as mean, correlation, regression coefficient or odds ratio. Then, the combined estimate will be the average of the individual estimates ̅ = 1 ∑= 1 . For the combined variance, the variance is first calculated within the imputations ̅ = 1 ∑ =1 and the variance between imputations = 1 −1 Then, the total variance, which is the combined variance, will be = ̅ + (1 + 1 ) .
The literature standardizes m = 5 imputations, a value that, as indicated by Nunes (2007) emerged from the researcher's experiences. MI is implemented in the mice R software library, developed by Buuren and Groothuis-Oudshoorn (2010).

Maximum Likelihood
The ML method's basic principle is to choose as an estimation of parameters those values that, if true, would maximize the probability of observing what was actually observed (Allison, 2001). The parameters estimation is performed by maximizing the likelihood function, which, in most cases, can't be performed analytically. It is necessary to employ numerical methods, such as the EM algorithm. This maximization method is very popular when the data considered in the estimation aren't complete (Allison, 2001).
The EM algorithm is an iterative procedure consisting of two steps repetition: Estimation (E) and maximization (M). The process begins with mean vector and covariance matrix estimation by using only the complete data. According to Junger and Leon (2015), being , ( = 1, … , ), the ℎ random vector realization, with multivariate normal distribution and components not observed. The vector can be arranged in such a way that the missing components are placed in the first positions, that is, = ( 1 , … , , ( +1) , … , ) , and represented as = ( 1 , 2 ) . Consider B windows with different covariance regimes over time. The estimation of the mean vector at the instant and window b, (b = 1, …, B), can be partitioned following the same configuration as that corresponding to components, according to equations 7 and 8.
The imputation algorithm consists of (i) replace the missing values for estimates values; (ii) estimate the normal model parameters and and each univariate time series level; (iii) re-estimate the missing values considering just the updated parameters and each time series level. This process is repeated until the estimated values stop to vary (Junger and Leon, 2015). Junger and Leon (2015) report that initial estimates ̃0 and ̃0 are, respectively, the mean vector and the sample covariance matrix, considering only the observed data. In iteration ( + 1) of the step E for the EM algorithm, the missing values are imputed as the conditional mean to the observed values and the parameters estimated in the previous interaction appropriated using equation (9), with the contributions to covariance estimated by Equations 10 and 11.
Junger and Leon (2015) emphasize the need to use additional models to estimate the contribution of the time component for each univariate series, that is, µt value. The temporal trajectory of the series considered in this study was modeled with use of cubic non-parametric splines because, according to the same authors, this trajectory was the one that presented the best performance in relation to the regression models and ARIMA (Autoregressive Integrated Moving Average) models for air pollution time series missing data imputation.
Therefore, consider that µt can be estimated by a smooth function gj with j = 1, ..., p. The curve gj, in turn, is estimated in such a way that the functional ( ) = ∑ [ − ( )] 2 + =1 ∫ ( ′′) 2 be minimized. The points v1, v2, ..., vk, ordered in the interval [a, b], are the nodes and λ is the curve smoothing parameter (Junger and Leon, 2015). This results in a natural cubic spline (Green and Silverman, 1993). Each variable XJ has its level given by µjt = g(Xjt).
In this paper, the EM algorithm proposed by Junger (2008) was used, which makes use of the mtsdi platform in the R software, which was implemented by the author.

Performance Indexes
For the performance evaluation of different missing data, imputation methods in effecting the estimates were employed as can be seen below (Equations 12, 13, 14, 15 and 16).
The BIAS measure quantifies underestimation and overestimation estimates with respect to the average observations (Bier and Ferraz, 2017;Junger, 2008); RMSE and MAPE are accuracy estimates measures. In equations 12-16, represents the number of missing data in the modeled data set, the observed data, ̃ the imputed data, = 1, … , , ̅ the observed values mean and ̃̅ the imputed data mean (Pinto, 2013).

Application
From the Doce River monthly average flow-time series, an algorithm was created to simulate five incomplete data banks, with 5%, 10%, 15%, 25% and 40% missing data. The routines for simulating the missing data percentages, imputations and analyzes were implemented by using the R software (R Development Core Team, 2018). The website for that software access is http://www.r-project.org.

RESULTS AND DICUSSION
For a better understanding of the studied average monthly flow variable, each station's descriptive statistics are presented in Table 2. It's observed that, during the study period, the monthly average flow rates ranged from 162.48 m 3 .s -1 in E1 to 797.77 m 3 .s -1 in E6. These values represent the lowest and highest average monthly flows, considering all stations. A high standard deviation and coefficient of variation values can be observed, an aspect that allows us to conclude that the means aren't representative, a fact that, according to Bayer et al. (2012), can be associated with the large data intra-annual variability, indicating a seasonal component. It is also noted that the observed extreme monthly average flow values, maximum and minimum, were 3,528.32 and 41.58 m 3 .s -1 , at Stations E6 and E1, respectively. The first was registered on January 12, 2013, while the second was observed on October 1, 2014.
The fluviometric stations' monthly average flow distributions present positive asymmetry values ranging from 1.79 to 2.04. For kurtosis, the variation was from 3.24 to 4.63. The mean values are greater than the respective median and mode values, and the data distribution values' degree of concentration is classified as leptokurtic, according to Fonseca and Martins (2011). Figure 2 shows the six stations' average monthly flow-time evolution. As can be observed, the time series present intra-annual variability pattern, with floods periods followed by drought periods, which characterizes seasonality property, already indicated by the descriptive statistics and confirmed in Pinto et al. (2015) studies for E6 monthly mean flows and by Bleidorn et al. (2018) studies, for minimum monthly flows for all stations considered in this study. It's relevant to note that both studies suggest a seasonal component with a 12-month period, with a flood period from November to May, and drought period from June to October.  According to the Fischer F and t-Student tests, the time series under study are homogeneous, that is, they do not present variances and mean changes over time, considering five-year analysis periods. This fact is important to validate studied stations' data consistency. The imputation methods considered in this work require that the data follow a Normal distribution; however, according to Shapiro and Wilk (1965) and Bera and Jarque (1981) tests, the data normality null hypothesis is rejected. According to Junger (2008) and Sabino et al. (2014), environmental data, commonly, do not follow Normal distribution. Thus, statistical tests are performed to check the application of a transformation that could stabilize the variance of the observations. In the case under study, the smoothing parameter λ was estimated, as suggested by Box and Cox (1964) to define the type of transformation. According to Reisen et al. (2008), often the transformation not only stabilizes the variance but also improves the data distribution approximation to the Normal distribution. Thus, all imputations were performed using the natural logarithm transformation in order to improve the approximation with the Rev. Ambient. Água vol. 17 n. 2, e2795 -Taubaté 2022 Normal distribution and to stabilize the variance. The imputed data was back transformed for subsequent analysis.
Linear Pearson correlation coefficients for the average monthly flows r between stations under study indicate a highly homogeneous distribution, because among the pairs of stations the lowest value was 0.9240, as shown in Table 3. This condition was expected, because the stations E1 and E6 present the highest distance between stations identified in the study area, which suggests that better performance can be achieved by using the multivariate imputation methods (SLR, MLR, RW, MI and ML). In order to validate the missing data imputation methods considered in this study, the E6 database underwent 1000 replications for each failure ratio. Table 4 shows the performanceindicator means for the missing data imputation methods. In general, there is a decreasing imputation quality gradient presented by the performance indexes (PI) because of missing data increase. The SI (AM, M, SLR, MLR and RW) methods showed considerably low BIAS values and high RMSE and MAPE values in relation to SI (SPLINE, STINE e KALMAN) and mainly to multivariate attribution (MI and ML) methods. SI (AM, SLR, MLR and RW) methods showed a considerable increase in BIAS when missing data proportion increased, making imputed series underestimated in relation to observed one. M showed a similar behavior, however, with lower intensity. The increase in bias values was observed to a lesser extent for SI methods (SPLINE, STINE and KALMAN) and mainly for MI and ML, suggesting a small imputed data variance loss in relation to observed ones. In addition, the methods that lost most quality for R 2 and d2 indicators were SI (AM, M, SLR, MLR and RW) followed by SI (SPLINE, STINE and KALMAN) and with little loss for MI and ML.This confirms MI and ML methods good performance, being that the last corroborates with the efficiency proven by Junger (2008) and reforced by Burger et al. (2018).
Both AM and M methodologies are central tendency measures. However, the second is a better alternative for variables that do not follow a Normal distribution, that is, the median better represents the central tendency of a distribution that presents large deviations from the Normal distribution (McKnight et al., 2007;Pinto, 2013). Therefore, for variables that do not follow a Normal distribution, such as the river average flow rate, under study, the imputation methodology by AM is not efficient, especially for faults above 5% even under natural logarithm transformation. This conclusion was also found in Pinto (2013), where the detrimental effect of this method on the imputation of the variable PM10 (Inhalable Particulate Matter with aerodynamic diameter less than 10 µm) is reduced only in samples with a small percentage of missing data (5%). The efficiency of the imputation methodology by M decreases slightly for faults above 10%. Hence, this method is not indicated for these situations. In the hydrology area, the idea of using a single imputation method to recover the missing value (mainly AM and M) is widely used by several authors, as can be seen Ben Aissia et al. (2017); Gao et al. (2018); Kabir et al. (2019); Norliyana et al. (2017) and Rahman et al. (2017) works. However, despite the method's simplicity the researchers agree that reconstructing the missing value using the same "number" does not reflect the variation that would likely occur if the variables were observed. The real numbers possibly differ from those imputed. Thus, the variation of those same variables is underestimated.
SLR methodology presents better results in relation to MLR according to all performance indicators and regardless of the number of failures. MLR involves more independent variables to predict an adjustment model and, theoretically, its complexity results in a higher efficiency than the corresponding SLR. However, for the SLR methodology, an algorithm was developed in which the independent station was taken as a function of the correlation coefficient r higher value and, thus, alternate stations were candidates according to the replication progress. For the imputation methodology via MLR, based on the correlation coefficient r (which assumed values greater than 0.924 between the station pairs), all the other stations (E1, E2, E3, E4 and E5) were considered as adjusted model independent variables, for the dependent variable E6. This justifies the better performance of SLR over MLR in this study. The imputation methodology via RW, which is widely used in hydrological variables series imputation studies, showed superior efficiency than the corresponding to other SI (AM, M, SLR and MLR) methods only for the 5% failure rate and presented less efficiency than those corresponding to SI (SPLINE, STINE and KALMAN) in addition to MI and ML methods, independently of the missing data percentage. Oliveira et al. (2010) concluded that the MLR method showed, according to the performance indexes BIAS, MAE and r, the best results in annual rainfall faults imputation for six stations located in the state of Goiás, Brazil, followed by vector regional combined with multiple potential regression (MPR), RW, regional vector combined with multiple linear regression, regional weighting based on linear regression, regional vector combined with regional weighting and, lastly, regional vector method. This result corroborates those found by (i) Ventura et al. (2016), that, considering BIAS, MAE and r, verified the superiority of the MLR method for temperature, relative humidity and dew point meteorological variables, considering Manaus, Rio de Janeiro and Porto Alegre Brazilian cities stations series; (ii) those found by Mello et al. (2017) that, according to the MAE, for precipitation series corresponding to eight rainfall stations located in Santa Catarina state, Brazil, the MLR method was the one that presented the best performance, followed by RW, regional weighting based on linear regressions and, finally, SLR; and (iii) those found by Bier and Ferraz (2017) who concluded that, for average compensated temperature for meteorological stations located in Rio Grande do Sul state, Brazil, the MLR and the RW methods were the most adequate for missing data estimates, whereas for precipitation there was no method that could be considered best. According to the same authors, this was due to the fact that neighboring stations' precipitation data was less correlated if compared to average temperature compensated data, generating estimates less related to the observed series and presenting larger estimative errors.
Low BIAS, RMSE and MAPE values and high R² and d2 values confirm univariate single imputation methods (SPLINE, STINE and KALMAN) good performance, however, inferior to MI and ML multiple multivariate imputation methods. Additionally, it is possible to verify a slight superiority of imputation quality via ML methodology in relation to MI, regardless of missing data proportion. Figures 3 and 5 show the simulated results for 5 and 40% missing data proportions and Figures 4 and 6 show a visual comparison between observed and imputed data. This comparison shows good performance by all methodologies for the 5% missing data proportion. Therefore, for this smaller data failure percentage, any imputation methodology could be assumed without causing significant changes in the time series characteristics. Thus, caution is suggested in the use of AM, even for small failure proportions. However, for the 40% proportion, the analysis shows SI methods (SPLINE, STINE and KALMAN) good performance and superiority of MI and ML methods, while for other SI methodologies (AM, M, SLR, MLR and RW) imputed data variability underestimation is notable. Therefore, it can be concluded that the use of such methodologies is not recommended to carry out imputation with high percentages of missing data, as the time series characteristics can be extremely altered.     Table 5 shows relative difference results between the quality indicators used in imputation methods evaluation in this study for Doce River flow series in a scenario where missing data percentage increased from 5% to 40%. As can be seen, all methodologies tested in this study showed a loss of quality with the missing data proportion increase and, based on that, the table was organized in order to order the methodologies compared in this study from the one with greatest to the least loss of quality according to indicators used. From this, we can conclude that, in cases where there are support stations for imputation, the best methodologies for imputing missing data in flow data are the multivariate ones MI and ML, while, in cases where there are no support stations for imputation, the best options of imputation methodologies to be used are the univariate ones Kalman Smoothing and Stineman interpolation. Based on these results, it can be said that, multivariate MI and ML methodologies application in river flow rates missing data imputation, confirmed their good performance, already proven for imputations considering other variables (Junger, 2008;Pinto, 2013;Nunes et al., 2010;Vinha and Laros, 2018), as occurred in cases of application univariate methodologies KALMAN and STINE whose good performance had also been confirmed to handle missing data in exchange rate data (Burger et al., 2018). Therefore, it is suggested that both methodologies should be taken as quality references for fluviometric variables missing data imputation, especially the ML. Lastly, Table 6 shows the main advantages and disadvantages for each methodology evaluated in the present study.
In this way, fluviometric monitoring and historical series data consistency as well as the adequate missing data treatment are fundamental for water-resource studies and projects, turning possible adequate water-resource use planning, river basin management, flow forecasting, industrial and agricultural public supply, navigation, basic sanitation, water concession and granting, academic studies and water use conflicts resolution.

CONCLUSIONS
Reliable flow series are essential for studies related to water resources. However, incomplete series can result even from the operation of large monitoring networks with data quality control. This fact indicates the importance of techniques that allow imputation of missing data. However, these techniques have been little used in hydrological studies. In this study, it was verified that, for the Doce River flow series, the multiple imputation and maximum likelihood methodologies, considered as references in the imputation of missing data for series Rev. Ambient. Água vol. 17 n. 2, e2795 -Taubaté 2022 involving several environmental and social variables, performed better than those most commonly utilized. It was also concluded that the improvement in the results of the imputations in relation to the others increases as the flow series present greater proportions of the missing data. Simulations showed that, for 5% missing data, all the imputation methods present good performances. For this small missing data proportion, statistical efficiency is not compromised. Even so, for this missing data amount, the arithmetic mean methodology, which presents the worse results, should be avoided. The imputation method quality begins to become different for proportions above 10%. The arithmetic mean, simple and multiple linear regression and regional weighting imputation methodologies may be considered limited, even though the last three methods are multivariate and under the condition that the stations showed high homogeneity among them. Therefore, multivariate methods that consider the single imputation principle were not efficient. On the contrary, univariate methodologies Spline interpolation, Stine interpolation and Kalman Smoothing being single methods showed good results in imputation at high proportions of missing data. Also, multiple imputation and maximum likelihood multivariate methods were shown to be more robust and accurate for missing data treatment of the variable under study, being recommended for failure imputation procedures. For this, it is fundamental that the variables' data present homogeneity among them. Therefore, such methods are recommended for the proper treatment of missing data, which allows us to guarantee quality in imputed time series, thus being able to subsidize the planning and control of water resources.
The method of imputation by multiple linear regression can be improved, since the adjustment of ni < 5 neighboring stations were based on the correlation coefficient and, according to the replication progress of replication, it is possible that one or more stations do not present a qualified correlation coefficient. As for multiple imputation, it is concluded that its efficiency can be increased as an ideal m imputation value is established for the variable under study, considering different proportions of missing data. Therefore, it is recommended that future studies be developed because the procedure efficiency is related to the number of m databases generated and, in general, the higher the percentage of missing values, the greater is m. It is also recommended that studies be developed regarding the imputation of missing data for river flows in daily and monthly series.