Introduction
Aerobiological data (e.g. pollen and spores) are typically measured on a daily basis. Current monitoring typically relies on manual sampling, which can result in incomplete datasets due to malfunctions or voluntary interruptions of sampling (Navares & Azanarte, 2019). Various imputation methods are used to address missing data issues, such as statistical approaches (e.g., linear or moving mean interpolation) and machine learning (ML) techniques (e.g., Convolutional Neural Networks, k-Nearest Neighbours algorithm) (Navares & Azanarte, 2019; Picornell et al., 2021; Marchetti et al., 2017). However, the application of ML approaches in aerobiological missing data imputation has not been well-explored yet. In this research, a simulation study was conducted to evaluate, for the first time, the effectiveness of Gappy Singular Value Decomposition (GSVD) in imputing aerobiological datasets, comparing it with a well-known statistical method, the moving mean algorithm. The GSVD is part of ModelFLOWs-app and it has been adapted for the present study.
Aerobiological data
Daily pollen concentrations for the period 2018-2022 were downloaded from the POLLnet Italian aerobiological monitoring network using the “pollnet” R package. Alnus and Poaceae pollen measured at two monitoring stations in northern Italy were selected for the analysis due to their different load and seasonality. The pollen season of the year 2022 was chosen as simulation period to ensure a complete data series for the preceding years, thus guaranteeing the applicability of the data-driven method.
Imputation methods
The moving mean method of the “AeRobiology” R package, specifically developed for aerobiological datasets, was used. The GSVD method was tailored to aerobiological datasets in the present study. Different settings of the GSVD algorithm were tested by changing the initial values of the gaps and the number of modes, and two cases are presented: the mean value of the time series with 5 modes and a linear interpolation between values of the time series with 10 modes.
Simulation scenarios
For each pollen type and station, 12 simulation scenarios were generated by combining 3 missing data (NAs) proportions (5%, 10%, 25%) and 4 gap lengths (number of consecutive missing days: 3, 5, 7, 10 days). For each simulation scenario, 100 simulated datasets were produced. As a result, a total of 48 simulations (12 scenarios x 2 stations x 2 pollens) were obtained, each with 100 time series for imputation.
Imputation and accuracy evaluation
The moving mean and the GSVD methods were applied to the gappy datasets (from the 100 incomplete time series for each pollen, station, and simulation scenario). The Root Mean Square Error (RMSE) was used as an indicator of imputation accuracy, comparing the reconstructed datasets to the observed time series. After each imputation, RMSE was calculated for each repetition and pollen time series. Finally, the median RMSE was calculated for each imputation method and each combination of NAs. Pollen variability, which can impact imputation accuracy, was also analysed using the Variation Index (VIn) (Picornell et al., 2021), an indicator of variability in pollen concentrations between consecutive days. Additionally, multiple linear regression models were applied to further investigate the relationship of log-transformed RMSE with imputation method, and pollen variability, accounting for gap length and proportion of missing values.
Results
Promising results emerged, showing that the GSVD was as good as the well-established moving mean method and demonstrating its strong generalization capabilities across different data types. However, the imputation error was primarily influenced by pollen characteristics and location, regardless of the imputation method used. High variability in pollen concentrations and the distribution of missing data negatively affected imputation accuracy.
You can also see the explanation of this paper in the following video:
You can download the mentioned datasets and the code to run the GSVD method in the following link.
Funding
A.M. received grants to conduct the MEETOUT study from the European Union through the Italian Ministry of University and Research under the ESF REACT-EU Green and Innovation funding programme (Ministerial Decree 1061/2021) and the NextGenerationEu funding programme (Ministerial Decree 737/2021). Article processing charges were supported by the special fund at the University of Verona dedicated to Open Access publications. S.L.C. and A.C. acknowledge the grants PID2023-147790OB-I00, TED2021-129774B-C21 and PLEC2022-009235 funded by MCIN/AEI/10.13039/501100011033 and by the European Union “NextGenerationEU”/PRTR. The authors acknowledge the MODELAIR and ENCODING projects that have received funding from the European Union’s Horizon Europe research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 101072559 and 101072779, respectively. The results of this publication reflect only the author(s) view and do not necessarily reflect those of the European Union. The European Union cannot be held responsible for them. A.C. acknowledges the support of Universidad Politécnica de Madrid, under the program ‘Programa Propio’. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional external funding received for this study.