FUNCTIONAL REGRESSION MODELS FOR THE PREDICTION OF CoViD-19

1 Background

Datasets
Variables definition

2 Methodology

Filtering
Fit the model
Predictions

3 Shiny app

Hosting and deployment
R packages
The R Graph Gallery

Objective

Predict the growth rate at horizon \(k\) using the past during the last \(l\) days of growth rate

SHINY APP http://modestya.securized.net/covid19prediction

Background

The motivation for using a functional regression model to predict CoViD19 cases arises from the classic SIR epidemiological model, which in the part specifically dealing with infections proposes the equation: \[\frac{dI}{dt}=\beta S I-\gamma I\] where \(\beta\) is the infection rate, \(\gamma\) is the recovery rate, \(S\) is the population susceptible to infection and \(I\)is the number of infections. Rewriting the above equation we obtain \[GR=\frac{dI/dt}{I}=\beta S - \gamma\] which signals that the growth rate of infections (\(GR\)) is a function of \(S\). If we extend this idea and discretize the above equation, we can pose the following functional regression model: \[ GR_{t+h}=f(S_{t-l}^t,GR_{t-l}^t,I_{t-l}^t,\ldots)+\epsilon_{t+h} \] where the notation \(X_{t-l}^t\) refers to process \(X\) in the interval \([t-l,t]\) and \[GR_{t+h}=\frac{I_{t+h}-I_{t}}{I_{t}+c}\] (divided by \(I_{t}+c\) to not divide by zero) with \(h\) as the prediction horizon.

The function \(f\) is the way to link the scalar response with the functional covariates for which the literature on functional data provides several possibilities.

APP version 1

Datasets

World country dataset from: John Hopkins University Center for System Science and Engineering John Hopkins University dataset, which is updated daily in DATA1. The name of the latest time series (since 22/3):

time_series_covid19_confirmed_global.csv for cumulative confirmed cases.
time_series_covid19_deaths_global.csv for cumulative deaths.
time_series_covid19_recovered_global.csv for cumulative recovered cases.

Spanish region dataset. Confirmed, hospitalised, Intensive care units (ICU), deaths and recovered cases by Autonomous Community of Spain available at Situation of COVID-19 in Spain from Instituto de Salud Carlos III. Data updated daily in DATA2. The structure of this file is not stable over time. The current variables are: CCAA, FECHA, CASOS, PCR+, TestAc+, Hospitalizados, UCI, Fallecidos, Recuperados. Please read the notes at the end of the CSV.
Italian region dataset. Confirmed, hospitalised, Intensive care units (ICU), deaths and recovered cases by regions of Italy available at COVID-19 Italia - Monitoraggio situazioneDipartimento della Protezione Civile from Presidenza del Consiglio dei Ministri - Dipartimento della Protezione Civile. Data updated daily in DATA3.
Catalonia region dataset. These data come from the RSAcovid19 record from the Health Department and show data from the accumulated positive cases, which are those that tested positive on some diagnostic test (PCR or fast test). It also includes data from the accumulated suspicious cases corresponding to people who presented symptoms at some point and a sanitary professional has classified them as a possible case, but they do not have a diagnostic test (PCR or fast test) with a positive result. The surveillance service activated all the cases and they identified the person’s residence zone indicated on each sanitary card. Information is updated in open data daily at Dades obertes de Catalunya.
Madrid region dataset Portal de Datos Abiertos de la Comunidad de Madrid and new app

The availability of quality updated data conditions the selection of the training sample as well as the resolution at which the prediction/estimation/forecast may be made.

Other datasets

Variables definition

Cumulative cases at day \(t\): \(x_t^{(j)}\) with \(j\in \{1,...,5\}\) being, respectively for, confirmed, deaths, hospitalized, ICU and recovered cases.
New cases at day \(t\): \(x_t^{(j)} - x_{t-1}^{(j)}\)
Growth Rate of cases - H\(_k\): \(r_{k}^{(j)}(t)=\frac{x_{t+k}^{(j)} - x_{t}^{(j)}}{x_{t}^{(j)} + 1}\) for \(t=...,t_0-1\) and \(k=1,\ldots,5\)
Active cases at day \(t\): \(a_t = x_{t}^{(1)} - x_{t}^{(2)} - x_{t}^{(5)}\).
Hospitalised and ICU cases are only available for regions of Spain,

Note: new active cases can be negative for some days, if on this day there were more new recoveries \(+\) deaths cases than there were new confirmed cases.

Methodology

Related with the idea of “flattening the curve”, we consider the curve (\(r_{1}^{(j)}(t)\)) that captures how growth rate changes over time. Besides, we smooth this signal to avoid the effect of sudden changes in notification (such as the weekend effect).

Objective: Predict the growth rate at horizon \(k\) using the past during the last 15 days of growth rate H\(_1\):
\[R_{1}(0)=\{r_1^{(j)}(-14),\ldots,r_1^{(j)}(0)\}\]

Algorithm steps:

1. Filtering

Some data from certain regions are banned by certain inconsistency on the records: “Diamond Princess”,“Iran”,“Japan”,“Bahrain” and “Qatar”
For \(r_{t+k}^{(1)}\) response (confirmed cases), we uses the countries or regions with more than 200 confirmed cases at time \(t\).
For \(r_{t+k}^{(2)}\) response (deaths cases), we uses the countries or regions with more than 30 deaths at time \(t\).

Data Incidences only from Instituto de Salud Carlos III (ISCIII)

The file obtained from Instituto de Salud Carlos III (ISCIII) has suffer changes along time in the units of the variables. Typically, the historical data is not reconstructed.

Apr, 4th, 2020. Hospitalized - Extremadura. Adjustment (-36)
Apr, 8th, 2020. ICU - C. Valenciana. Cumulative instead prevalence.
Apr, 11th, 2020. Hospitalized - Castilla La Mancha. Cumulative instead prevalence.
Apr, 12th, 2020. ICU - Castilla La Mancha. Cumulative instead prevalence.
Apr, 16th, 2020. ICU - Castilla y León. Cumulative instead prevalence.
Apr, 16th, 2020. ICU - Aragón. Adjustment (-51)
Apr, 17-18th, 2020. Recovered - Galicia. Missing data.
Apr, 23rd, 2020. ICU - Extremadura. Adjustment.
Apr, 26th, 2020. ICU and Hospitalized - Madrid. Cumulative instead prevalence.
Apr, 28th, 2020. ICU - Galicia - Cumulative instead prevalence. (+235)
Apr, 28th, 2020. Recovered - Galicia. Increased the number of recovered people at home (+3552)
Apr, 28th, 2020. Hospitalized - Galicia. Adjustment (-22)
Apr, 29th, 2020. Confirmed - Galicia. Adjustment (-769)
May, 21th, 2020. Due to the new surveillance and control strategy there is a change in the notification of the Spanish regions (CCAA). The predictions for Spain and its regions will be updated when this information becomes again available, source: ISCIII.

2. Fit the model: Functional regression models

All these models are implemented in the fda.usc packages (Febrero-Bande and Oviedo de la Fuente 2012)

fregre.lm, Lineal Model (FLM)(Cardot, Ferraty, and Sarda 1999): The linear operator is used into functional space \(f(X_{a}^b,Y_{c}^d,\ldots)=\alpha+\int_{a}^b\beta_X(t)X(t)dt+\int_c^d\beta_Y(t)Y(t)dt+\ldots\)
fregre.gsam, Spectral Additive Model (FSAM)(Műller and Yao 2008): Given one finite representation of the curves on a basis of Hilbert space \(X_{t-l+1}^t\approx\sum_{k=1}^{K_X} c_k^X\phi_k^X\) then \(f(X_{a}^b,Y_{c}^d,\ldots)=\alpha+\sum_{k=1}^{K_X}f_k^X(c_k^X)+\sum_{k=1}^{K_Y}f_k^Y(c_k^Y)+\ldots\) where the functions \(f_k^X(c_k^X)\) e \(f_k^Y(c_k^Y)\) of the scalar coefficients of the basis representation are smooth.
fregre.gkam, Additive Kernel Model (FKAM)(Febrero-Bande and González-Manteiga 2013): \(f(X_{a}^b,Y_{c}^d,\ldots)=\alpha+f_X(X_{a}^b)+f_Y(Y_{c}^d)+\ldots\) where the function \(f_X\)(resp. \(f_Y\)) is estimated using a functional Kernel.
Other models

fregre.gls, Functional GLS model (Oviedo de la Fuente et al. 2018)
fregre.basis.fr, Functional Response Model (Chiou et al. 2004)

3. Predictions

Re-estimate Functional Models (Step 2) when new data is available (all countries and regions of Data1, Data2 and Data2).
Reconstruct the expected number of accumulated cases and deduce the new cases to each horizon (confirmed , deaths and actives).
Calculate the incidence rate by contry or region.

More information available in Informest

SHINY APP

Shiny is an R package that makes it easy to build interactive web apps straight from R. You can host standalone apps on a webpage or embed them in R Markdown documents or build dashboards. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions, see R Shiny

Shiny apps are easy to write. No web development skills are required.

Hosting and deployment

Shinyapps.io Host your Shiny apps on the web in minutes with Shinyapps.io. It is easy to use, secure, and scalable. No hardware, installation, or annual purchase contract required. Free and paid options available.

Shiny server

Deploy your Shiny apps and interactive documents on-premises with open source Shiny Server, which offers features such as multiple apps on a single server and deployment of apps behind firewalls.

RStudio server RStudio Server enables you to provide a browser based interface to a version of R running on a remote Linux server, bringing the power and productivity of the RStudio IDE to server-based deployments of R.

R packages

R Markdown Analyze. Share. Reproduce

htmlwidgets Embed widgets in R Markdown documents and Shiny web applications

readxl Read Excel Files

jsonlite A reasonably fast JSON parser and generator, optimized for statistical data and the web

DT:::datatable creates an HTML widget to display R data objects

foreign Reading and writing data stored by some versions of ‘Epi Info’, ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, and for reading and writing some ‘dBase’ files.

tabulizer extracts Tables from PDFs in R

The R Graph Gallery

ColorBrewer palettes

dygraphs: Automatically plots xts time series objects (or any object convertible to xts).

leaflet: Embed maps in knitr/R Markdown documents and Shiny apps

plotly Plotly’s R graphing library makes interactive, publication-quality graphs

dplyr dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges

rgdal rgdal: Bindings for the ‘Geospatial’ Data Abstraction Library

Other R tools

ggplot2 Pyramid plot in R

googleAnalyticsR R library for working with Google Analytics data

maptools Exploratory spatial data analysis is a set of techniques to describe and visualize spatial distributions, identify atypical locations or spatial outliers, discover patterns of sptial association, clusters or hot spots, and suggest spatial regimes or other forms of spatial heterogeneity (Dell’arba, 2005: Anselin, 1988)

Fundings

This work has been supported by Project MTM2016-76969-P from Ministerio de Economía y Competitividad - Agencia Estatal de Investigación and European Regional Development Fund (ERDF) and IAP network StUDyS from Belgian Science Policy.

Acknowledgements

Thanks to Diego Campanario for creating the Shiny server.

References

Cardot, Hervé, Frédéric Ferraty, and Pascal Sarda. 1999. “Functional Linear Model.” Statistics & Probability Letters 45 (1): 11–22.

Chiou, Jeng-Min, Hans-Georg Muller, Jane-Ling Wang, and others. 2004. “Functional Response Models.” Statistica Sinica 14 (3): 675–94.

Febrero-Bande, Manuel, and Wenceslao González-Manteiga. 2013. “Generalized Additive Models for Functional Data.” Test 22 (2): 278–92. http://dx.doi.org/10.1007/s11749-012-0308-0.

Febrero-Bande, Manuel, and M Oviedo de la Fuente. 2012. “Statistical Computing in Functional Data Analysis: The R Package fda.usc.” J. Statist. Software 51 (4): 1–28.

Műller, HG, and F Yao. 2008. “Functional Additive Model.” J Am Stat Assoc 103: 1534–44.

Oviedo de la Fuente, Manuel, Manuel Febrero-Bande, Marı́a Pilar Muñoz, and Àngela Domı́nguez. 2018. “Predicting Seasonal Influenza Transmission Using Functional Regression Models with Temporal Dependence.” PloS One 13 (4): e0194250.