Skip to contents

TL;DR

This is a longer article explaining the workings and the choices behind the nowcasting model and its advantages and limitations.

As Before…

As in the Get Started we start by loading the package and its lazy data, by:

library(nowcaster)
# Loading Belo Horizonte SARI data set
data(sragBH)

Non-structured data

The Get Started example is done on non-structured data, here we give a more a detailed description of this type of data and how it can change the nowcasting.

When we call the nowcasting_inla() function, it has by default, the parameterization takes the data and estimates with a non-structured data form. The estimate fits a negative binomial distribution, NegBinom(λt,d,ϕ)NegBinom(\lambda_{t,d}, \phi), to the case count at a time tt with delay dd, ϕ\phi is the dispersion parameter. The rate λt,d\lambda_{t,d} is then parametric in a log-linear format with a constant term added by structured delay, random effects, and structured time random effects. This is the whole premise of nowcasting, we model the case counts at time tt Given the case count at that time and past values, as well as given the delay the distribution of that count.

Hence, the model is given by the following:

$$\begin{equation} Y_{t,d} \sim NegBinom(\lambda_{t,d}, \phi), \\ \log(\lambda_{t,d}) = \alpha + \beta_t + \gamma_d, \\ \beta_t := u_t - u_{(t-1)} - u_{(t-2)} \sim N(0,\tau_{\beta}); \gamma_d := u_d - u_{(d-1)} \sim N(0, \tau_{\gamma})\\ t=1,2,\ldots,T, \\ d=1,2,\ldots,D, \end{equation}$$

Where the intercept α\alpha follows a Gaussian distribution with a very large variance, βt\beta_t follows a second-order random walk with precision τβ\tau_\beta, while γd\gamma_d a first-order random walk with precision τγ\tau_\gamma. The model is then completed by INLA's default prior distributions for ϕ\phi, τβ\tau_\beta, and τγ\tau_\gamma. See nbinomial, rw1 and rw2 INLA help pages.

The call of the function is straightforward; it simply needs data set as input, here the LazyData loaded in the package namespace. The function has 3 mandatory parameters, dataset to parse the data set to be nowcasted, date_onset for parsing the column name, which is the date of onset of symptoms and date_report which parses the column name for the date of the report of the cases. Here, these columns are “DT_SIN_PRI” and “DT_DIGITA”, respectively. The need for two dates is due to we are modeling the delay as part of the nowcasting estimation, usually nowcasting models, assuming a delay distribution and applying it to the cases observed.

nowcasting_bh_no_age <- nowcasting_inla(dataset = sragBH, 
                                        date_onset = DT_SIN_PRI, 
                                        date_report = DT_DIGITA)
head(nowcasting_bh_no_age$total)
#> # A tibble: 6 × 7
#>    Time dt_event   Median    LI    LS   LIb   LSb
#>   <int> <date>      <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    17 2021-12-13    444  442    448   443  445 
#> 2    18 2021-12-20    632  626    641   630  634 
#> 3    19 2021-12-27    736  727    749   733  740 
#> 4    20 2022-01-03    759  746.   779   754  765.
#> 5    21 2022-01-10    879  861    903   872  887 
#> 6    22 2022-01-17    786  765    816   778  795

The above calling will return only the nowcasting estimate and its Confidence Interval (CI) for two different credibility levels, LIb and LSb are the max and min CI, respectively, with credibility of 50% and LI and LS are the max and min CI, respectively, with credibility of 95%.

The nowcasting_inla has the option to return the curve on which the window of action of the model was set, if the data.by.week parameter is flagged as TRUE it returns on the second element of the output list, the summarized data by week.

nowcasting_bh_no_age <- nowcasting_inla(dataset = sragBH, 
                                        date_onset = DT_SIN_PRI, 
                                        date_report = DT_DIGITA, 
                                        data.by.week = T)
head(nowcasting_bh_no_age$data)
#> # A tibble: 6 × 4
#>   dt_event   delay     Y  Time
#>   <date>     <dbl> <dbl> <int>
#> 1 2021-08-23     0     8     1
#> 2 2021-08-23     1    70     1
#> 3 2021-08-23     2    92     1
#> 4 2021-08-23     3    68     1
#> 5 2021-08-23     4    32     1
#> 6 2021-08-23     5    32     1

This element is the count of cases by each delay day. It is known as the delay triangle, if we tabulate the delay amount against the date of onset of the first symptoms, we can see the pattern of the delay for the cases.

library(dplyr)

data_triangle <- nowcasting_bh_no_age$data |> 
  filter(delay < 30) |> 
  arrange(delay) |> 
  select(-Time)

data_triangle |> 
  filter(dt_event >= (max(dt_event) - 84),
         delay <= 10) |> 
  tidyr::spread(key = delay, value = Y)
#> # A tibble: 13 × 12
#>    dt_event     `0`   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
#>    <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 2021-12-27     5   102   129    80    68   103    92    38    34    26    16
#>  2 2022-01-03    14    94   101   108    91   113    74    63    28    15    21
#>  3 2022-01-10     0    98   151   102   140   116    89    68    32    26    19
#>  4 2022-01-17     4   128   106   131    98    90    78    51    21    29    NA
#>  5 2022-01-24     9    88   121    81    63    71    28    37    30    NA    NA
#>  6 2022-01-31     6    76    90    80    77    38    33    39    NA    NA    NA
#>  7 2022-02-07    12    71    77    60    33    36    36    NA    NA    NA    NA
#>  8 2022-02-14     5    77    60    48    69    45    NA    NA    NA    NA    NA
#>  9 2022-02-21    16    52    81    39    29    NA    NA    NA    NA    NA    NA
#> 10 2022-02-28     7    57    52    39    NA    NA    NA    NA    NA    NA    NA
#> 11 2022-03-07     5    57    75    NA    NA    NA    NA    NA    NA    NA    NA
#> 12 2022-03-14     6    57    NA    NA    NA    NA    NA    NA    NA    NA    NA
#> 13 2022-03-21     3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

When we look at the number of cases with than 10 weeks of delay or less than 84 days before the latest date. The default maximum is 30 weeks delay is considered in nowcasting estimation.

If this element is groped and summarized by the onset of symptoms date, here DT_SIN_PRI, it is the epidemiological curve observed. To exemplify it, we plot the estimate and the epidemiological curve altogether.

library(ggplot2)

data_by_week <- nowcasting_bh_no_age$data |> 
  dplyr::group_by(dt_event) |> 
  dplyr::reframe(
    observed = sum(Y, na.rm = T)
  ) |>
  dplyr::filter(dt_event >= max(dt_event)-270)

nowcasting_bh_no_age$total |> 
  filter(dt_event >= (max(dt_event)-270)) |> 
  ggplot(aes(x = dt_event, y = Median, col = 'Nowcasting')) +
  geom_line(data = data_by_week, 
            aes(x = dt_event, y = observed, col = 'Observed'))+
  geom_ribbon(aes(ymin = LI, ymax = LS, col = NA), alpha = 0.2, show.legend = F)+
  geom_line()+
  theme_bw()+
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 90)) +
  scale_color_manual(values = c('grey50', 'black'), name = '')+
  scale_x_date(date_breaks = '2 weeks', date_labels = '%V/%y', name = 'Date in Weeks')+
  labs(x = '', y = 'Nº Cases')

Structured data, Age

The first improvement we have done on the baseline model for nowcasting with non-structured data is to have the same model for categorical class of the case counts, due to the epidemiological course of SARS-CoV-2, we expect different ages to have different delays distribution. We call this kind of looking into the data as the structured data, as the data now has identifiers for time, delay, and the age class of the cases. To the structured data we fit again a Negative binomial distribution for the case count at time tt with delay dd. Differently from the non-structured case, the model now gives random effects on the delay distribution and time distribution by each of the age-class chosen by the user to break the data. The model has the form now:

$$\begin{equation}Y_{t,d,a} \sim NegBinom(\lambda_{t,d,a}, \phi), \\ \log(\lambda_{t,d,a}) = \alpha_a + \beta_{t,a} + \gamma_{d,a}, \\ \beta_{t,a} := u_t - u_{(t-1)} - u_{(t-2)} \sim N(0,\tau_{a, \beta}); \gamma_{d,a} := u_d - u_{(d-1)} \sim N(0, \tau_{a, \gamma}) \\ t=1,2,\ldots,T, \\ d=1,2,\ldots,D, \\ a=1,2,\ldots,A, \end{equation}$$

where each age class, aa, has an intercept αa\alpha_a following a Gaussian distribution with a very large variance, the time-age random effects, βt,a\beta_{t,a}, follow a joint multivariate Gaussian distribution with a separable variance components, an independent Gaussian term for the age classes with precision τa,β\tau_{a,\beta} and a second-order random walk term for the time with precision τβ\tau_{\beta}. Analogously, the delay-age random effects, γd,a\gamma_{d,a}, follow a joint multivariate Gaussian distribution with a separable variance components, an independent Gaussian term for the age classes with precision τa,γ\tau_{a,\gamma} and a first-order random walk term for the time with precision τγ\tau_{\gamma}. The model is then completed by INLA's default prior distributions for ϕ\phi, τa,β\tau_{a,\beta}, τa,γ\tau_{a,\gamma}, τa,β\tau_{a,\beta} and τγ\tau_\gamma. See nbinomial, rw1 and rw2 INLA help pages.

This new model corrects the delay, taking into account the effects of age classes and the interactions of each age class between time and delay. Now the model needs a flag indicating which column in the data set will be used to break the data into age classes, and another parameter flagging on how the age classes will be split. This is given by the parameters age_col and bins_age. We also pass two additional parameters, data.by.week to return the epidemiological curve out of a window of action of nowcasting estimate and return.age to inform we desire a nowcasting result in two ways, the total aggregation estimate and the age-stratified estimate. The calling of the function has the following form:

nowcasting_bh_age <- nowcasting_inla(dataset = sragBH, 
                                     bins_age = "10 years",
                                     data.by.week = T, 
                                     date_onset = DT_SIN_PRI, 
                                     date_report = DT_DIGITA,
                                     age_col = Idade)

Each of the estimates returned by nowcasting_inla has the same form as in the non-structured case. On the nowcasting estimates, it returns a data.frame with the posterior median and 50% and 95% credible intervals, (LIb and LSb) and (LI and LS) respectively.

library(ggplot2)

dados_by_week <- nowcasting_bh_age$data |> 
  dplyr::group_by(dt_event) |> 
  dplyr::reframe(
    observed = sum(Y, na.rm = T)
  ) |>
  dplyr::filter(dt_event >= max(dt_event)-270)


nowcasting_bh_age$total |> 
  ggplot()+
  geom_line(aes(x = dt_event, y = Median, 
                col = 'Nowcasting'))+
  geom_line(data = dados_by_week, 
            aes(x = dt_event, y = observed, 
                col = "Observed"))+
  geom_ribbon(aes(x = dt_event, y = Median,
                  ymin = LI, ymax = LS), 
              alpha = 0.2, show.legend = F)+
  theme_bw()+
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 90))+
  scale_color_manual(values = c('grey50', 'black'), name = '')+
  scale_x_date(date_breaks = '2 weeks', date_labels = '%V/%y', name = 'Date in Weeks')+
  labs(x = '', y = 'Nº Cases')

For the sake of completeness, we plot both estimates together to check how different they are.



ggplot()+
  geom_line(data = dados_by_week, 
            aes(x = dt_event, y = observed, 
                color = "Observed"))+
  geom_line(data = nowcasting_bh_no_age$total |> 
              filter(dt_event >= (max(dt_event)-270)),
            aes(x = dt_event, y = Median, 
                color = 'Nowcasting - Non-structured'))+
  geom_ribbon(data = nowcasting_bh_no_age$total |> 
              filter(dt_event >= (max(dt_event)-270)),
              aes(x = dt_event, y = Median,
                  ymin = LI, ymax = LS,
                  fill = "Nowcasting - Non-structured"), 
              alpha = 0.5, show.legend = F)+
  geom_line(data = nowcasting_bh_age$total |> 
              filter(dt_event >= (max(dt_event)-270)),
            aes(x = dt_event, y = Median, 
                color = 'Nowcasting - Structured'))+
  geom_ribbon(data = nowcasting_bh_age$total |> 
              filter(dt_event >= (max(dt_event)-270)),
              aes(x = dt_event, y = Median,
                  ymin = LI, ymax = LS,
                  fill = "Nowcasting - Structured"), 
              alpha = 0.5, show.legend = F)+
  theme_bw()+
  theme(legend.position = "bottom", 
        axis.text.x = element_text(angle = 90))+
  scale_color_manual(values = c("lightblue4", 
                                "orange4", 
                                "black"),
                     name = "")+
  scale_fill_manual(values = c("lightblue1", 
                                "orange1", 
                                "black"),
                    name = "")+
  scale_x_date(date_breaks = '2 weeks', 
               date_labels = '%V/%y', 
               name = 'Date in Weeks')+
  labs(x = '', 
       y = 'Nº Cases')

The nowcasting when using the age information is narrower and shows a less decreasing tendency at the end of the time series. This is due to as each age classes have a nowcasting model acting on it, it can capture effects once mascaraed when not breaking by age.

Conclusion

Over this vignette, we learned how to use the nowcasting_inla() and how nowcaster can employ two different models, one without considering the differences of delay by age class, and another considering the differences in delay per age class. Finally, we compared both estimates, showing that the model that uses the age class information to nowcasting produces narrower estimates