Exercises: Introduction to dplyr

Exercises: Introduction to dplyr

Introduction

The dplyr package provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.

These verbs can be organised into three categories based on the component of the dataset that they work with:

  • Rows:
    • filter() chooses rows based on column values.
    • slice() chooses rows based on location.
    • arrange() changes the order of the rows.
  • Columns:
    • select() changes whether or not a column is included.
    • rename() changes the name of columns.
    • mutate() changes the values of columns and creates new columns.
    • relocate() changes the order of the columns.
  • Groups of rows:
    • summarise() collapses a group into a single row.

In this practical we are going to use these by refactoring the 3-state Markov model. The available functions and what they can do is still in active development and we will only be touching on what possible here. You can keep up to date on the tidyverse webpages.

Cost and QALYs input matrices

Open up the Markov_model_original.R script. This is a simplified version of the whole script which just simulated the population counts.

The original cost-effectiveness Markov model uses R matrices and arrays so we are first going to convert these to long format. Remember that this is in contrast to wide format where a column represents a variable value. In long format all values are by row.

For example, state_c_matrix was a 2 by 3 matrix of costs. We will append _m here just to make it clear in this parctical this is a matrix. In general you won’t want to do this.

state_c_matrix_m <-
  matrix(c(cAsymp, cProg, 0,
           cAsymp + cDrug, cProg, 0),
         byrow = TRUE,
         nrow = n_treatments,
         dimnames = list(t_names,
                         s_names))

The first thing to do is convert to data frame (or tibble). dplyr operations won’t work on matrices.

state_c_matrix <-
  state_c_matrix_m |>
  as.data.frame()

This will appear the same as before but you can check using class() or str()

class(state_c_matrix)
str(state_c_matrix)

Previously treatment was a row name but now we wish to have treatment as its own column. dplyr has a function specifically for dealing with this.

state_c_matrix <-
  state_c_matrix_m |>
  as.data.frame() |>
  rownames_to_column(var = "treatment")

Finally, we are in a position to pivot from the wide format to long format. pivot_wider() and pivot_longer() are relatively new functions, renamed because people would get confused with the previous names. This is also called casting and melting in the reshape2 package. The tidyr package also has gather and spread. Try them out in your work and see which you prefer.

state_c_matrix <-
  state_c_matrix_m |>
  as.data.frame() |>
  rownames_to_column(var = "treatment") |>
  tidyr::pivot_longer(cols = Asymptomatic_disease:Dead,
                      names_to = "state",
                      values_to = "val")

Can you do the same conversion from wide to long format for state_q_matrix and trans_c_matrix?

Of course, you don’t have to go in this round about way to get the input data. You could specify the data frame directly with

state_c_matrix <-
  data.frame(treatment = rep(t_names, each = n_states),
             state = s_names,
             val = c(cAsymp, cProg, 0,
                     cAsymp + cDrug, cProg, 0))

Transition probabilities array

The original probabilities were contained in an array. Again, we’ll append _a just to be super clear this is the array version.

p_matrix_a <- array(data = 0,
                  dim = c(n_states, n_states, n_treatments),
                  dimnames = list(from = s_names,
                                  to = s_names,
                                  t_names))

Now its less clear how to “lengthen” this object. For this we will use the metl() function in the reshape2 package (this is named a oddly a bit like ggplot2. There is a reshape package but no one uses it!).

There are different melt arguments depending on if its a matrix, data frame or array. Have a look at the help documentation to see the differences. In our case we have

p_matrix <-
  reshape2::melt(p_matrix_a,
                 varnames = c("from", "to", "treatment"),
                 value.name = "val")

Population array

The population array pop is slightly different to the other array because nearly all of the values are empty except for at time 1.

Therefore, we will create a data frame with all empty values and then substitute in the extra values. So firstly simpy create the data frame as follows

pop <- 
  data.frame(treatment = rep(t_names, each = n_cycles*n_states),
             cycle = rep(1:n_cycles, each = n_states),
             state = s_names,
             val = NA)

Now, we will use the case_when() function. This is an extension to the ubiquitous ifelse function. case_when is originally taken from SQL where this sort of operation is very common. It allows us to make multiple ifelse logical statements with the need to keep nesting them, which can quickly become very untidy, hard to read and error prone.

The syntax is to first give the logical statement and if it is true then insert the value after the ~. If none of the statements are true then it is good practice to put TRUE ~ at the end to catch all.

pop <- 
  data.frame(treatment = rep(t_names, each = n_cycles*n_states),
             cycle = rep(1:n_cycles, each = n_states),
             state = s_names,
             val = NA) %>% 
  as_tibble() |> 
  mutate(val = 
           case_when(cycle == 1 & state == "Asymptomatic_disease" ~ n_pop,
                     cycle == 1 & state == "Progressive_disease" ~ 0,
                     cycle == 1 & state == "Dead" ~ 0,
                     TRUE ~ NA_real_))

Now we are prepared to run the population simulation. The original Markov model relied on matrix multiplication and dot products which is not so easy now that we’ve converted the code into a tidy data format.

The first thing to do is combine the population data frame pop and transition probability matrix p_matrix. We can do this using the join operators in dplyr. Joins are a way to match rows according to their entries in specified columns. This is a large field, again very important in SQL when using relational databases. I recommend checking out the help documentation. The base R version merge does lots of the same things as *_join.

The joined data frame looks like the following. We also filter the data frame for one particular cycle and take the product of the state populations and transition probabilities from that state.

dplyr::full_join(pop, p_matrix,
                 by = c("treatment" = "treatment",
                        "state" = "from")) |>
  filter(cycle == j - 1) |> 
  mutate(prop = val.x*val.y) 

Now we wish to sum up the total number of transitions in to a state. To do this we need to specify which groups of rows to do this by using the group_by() function. So we don’t include the from column in this list so all from states to a particular to state are aggregated. How they are aggregated is specified by summarise() which in this case is sum but this could be any statistic plugged in here.

dplyr::full_join(pop, p_matrix,
                 by = c("treatment" = "treatment",
                        "state" = "from")) |>
  filter(cycle == j - 1) |> 
  mutate(prop = val.x*val.y) |>
  group_by(treatment, cycle, to) |>
  summarise(val = sum(prop), .groups = "keep") 

Finally, we need to insert these newly calculated population values in to the pop data frame. We can do this with another full_join. The join doesn’t overwrite the val column though, even though they’re called the same thing. It creates two new columns called val.x and val.y. However, we do want the new population val to be inserted into the current list. This is something called coalescing and again is something from SQL. Generally speaking, coalesce returns the first non-empty value from a set of available values so in our case it replaces empty val with values from the val.y column. To clean up we just remove the val.x and val.y columns using select and a negative sign in front of the names.

dplyr::full_join(pop, p_matrix,
                 by = c("treatment" = "treatment",
                        "state" = "from")) |>
  filter(cycle == j - 1) |> 
  mutate(prop = val.x*val.y) |>
  group_by(treatment, cycle, to) |>
  summarise(val = sum(prop), .groups = "keep") |> 
  mutate(cycle = j) |> 
  rename(state = to) |> 
  full_join(pop, by = c("treatment", "cycle", "state")) |> 
  mutate(val = coalesce(val.x, val.y)) |>
  select(-val.x, -val.y)

Clearly, this is not as elegant as the original code and is a good example of when certain approaches are appropriate for certain jobs. The full solutions for these exercise is in the file Markov_model_tidyverse.R. Try running the whole thing and understanding all of the new elements.