Introduction to socialmixr

Sebastian Funk

2018-08-09

socialmixr is an R package to derive social mixing matrices from survey data. These are particularly useful for age-structured infectious disease models. For background on age-specific mixing matrices and what data inform them, see, for example, the paper on by Mossong et al.

Usage

The latest stable version of the socialmixr package is installed via

install.packages('socialmixr')

The latest development version of the socialmixr package can be installed via

devtools::install_github('sbfnk/socialmixr')

To load the package, use

library('socialmixr')
#> 
#> Attaching package: 'socialmixr'
#> The following object is masked from 'package:utils':
#> 
#>     cite

At the heart of the socialmixr package is the contact_matrix function. This extracts a contact matrix from survey data. You can use the R help to find out about usage of the contact_matrix function, including a list of examples:

?contact_matrix

The POLYMOD data are included with the package and can be loaded using

data(polymod)

An example use would be

contact_matrix(polymod, countries = "United Kingdom", age.limits = c(0, 1, 5, 15))
#> Using POLYMOD social contact data. To cite this in a publication, use the 'cite' function
#> $matrix
#>          contact.age.group
#> age.group      [0,1)     [1,5)   [5,15)      15+
#>    [0,1)  0.40000000 0.8000000 1.266667 5.933333
#>    [1,5)  0.11250000 1.9375000 1.462500 5.450000
#>    [5,15) 0.02450980 0.5049020 7.946078 6.215686
#>    15+    0.03230337 0.3581461 1.290730 9.594101
#> 
#> $participants
#>    lower.age.limit participants proportion
#> 1:           [0,1)           15 0.01483680
#> 2:           [1,5)           80 0.07912957
#> 3:          [5,15)          204 0.20178042
#> 4:             15+          712 0.70425321

This generates a contact matrix from the UK part of the POLYMOD study, with age groups 0-1, 1-5, 5-15 and 15+ years. It contains the mean number of contacts that each member of an age group (row) has reported with members of the same or another age group (column).

Methodology

The contact_matrix function requires a survey given as a list of two elements, both given as data.frames: participants and contacts. They must be linked by an ID column that refers to the identity of the queried participants (by default global_id, but this can be changed using the id.column argument). The participants data frame, as a minimum, must have the ID column and a column denoting participant age (which can be set by the part.age.column argument, by default participant_age). The contacts data frame, similarly, must have the ID column and a column denoting age (which can be set by the contact.age.column argument, by default cnt_age_mean).

The function then either randomly samples participants (if bootstrap is set to TRUE) or takes all participants in the survey and determines the mean number of contacts in each age group given the age group of the participant. The age groups can be set using the age.limits argument, which should be set to the lower limits of the age groups (e.g., age.limits=c(0, 5, 10) for age groups 0-5, 5-10, 10+). If these are not given, the narrowest age groups possible given survey and demographic data are used.

Surveys

The key argument to the contact_matrix function is the survey that it supposed to use. The socialmixr package includes the POLYMOD survey, which will be used if not survey is specified. It also provides access to all surveys in the Social contact data community on Zenodo. The available surveys can be listed (if an internet connection is available) with

list_surveys()
#>    id       date                             title            creator
#> 1:  1 2018-01-23        France social contact data   Guillaume Béraud
#> 2:  2 2017-11-07       POLYMOD social contact data       Joël Mossong
#> 3:  3 2017-12-07      Peruvian social contact data Carlos G. Grijalva
#> 4:  4 2018-06-14   Social contact data for Vietnam        Horby Peter
#> 5:  5 2017-12-22      Zimbabwe social contact data   Alessia Melegaro
#> 6:  6 2018-02-05 Social contact data for Hong Kong       Kathy  Leung
#>                                       url
#> 1: https://doi.org/10.5281/zenodo.1157918
#> 2: https://doi.org/10.5281/zenodo.1043437
#> 3: https://doi.org/10.5281/zenodo.1095664
#> 4: https://doi.org/10.5281/zenodo.1289473
#> 5: https://doi.org/10.5281/zenodo.1127693
#> 6: https://doi.org/10.5281/zenodo.1165561

A survey can be downloaded using the get_survey command. This will get the relevant data of a survey given either its ID (the first column in the output of the list_surveys command) or Zenodo DOI (also returned by list_surveys). All other relevant commands in the socialmixr package accept a DOI or ID, but if a survey is to be used repeatedly it is worth downloading it and storing it locally to avoid the need for a network connection and speed up processing.

peru <- get_survey(1)
saveRDS(peru, "peru.rds")

This way, the peru data set can be loaded in the future without the need for an internet connection using

peru <- readRDS("peru.rds")

Some surveys may contain data from multiple countries. To check this, use the survey_countries function

survey_countries(polymod)
#> Using POLYMOD social contact data. To cite this in a publication, use the 'cite' function
#> [1] "Italy"          "Germany"        "Luxembourg"     "Netherlands"   
#> [5] "Poland"         "United Kingdom" "Finland"        "Belgium"

If one wishes to get a contact matrix for one or more specific countries, a countries argument can be passed to contact_matrix. If this is not done, the different surveys contained in a dataset are combined as if they were one single sample (i.e., not applying any population-weighting by country or other correction).

By default, socialmixr uses the POLYMOD survey. A reference for any given survey can be obtained using cite, e.g.

cite(polymod)
#> Using POLYMOD social contact data. To cite this in a publication, use the 'cite' function
#> 
#> To cite POLYMOD social contact data in publications use:
#> 
#> Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk R,
#> Massari M, Salmaso S, Tomba GS, Wallinga J, Heijne J,
#> Sadkowska-Todys M, Rosinska M, Edmunds WJ (2017). "POLYMOD social
#> contact data." doi: https://doi.org/10.5281/zenodo.1157934 (URL:
#> http://doi.org/https://doi.org/10.5281/zenodo.1157934), Version
#> 1.1.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Misc{,
#>     title = {POLYMOD social contact data},
#>     author = {Joël Mossong and Niel Hens and Mark Jit and Philippe Beutels and Kari Auranen and Rafael Mikolajczyk and Marco Massari and Stefania Salmaso and Gianpaolo Scalia Tomba and Jacco Wallinga and Janneke Heijne and Malgorzata Sadkowska-Todys and Magdalena Rosinska and W. John Edmunds},
#>     doi = {https://doi.org/10.5281/zenodo.1157934},
#>     note = {Version 1.1},
#>     year = {2017},
#>   }

Bootstrapping

To get an idea of uncertainty of the contact matrices, a bootstrap can be used. If an argument n greater than 1 is passed to contact_matrix, multiple samples of contact matrices are generated. For each sample, participants are sampled (with replacement, to get the same number of participants of the original study), and contacts are sampled from the set of all the contacts of all the participants (again, with replacement). All resulting contact matrices are returned as matrices field in the returned list. From these, derived quantities can be obtained, for example the mean

m <- contact_matrix(polymod, countries = "United Kingdom", age.limits = c(0, 1, 5, 15), n=5)
#> Using POLYMOD social contact data. To cite this in a publication, use the 'cite' function
length(m$matrices)
#> [1] 5
mr <- Reduce("+", lapply(m$matrices, function(x) {x$matrix})) / length(m$matrices)
mr
#>          contact.age.group
#> age.group      [0,1)     [1,5)   [5,15)      15+
#>    [0,1)  0.59012282 0.7681383 1.283456 6.092954
#>    [1,5)  0.09956610 1.6573981 1.547400 5.443418
#>    [5,15) 0.02067134 0.4688799 7.754331 6.257101
#>    15+    0.03413030 0.3305091 1.246192 9.523608

Demography

Obtaining symmetric contact matrices or splitting out their components (see below) requires information about the underlying demographic composition of the survey population. This can be passed to contact_matrix as the survey.pop argument, a data.frame with two columns, lower.age.limit (denoting the lower end of the age groups) and population (denoting the number of people in each age group). If no survey.pop is not given, contact_matrix will try to obtain the age structure of the population (as per the countries argument) from the World Population Prospects of the United Nations, using estimates from the year that closest matches the year in which the contact survey was conducted.

If demographic information is used, this is returned by contact_matrix as the demography field in the results list.

Symmetric contact matrices

Conceivably, contact matrices should be symmetric: the total number of contacts made by members of one age group with those of another should be the same as vice versa. Mathematically, if \(c_{ij}\) is the mean number of contacts made by members of age group \(i\) with members of age group \(j\), and the total number of people in age group \(i\) is \(N_i\), then

\[c_{ij} N_i = c_{ji}N_j\]

Because of variation in the sample from which the contact matrix is obtained, this relationship is usually not fulfilled exactly. In order to obtain a symmetric contact matrix that fulfills it, one can use

\[c'_{ij} = \frac{1}{2N_i} (c_ij N_i + c_ji N_j)\]

To get this version of the contact matrix, use symmetric = TRUE when calling the contact_matrix function.

contact_matrix(polymod, countries = "United Kingdom", age.limits = c(0, 1, 5, 15), symmetric = TRUE)
#> Using POLYMOD social contact data. To cite this in a publication, use the 'cite' function
#> Warning in pop_age(survey.pop, age.limits, ...): Not all age groups represented in population data (5-year age band).
#>   Linearly estimating age group sizes from the 5-year bands.
#> $matrix
#>          contact.age.group
#> age.group      [0,1)     [1,5)    [5,15)      15+
#>    [0,1)  0.40000000 0.6250000 0.7643524 4.122001
#>    [1,5)  0.15625000 1.9375000 1.4059984 5.927286
#>    [5,15) 0.07149388 0.5260415 7.9460784 7.425725
#>    15+    0.05762596 0.3314560 1.1098739 9.594101
#> 
#> $demography
#>    lower.age.limit population proportion year
#> 1:               0     690312 0.01146507 2005
#> 2:               1    2761248 0.04586028 2005
#> 3:               5    7380235 0.12257488 2005
#> 4:              15   49378217 0.82009977 2005
#> 
#> $participants
#>    lower.age.limit participants proportion
#> 1:           [0,1)           15 0.01483680
#> 2:           [1,5)           80 0.07912957
#> 3:          [5,15)          204 0.20178042
#> 4:             15+          712 0.70425321

Splitting contact matrices

The contact_matrix contains a simple model for the elements of the contact matrix, by which it is split into a global component, as well as three components representing contacts, assortativity and demography. In other words, the elements \(c_{ij}\) of the contact matrix are modelled as

\[ c_{ij} = q d_i a_{ij} n_j \]

where \(q d_i\) is the number of contacts that a member of group \(i\) makes across age groups, \(n_j\) is the proportion of the surveyed population in age group \(j\). The constant \(q\) is set to the value of the largest eigenvalue of \(c_{ij}\); if used in an infectious disease model, it can be replaced by the basic reproduction number \(R_0\).

To model the contact matrix in this way with the contact_matrix function, set split = TRUE. The components of the matrix are returned as elements normalisation (\(q\)), contacts (\(d_i\)), matrix (\(a_{ij}\)) and demography (\(n_j\)) of the resulting list.

contact_matrix(polymod, countries = "United Kingdom", age.limits = c(0, 1, 5, 15), split = TRUE)
#> Using POLYMOD social contact data. To cite this in a publication, use the 'cite' function
#> Warning in pop_age(survey.pop, age.limits, ...): Not all age groups represented in population data (5-year age band).
#>   Linearly estimating age group sizes from the 5-year bands.
#> $mean.contacts
#> [1] 11.73887
#> 
#> $normalisation
#> [1] 1.022869
#> 
#> $contacts
#> [1] 0.6995727 0.7464190 1.2235173 0.9390331
#> 
#> $matrix
#>          contact.age.group
#> age.group     [0,1)     [1,5)    [5,15)       15+
#>    [0,1)  4.1534023 2.0767011 1.2302166 0.8612967
#>    [1,5)  1.0948299 4.7138509 1.3312672 0.7414821
#>    [5,15) 0.1455146 0.7494002 4.4126024 0.5159003
#>    15+    0.2498871 0.6926217 0.9339135 1.0375529
#> 
#> $demography
#>    lower.age.limit population proportion year
#> 1:               0     690312 0.01146507 2005
#> 2:               1    2761248 0.04586028 2005
#> 3:               5    7380235 0.12257488 2005
#> 4:              15   49378217 0.82009977 2005
#> 
#> $participants
#>    lower.age.limit participants proportion
#> 1:           [0,1)           15 0.01483680
#> 2:           [1,5)           80 0.07912957
#> 3:          [5,15)          204 0.20178042
#> 4:             15+          712 0.70425321

Plotting

The contact matrices can be plotted, for example, using the geom_tile function of the ggplot2 package.

library("reshape2")
library("ggplot2")
df <- melt(mr, varnames = c("age1", "age2"), value.name = "contacts")
ggplot(df, aes(x = age2, y = age1, fill = contacts)) + theme(legend.position = "bottom") + 
    geom_tile()