Genomic surveillance of SARS-CoV-2


Volume 48-4, April 2022: First Nations Health


The need for linked genomic surveillance of SARS-CoV-2

Caroline Colijn1, David JD Earn2, Jonathan Dushoff3, Nicholas H Ogden4, Michael Li5, Natalie Knox6, Gary Van Domselaar6, Kristyn Franklin7, Gordon Jolly8, Sarah P Otto9


1 Department of Mathematics, Simon Fraser University, Burnaby, BC

2 Department of Mathematics & Statistics and M. G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON

3 Department of Biology and M. G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON

4 Public Health Risk Sciences Division, National Microbiology Laboratory, Public Health Agency of Canada, St.-Hyacinthe, QC

5 Public Health Risk Sciences Division, National Microbiology Laboratory, Public Health Agency of Canada, Guelph, ON

6 National Microbiology Laboratory, Public Health Agency of Canada and Department of Medical Microbiology & Infectious Diseases, University of Manitoba, Winnipeg, MB

7 Centre for Immunization and Respiratory Infectious Diseases, Public Health Agency of Canada, Calgary, AB

8 Public Health Genomics, Public Health Agency of Canada

9 Department of Zoology & Biodiversity Research Centre, University of British Columbia, Vancouver, BC


Suggested citation

Colijn C, Earn DJD, Dushoff J, Ogden NH, Li M, Knox N, Van Domselaar G, Franklin K, Jolly GW, Otto SP. The need for linked genomic surveillance of SARS-CoV-2. Can Commun Dis Rep 2022;48(4):131–9.

Keywords: genomic surveillance, SARS-CoV-2, viral variants, COVID-19, epidemiology, public health, data sharing


Genomic surveillance during the coronavirus disease 2019 (COVID-19) pandemic has been key to the timely identification of virus variants with important public health consequences, such as variants that can transmit among and cause severe disease in both vaccinated or recovered individuals. The rapid emergence of the Omicron variant highlighted the speed with which the extent of a threat must be assessed. Rapid sequencing and public health institutions’ openness to sharing sequence data internationally give an unprecedented opportunity to do this; however, assessing the epidemiological and clinical properties of any new variant remains challenging. Here we highlight a “band of four” key data sources that can help to detect viral variants that threaten COVID-19 management: 1) genetic (virus sequence) data; 2) epidemiological and geographic data; 3) clinical and demographic data; and 4) immunization data. We emphasize the benefits that can be achieved by linking data from these sources and by combining data from these sources with virus sequence data. The considerable challenges of making genomic data available and linked with virus and patient attributes must be balanced against major consequences of not doing so, especially if new variants of concern emerge and spread without timely detection and action.


Since the start of the pandemic, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has evolved in multiple ways that increase its public health threat, with higher transmissibility (Alpha, Delta, Omicron variants)Footnote 1Footnote 2Footnote 3Footnote 4, partial immune escape (Beta, Omicron variants)Footnote 5Footnote 6 and greater severity (Alpha, Delta variants)Footnote 7Footnote 8Footnote 9. The continued emergence and spread of new variants of interest and variants of concern (VOC) have the potential to undermine our ability to manage the coronavirus disease 2019 (COVID-19) pandemic, with costly consequences to health, healthcare systems and economies. The SARS-CoV-2 virus faces heterogeneous selection: highly vaccinated communities and those with substantial immunity from previous infection are partially protected, while unvaccinated communities and those with waning immune protection are susceptible. With rising immunity levels, selection is expected to favour variants that better escape vaccine or infection-induced immunityFootnote 10. It is particularly crucial to know if a new virus variant emerges with mutations that increase 1) the ability to infect vaccinated or recovered individuals, 2) the transmissibility of the virus and/or 3) the severity of the disease. The rapid spread of the Omicron variant has led to the highest demand yet on hospitals in many areas, despite the disease being less severe on averageFootnote 11, highlighting the urgency of developing the methods and data processes to answer these questions in time to take appropriate preventive action.

It is to be hoped that SARS-CoV-2 will not evolve higher transmissibility simultaneously with higher severity among vaccinated or recovered individuals. The cellular immune response is strong and complexFootnote 12Footnote 13Footnote 14, and breakthrough infections have had reduced severity compared to infections in unvaccinated individualsFootnote 15. Before Omicron emerged, vaccine-induced antibody responses remained strong across a variety of VOCsFootnote 16Footnote 17, but Omicron is a stark reminder that variants can emerge that substantially evade our immune responses Footnote 1Footnote 2Footnote 3Footnote 18, at least in terms of neutralizing antibodiesFootnote 14Footnote 18Footnote 19Footnote 20, dramatically reducing vaccine-induced protection against infectionFootnote 21. There is no guarantee that future variants will follow Omicron’s path in terms of severity.

Virus sequencing initiatives and related genomic surveillance systems give a high-resolution and near-real-time view of how SARS-CoV-2 is evolving and spreading and of the mutations that are rising in frequencyFootnote 22. Establishing surveillance systems that can detect evolving viral characteristics that impact clinical outcomes and effectiveness of control measures is a key aim of viral sequencing effortsFootnote 23. For a newly emerging variant with uncertain impact, rapidly assessing the degree of risk to control efforts is paramount and requires multiple sources of data.

Data and linkages that are required

While genomic data alone allow certain inferences (e.g. identifying which cases are related, and identifying which mutations occur in a new variant), substantially greater value can be obtained by combining a “band of four” key data sources: genetic data; epidemiological and geographic data; clinical and demographic data; and immunization (or recovery) data.

Genetic data refer to attributes of the virus. Here we focused on SARS-CoV-2 whole genome sequence data, but note that polymerase chain reaction testing can identify specific mutations or deletions without fully sequencing the virus genome and so can provide rapid VOC detection.

Epidemiological and geographic data refer to information about the transmission context, including the geographic location and the reason for testing or sequencing (e.g. whether the individual was part of a known outbreak, was a traveller, was randomly sampled, was a vaccine breakthrough infection, was someone previously infected or was tested for other reasons). Epidemiological data also include information about the source and location of exposure: workplace outbreak; household; travel; community exposure; animal exposure; and health care worker, as well as any other contact investigation information (e.g. indoors vs outdoors, ventilation, community setting).

Clinical and demographic data refer to attributes of individuals infected with SARS-CoV-2, including treatments provided, outcomes (e.g. symptoms, severity) and demographic aspects (e.g. age, comorbidities, exposure risks).

Immunization (or recovery) data refer to attributes of past COVID-19 infection or vaccination, including vaccine type(s), number of doses and dates of doses.

These data are typically gathered by different parts of a health system at different times and are used for a variety of purposes, creating challenges for data linkage. Medical facilities manage the clinical course of disease, contact tracing and other case data are gathered by epidemiological teams in public health, vaccination status may be in medical records or known only to the individual, while sequence information is often collected at specialized sequencing centres. Along the way, information may be lost or remain disconnected. Jurisdictions differ in the extent to which linkages among these data can be made; however, linking these four data sources is the most promising way to rapidly detect variants that have the potential to break through pandemic containment measures.

Opportunities with partial data

It is essential to understand vaccine effectiveness against a variety of outcomes (infection, symptoms, hospitalization, death), as well as intrinsic transmissibility and severity in vaccinated and unvaccinated individuals. These can change rapidly as new variants arise and spread. Links to genetic data can attribute transmissibility, severity and vaccine effectiveness to viral types, and thereby provide a better basis for projecting infections and healthcare burden in the context of vaccination. Viral evolution also causes a continual turn-over in how we classify a virus, as names are given only when a variant has spread and become sufficiently distinct (e.g. by Phylogenetic Assignment of Named Global Outbreak Lineages)Footnote 24. Consequently, case data with linked lineage information need to be updated as our classification system changes, and this is only possible if links to sequence data, as opposed to lineage names, are maintained.

With only viral sequences and sample dates, it is possible to identify unusual new variants, bursts of mutations, “mutator” lineages that evolve faster than predictedFootnote 25Footnote 26 or genetic changes that spread more rapidly than expected; however, rapid growth is difficult to interpret. Rapid growth could be due to viral characteristics, epidemiological fluctuations, travel-associated introductions or sampling artifactsFootnote 26. For example, the mutational profile of the Omicron variant was a cause for concern as it includes both new mutations and a number of mutations already seen in other VOC—including mutations known to enable the virus to evade neutralizing antibodiesFootnote 27. Because of their genetic surveillance system, the Department of Health in South Africa sounded the alarm about Omicron (B.1.1.529; November 25, 2021) after detecting the new subvariant and witnessing its rapid spread in a matter of weeks (first collected on November 11, 2021). The researchers noted key outstanding questions about the effect of Omicron on transmissibility, effectiveness of vaccines and disease severity, which cannot be determined from data on the number of detected Omicron sequences aloneFootnote 28.

The fields of phylogeography and phylodynamics have enabled the use of virus sequence data to infer the geographic movements of virusesFootnote 24Footnote 25, identify factors driving transmission across geographic regionsFootnote 29, estimate the effective reproduction number over timeFootnote 30Footnote 31 and link virus sequences to epidemiological models for a range of applicationsFootnote 32Footnote 33; however, there are limitations. Phylogeographic analyses are affected by geographic differences in both sampling rates and strategies. Phylodynamic estimates of reproduction numbers over time tend to be retrospective, apply to large virus populations at the national or international scale, have high degrees of uncertainty and are often not immediately actionable at smaller locations—where public health units need to act. Combining sequence data with the other three bands of data offers more opportunities to use virus sequences to understand transmission, severity and immunity. This combination does not necessarily require individual-level linked data; much could be done with data that are de-identified and even data reported for small groups rather than individuals. Even disaggregating outcomes by VOC status would have very high value, as noted recently for OmicronFootnote 34.

If the epidemiological context is known, it is possible to distinguish the emergence of a variant with a high growth rate from growth driven by chance “founder effects” (e.g. superspreader events, social gatherings among unvaccinated individuals, introductions vs transmission in care settings or increased sampling due to a particular outbreak)Footnote 35Footnote 36. Making this distinction increases the reliability of the inference and the value for both research and public healthFootnote 35Footnote 36. For example, Volz et al. combined sequencing and polymerase chain reaction testing data with reason for sequencing (community samples) and geography in estimating transmissibility of the Alpha variant B.1.1.7Footnote 1. Virus sequences can also be linked to travel history to monitor the spread of emerging variants and to inform public health measures aiming to limit importationFootnote 24Footnote 38Footnote 39.

In densely sampled outbreaks, linking virus sequences to epidemiology can offer information of immediate relevance to infection prevention, especially when analysis can be done in real time. Lucey et al. used whole genome sequence data to identify previously undetected transmission events in hospital-acquired infections, finding evidence that transmission occurred from both symptomatic and asymptomatic healthcare workers, and occurred disproportionately in patients who required high levels of nursing care, informing better prevention toolsFootnote 40. In a real-time genomic epidemiology study in Australia, sequencing linked to epidemiological data indicated the probable source of infection and identified previously unknown connections between institutionsFootnote 37Footnote 41. Linking virus sequences to additional host and epidemiological data, such as the location of exposure, would also make it possible to detect mutations that give the virus a context-specific advantage, such as transmitting more efficiently outdoors or among specific age groups.

Linking viral sequence data with host data on age, sex, race, occupation, dwelling type, comorbidities and other clinical/demographic data permits virus and host factors contributing to severe disease to be identified. For example, Bager et al. used linked data for virus sequences, hospitalization outcome and a large number of host covariates to demonstrate a higher adjusted risk ratio of hospitalization for the Alpha variantFootnote 42. Similarly, Fisman and Tuite estimated the increase in risk of hospitalization, intensive care unit admission and death from N501Y-containing variants and the Delta variantFootnote 43. Further resolution could be achieved with whole genome sequence in place of VOC screening data.

Linked immunization and sequence data are essential to determine whether newly emerging types and/or variants reduce vaccine effectiveness and to what extent. For example, Skowronski et al. linked VOC typing with vaccine status and testing information to show that a single dose of messenger ribonucleic acid (mRNA) vaccines was similarly effective against the Alpha and Gamma variants and non-VOC SARS-CoV-2Footnote 44. Examining clusters or sets of closely related virus sequences together with immunization status informs us about potential transmission. If a cluster consists mainly of vaccinated individuals, this suggests considerable transmission among these individuals; however, if breakthrough infections are preferentially sequenced, an apparent cluster of breakthrough cases could be missing many unvaccinated individuals who comprised most of the transmission. Distinguishing between these requires linking sequences, vaccination status and reason for sequencing, which may include contact tracing or household information.

The entire band of four is needed to determine whether a virus variant can be transmitted by vaccinated individuals and cause severe disease among them: sequence data can tell us whether this is a new variant; epidemiological data and vaccination data can tell us whether it is being transmitted among vaccinated individuals and clinical data will indicate whether the variant is causing severe disease. Without these four linked pieces—shared sufficiently rapidly and over a large enough area to have strong statistical power—there will be gaps that substantially weaken our ability to monitor the virus’ changing phenotype. Small-scale but aggregated and de-identified data may be sufficient for early warnings and help to avert concerns over privacy.

Data sharing and statistical power

Many jurisdictions may gather virus sequences and clinical, epidemiological and immunization data, but may not permit linkage among them due to structural or other barriers. Even where timely joint analysis of these data is possible, however, there is an additional challenge that an emerging variant or type is necessarily rare when it is first emerging. Sharing data across jurisdictions results in greatly improved statistical power by increasing the total amount of data available. Data delays are an additional problem. Even for countries sharing virus genomic data through the Global Initiative on Sharing All Influenza Data database, lags can span monthsFootnote 45. These extensive time lags hamper international efforts to track variants and their mutations, determine which are rising in frequency and where, track variants’ epidemiological and biological consequences and develop effective public health policyFootnote 45. Furthermore, even where sequences are shared in a timely manner to the Global Initiative on Sharing All Influenza Data database, they are typically not shared alongside epidemiological, clinical/demographic and immunization data. Indeed, the barriers to public health data sharing are extensive: van Panhuis et al. described technical, motivational, economic, political, legal and ethical barriersFootnote 46. Many of these are of daily relevance in the COVID-19 pandemic.

Timeliness matters

To make an immediate practical difference, these data linkages and analyses need to be conducted with as little delay as possible. The sooner a new VOC can be characterized, the more warning decision-makers have about the risk. Identifying the spread of a VOC requires strong real-time genomic surveillance with sampling that reflects community transmission, and it requires regular reporting on the makeup of the virus population.

There are significant challenges to developing timely surveillance for emerging VOC, and these challenges differ according to whether the concern is an increase in severity, immune escape, transmissibility or a combination. It takes many infections before we can estimate a difference in severity, yet changes in severity will shape the impact on the healthcare burden. But only a minority of individuals experience severe disease, and there are inherent delays between infection and eventual outcomes. By the time the risks of hospital and acute care needs can be estimated, many hundreds or thousands of infections will have occurred. To stratify severity estimates by viral factors requires even more hospital records and therefore more infections (potentially thousands). This can be ameliorated slightly by focusing on measures with minimal time lags (for example hospital admissions rather than occupancy) and with timely reporting.

Differences in transmissibility are likely to be apparent earlier than differences in severity, because transmission occurs for all infections (whereas severe outcomes occur for a small minority). Indeed, with both the Alpha and Delta variants, increases in transmissibility were detected well ahead of increases in severityFootnote 1Footnote 7. Differences in immune evasion may or may not be apparent soon after the relevant variants arise, depending on the genomic surveillance system (e.g. prioritization of breakthrough infections, extent of surveillance) and whether the new type causes severe disease among vaccinated individuals.

An effective surveillance system also requires linking timely detection with timely action. Public health and policy makers need to assess when to take action in the face of the uncertainty that is inherent in early assessments of variants that might increase transmission, severity or immune escape. Early localized actions that prevent a VOC from spreading widely, while costly in the short-term, reduce the risk of prolonged and global challenges to effective COVID-19 control.


Timely and accurate surveillance requires a range of expertise spanning infectious disease epidemiology, statistics, virus evolution, genomics and public health. Benefits are gained not just from combining data but from conducting joint analyses, bringing together a sufficient range of expertise to increase the chance of early detection of an emerging threat. Many standard approaches used to estimate transmissibility, vaccine effectiveness and severity (e.g. attack rates, test negative study designs) are only possible after community transmission is well established. Designing systems to warn of possible elevated transmission, immune evasion and severity when there are still few cases requires integrating many sources of information and expertise and developing and using analytical methods designed to combine these data streams. Furthermore, progress in establishing linked surveillance for SARS-CoV-2 is likely to benefit surveillance for other respiratory pathogens, including newly emerging zoonotic viruses and high-burden pathogens such as influenza and respiratory syncytial virus. Improvements in sequencing technology also allow sequencing multiple viral pathogens sampled from patients or the environment, improving the ability to respond rapidly to any newly emerging virusFootnote 47.

There are precedents for strong genomic-based surveillance systems with linkage to clinical and epidemiological data. PulseNet CanadaFootnote 48 is a virtual electronic network that delivers systemic surveillance for enteric disease and ensures that genomes of causal bacteria are rapidly sequenced. The presence of clusters of cases triggers coordinated outbreak investigations in which data are collected and linked to sequences to assess the full extent of the outbreak and identify the source. For SARS-CoV-2 surveillance, the Canadian COVID-19 Genomics NetworkFootnote 16 aims to establish large-scale virus and host sequencing at a national scale to inform decision-making and track the evolution and spread of the virus. Such national platforms can enable data linkage, either with public access or with privileged access given to approved researchers. Although to date such goals have been hampered in Canada, in part by limited or delayed access to virus sequences and limited linkage.

Throughout the SARS-CoV-2 pandemic, the United Kingdom has led the world in data linking, analyses and public communication in its efforts to understand SARS-CoV-2 evolution and impact on public health. The COVID-19 UK Genomics ConsortiumFootnote 49 performs and coordinates sequencing, with over 1.5 M publicly available viral genomes as of February 17, 2022 Footnote 50. Sequences are linked with clinical and epidemiological information and are stored securely. Public health agencies use genomic data linked to clinical, demographic and epidemiological data in the public health response and can provide de-identified COVID-19 patient information into the Cloud Infrastructure for Microbial Bioinformatics (CLIMB-COVID-19)Footnote 51 database. There are systems in place for researchers to access the data.

A recent briefing (SARS-CoV-2 VOC and variants under investigation in England: technical briefing 36) from the UK Health Security AgencyFootnote 21 provides an excellent example of the impact of research enabled by data linkage in the United Kingdom. This report summarizes research linking Phylogenetic Assignment of Named Global Outbreak lineage information to contact tracing data, permitting the discovery that the BA.2 sublineage of Omicron has shorter serial intervals than the BA.1 sublineage, which in turn impacts the interpretation of selection (higher rate of spread is in part due to faster transmission rather than more overall transmission). Linking to vaccination data, age profiles and severity permitted estimates of protection against severe disease and the likely health care burden of BA.2. Sequence and screen-based characterization of the rise of BA.2 allowed estimates of its rate of spread, which is needed to project the future burden of infection and disease. The report is a collaboration of teams that combine expertise in genomics, outbreak surveillance, contact tracing, epidemiology and data analytics, linking and analyzing emerging data with very rapid turn-around and thereby benefitting the global community.

Beyond national-level analyses, linking data at a local level can provide important insight into transmission routes and outbreak risks; for example, genomic epidemiology tools have been used to examine transmission at the scale of outbreaksFootnote 52Footnote 53Footnote 54Footnote 55Footnote 56. By linking sequences, clinical outcome, epidemiological data and vaccination status, such local analyses can alert public health to the emergence of a concerning cluster. If there was a growing cluster with transmission among vaccinated individuals and high severity, this could be detected early. Both national and local-scale analyses require linkage among disparate data systems through unique identifiers, collaboration across multiple disciplines, and a process by which researchers can access linked data to develop and validate methods.


The SARS-CoV-2 virus will continue to evolve. We cannot predict where new variants of concern will arise, nor rely on them being detected early in locations that have strong genomic surveillance. The more we build strong surveillance systems worldwide, with high-quality data and linkages, the earlier we will be able to detect new variants and act accordingly. Many wealthy countries have high rates of vaccination, which leads to selection of variants with the ability to transmit among vaccinated individuals. With extensive international travel, emerging variants will be able to rapidly migrate around the world, and any that evade immunity will not be as impacted by vaccination requirements. In the worst case, viral evolution could undermine the potential for vaccination to mitigate the pandemic, even in countries that have not yet reached high vaccination rates. Countries with the resources to conduct high volumes of sequencing and to develop strongly linked surveillance programs are also the ones that have most benefited from early and extensive vaccination programs. Developing and supporting strong genomic surveillance that enables monitoring the virus’ phenotypes is important to help ensure that the vaccines remain effective for the rest of the world.

Authors’ statement

  • CC — Conceived the project, led discussion with all authors, wrote the first draft
  • SO — Literature overview
  • NO — Literature overview
  • GJ — Literature overview
  • GvD — Literature overview

All authors performed writing-review and editing. All authors contributed text and approved the final manuscript.

Competing interests





We would like to acknowledge funding support from the Natural Sciences and Engineering Research Council (CANMOD;RGPIN/06624-2019). DJDE and JD are grateful for support from The Michael G. DeGroote Institute for Infectious Disease Research (IIDR) at McMaster University. SO is supported by NSERC RGPIN-2016-03711. The sponsors of the funding sources were not involved in this work.

Report a problem or mistake on this page
Please select all that apply:

Thank you for your help!

You will not receive a reply. For enquiries, contact us.

Date modified: