Big data and GPHIN


Volume 41-9, September 3, 2015: Big Data


Big Data and the Global Public Health Intelligence Network (GPHIN)

Dion M1, AbdelMalik P2, Mawudeku A2*


1 Schulich School of Family Medicine and Dentistry, University of Western Ontario, London, ON

2 Centre for Emergency Preparedness and Response, Public Health Agency of Canada, Ottawa, ON




Background: Globalization and the potential for rapid spread of emerging infectious diseases have heightened the need for ongoing surveillance and early detection. The Global Public Health Intelligence Network (GPHIN) was established to increase situational awareness and capacity for the early detection of emerging public health events.

Objective: To describe how the GPHIN has used Big Data as an effective early detection technique for infectious disease outbreaks worldwide and to identify potential future directions for the GPHIN.

Findings: Every day the GPHIN analyzes over more than 20,000 online news reports (over 30,000 sources) in nine languages worldwide. A web-based program aggregates data based on an algorithm that provides potential signals of emerging public health events which are then reviewed by a multilingual, multidisciplinary team. An alert is sent out if a potential risk is identified. This process proved useful during the Severe Acute Respiratory Syndrome (SARS) outbreak and was adopted shortly after by a number of countries to meet new International Health Regulations that require each country to have the capacity for early detection and reporting. The GPHIN identified the early SARS outbreak in China, was credited with the first detection of MERS-CoV and has played a significant role in the monitoring of the Ebola outbreak in West Africa. Future developments are being considered to advance the GPHIN's capacity in light of other Big Data sources such as social media and its analytical capacity in terms of algorithm development.

Conclusion: The GPHIN's early adoption of Big Data has increased global capacity to detect international infectious disease outbreaks and other public health events. Integration of additional Big Data sources and advances in analytical capacity could further strengthen the GPHIN's capability for timely detection and early warning.


As globalization increases, so does the rapid spread of communicable diseases and emerging public health events. As a result, ongoing surveillance and early detection are even more important to prevent or mitigate the international spread of infectious diseases and to provide countries adequate time to prepare and respond. Big Data refers to the extremely large datasets provided by sources such as social media or newspapers which require powerful computational methods to reveal trends, patterns or the predictive likelihood of an eventFootnote 1Footnote 2. Big Data has been used to optimize sales and business processes, inform trades among sports teams and to improve city planning. It is quickly becoming integral to a variety of aspects of health ranging from health care administration to Google Flu and pharmacosurveillanceFootnote 3.

Canada was an early adopter of Big Data for the initial identification of emerging infections beginning in 1997 through the development of the Global Public Health Intelligence Network (GPHIN), a cooperative effort between (at the time) Health Canada and the World Health Organization (WHO)Footnote 4Footnote 5. The GPHIN continues to be maintained by the Public Health Agency of Canada (the Agency) and links a global network of public health professionals and organizations (e.g., Ministries of Health) for situational awareness and early detection of emerging public health events. The GPHIN relies on an automated web-based system that scans newspapers and other communications worldwide for potential indicators of outbreaks (or "signals") that are analyzed and rapidly assessed by a multilingual, multidisciplinary team at the Agency. When a risk is identified, analysts disseminate relevant information and alerts to senior officials and stakeholders for decision-making. While initially devised to identify communicable disease outbreaks, the system has also been used to monitor potential chemical and radio nuclear hazardsFootnote 4Footnote 6.

The objective of this article is to identify how the GPHIN functions within the context of Big Data, to provide recent examples of the GPHIN in action and to explore potential future directions.

The GPHIN and Big Data

Big Data has been defined by three V's: volume, velocity and varietyFootnote 7Footnote 8. Volume describes the quantity of data that is collected, velocity is the speed at which the data is collected and disseminated and variety refers to the multiplicity of sources that are used to compile the dataFootnote 7.

The GPHIN's volume and variety are exemplified through the use of search functions and news aggregators (companies that provide access to thousands of news sources whose content is automatically indexed) that gather large quantities of data sets from multiple different sources. A web-based application in the GPHIN system continuously scans and mines acquired news sources worldwide in nine languages (Arabic, English, Farsi, French, Portuguese, Russian, simplified Chinese, Spanish and traditional Chinese)Footnote 4. The quantity of data generated is dependent on the criteria, variables and algorithms outlined for the aggregatorsFootnote 6. These algorithms identify potential signals of emerging public health events and filter out irrelevant data considered as "noise" (Figure 1)Footnote 7. Every day, on average, the GPHIN processes 3,000 news reportsFootnote 9. Volume increases when news sources expand coverage on emerging public health events such as the recent Ebola outbreak in West Africa.

Figure 1: The flow of information for the GPHIN process

Figure 1: The flow of information for the GPHIN process

Text description: Figure 1

Figure 1: The flow of information for the GPHIN process

This is a diagram that describes the flow of information for the GPHIN process. There are 3 vertical boxes on the left side of the diagram. The first box is marked data collection. On the right side of the box is an image with many words, such as "breaking news, media and headlines" that are going into the top of a funnel. The second box is marked "Data processing an analyst assessment". On the right side of this box, and below the many words, is a large funnel which states on the outside of it: "Reduplication, metadata, categorization, translation and relevancy". The third box is marked "Intelligence turned into alert notifications". On the right side of the box, and below the funnel, are three piles of papers.

The GPHIN has an abundant variety of data sources. The GPHIN's news aggregators rely on a large variety of national and local newspapers and select newsletters from around the worldFootnote 4Footnote 6. Local newspapers and newsletters are scanned because emerging events can be a localized phenomenon and are often reported in community newspapers and newsletters. Various sections of news publications (sports, travel and finance) are also monitored as they may signal an emerging public health event. Scanning across various languages is done in order capture public health events that are not reported in English newsFootnote 10.

After further application of algorithms within the GPHIN system, approximately 60% (1,800 news reports) of the data are deemed as relevant public health events for assessment. GPHIN analysts sift through these news reports to identify and provide alerts about events with potential implications for decision-making by stakeholders. Access to the GPHIN system is provided to entities that have the responsibility to monitor, respond to and or mitigate emerging public health threats. The GPHIN includes ministries of health, other governmental departments and agencies, international and non-governmental organizations and private companies.

The capacity for velocity in the GPHIN is impressive. It operates on a near real-time, 24/7 basisFootnote 4. The GPHIN system retrieves relevant data from the news aggregators every 15 minutes and is able to complete the processing (including translation) of the data in less than one minuteFootnote 9.

The GPHIN in action

Early detection

The GPHIN has proven to be an effective early detection resource for infectious disease outbreaks. Its utility was initially demonstrated during the Severe Acute Respiratory Syndrome (SARS) outbreak in 2003 when early alerts were provided in reports from Chinese newspapers. The first English report about an atypical outbreak in China was noted by a pharmaceutical company in the financial section of a newspaper that had reported increased sales of its antiviral drugsFootnote 11. This not only flagged the emergence of the outbreak but provided additional information about the local use of antiviral drugs to contain the spread of the virus.

Following the SARS outbreak, the significance of using news media to complement more traditional national public health surveillance systems was recognized by the WHO and its member statesFootnote 12Footnote 13. The SARS outbreak lead to revisions of the International Health Regulations (IHRs)Footnote 14 that required countries to report and control outbreaks of potential international concern in order to strengthen global public health security. The IHRs note that the WHO may include reports from sources other than official notifications or consultations in their assessment of a potential emerging public health eventFootnote 14. After the SARS outbreak, the GPHIN outputs have been used by multiple countries to expand their surveillance capacityFootnote 4Footnote 15.

Over the years, the GPHIN has continued to detect early signals of outbreaks of international concern such as the pandemic influenza H1N1 in 2009Footnote 16. Initial Spanish language reports about the outbreak noted an unusual respiratory outbreak in the state of Veracruz, Mexico that had claimed two lives.

In April 2012, the GPHIN identified eight cases of an unknown respiratory illness and one death in Jordan. GPHIN issued an alert notifying stakeholders, including the WHO, about these cases. Following further investigation and the results of a retrospective laboratory analysis, an outbreak of Middle East Respiratory Syndrome Coronavirus (now known as MERS-CoV) was confirmed. An International Health Regulations (IHR) Notification was posted in November 2012. The GPHIN was credited with being the first to issue an alert about this new emerging illness.

Ongoing monitoring

The GPHIN has proven to be useful for both early detection and continuous monitoring. Ongoing monitoring of events is critical for situational awareness regarding the evolution of an outbreak and the response and mitigation strategies being implemented by the local, national and international communities. Examples of situational awareness of mitigation strategies include the GPHIN's ability to scan for cancellation of flights or cruises, new travel advisories, health screening procedures at border crossings or trade bans. This process has been much more efficient than individually contacting commercial transportation companies, travel agencies and airports.

For example, during pandemic Influenza H1N1, the GPHIN was used as an intelligence source by the World Trade Organization to monitor the extent and the effect of trade bansFootnote 17. Similarly, during the recent response to the Ebola outbreak in West Africa, the GPHIN provided situational awareness about the cancellation of flights, travel advisories and health screening procedures at border crossings.

Next steps

Potential new data sources

Internet, email, smart phones and social media have developed rapidly since the GPHIN was first developed. As a result, potential new sources of Big Data have emerged that can be analyzed to detect signals of early infectious disease outbreaks. Social media tools (such as Twitter and Facebook) have witnessed exponential growth over the last 10 years and these platforms create huge amounts of user-generated content and dataFootnote 18.

These various social media represent potential new data sources for the GPHIN. In addition, other organizations have started to mine social media resources to improve disease surveillanceFootnote 18. For example, Google Flu Trends monitors online search behaviour for early warning signs of influenzaFootnote 19; researchers have used Facebook to help predict health outcomes at the local population health levelFootnote 20; Twitter has been used as a large source of data to monitor health trends during an avian influenza outbreakFootnote 21; and mobile phones have been used to measure human mobility patterns in the context of malaria transmission in the developing worldFootnote 22.

Social media has improved emergency response by providing real-time data capture about the health of communitiesFootnote 23 and the public response to an eventFootnote 24. For example, the use of smartphones and Twitter in Nigeria during the Ebola outbreak in West Africa helped to identify an outbreak in a new area three days before a WHO announcementFootnote 25.

Other novel applications include crowdsourcing systems that capture voluntarily submitted symptoms from the general public through the Internet or mobile phone networks and rapidly aggregate and provide feedback about data in near real-time. This has been used by participatory infectious disease surveillance applications such as Flu Near YouFootnote 26 and DoctorMeFootnote 27.

However, there are some inherent challenges in the use of social media data sources. One of the primary challenges of Big Data in general and social media content in particular, is the "signal-to-noise" ratio which can significantly increase the potential for false positives and false negatives. With the influx of discussions and tweets surrounding the Ebola outbreak in West Africa, for example, it was difficult to distinguish between actual signals of concern and the plethora of messages that would otherwise be expected during such an event. In addition, some social media, such as tweets that are limited to 140 characters, may not have enough contextual information to help discern a reliable signalFootnote 28.

Another challenge when using social media is representativeness. Not everyone has access to a smart phone and therefore data from social media platforms can only reflect the portion of the population that uses themFootnote 28. Mobile technology is expanding significantly so this may help address concerns about representativenessFootnote 29.

Finally, the use of social media poses ethical considerations associated with the rights of individuals, including privacy issuesFootnote 2.

Improving data analysis

Not only might the GPHIN expand its data sources, it could also advance its data analysis capacities. Advanced computational and verification methods to improve the sensitivity and specificity of signals that are detected are being consideredFootnote 30. Also up for consideration is whether better data processing could reduce reliance on a multilingual, multidisciplinary team. The GPHIN is continuously assessing and honing the aggregators and algorithms used which could potentially result in more advanced forms of artificial intelligence. Continuing to advance the GPHIN's analytical capacity will enable the robust management, integration, analysis and interpretation of increasingly large and complex volumes of dataFootnote 31.


Canada's Global Public Health Intelligence Network was an early adopter of Big Data and as an ongoing global resource, helps countries meet event-based surveillance capacity requirements for early detection and reporting of infectious disease outbreaks and other events of international concern. Ongoing advances in Big Data including the use of social media and smart phones, as well as advances in analytical capacity provide opportunities for the further enhancement of the GPHIN. Overall, Big Data approaches have become a vital component of local, national and international public health efforts to detect, report, and control emerging outbreaks.


We would like to acknowledge the entire GPHIN team. The success of the GPHIN is attributed to their hard work, dedication and consistent collaboration. Their support and guidance was much appreciated and contributed greatly to the development of this paper.

Conflict of interest



The GPHIN is funded by the Public Health Agency of Canada.

Page details

Date modified: