An ethical framework when using artificial intelligence applications


Volume 46–6, June 4, 2020: Artificial intelligence in public health


A call for an ethical framework when using social media data for artificial intelligence applications in public health research

Jean-Philippe Gilbert1, Victoria Ng2, Jingcheng Niu3, Erin E Rees2


1 Université Laval, Québec, QC

2 Public Health Agency of Canada, Ottawa, ON

3 University of Toronto, Toronto, ON


Suggested citation

Gilbert J-P, Ng V, Niu J, Rees EE. A call for an ethical framework when using social media data for artificial intelligence applications in public health research. Can Commun Dis Rep 2020;46(6):169–73.

Keywords: ethics, ethical research, social media, artificial intelligence


Advancements in artificial intelligence (AI), more precisely the subfield of machine learning, and their applications to open-source internet data, such as social media, are growing faster than the management of ethical issues for use in society. An ethical framework helps scientists and policy makers consider ethics in their fields of practice, legitimize their work and protect members of the data-generating public. A central question for advancing the ethical framework is whether or not Tweets, Facebook posts and other open-source social media data generated by the public represent a human or not. The objective of this paper is to highlight ethical issues that the public health sector will be or is already confronting when using social media data in practice. The issues include informed consent, privacy, anonymization and balancing these issues with the benefits of using social media data for the common good. Current ethical frameworks need to provide guidance for addressing issues arising from the use of social media data in the public health sector. Discussions in this area should occur while the application of open-source data is still relatively new, and they should also keep pace as other problems arise from ongoing technological change.


Rapid technological advancements in artificial intelligence (AI), and more specifically, natural language processing (NLP) using machine learning techniques, are enabling easy access and use of open-source big data. NLP allows computers to analyze datasets of natural language discourse (i.e. text not structured for quantitative analysis).

In public health, digital epidemiology has emerged as a new field that focuses on using non–public health sector data such as open-source internet data (e.g. Google Trends, news media) and social media data (e.g. Twitter and Facebook posts), whereas traditional epidemiology uses data collected for the purposes of health care, such as reporting of notifiable diseases by healthcare professionals to contribute to data for the surveillance of disease cases.

Researchers and policy makers recognize the potential of digital epidemiology data for advancing early warning of public health threatsFootnote 1Footnote 2Footnote 3. Odlum & YoonFootnote 4 used NLP to assess Twitter data and reported that Tweets related to Ebola increased in the days leading up to the official alert of the 2014 Ebola outbreak in Africa. Yousefinaghani et al.Footnote 5 showed that 75% of real-time outbreak notifications of avian influenza were identifiable from Twitter; one-third of outbreak notifications were reported on Twitter earlier than official reports. These observations support using Twitter volumes to predict the occurrence of outbreaks, and even forecast expected case counts, has also been shown with Google Trends dataFootnote 1Footnote 6. Furthermore, refinement of social media data into various disease-relevant categories, by using NLP to classify Tweets into symptom types (e.g. fever, vomit), or focusing analysis on specific search terms from Google Trends, helps increase the accuracy in predictions of outbreak occurrence and forecast estimates.

Research that uses data from human participants requires ethical approval. A review process by a government body or university committee independent of the researchers assesses if use of these data ensures the safety, dignity and rights of the participants. Researchers need to demonstrate to the research ethics board (REB) that their study minimizes harm to participants and respects their autonomy, generates and maximizes benefit (e.g. to society, science, participants) and acts with integrity, fairness and transparency to all stakeholders (e.g. participants, beneficiaries of the research). However, in a systematic review of the utilization of Twitter for health research, only 32% of the studies acquired ethical approvalFootnote 7.

This is an example of technology moving faster than policy, in that the availability of newer data sources, such as from social media, have outpaced the need to assess the ethics of their use. This has led to studies with questionable ethical actions, which casts a shadow on all fields that use big data. An example is the “Tastes, Ties, and Time” study in 2007, where the researchers published an anonymized dataset of a group of university students and a codebook with information about the dataset; the dataset was identifiable from the codebookFootnote 8. Similarly, in 2012, evidence of online emotional contagion was sought, without prior consent, by manipulating the Facebook news feed of thousands of people to see if doing so changes sentiments in individuals’ postsFootnote 9.

In this article, we explore issues to do with traditional ethical frameworks in relation to research based on AI, particularly in the field of public health and digital epidemiology. We then present ethical frameworks that allow scientists and policy makers to use data from social media and their applications.

Contemporary ethics

In contemporary science, researchers need ethical approval for the use of human data. This very criterion is the main problem in big data–based research. It raises a seemingly simple question: Does a post or a Tweet represent human data or text data?Footnote 10. Several issues and points of view arise from this question, leading to a necessary debate given that the popularity of using social media data is increasing in several scientific fields, including digital epidemiology.

Currently, studies that use social media data are usually perceived as outside the scope of ethics committees’ evaluation because these data are commonly not considered to be human dataFootnote 11Footnote 12. Many researchers, policy makers and practitioners assume that they can use open-source data, for example, Tweets, public posts on Facebook, public photos on Instagram and Google Trends queries, which do not require passwords to accessFootnote 8Footnote 13. However, for many users of social media, posting publicly does not equate with giving their consent for the post to be used for researchFootnote 8Footnote 11Footnote 12. This issue is not covered by existing ethical review mechanismsFootnote 14.

Furthermore, the ease of access to social media data (in the absence of ethical regulations and using rapid data capture via AI) means that the number of data points is often much larger than from traditional epidemiological datasets. Therefore, decisions about the use and implications of social media data can potentially affect more peopleFootnote 14. For example, the number of people accidentally or maliciously reidentified in a Twitter database is only limited by the resources used to compile and analyse the database, which is far less than traditional surveillance systemsFootnote 14.

Informed consent

Informed consent in the way it exists in contemporary ethics fits poorly with social media data. Firstly, it is almost impossible to obtain the informed consent of people whose data contribute to digital epidemiology because there are often insufficient resources to contact such high numbers of people who can be living anywhereFootnote 15.

Secondly to obtain informed consent, scientists need to confirm the identity of the social media usersFootnote 16. There is no way to ensure that the person behind the social media profile is who they claim to be or to confirm whether the social media post was not generated by a bot (i.e. “robot” responsible for computer-generated social media posts). Because of this complication, some researchers consider consent to the terms and services of a social media platform, which users must give to use the platform, to be a surrogate for informed consentFootnote 16. However, users often do not read the terms and services or understood them wellFootnote 17Footnote 18Footnote 19; nor do these stipulate the terms and conditions under which the data will be used for research, which calls into question the legitimacy and integrity of using terms and services as a surrogate for informed consent. Many “participants” in digital epidemiology are not aware that their data were collected or usedFootnote 20.

Privacy and anonymization issues

We are becoming increasingly reliant on technology to structure and analyze the data proliferating in our digital societies. Data mining helps researchers find complex and unintuitive data patterns. However, data mining methods can also reveal confidential information from seemingly harmless social media data, for example, political affiliationsFootnote 12Footnote 21. In addition, Wang et al.Footnote 22 reported being able to identify people’s sexual orientation by processing pictures of people from a dating website.

An anonymized dataset is the minimal requirement to protect the identity of subjects in social scienceFootnote 23 or in traditional epidemiologyFootnote 20. According to the Common Rule, also known as 45 CFR 46 Subpart A, the principal regulation for human research from the Department of Health and Human Services of the United StatesFootnote 24, 17 identifiers need to be removed to consider a dataset anonymized. These include, among others, name, location of residence, all dates except the year and biometric identifiersFootnote 25. The Canadian Institutes of Health Research (CIHR), the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Social Sciences and Humanities Research Council (SSHRC), identify similar identifiersFootnote 26. However, removing the 17 Common Rule identifiers is often not enough to ensure a dataset is anonymized. This is because social media data are highly complex (i.e. have high dimensionality). Many non-traditional attributes can enable identification, such as reidentification from assessing the structure of the social networks (i.e. human connections) from multiple social media platformsFootnote 15Footnote 27. The advancements in AI algorithms and computational power to extract information and assess patterns means it is no longer possible to have anonymous databasesFootnote 28Footnote 29. Many examples in the scientific literature demonstrate this issue by reidentifying an anonymized and subsequently published datasetFootnote 12Footnote 21.

The common good

The common good takes roots in the utilitarian vision of ethics. In this vision, the common good that research can do is considered versus the potential harm to individuals. A certain level of harm can be tolerated if the result is “positive morality.” In the context of social media, the harm is mostly an invasion of privacyFootnote 30. People are more willing to sacrifice their privacy if they perceive that usage of their data will benefit the common goodFootnote 31Footnote 32. For the most enthusiastic social media users in the Mikal et al. studyFootnote 31, “it’s cool when it’s stuff […] like the flu, because then that’s how [public health decision-makers] know to get the vaccines to a place”. Similarly, for the social media users in the Golder et al. studyFootnote 32, it “could give a voice to patients and others groups, uncover true prevailing issues, and improve patient care.” Factors that influence people’s compliance in sharing their data for the common good include the type of research and the researchers affiliations (i.e. university, company, government)Footnote 32Footnote 33Footnote 34.

Ultimately, while the majority of people agree with the concept of the common good, there is no agreed-upon threshold for which an invasion of privacy can, and should, be tolerated for public health research.

New ethical frameworks

New frameworks that respond to new ethical challenges regarding the use of AI for research have been proposed by the Association of Internet Researchers (AoIR)Footnote 35 and Zook et al.Footnote 36 (Table 1).

Table 1: Proposed ethical frameworks
Authors Guidelines
AoIRFootnote 35 1) Protect vulnerable populations
2) Assess potential harm from research studies on a case-by-case basis
3) Consider data from humans to be human
4) Balance the rights of all involved parties (i.e. the right of privacy for the subject and the right to do research for the scientist)
5) The temporal variability of ethical considerations must be resolved when it occurs
6) Discuss ethical problems with qualified professionals when these arise
Zook et al.Footnote 36 1) Acknowledge that data are people and can do harm
2) Recognize that privacy is more than a binary value
3) Guard against the reidentification of your data
4) Practice ethical data sharing
5) Consider the strengths and limitations of your data; big does not automatically mean better
6) Debate the tough, ethical choices
7) Develop a code of conduct for your organization, research community or industry
8) Design your data and systems for auditability
9) Engage with the broader consequences of data and analysis practices
10) Know when to break these rules

Following a framework can help to legitimize research for the populationFootnote 37. Since the AoIR frameworkFootnote 35 is accepted in the scientific literature, with the Association being one of the most cited organizations in terms of ethics and big data, scientists may want to use this framework rather than the lesser-known Zook et al. framework. However, the Zook et al.Footnote 36 framework is less restrictive and easier to follow.

Many points in these guidelines are already considerations that public health scientists have to address (e.g. protection of the vulnerable population, the potential harms of the study, the anonymization process). Public health scientists already frequently use highly confidential data. The main difference between social media data and traditional data is the way the data are accessed; the original intent for which the data are produced; and the limited ability for social media users to provide informed consent. The data still represent humans, and can result in unintentional consequences such as identifying the individual behind their social media content. Public health scientists have an obligation to protect the individuals behind their data while balancing this with the common good; this subjective decision is extremely difficult to agree upon.


As technology advances rapidly and more research is done with AI and social media data, an established ethical framework is essential to prevent improper use of social media data in public health applications. Researchers in public health, computer science and ethics need to come together to develop a framework that will help scientists conduct responsible research. In general, existing frameworks have been developed for use in every scientific field. Public health-related decisions can have an important impact on the population, however, going as far as to restrict the freedom of movement of persons in the case of a highly infectious disease, as an exampleFootnote 20.

The REB is an important part of the process to ensure the research is within the ethical framework. Inherent in using open-source social media data is that people do not know, or do not have the opportunity to consent, with their data being used. Thus, the REB provides the means to defend the safety, dignity and rights of the participants as stipulated through the ethical framework.

The REB and ethical framework are also needed to address the limitations of social media data. Many social media platforms are available, and the predominance in their use can differ by location. For example, Twitter and Facebook are used extensively in Western countries but banned in the People’s Republic of China; the Chinese government authorizes the use of Sina Weibo and WeChat as the respective Twitter and Facebook equivalents. Furthermore, the demographics of use can vary among applications. Older generations tend to use Twitter and Facebook, while younger generations tend to use Snapchat, Instagram and TikTok. This is known as the digital divideFootnote 38. Some profiles may be underrepresented (e.g. children and elderly), depending of the social media platforms.


The ethical issues to do with using social media data for AI applications in public health research centre around whether these data are considered human. Current ethical frameworks are inadequate for public health research. To prevent further misuse of social media data, we argue that considering social media to be human would facilitate an REB process that ensures the safety, dignity and rights of social media data providers. We further propose that there needs to be more consideration towards the balance between the common good and the intrusion of privacy. Collaboration between ethics researchers and digital epidemiologists is needed to develop ethics committees, guidelines and to oversee research in the field.

Authors’ statement

  • JPG — Writing–original draft, project administration, conceptualization
  • VN — Writing–reviewing & editing, conceptualization, supervision
  • NJ — Writing–reviewing & editing, conceptualization, supervision
  • EER — Writing–reviewing & editing, conceptualization, supervision

Conflict of interest



The authors would also like to acknowledge S de Montigny, N Barrette and P Gachon for their comments.


This work is supported by the Public Health Agency of Canada.

Page details

Date modified: