Article Text

other Versions

Download PDFPDF

Paper
Policymaking to preserve privacy in disclosure of public health data: a suggested framework
  1. Mehrdad A Mizani,
  2. Nazife Baykal
  1. Department of Medical Informatics, Middle East Technical University, Informatics Institute, Ankara, Turkey
  1. Correspondence to Mehrdad A Mizani, Department of Medical Informatics, Middle East Technical University, Informatics Institute, Ankara 6531, Turkey; mehrdadmizani{at}gmail.com, mehrdadam{at}gmail.com

Abstract

Health organisations in Turkey gather a vast amount of valuable individual data that can be used for public health purposes. The organisations use rigid methods to remove some useful details from the data while publishing the rest of the data in a highly aggregated form, mostly because of privacy concerns and lack of standardised policies. This action leads to information loss and bias affecting public health research. Hence, organisations need dynamic policies and well-defined procedures rather than a specific algorithm to protect the privacy of individual data. To address this need, we developed a framework for the systematic application of anonymity methods while reducing and objectively reporting the information loss without leaking confidentiality. This framework acts as a roadmap for policymaking by providing high-level pseudo-policies with semitechnical guidelines in addition to some sample scenarios suitable for policymakers, public health programme managers and legislators.

  • Confidentiality/Privacy
  • Information Technology
  • Policy Guidelines/Inst. Review Boards/Review Cttes.
  • Public Health Ethics

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

In recent years, the number of organisations that gather public health data has increased considerably. This facilitates data analysis and sharing among organisations and scientific disciplines.1 Scattered among organisations from different disciplines, health data are valuable sources for epidemiological research, population health evaluation, disease surveillance and healthcare services improvement.

The building block of public health research is individual or person-specific data gathered by healthcare providers, public health agencies, registries and census bureaus. However, sharing person-specific data poses a threat to the privacy of individuals. Health-related datasets include confidential information that, without required precautions, may lead to undesirable consequences, including discrimination in employment, insurance and government services.2 Privacy concerns have led many health organisations to publish data in aggregated form instead of individual form. Although aggregated data provide overall information for ecological studies, they diminish data integrity and affect the generalisability of the results.

The available solutions proposed to protect the privacy of individual data are mostly theoretical and technical. Public health organisations are usually reluctant to use those techniques largely because of the complex nature of privacy protection. In fact, privacy protection of individual data requires more than implementation of a certain algorithm because it requires robust policies, dynamic procedures, collaborative interorganisational effort and multidisciplinary expert intervention. Structured procedures are essential to answer many questions that cannot be addressed by technicians only.3 Policies and regulations are also essential to balance confidentiality measures and the ability to improve public health.4

Public health professionals in Turkey admit that a valuable amount of data could be exploited to improve health policies. The available data, however, are only accessible in highly aggregated form. The reluctance of organisations to publish non-aggregated data is due to uncertainty about the risks, the complexity of choosing the best solutions, legal and ethical concerns, and the lack of national guidelines and standards. Thus, this paper proposes a general framework for policymaking for the preservation of privacy in disclosure of individual health data.

Technical and policy aspects of protecting the privacy of individual data

Individual data contain attributes that can lead to subject identification. Examples of such directly identifiable attributes are one's name and social security number. A necessary, although non-sufficient, measure to protect privacy is to remove directly identifiable attributes through a process called de-identification. However, it has been shown that 87% of the population of the USA can be uniquely re-identified by using a combination of gender, five-digit zip code and date of birth in de-identified data.5 One of the methods used to reduce the risk of re-identification is called k-anonymity,5–7 which guarantees that a person's record in a de-identified dataset is not distinguishable from those of other (k−1) persons. In the k-anonymity algorithm, the attributes that can be combined to infer individual identity are called quasi-identifiers. To achieve k-anonymity, the values of quasi-identifiers are generalised or suppressed. Generalisation replaces the value of quasi-identifiers with a less specific value, while suppression completely conceals them.8 Generalisation and suppression are applied to guarantee that different combinations of quasi-identifiers appear at least k times in published data. Table 1 shows an example of a 2-anonymous dataset with age, gender, and zip code as the quasi-identifiers.

Table 1

The 2-anonymous dataset with age, gender and zip code as the quasi-identifiers

While generalisation decreases the chance of re-identification, it diminishes the information content, making the data less useful for public health research. This information loss affects the accuracy of research and the interpretation of the results. For example, complete suppression of the age attribute, shown as ‘***’ in the sixth and seventh rows of table 1, makes it impossible to infer any causal relationship between age and health condition. The desirable k-anonymity outcome optimally balances privacy protection and information content. However, the requirements of real data publishing are not always the same as the optimal algorithm behaviour. Privacy or information content requirements may vary based on the nature of the data, publicly accessible related data and state regulations. Hence, the value of various parameters of the algorithm and the degree of privacy risk an organisation is willing to take depend heavily on the strategies, policies, legislation, context and nature of the project.

Confidentiality issues in clinical uses of health data are enforced by giving authorised people access to data. In public health, however, data might be disclosed publicly without prior information about who would have access to them. As a result, the definition and protection of privacy in public health depend on the context of data disclosure. Therefore, intervention by experts from different disciplines and setting dynamic procedures are required to implement the contextual parameters of the technical countermeasures and to reduce the risk of re-identification.

Joining data also requires coordination and commonly accepted data standards established among data-holders. Data sharing across organisational or national borders becomes complicated when data-holders assert different privacy requirements. Interoperable data sharing requires adherence to common standards and policies.9 Thus, a roadmap specifying the most applicable methods and their details needs to be determined by a group of experts and policymakers. With all of the above-mentioned limitations and contextual requirements of data disclosure, a national framework with robust policies, guidelines and best practices would facilitate privacy protection in secondary uses of health data.10

Problems of public health data disclosure in Turkey

Although patient rights are accepted as universal values, practical implementation varies from one country to another.11 In Turkey, patient privacy is mentioned in the statute of patient rights.11 ,12 Statements about privacy in this regulation are generally related to clinical uses of data. The only statement related to the secondary uses of data indicates that the identity of patients cannot be disclosed without patient consent for research purposes. Privacy, from an ethical point of view, is mentioned in the deontology of medical practice in Turkey.11 ,13 It contains a general statement indicating that patient identity should not be disclosed for reporting and publishing purposes. The only specific example in this document is the requirement to remove a patient's name from abortion reports.

Existing regulations in Turkey mostly focus on clinical practices containing a general requirement to remove patient identity from disclosed data. With awareness of the inadequacy of removal of directly identifying information to protect privacy and due to legal and ethical concerns, data-holders refuse to disclose individual data. Some of these data-holders that gather public health data in Turkey are the Turkish Statistical Institute (TurkStat), Family Medicine Information System and Community Health Centres Information System of the ministry of health. The policies of these organisations limit the disclosure of public health data to aggregated or distorted datasets to prevent privacy leaks. TurkStat, which acts as a warehouse for all census data, grants access to certain categories of individual data at their Data Research Centre.14 TurkStat provides publicly accessible data, including health-related categories, only in highly aggregated form without any detail about information loss caused by aggregation.

Proposed framework for effective policymaking to preserve privacy in secondary uses of health data

We developed a framework for policymaking to preserve privacy in disclosing public health data, preferably in non-aggregated form. This framework is part of an academic research and acts as a means to examine the privacy protection methods and procedures in public health. The target users are health informaticists, public health specialists, project managers and policymakers. The framework is based on a generic and adaptable approach to highlight the unique requirements of privacy protection and policymaking in secondary uses of health data. We prepared some example scenarios, including data aggregation, tailoring the data based on researcher requirements, temporal disclosures and expert intervention. These examples intend to highlight the most common issues by suggesting appropriate solutions that depend on the unique context of data disclosure.

Figure 1 depicts the framework in its highest level of abstraction. To divide the tasks into more manageable steps, we developed a modular framework with each module providing instructions for policymaking from design to evaluation. These instructions are general and provide pseudo-policies and both guidelines and procedures therein. Pseudo-policies in our framework refer to high-level and context-independent policy statements and information flows that outline the general objective instead of detailed procedures. Figure 2 depicts the major modules of the framework. This framework establishes a foundation to address the requisites of policymaking to preserve privacy in Turkey or similar developing countries and to evaluate policies in developed countries.

Figure 1

High-level abstraction of the framework.

Figure 2

Levels and modules of the framework.

General and dynamic policymaking roadmap

The pseudo-policies provided by this framework are high level and semitechnical. Therefore, these policies are expandable to and adaptable in different contexts. The framework, as shown in figure 2, has two levels. The first level covers the issues of designing high-level policies based on strategic privacy goals that change over time or from one organisation to the other. The second level contains guidelines to implement procedures to meet the dynamic goals of the first level. We simulated a public health data publishing and sharing environment and applied it on the UCI adult dataset15 to emphasise the technical aspects of anonymisation, as it is the de facto dataset for testing algorithms based on k-anonymity. We also used data from the Global Adult Tobacco Survey of Turkey in 200816 to provide examples on actual health-related data. These examples include extreme cases demonstrating the effects of data characteristics and parameters on the results, all of which emphasise that the outcome is dependent on the context and nature of the data and strategic decisions.

Privacy as not a solely technical issue: multidisciplinary expert review

Our framework emphasises expert review as an essential aspect of policymaking. The reason for employing expert review is that the privacy risk mitigation measures are usually context-dependent and do not necessarily follow the optimised solutions provided by technical algorithms. Additionally, privacy and policy issues are multidisciplinary, requiring teamwork of experts from different fields, including informatics, public health, social sciences, ethics and epidemiology. This expert intervention is applied in the policy cycle by either defining high-level strategic priorities or by determining detailed parameters affecting the behaviour of the preferred anonymisation method. It is also an essential aspect of multiorganisational data publishing where collaborative anonymisation and joining procedures are applied.

Standardisation and interoperability: data preparation module

The data preparation module provides guidelines to determine data specifications, including the syntax, semantics, content and actions required for anonymisation. Classification of attributes based on their identifiability is performed in this module. Regardless of the preferred data formats used internally by each data-holder, this module provides guidelines to transform data to common sets of standards for interoperability.

Contextual solutions: anonymisation and parameters modules

The anonymisation module is the core of the framework focusing on the methods based on k-anonymity. These methods facilitate the disclosure of individual data while keeping the privacy risks within an acceptable margin. The choice of the anonymisation methods and their outcome heavily depend on the contextual parameters specified by the data-holder. To methodically distinguish the different choices and be able to manage these using pre-established guidelines, we elicited different parameters affecting the outcome. The values of these parameters, handled in the parameters module, are determined by procedures or experts guided by policies, priorities, regulations and the context of data publishing. These parameters specify the details of anonymisation, including the temporality, degree of centrality of various tasks, sensitivity of data and levels of generalisation.

Temporal and incremental data disclosures for long-term projects: audit module

Today, with increasing demand for up-to-date data, datasets are constantly growing, which leads to linking them over time and inferring the identity of individuals.6 ,17 ,18 To prevent this problem, it is necessary to audit the specifications of each release along with the disclosed data. In temporal disclosures of growing and changing datasets, data-holders must base their future data release on the previously audited disclosures. The ‘audit module’ provides guidelines to audit the details about previous disclosures and to resolve any identified problem.

Whole picture by linking scattered data: joining module

In Turkey, diverse organisations gather public health data on similar topics or from the same group of people. Joining these scattered data provides a thorough picture of the situation under study. However, there are concerns about privacy issues in such cross-organisational data joining. The already de-identified public health data do not include any clues about the unique identification of individuals, which then leads to duplicated records. However, it is extremely difficult, if not impossible, to join data from independent data-holders that gather data in different formats and details. The joining module provides guidelines to link data based on records belonging to the same individuals.19–21 To provide the most applicable methods for data joining, we defined three scenarios, namely, central, distributed and hybrid models. These scenarios differ based on the role and degree of involvement of a central organisation or expert team as a liaison to join the data from independent data-holders. Based on the scenario and priorities, the joining of the data can be accomplished before or after anonymisation as depicted in figure 2.

Exporting data accompanied with information loss: export and measurement modules

In practice, anonymised data do not contain details about information loss, which leads to uncertainties in interpreting the results of the study. This uncertainty affects public policies and programmes based on the results of the research. Along with the actual anonymised data, we also worked on methods to report the details of anonymisation and information loss for whole datasets or groups of people. For example, the information loss metrics for groups of people in table 1 might be preferable to those of the whole dataset, as the higher generalisation of the sixth and seventh records would affect the overall information loss. Reporting the amount and details of information loss enables researchers to include this information in their models through objective measurement of the bias and to measure the applicability of the dataset for inference studies and generalising the results to the population.

Whole policy cycle: shifting from designing to evaluation

The policies derived by using this framework are flexible in terms that they can be evaluated technically and procedurally. The audit module provides mechanisms to evaluate the policies from a technical viewpoint. The parameters module enables evaluating and refining both technical and procedural parameters while redesigning the policy to exhibit the desired behaviour or to satisfy the dynamic strategies of the organisation.

Conclusion

The lack of dynamic policies encompassing uniform standards and guidelines to implement procedures for data disclosure and joining is a salient problem in Turkey. This problem is due to a lack of efficient national regulations and guidelines for developing in-depth privacy preserving policies. As a result, health organisations either withhold datasets or disclose aggregated data with high levels of information loss.

To facilitate policymaking in a way that renders health data accessible for public health purposes, we designed a framework that can be used in privacy preserving data disclosure and joining. The framework provides a roadmap to design, implement and evaluate policies for privacy protection in the public health domain without forcing the organisation to adopt a certain policy model or technical solution. Therefore, our framework provides pseudo-policies with high-level semitechnical and procedural guidelines to protect privacy and report the inevitable semantic information loss. The main focus of the framework is on publishing and joining person-specific datasets. To realise this goal, we developed guidelines to choose the most appropriate methods based on k-anonymity in accordance with the nature of the data and the context of data disclosure.

A potential benefit of this study is that few health-related individual data are publicly accessible in Turkey. The value of this situation arises from the fact that the outcome of anonymisation is considerably affected by publicly accessible data disclosed in the past. With no or very few person-specific data disclosed, it is possible to regulate data disclosure initiations from scratch, which results in higher information content and fewer privacy leaks. Another contribution of this framework is its emphasis on regulatory, policy, ethics and expert intervention issues. The main focus of medical informatics studies in Turkey is on technical issues or clinical procedures. There is a gap in the areas of ethics, social sciences and patient rights, in that the social aspects surrounding public health informatics are usually overlooked.

With an apparent need for wider access to public health data in Turkey, we believe that our framework provides a general roadmap for policymaking based on guidelines for developing, implementing, evaluating and improving procedures for privacy preserving data disclosure and joining. Such a framework contributes to public health research by facilitating the usage of individual health data for secondary purposes while providing systematic privacy protection.

Acknowledgments

We wish to thank Banu Cakir, Erhan Eren, Arda Arikan and Cigdem Toskay for their valuable comments and feedback.

References

Footnotes

  • Contributors MAM proposed the framework, designed the framework, coded the simulation environment in Java language, applied the simulation on two different datasets, prepared pseudo-policies, compared methods and information content loss metrics, prepared e-survey questions, pretested the questionnaire, gathered data from internet participants, analysed the results, drafted the paper, revised the manuscript, and responded to the comments of the reviewers. NB monitored the whole study, contributed to the design and improvement of the framework, pretesting the questionnaire, official submitting of questionnaire to mailing lists, monitoring the survey process, revised the draft paper, and contributed to the analysis of survey results.

  • Competing interests None.

  • Ethics approval Applied Ethics Research Center, Middle East Technical University, Ankara. http://www.ueam.metu.edu.tr/.

  • Provenance and peer review Not commissioned; externally peer reviewed.