11 February 2021

Privacy of public health datasets under scrutiny

Public Health Technology

But a tool used to assess whether data is secure is itself lacking transparency, critics say.


A new data privacy tool, designed to ensure publicly released anonymised datasets remain secure and private, has cybersecurity experts asking questions about its veiled design.

An early version of the Personal Information Factor (PIF) tool, developed by the NSW government and the Cyber Security Cooperative Research Centre in collaboration with CSIRO’s Data61, has already been put to use in the pandemic, analysing privacy risks of sharing deidentified datasets tracking COVID-19 cases across the state.

With the pandemic still in full swing, more personal data is flying around than ever before. Deidentifying data involves removing any personal identifiers and using other software safeguards to prevent reidentification.

In the context of the pandemic, making such information available to policy makers, health experts and researchers has been critical to informing COVID-19 outbreak response.

Nevertheless, when any kind of anonymous public data is shared online an individual’s privacy is at stake.

PIF is designed to ensure data shared on open platforms for research purposes and policy making remains secure and preserves individuals’ privacy.

The program uses an algorithmto compute a risk score telling data custodians if deidentified data is safe to release, or if additional data protection measures are required, to mitigate the risk of data being matched to individuals.

It works by assessing the type of information contained in an anonymised dataset, how sensitive that data is, and how it might be compromised or possibly reidentified.

“PIF takes a tailored approach to each dataset by considering various attack scenarios used to deidentify information,” said senior research scientist and project lead Dr Sushmita Ruj, from CSIRO’s Data61.

If the program identifies any privacy risks, it makes recommendations on how to transform and secure the data using various techniques, such as data aggregation, Dr Ruj said.

“This process goes on for a couple of iterations to be sure once the data is released, it protects the privacy of the individuals,” she said.

But cybersecurity experts have expressed concerns that there has been no open, rigorous evaluation of the PIF tool, even though it has been used by the NSW government throughout 2020, to analyse the security and privacy risks of releasing datasets about COVID-19 cases and testing rates in NSW.

“This kind of evaluation needs to happen before such a tool is used for real-life data,” said Carsten Rudolph, an associate professor of cyber security at Monash University in Melbourne.

“It’s the responsibility of the people taking this decision [to publish data] to ensure that there was a rigorous analysis of the tool.”

Though little information is currently available, come June 2022, the completed tool is expected to be made available for wider public use, at which point, detailed specification and evaluation will be released also, said Dr Ruj.  

In the meantime, the NSW government is balancing protecting people’s privacy with the need for detailed public health information about the fast-moving pandemic, using an early version of the PIF tool to minimise the reidentification risk before releasing COVID-19 data to the public.

“We needed to release critical and timely information at a fine-grained level detailing when and where COVID-19 cases were identified,” Dr Ian Oppermann, the NSW Chief Data Scientist, said in a statement about the PIF tool.

“But we also needed to protect the privacy and identity of the individuals associated with those datasets.”

Other cybersecurity experts see it differently and urge caution, especially if the PIF tool is being used to examine other data sets before public release, such as “domestic violence data collected during the COVID-19 lockdown and public transport usage”, according to CSIRO Data61’s press release.

It is “completely irresponsible” to release other people’s data on the basis of a privacy vetting tool without previously having released detailed technical specifications for the tool, said Vanessa Teague, professor of cryptography and CEO of Thinking Cybersecurity.

“If the tool works, then showing us exactly how it works should improve transparency and public trust,” Professor Teague told The Medical Republic.

Professor Teague added that COVID-19 case datasets made public by the NSW government are “obviously, easily, identifiable – if you know an infected person’s postcode and notification date”.

Some other attributes can be easily inferred by linking across datasets, Professor Teague said, before adding that “the defence is that the dataset doesn’t contain very much sensitive information” – which presumably the PIF tool took into account.

It’s not the first time Professor Teague has shown how people’s personal details can be re-identified in health datasets made public for research and policy development purposes.

In 2016, she alerted the federal Department of Health to a security issue where patients could be reidentified, by name, in a public health dataset released by the federal government under its policy on accessible public data.

By matching unencrypted parts of an open dataset to publicly available information about the listed individuals, Dr Teague and her colleagues at the University of Melbourne showed how “a few mundane facts often suffice to isolate an individual”.

“While the ambition of making more data more easily available to facilitate research, innovation and sound public policy is a good one,” Dr Teague wrote in 2017, after the dataset was taken offline, “there is an important technical and procedural problem to solve: there is no good solution for publishing sensitive complex individual records that protects privacy without substantially degrading the usefulness of the data”.

Which still rings true today, as it seems a lot more work will be required to validate Data61’s PIF tool and test its potential uses – in full transparency – before cybersecurity experts are comfortable seeing it applied to other, more sensitive public health datasets.