Skip Navigation LinksIFCA > IFCA | Instituto de Física de Cantabria > News > An IFCA team creates pyCANON: An online tool that informs about the privacy of our data

An IFCA team creates pyCANON: An online tool that informs about the privacy of our data

It is a Python library, now available for public use

7 February 2023

Judith Sáinz-Pardo and Álvaro López, researchers from the Advanced Computing and e-Science group at the Institute of Physics of Cantabria (IFCA, CSIC-UC) have developed pyCANON, a Python library that, through the use of anonymisation techniques, allows users to know the level of anonymity of their data and thus the possible privacy risks they may suffer. The work has been published in the journal Nature Scientific Data.

Technologies that handle large amounts of data have experienced rapid growth in recent years, thanks above all to the easy handling of large volumes of data, known as big data. Artificial intelligence, machine learning and deep learning enable applications ranging from machine vision to natural language processing and speech recognition. However, producing these systems requires large amounts of data and training models with a good level of accuracy. 

For this reason, IFCA researchers have created pyCANON, a tool that can be used by any user without the need for extensive knowledge of privacy techniques or programming, and which guarantees the level of privacy in the data in order to operate with them securely

Judith Sáinz-Pardo explains that "pyCANON was created to provide the research team, and in general anyone who wants to publish a dataset in open access or share it with others, with knowledge of the level of anonymisation of their data, i.e. how anonymous their data is". "The tool provides information about the possible risks to which this information would be exposed, and its resistance to different attacks", she says.

Judith Sáinz-Pardo is one of the creators of pyCANON.

For example, in the case of a database with patients' clinical information, "we would have a very large set of data, and among them would be the quasi-identifiers, which are, for example, the patient's place of residence, age, gender, etc. Then the sensitive attributes, i.e. information that we should not know about the patient, which an attacker should therefore not be able to access", says the researcher. "What PyCANON would do in this case is to compare the distribution of these two groups of data in the database to find out how anonymous they are, according to 9 different and very useful techniques, which prevent a different type of attack", she concludes.  

Schema to obtain the data anonymisation report, list of quasi-identifiers and sensitive attributes / Nature Scientific Data

The library, which is already available online, has been created so that its use is "very intuitive and very simple", explains the IFCA researcher, "we have prepared it so that the document showing the level of anonymisation of the user's data can be viewed in PDF format, with which everyone is familiar". 

pyCANON: Balancing privacy and information 

The main challenge in handling large amounts of data is to maintain the balance between privacy and preserving as much information as possible. A study conducted on the US census revealed that in 81% of cases, three pieces of personal information such as a person's postcode, gender and date of birth are sufficient to identify someone in a database, and would allow certain sensitive information such as salary class or education level to be extracted. 

The problem is compounded in medical environments when dealing with databases containing clinical patient data, where a balance has to be struck between preserving patient privacy and keeping as much information as possible to develop models. 

"We believe it is important to be able to publish data in the open, or share data between entities, knowing the security guarantees you have", says Sáinz-Pardo. 

Rebeca García / IFCA Communication
  • Joint Centre with the combined effort of Spanish National Research Council (CSIC) and University of Cantabria (UC)

    Instituto de Física de Cantabria
    Edificio Juan Jordá
    Avenida de los Castros, s/n
    E-39005 Santander
    Cantabria, Spain

  • © IFCA- Institute of Physics of Cantabria