Ethical Tech Tips: Anonymisation

An introductory series on the technical aspects of data ethics.

Data ethics is all about furthering our evolution as a society, raising the bar on freedom and making in-depth considerations for a technologically driven future.

Valleys of Gold

Data is valuable. It’s a historic map of how your business got to where it is now. It identifies your customers, how they found you, why they’re here and the pathways they took through your processes. It’s super-helpful for marketing and systems development right now but even more important for analysis and training of AI in the coming years. The future doesn’t always follow trends from the past but there are truths to be found and, crucially, you need history for context and to spot change.

But it seems that many business owners do not appreciate the value in this historic data. On my travels to give advice on data ethics I found unnecessary deletions as people panicked to meet GDPR deadlines. The intention of this article is to raise awareness to the importance of your data whilst also considering your duty to the real people behind it.

A future of freedom for all is the greatest gift we have.

Respect

When I talk about anonymisation, it’s led by respect for individual rights to privacy and the freedoms we all deserve from permanent records or exploitation of the data we leave behind.

Data ethics is all about furthering our evolution as a society, raising the bar on freedom and making in-depth considerations for a technologically driven future.

There should be a genuine attitude of not wanting to know who your data relates to. It’s the trends and the patterns found in groups of people that matter, not knowledge of an individual.

So we anonymise…

Anony-what?

In general, anonymisation is the removal of identifying information such as name or address from the rest of the data you have on a person. The idea being that you keep some value in what remains but remove any interest in the person that created the data.

It sounds simple but anonymisation has an arch-enemy… re-identification — the linking of processed data back to an individual, often in surprising ways. Cleaned data can retain too much uniqueness allowing obvious identification. More commonly, a personal link can be remade by amalgamating other data sources with the data you retained.

A famous example of this occurred when New York City released taxi cab journey information into the public domain. A bright spark matched the location and time of each ride to photos on social media of people getting into cabs thus linking real people to recorded journeys.

It is worth noting that scholars define anonymisation as a thorough and irreversible process where re-identification is not possible by any means. They would say that true anonymisation is devoid of all personal data making it useless for analytics. This leaves us with a conundrum:

Data can be either useful or perfectly anonymous but never both. Paul Ohm

What most schemes achieve instead is de-identification — the removal of identifying info plus the use of other techniques that reduce uniqueness of the remaining data so that re-identification is inconceivable.

Re-identification should be at the forefront of any anonymisation decisions you make. Finding the balance between removing everything and retaining enough value is unique to each situation.

Removing the individual from that data shows respect for humankind and uplifts us all.

Keep it Secret, Keep it Safe

The destination of your data will influence your decisions of when and how to anonymise.

Data shared outside of your organisation is at most risk. It’s wise to only ever share de-identified data and then only the minimum amount required for any third-party to achieve their aims. In some cases, aggregating data into broad groups led by purpose is best. I mention some techniques to ensure privacy of shared data in the next section.

Access control is emerging as a good solution too. A dynamic abstraction layer sits atop your data source, controlling retrieval of what and by whom based on various rules and privacy law conformance. I think that these layers will be commonplace in time as they address many privacy concerns whilst still allowing freely available data for social good.

Even when data is staying within your own walls you still have obligations to the owner of the data. Privacy laws such as the GDPR state not to keep data beyond the original purpose of collection. This is a good attitude to take in general. De-identify or delete as soon as possible but note that some industries require retention of original data for a period of time. Check with your Data Protection Officer.

The data in our control has great potential for social good if presented in the right way.

Above & Beyond

After deciding when to anonymise, how do we go about it? There are many techniques for removing the person from your personal data. I will touch briefly on the beginnings of such a process.

Step one is to remove direct identifiers such as name, date of birth, address, government ids, phone numbers and email addresses. It is common practise to replace those fields with the text, REDACTED (in caps in order to stand out from active data).

Date of birth and location are commonly required in a de-identified dataset so some elements are retained. Your own data requirements will dictate exactly how you modify dates but, as an example, the month and day could be reset. This is generalisation — taking something specific to a few people and transforming it to be more general to a larger group thereby making it harder to re-identify. In addition, randomising the birth year by a few years (adding noise) or changing to a bracket of values (intervals) is good practise.

Postal codes can be translated to GPS coordinates and adjusted to the nearest town rather than street level - generalisation again. Alternatively, the outbound digits of a postal code can be retained and the inbound dropped.

In both of these examples consider the cardinality of the data after generalisation. It may be that the modified data still only relates to a few people or even just one making it possible to extrapolate identity given the right kind of secondary dataset. A technique to measure this by is 𝒌-anonymity where the aim is to group generalised people with 𝒌 other individuals. An extension of this technique is 𝒍-diversity that ensures 𝒍 number of differing values in the whole set.

Other data categories to consider are layers of family information such as mother and father names and children’s date of births. Free-text notes could contain anything so remove or sanitise them if you’re confident of identifying all personal info within. The same for stored copies of documents. There may also be IP addresses in access logs and exported data in files or reports. Look far and wide in your search for personal information.

When deciding what and how far to take de-identification, I find it best to err on the side of caution. Re-identification is a creative endeavour that is often achieved radically so go above and beyond the call to protect privacy.

Compassionate keepers of data are part of a growing movement and shift in attitudes towards greater freedom.

It’s important to remember that retaining any kind of personal data makes this a legal issue so be aware of your obligations when approaching anonymisation, perhaps taking advice along the way. If you have a Data Protection Officer then ask their advice - they’ll be happy to see you.

As with any data privacy action, documenting the decisions made and the reasoning behind them is helpful. I would also include any known risks and possibilities of re-identification.

Note that there are various anonymisation terms used interchangeably from country to country and between legal and tech fields so make clear definitions in your documents.

Also be aware if you’re merely pseudonymising. As the name implies, it is anonymisation in disguise, not the real thing. Two examples are replacing identifying fields with other unique identifiers or masking data with encryption. In the eyes of the GDPR, pseudonymised data is still classified as personal data and within scope of the law.

Let’s Learn

If this introduction has piqued your interest then I would recommend reading these sound opinion pieces:

Opinion on Anonymisation Techniques by the EC Data Protection Working Party

Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization by Paul Ohm

I can’t finish without mentioning differential privacy, a valuable approach that is developing well. Apple have a good introductory explanation of their usage.

A Heart-led Future

Compassionate keepers of data are part of a growing movement and shift in attitudes towards greater freedom. The data in our control has great potential for social good if presented in the right way. Removing the individual from that data shows respect for humankind and uplifts us all. A future of freedom for all is the greatest gift we have.

Hi.

I’m Simon.

If you need a fresh perspective on any data protection or data ethics issues you’re facing I’d love to hear from you: hi@honeychurch.tech

Find details of my services here.