Joakim Wahlqvist blogpost

“Synthetic data – the only way to secure application quality in a GDPR world”

Regulations regarding personal data are becoming stricter and that is a good thing. Nevertheless, tightening these rules creates challenges. For many companies and organizations, data is also essential to be able to respond to the needs of the market and citizens. How synthetic data can help and what it is? Learn more in this blogpost from Joakim Wahlqvist, Analytics & Cognitive Lead at Sogeti Sweden.

Data security and usage have, in recent light, become an important consideration for not only companies, but also individuals whose data is being protected. Every day, we produce around 2.5 quintillion bytes of data  (Forbes, 2018), be it in the form of social media posts, tweets, transactions, likes, web searches, etcetera. All of this data is invaluable to companies as they use it to build and understand their customer profiles, look for trends, identify opportunities, tailor better services and products, and even anticipate events to capitalise on. However, this data can also be used to exploit, influence and abuse. This is why we need regulations, like GDPR, in place that govern and hold companies and individuals accountable for the way they use and gather the data.

Protecting our data

GDPR evolved from a rule, to become a regulation – the first of its kind in the European Union. Under this regulation, personal data or PII (personally identifiable information) is protected by restricting the processing and usage of the data. This regulation protects the end consumer and empowers them to be able to choose what happens with their data and understand how companies are using their data and for what – this is known as ‘right of access’. Under this regulation, individuals can choose whether or not companies can use their data for different purposes. Companies have to delete any data they might have from the individual if the individual decides to revoke their right of access.

Anonymised data not enough

Another feature of GDPR focuses on the usage of data and prohibits companies to use data other than for specific purposes that are inherent to their business models. The companies need to be able to state what data they collect and for what purpose. So, if a company is using production data for testing, this could amount to unlawful processing, especially if it was not explicitly stated what the data would be used for when getting the consent from the individual. There are of course ways to avoid incurring high fines and one of those methods is to use pseudonymised/masked data. The usage of pseudonymised data is more relaxed under GDPR and does not have the strict regulations to comply with however, there is still a risk of a data breach. Even better is the use of anonymised data, which is not regulated by GDPR, although this data comes with risks as well. Anonymised data is data that cannot be traced back to a certain individual, but recent studies have shown that anonymised data can still be traced back to identifying the underlying individuals, which makes this strategy still susceptible to adversarial attacks (Nature Communications, 2019) .

The ground-shaking power of synthetic data

This is where the power of synthetic data shines. Synthetic data looks and feels just like the real data holding all the characteristics and relationships present in the real data. One of my passions right now as an expert in AI is diving into how to create synthetic data with AI, using advanced neural networks to generate synthetic data that can then be used in place of real data. Because in these GDPR times we still desperately need to test and assure the quality of our systems and applications.

And AI can really be useful here. Of course the AI solution needs to be trained on real data. Are we then not being GDPR compliant? Well there is a way to solve that. By first extracting a dataset used in an application, environment or report, then generating synthetic data and push it back into your databases. The advantages of using synthetic over real data are two-fold. First, the advantage of creating an entire dataset that looks and feels like your real data but without the security risk of any data breach is valuable for companies that operate in very highly regulated industries. Secondly, this solution is scalable meaning that we can create endless amounts of data based on a small sample of the real data. The advantage here is that we can create enough data for testing that is once again, GDPR compliant as it is purely synthetic.

It is simply a combination of AI and Testing on steroids that will surely help especially the public sector, deeply affected by GDPR to keep a high quality for their systems and applications. Love it! Feel free to connect with me on this topic – night and day!

  • Joakim Wahlqvist
    Joakim Wahlqvist
    National Data & AI Lead
    +46 703 451 801

    Sorry, this content can only be visible if Functional Cookies are accepted. Please go to the Cookie Settings and change your preferences.