GDPR and Data Engineering

November 11, 2021
GDPR and Data Engineering

Three years after coming into force, General Data Protection Regulation (GDPR) still raises hackles, and fears. When first adopted by the European Union in April 2016, the problem was one of even defining the scope of an enterprise’s GDPR project, and then how to implement it. After 2018, as fines in millions of euros began to be levied against non-compliant companies, concerns about being caught violating GDPR became headaches, then nightmares. Those nightmares became reality in July 2021 when Amazon was hit with a 746 million euro (approx. $886 million) fine for violating GDPR regulations in regard to processing personal data. In August 2021, Deliveroo and Foodinho in Italy were fined for over two million euro each under GDPR for privacy violations after illegally collecting data on their food drivers. Companies outside of Europe were put on notice, too, as under some conditions, the regulation has a global reach. For all, the law allows the EU to forbid the companies from further using the data in question, so even firms that might find the insight gathered worth the risk of a monetary fine might think twice before proceeding.

What the threat of running afoul of the regulators means is that experience with GDPR can make a data privacy engineer or partner company a very important part of an enterprise. To the general public, GDPR may seem to be little more than a click or two regarding information sharing on websites, but there are parts of the regulation that are not constantly in the public eye, such as data security and how much personalization needs to be removed from a subject’s data and under what conditions, which require an understanding of both law and data engineering. Compliance-related tech is a young, though rapidly growing, field and a company’s  data science process can be severely impacted by ignoring it.

The compliance market was created before GDPR’s inception, but the EU regulation initiated a growth stage. The market is expected to grow from an estimated $1.4 billion in 2020 to $4.4 billion by 2027, according to Research and Markets. However, implementing GDPR has been easier said than done. Sorting out what needs to be put in place, not only in terms of subjects accepting the collection of their data, but  also in terms of how companies analyze and secure it, took years to sort out. Small and medium-sized firms in particular found it difficult, according to an official report in 2020.

Image by Proxet. Simple Data Pipeline for GDPR
Simple Data Pipeline for GDPR

To give an example of  the scope of the issue from a practical standpoint, at Proxet we’ve successfully built systems for medical research. It’s not news that the days of the lone doctor with a microscope are far in the past, but the size of such projects spirals quickly.

“Our work with medical researchers included a global portal for over 10,000 users focused on new cures for diseases. With so many users across a variety of geographies, including EU citizens within the EU and abroad, we simply had to put GDPR and national privacy regulations at the center of every module touching data from collection to analytics to output.”

Vlad Medvedovsky, CEO at Proxet (ex - Rails Reactor), a company providing software development services

What is GDPR and What Does it Cover?

GDPR is the European Union’s regulation on data protection and privacy. The regulation centers on the individual’s rights and control over their personal data. GDPR also focuses on what data businesses, especially international ones, can work with, as well as how they can work with it. The latter point encompasses identity-related issues such as pseudonymisation and anonymisation. GDPR has become the starting point for data protection regulation by governments around the world.
The regulation rests upon a set of principles that govern the conditions under which data may be collected from a data subject, and utilized by a data controller. Consent, legal and contractual obligations, public or official interest, and, if those criteria are not crossed, the legitimate interests of the data controller. A data subject’s rights include transparency, access, correcting and erasing data, and the right to object to personal data being processed. The data controller has responsibilities with regard to the proper level of pseudonymisation, record keeping of data processing activities, and security of subject data. GDPR stipulates that in some cases, a data protection officer must be hired by the data controller. All of this needs to be taken into account by the data privacy engineer.

There is significant confusion on what the principles or pillars of GDPR actually are. A simple web search returns websites showing anywhere from three to seven pillars, or up to eight principles. The European Commission itself lists five principles, illustrated in the following questions:

  • What data can we process and under which conditions?
  • What is the purpose of the data processing?
  • How much data can be collected?
  • For how long can data be kept and is it necessary to update it?
  • What information must be given to individuals whose data is collected?

Though the web page was probably intended to make GDPR clearer to the public, it contains mistakes and isn’t very well written.  comes as no surprise that GDPR took years to be clarified.

GDPR and Data Engineering Impact Factor

Just how hard is it to create an IT presence that fits GDPR, especially in terms of data protection? Consider this: In April 2020, Threatpost noted that an EU-funded GDPR compliance advice website got hacked and passwords for the system were exposed in GitHub.

Even though the EU gave the business community two years to prepare for GDPR, and invested millions of euro into PDP4E, the meeting of GDPR and data science was minimal. Researchers found in 2020 that few software engineers had the requisite understanding of GDPR, much less it’s six principles.

“Our study revealed that the GDPR law is not well known to the software developers and those familiar with it, did not understand all the principles...Our study found this to be the main problem as none can implement something that s/he is not familiar with.”

— Alhamzi and Arachchilage Jazan Univ and UNSW

Data EngEngineering Community and Experience with GDPR

Though the implementation of GDPR has had a great impact on data handling by enterprises, the EU has provided help (and funding). The support is important because of the difficulties in transforming legal requirements into a codable form. One result of this help has been the Privacy and Data Protection 4 Engineering (PDP4E) project. PDP4E ran from May 2018 through April 2021, and united legal and engineering teams to work on more than just the actual code or modules that can implement aspects of GDPR that go beyond accepting data collection on a web page. The project also tried to make the less readily understandable elements of the regulation clearer to the citizens the regulation was supposed to protect as well as to the developer community writ broadly that needed to use the methods and tools developed by PDP4E.

PDP4E was an EU-funded project, but it worked closely with the European Data Protection Supervisor, EDPS. The authority has a remit that takes it far beyond GDPR, and it had created the IPEN initiative in 2014 to support engineers in their privacy engineering efforts. As COVID-related efforts from vaccinations to clinical trials to medical records and even video-based mask usage calculation all create data, the response to the pandemic has produced GDPR-related issues that required coordination between government bodies and the data privacy engineer.

Tools for GDPR Compliance

A quick Google search under “Tools for GDPR Compliance” returns a set of solutions for the everyday web page access issue. This is a problem; Amazon isn’t paying an $800 million fine because it didn’t ask someone if they’d accept cookies. Fortunately, PDP4E created software GDPR open source tools that companies can use to become GDPR compliant. These tools cover risk management and assurance, but model-driven design as well.

Moreover, PDP4E took a model-driven engineering (MDE) direction because of the focus and complexity of the task at hand. The approach allowed the project to join three very disparate areas, as they saw it, namely systems and software engineering, privacy and data protection, and the legal field. In particular:

“Privacy and Data Protection should be addressed “by design”, that is since the onset of a system or software project rather than as an afterthought.”

Gabriel Pedroza, Research & Development Engineer & Project Manager at CEA

Model-driven engineering has advantages when the software requirements are very domain expertise-heavy. As Johan den Haan, CTO at Mendix noted in his blog, in model-driven approaches, “IT (as a software application) is defined on a much higher level. The models are as much as possible declarative and defined in domain concepts.”

Another advantage to MDE for GDPR compliance when it comes to data science that den Haan points to is that it minimizes the risk of errors. This is important in a field where error can lead to sensitive data leakage and possibly millions of euros in fines. 

At Proxet, our experience with GDPR in verticals that handle personal information ranging from medical research to the travel industry have given us a grasp of the intersection of GDPR and data science. Got a question about it regarding your project? Feel free to contact us here.

Related Posts