Warning, personal data!

Can you collect personal data from websites and feed AI with it? Can you train AI models on data from “regular people” at all? These are the questions we have been asking ourselves intensively since 2023.

Paula Skrzypecka

Let’s start with the fact that GDPR – the flagship EU product in the field of personal data protection, which comprehensively regulates this matter and to which other regulations refer – is not the only regulation we should pay attention to in the context of personal data processing security when using AI.

Our regulatory landscape also includes several other legal acts, including: AI Act (containing regulations on transparency, informing about interactions with AI and data quality control), Digital Services Act (known for content moderation requirements) and Data Act (establishing rules for accessing data generated by IoT devices). So it turns out we have quite a few requirements regarding gathering data from the market, public authorities or big tech companies.

And that’s not all. Under Polish law, we have the Labor Code, which establishes what employers can do regarding the personal data of employees and job candidates. If you’re thinking about using AI tools to assess CVs – that’s where you should start looking for information on what’s allowed what’s not allowed (and not worth it).

As you can see, the landscape of regulations concerning personal data that artificial intelligence can be fed is very complex. To ensure compliance with regulations, it’s important to identify key challenges in terms of personal data protection.

The role of entities in the processing chain

This process is essential, as it defines our scope of responsibility. If you’re a processor, don’t make decisions that could lead you to be regarded as a controller.

GDPR defines three basic roles:

Controller
Joint controller
Processor

Let’s take a closer look at them. In the genAI context, a controller is someone who, either independently or jointly with others, sets the goals and methods for processing personal data (Article 4(7) of the GDPR). In practice, it could be an organization that implements or develops an AI model for its own purposes.
What matters is who really decides on how and why data is used. When it comes to AI, the choice of model structure, training data, and how to use the generated content is crucial.

When the division of competences becomes unclear, the joint control comes into play (regulated by Article 26 of the GDPR). Joint controllers must clearly define areas of responsibility (and importantly, share information on this with individuals whose data they process), which in turn requires detailed agreements on issues such as:

Transparency rules for individuals whose data is concerned
Assignment of responsibilities in the exercise of data subjects’ rights
Responsibility for data security
Incident management procedures

In which cases can we encounter this formula? For example, when a technology company and a scientific institute collaborate on a research project. The tech company provides infrastructure and some training data, while the institute is responsible for validating the model and specialized industry data. Both entities jointly define the model’s parameters and the personal data processing goals.

On the other hand, a processor can perform a wide range of functions, from a cloud infrastructure provider, through a consultant, to companies providing analytical services. In the agreement, the rights and duties of the parties need to be described in detail, taking into account the specifics of AI. All elements required by Article 28(3) of the GDPR must be included, as EU institutions strictly adhere to them.

Just to make things a bit trickier, we need to remember that one company can perform more than one role at the same time (for example, they can be the joint controllers of data in one process, and processors in another). We can encounter such a situation with providers of pre-trained models, providers of training data, RAG, or entities validating models on demand.

In practice, this may require complex legal structures like multi-level agreements for personal data processing or extensive joint control agreements.

GenAI versus GDPR definitions

The definition of personal data processing contained in Art. 4 GDPR includes “an operation or set of operations which is performed on personal data.” In the context of AI, personal data processing can be:

Model training, which is a comprehensive operation of processing personal data – this process involves collecting and preparing training data, transforming data during model learning, as well as validating and testing the model itself.
New content generation by a model, which can lead on one hand to the “emergence” of someone in the personal data processing (e.g. data about an identifiable person) and, on the other hand, to indirect processing of personal data by using learned patterns, generating content with elements of training data, and adapting to new data in real time.
Storing and using a model from the perspective of personal data protection regulations, which may be classified as continuous processing of personal data even if they are not directly visible in the model’s operation results.

What does this mean in practical terms (as we want to comply with GDPR)?

Don’t skip any stage of the genAI model’s “life” in terms of GDPR compliance – from the beginning, analyze Your business model through the lens of personal data protection regulations.
Implement security measures that take into account the specifics of your operations, processes and business model.
Create information policies (e.g., ones that take into account information from the previous paragraph) to prepare for “day zero”, when you’ll have to answer the first request from a person whose data is concerned and, for instance, explain that due to the nature of technology, you can’t just delete their data (or that you can).
Perform and document analyses of the above issues. This is your future insurance – keep all evidence and traces of the specific criteria you used when making certain business decisions. Why is this important? For one thing, because security keeps changing, and in case of an incident you’ll be able to show – with a chronologically arranged documentation – that the security measures you applied were always in line with the state of the art and market practices.

The problem with algorithm transparency

Article 5(1)(a) of the GDPR clearly requires data processing to be transparent in relation to the data subject. This requirement is particularly tricky when it comes to advanced AI models, for both technical and legal reasons.

One of the key challenges in terms of transparency is the so-called black box problem. The complexity of deep learning models, which are based on multilayer neural networks, can make it difficult to understand, even for experts, how the decision-making process works. Models often operate on millions or even billions of parameters, whose interconnections are practically impossible to trace.

Precisely the impossibility to fully explain specific model decisions poses a significant challenge from the perspective of GDPR compliance. This basically means that a controller might not be able to clearly explain why the model made a specific decision or generated a particular response. This is especially important in the context of AI systems which make decisions that can have a significant impact on the rights and freedoms of individuals whose data is involved.

Additionally, in AI models, it’s exceptionally difficult to identify sources of potential errors or biases.

Can we somehow minimize the risk arising from the black box problem? It’s always worth trying, and the best results will come from combining technical and organizational solutions.

There are technical measures you can implement, like XAI (Explainable AI) systems using locally interpretable models, SHAP systems, model visualization techniques, activation maps for neural networks, modular construction of AI systems, intermediate layers that provide knowledge about the decision-making process, or knowledge distillation techniques. Additionally, keep an eye on data quality: for example, implement systems that monitor input data quality, audit training sets, document data characteristics, and implement validation mechanisms.

The queens of organizational solutions are, of course, procedures. Which procedures? Documenting the learning process, introducing testing and validation standards for models, establishing regular review processes on system operations, reporting anomalies, managing risks, creating mitigation plans, or continuously monitoring the effectiveness of corrective actions.

Incidentally, they will help you implement a number of legal requirements arising from the AI Act or GDPR.

How to train genAI models legally

Training genAI models requires processing significant amounts of data, which often contain personal data.
According to art. 5 of the GDPR, every processing of personal data requires a proper legal basis. In the context of genAI, choosing the right legal basis poses a particular challenge due to the specifics of this technology. Let’s take a look at the two options we have on the table.

Consent – art. 6(1)(a) of the GDPR – theoretically the most intuitive (many people still think it’s the only option) legal basis, but it has several significant limitations, like the difficulty to reach all individuals whose data is used, ensuring the consent’s specificity for various model applications, the problem of consent timeliness and validity period, or the problem with exercising the right to withdraw consent (for strictly technical reasons, e.g. the need to retrain the model).

Legitimate interest – art. 6(1)(f) of the GDPR – this is the most commonly used legal basis for training genAI models. In order to enjoy its benefits, you need to undergo a balance test, during which:

You will identify the legitimate interest that the planned data processing is to serve.
You will assess the need to process personal data (in general or to a specific extent).
You will analyze the impact of your data processing on the rights and freedoms of the data subjects.

Of course, you’ll need to document this test. Once you ascertain that the process is effective, you should plan to implement specific measures in order to minimize risks, like pseudonymization. Also, bear in mind that data protection authorities are quite strict when it comes to the results of proportionality tests – it’s better to have really solid and concrete arguments.

European Data Protection Board Guidelines

In December 2024, the European Data Protection Board (EDPB) issued an important opinion that could significantly impact how companies develop and implement AI models in Europe. The document, which was created in response to a request from the Irish data protection authority, provides the first comprehensive guidelines on applying GDPR in the context of artificial intelligence in the EU.

The opinion focuses on three key areas that have long raised doubts in the AI industry: the possibility of recognizing AI models as anonymous, using legitimate interest as the legal basis for data processing, and the impact of illegal data processing during training on the subsequent use of the model.

1. Is it possible to create or use AI models that process personal data without asking the data subjects for permission?

It depends on the situation. The EDPB shows two examples: AI assistant for users and AI for cybersecurity protection. They can be based on a legitimate interest if the processing is really necessary and the rights of all parties are appropriately balanced. In these cases, meeting these conditions seems possible.

When do we need consent instead of legitimate interest? When:

The processing purpose is unclear or unattainable.
There’s a high risk of rights violations or lack of proper security measures.
You want to process sensitive data.
You want to use data in a way that’s unexpected for the concerned individuals (e.g. data from the European Business Register to personalize ads).

This opinion can help supervisory authorities check if the use of personal data is in line with the expectations of the people it concerns. There are certain criteria that need to be taken into account:

Public access to data
Relationship character between the individual and the controller
Service character, context in which personal data was collected
Data source
Potential further applications of the model

2. How to treat a model that was developed using unlawfully-processed personal data?

According to the EDPB’s opinion, we can distinguish three cases:

The data stays in the model and is processed by the same controller when implementing the model – in this case, the compliance with GDPR of the subsequent processing depends on whether the development and implementation phases can be considered as separate processing purposes.
The data is already in the model – the new controller must check the model’s legality and assess the risk.
The data was obtained illegally, but anonymized – it can be used as long as the effective anonymization is proven.

3. When can an AI model be considered “anonymous”?

Bad news to start with – simply stating that “the model is anonymous” (e.g. in a contract or privacy policy) isn’t enough. You need to present specific evidence that you don’t collect personal data, or technical documentation confirming that it has been irreversibly anonymized. That means analyzing the model’s resistance to attacks, test results, or analyzing the vulnerability to data regurgitation, for example.

Summary

In the era of generative artificial intelligence, processing personal data poses a complex legal challenge, governed not only by GDPR but also by an entire ecosystem of EU and national regulations. According to the analysis, it’s essential to properly define the roles (controller, joint controller, processor) in data processing, choose the appropriate legal basis, and implement adequate technical and organizational security measures. The black box problem remains a particular challenge. In the future, we can expect further development of regulations in this area, especially with the implementation of the AI Act and its interaction with existing rules. Soon there should be more detailed technical guidance for AI systems that use personal data. Over time, courts will explain when it’s allowed to rely on legitimate interest (because consent often isn’t a realistic solution). Regulatory bodies will likely establish clearer rules for checking if AI models meet the data protection requirements. We can also expect an increased emphasis on the development of technologies that enable better transparency and explainability of AI models, which would be a direct response to regulatory requirements and claims for the construction of ethical and trustworthy artificial intelligence.

Paula Skrzypecka

Dr nauk prawnych, starsza prawniczka, liderka zespołów tech i e-com w Creativa Legal, wykładowczyni. Specjalizuje się w prawie nowych technologii. Obsługuje start-upy, firmy technologiczne i software house’y, pomaga wdrażać pierwsze regulacje sztucznej inteligencji. Wspiera w zarządzaniu ryzykiem, mówiąc, co i jak, a nie czego nie wolno.