Why You Shouldn’t Finetune Models Directly on Raw Data

Siddansh Chawla
Towards AI
Published in
5 min readApr 30, 2024

--

In today’s world, it’s easier to build on top of existing products rather than create your very own from scratch. In the evolving landscape of Artificial Intelligence (AI), language models like those developed by OpenAI have become an excellent foundation for businesses to incorporate into existing operations to make their internal workflow much smoother and more efficient. One popular method which is widely used nowadays is, finetuning. A process where a model is trained further utilising one’s own data for various purposes. Finetuning a pre-trained model has emerged as a cost-effective strategy when compared to creating one’s own model from scratch. However, this comes with a potential unseen red flag of sharing proprietary and considerably sensitive information, raising significant privacy concerns with respect to personally identifiable information, PII.

The Economics and Efficiency of Finetuning

Finetuning pre-trained models like the ones hosted on various platforms by OpenAI, Anthropic or Mistral has become a standard in every industry. It has been in practice for the past few months due to its cost-effective nature when compared to training a model from scratch. Training a new model requires immense resources and as well is quite time consuming. Finetuning allows a user to start with a model that has basic knowledge and understands the general structure of languages. This allows them to skip teaching the model to learn basics and can focus on making them learn about new patterns present in their data or tweak them to answer things the way they want. This approach allows even smaller businesses like startups to integrate state-of-the-art AI into their existing product and treat their customers with better results.

Potential Privacy Risks of Finetuning with Proprietary Raw Data

We now understand why companies are willing to integrate state-of-the-art AI into their workflows, but it’s important to understand the drawbacks of using it in an unsafe manner. To fine-tune a model to understand the business process, it must learn from a dataset that has the overall workflow and use case of the business. This might mean that organisations have to use their proprietary data. The primary drawback of utilising proprietary company data for finetuning is the potential exposure of Personally Identifiable Information. Let's understand the key risks one by one.

1. Data breaches: The data used for finetuning could potentially be exposed and become vulnerable to breaches due to unauthorised access during the process or due to just an unsafe environment. There is a risk that the “proprietary information” could be leaked leading to an immense business loss.

2. Model memorization: Pre-training and Finetuning a model or even a basic neural network often involves a risk of memorisation. AI models tend to learn and memorize specific details from the training corpus, leading to situations where the model outputs information similar to input data. This makes them susceptible to leaking sensitive details. This phenomenon is known as overfitting. A study done to learn about the phases of training will help you further understand when you should stop the model training process based on a few indicators like perplexity score, etc.

Furthermore, in a recent study (arXiv) conducted by the researchers of DynamoAI, Inc. They found out that models like GPT-3 when finetuned on a custom dataset, are susceptible to leakage of sensitive information (PII). The sensitive information can be obtained easily by querying the model via two prompting methods as described in the paper, 1. Classification, 2. AutoComplete.

“With just 1800 generations of text (around 250k tokens generated), we were able to recover 256 unique PIIs from our fine-tuning dataset.”

The results demonstrate the effectiveness of the attacks by just prompting the finetuned model over API calls. A fellow employee with minimal access to company data will be able to generate and retrieve sensitive information (PII) just by making a few hundred API calls. This study brings into notice that why one shouldn’t finetune models using raw data and consider some strategies beforehand.

3. Compliance Issues: Using sensitive information, including personal data, for model training, must be in compliance with various data protection laws, such as the General Data Protection Regulation (GDPR), CCPA, HIPPA, etc. These regulations have imposed strict guidelines for organizations to follow on how to use the data that is collected from their customers, stored, and further processed. Any failure to comply to even a single point can lead to significant sanctions and penalties.

The Role of PII Identifying Tools in Mitigating Risks

If you have understood the privacy risks clearly, and want to learn more about what to do next, then wait no more, you are just at the right place. You might be wondering if there were a way to identify the so called “sensitive” information that could cause problems. It would save a lot of time and manual labor if you could directly locate those (and skip one step) and work towards a solution to replace them or find an alternative strategy to use them.

In data science projects, we first often visualise the data, and understand what it denotes. This is important because it helps to identify what trends the data is trying to tell us. The same is important in case of finetuning as well, we need to perform “Exploratory data analysis”, not to identify trends, but to identify sensitive information present in the corpus. Identifying and classifying sensitive information present in the data corpus is an essential step that one has to take to mitigate the privacy risks associated with finetuning. The following discussion may help you identify sensitive information present in your corpus by utilizing existing solutions.

1. Microsoft Presidio: A simple Python framework service for detecting PII entities present in a target sample text. Presidio leverages SpaCy NER (Names entity recognition) module and Regex (Regular Expressions) for recognizing emails and other formatted text such as date and checksum with the context in multiple languages. It parses dataset-based predefined patterns.

2. yData-profiling: A simple web-UI based workflow which allows users to perform EDA and provide insights to data. It’s a python package which can be utilised just by running a single line of code, looks pretty cool. It generates a GUI dashboard to visualise different features. It offers easy to study PII Classification and management dashboard, which might be a quick and efficient way.

These tools can be used to identify the sensitive information present in the data corpus. Further steps can be taken to minimise the damage (if possible) involving removing the sensitive information, replacing them with synthetic data, or scrambling and randomising them.

Conclusion

The practice of finetuning language models using sensitive data presents considerable privacy risks that must be carefully managed. Organisations need to be particularly cautious about how they handle such data, using raw data might prove out to be a bit hectic because it is susceptible to leakage even by state of the art AI Language models. They should employ robust data identification and protection strategies where possible and consider alternative training methods. Now we clearly understand the potential risks of utilising raw data and how we can identify the sensitive information present in the corpus to mitigate those risks. As Generative AI continues to develop and become more integral to company operations in multiple domains, the importance of protecting sensitive information (PII) must remain a top priority. Ensuring privacy compliance not only protects individuals but also builds trust in AI technologies, fostering their more responsible use and development.

--

--