OpenAI is Adding Watermark to GPT: No More Plagiarizing

Mandar Karhade, MD. PhD.
Towards AI
Published in
8 min readDec 11, 2022

--

For the last few weeks, GPT-3, ChatGPT, and InstructGPT have taken the internet by storm. The large-language models (LLM) and transformer language models (TLM) have made great progress in “creating” AI-generated codes (OpenAI Codex), AI-generated texts (OpenAI GPT, ChatGPT, InstructGPT), AI-generated images (OpenAI DALL-E), and even AI-generated AI models (Not yet! but wait for it). These models have been generating human-like outputs that are indistinguishable from real human output.

Source: Pixabay (cc)

Issues with the current implementation of GPT-3 / ChatGPT

GPT (or any other LLM) can be used to create human-like bots on social networks. These bots can be tuned to generate biased texts that can elicit expected responses from real humans. The next nefarious logical step is to generate These models that can be tuned to generate fake news articles to disseminate false information at a mass scale. Other new-age issues are preventing plagiarism, fake research data generation for qualitative studies, generation of essays for school homework, etc.

Using a combination of deep fakes, voice synthesizers, and advanced AI for text generation, political actors could try to discredit their opponents.- Source

“It is mine” Proving the IP

Machine learning models are becoming increasingly becoming complex. The models are computationally intensive and use a large amount of data. These complex models are considered as intellectual property of those who trained them. To avoid the misuse of large models, we need a mechanism for proving the origin of the text, art, news, or whatever. Identifying unauthorized use opens up the door for a concept known as “Digital Watermarking”.

Digital Watermarking

Digital watermarking is a method of embedding information into a digital signal in a way that is difficult to remove, but can be detected. This information can be used to identify the source of the digital signal, or to prevent unauthorized copying or tampering. Digital watermarks are often used to protect copyrights in digital media, such as images, audio, or video. — Source ChatGPT

What are the vectors for unauthorized use of ML model

A. The Model and Weights are accessible to the attacker

B. Model Endpoint (API) is hacked to access the model without limits

C. Model Extraction in which the attacker uses the model to label unlabelled data and generate a surrogate model

Source: Research Paper

Access scenarios for ML models. (A) A white-box setting allows the attacker full access to the model and all of its parameters but not (necessarily) to the model’s training data. (B) In a black-box scenario, the attacker has no direct access to the model but instead interacts with it over an application programming interface (API).

Source: Research Paper

Process of a model extraction attack. The attacker holds auxiliary data from a similar distribution as the target model’s training data. Through query access, the attacker obtains corresponding labels for the auxiliary data. Based on that data and the labels, a surrogate model can be trained that exhibits a similar functionality to the original model.

Requirements for Watermarking techniques

Within the last years, several such requirements have been formulated by different parties (Uchida et al., 2017; Adi, 2018; Chen et al., 2019a; Li H. et al., 2019). Below is a quick screenshot from the said systematic review.

Categories of Watermarking techniques

  1. Embedding Watermarks into model parameters: These include methods proposed by Song et al. (2017) which proposed including information about the training data into the model parameters, Uchida et al. (2017) extended it by using an explicit watermarking string with a T-bit string, Wang et al. (2020) extended this work by developing an alternative for the embedding parameter X., Wang and Kerschbaum (2019) proposed a strategy to create undetectable watermarks in a white-box setting based on generative adversarial networks (GANs)., and Fan et al. (2019) suggested embedding passport-layers with digital signatures into NNs for ownership verification. Recently, Shafi and Vinod proposed a cryptographically undetectable backdoor model.
  2. Using Pre-Defined Inputs as Triggers: These include methods proposed by Le Merrer et al. (2020) proposed directly marking the model’s action itself by slightly moving the decision boundary through adversarial retraining such that specific queries can exploit it., Adi (2018) and Zhang et al. (2018) considered watermarking from a cryptographic point of view., Rouhani et al. (2018a) developed an approach of including the watermark as a T-bit string into the probability density function (pdf) of the data abstraction obtained in different network layers, and Chen et al. (2019b) proposed taking the model owner’s binary signature as a watermark for an NN.
  3. Trigger Dataset Creation Based on Original Training Data: Guo and Potkonjak (2018), Zhang et al. (2018) described algorithms for watermarking NNs for image classification with remote black-box verification mechanisms. Sakazawa et al. (2019) proposed cumulative and visual decoding of watermarks in NNs, such that patterns embedded into the training data can become visual for authentication by a third party.
  4. Robust watermarking: Robust watermarking does not rely on a single method to watermark. These avoid distillation attacks, as proposed by Papernot et al., 2016 and Yang et al. (2019). Jia et al. (2021) proposed a similar idea that relies on “entangled watermarking embeddings.” The entanglement is used to make the model extract common features of the data that represent the original task and the data that encode the watermarks and stem from a different distribution. Namba and Sakuma (2019) described a method they called “exponential weighting.” They generated a watermark trigger by random sampling from the training distribution and assigning the wrong labels to that sample for training. Li H. et al. (2019) developed a “null embedding” for including watermarks into the model’s initial training, such that attackers are not able to remove them or include their own watermarks on top.
  5. Unique Watermarking: Chen et al. (2019a) proposed an end-to-end collusion-secure watermarking framework for white-box settings. Xu et al. (2019) embedded a serial number in NNs for model ownership identification.
  6. Fingerprinting: Zhao et al. (2020) used adversarial examples as fingerprints for NNs that block transferability. Lukas et al. (2019) also exploited the transferability of adversarial examples in order to verify the ownership of ML models.

Is OpenAI implementing any methods to protect the model? YES

Although OpenAI has been pretty quiet about the security features, they are working on safeguarding their billion-dollar IP. Scott Aaronson has confirmed (ref 2) OpenAI is working on a few implementations. At the core, what matters to day-to-day users is that these systems will be able to know GPT output vs. other systems. OpenAI has found that out of 175B tokens, only about 100,000 tokens are needed to create a strong enough signature that one could identify the origin of the text.

To a regular user, the difference between random and pseudorandom is not perceptible. However, over the output signature of a few n-grams, the model will be able to pick up patterns that identify the owner of the model. OpenAI already has working prototypes, and it will likely be deployed in the next version of GPT4. It looks unlikely that there will be a GPT-3.6 before GPT-4, given the timeline of release sometime in February 2023.

Is the watermark foolproof?

Not really, but pretty good. One could generate output using the GPT model and then use another model to reword the output. Replacing a few words is still likely to maintain the signature in the text generated by GPT, ChatGPT, and InstructGPT. A similar implementation can be deployed in the DALL-E models, known as CLIP representation. However, given that the data representation in DALL-E is in the pixel format, it has higher complexity and different end-user considerations. To extend that statement, watermarking is a generic concept that could be used in all Deep-Neural-Networks. The shallower the depth of the network, the easier it is to remove or evade the watermarking. The watermarking techniques that use a separate set of nodes for “tagging” are relatively easier to remove too.

Most likely, proving the misuse and theft of an IP is a difficult task and will need a lot of output generated by the thief to prove that the IP was stolen and replicated by the perpetrator. In any case, it will be extremely difficult to prove in schools that a student has plagiarized some text. In the near term, I expect watermarking to be prevalent in protecting large-scale deployments, legal documents, scientific journals, and other big commercial use cases of Deep-Neural-Networks.

Closing thoughts

We hear a lot about Ethics in AI. It is a fuzzy concept. But IP is more concrete. Regulation of IP in AI is as important as the AI itself. It needs a concrete framework. The research by the experts in the cryptography, AI, and IP protection fields is invaluable to protect a potentially trillion-dollar industry.

Disclaimer: I am still learning about all this as I read and wrote this article.

References Consolidated

  1. https://xnv.io/potential-uses-and-misuses-for-gpt-3/ A few potential uses (and misuses) for GPT-3
  2. https://scottaaronson.blog/?m=202211 Scott Aaronson’s blog
  3. Twitter thread By Adrian Krebs
  4. Adi (2018)
  5. Liu et al. 2018
  6. Embedding watermarks into model parameters: Song et al. (2017)
  7. Uchida et al. (2017) first explicit watermarking scheme in NNs.
  8. Wang et al. (2020) developed an alternative for the embedding parameter X.
  9. Wang and Kerschbaum (2018) showed that none of these approaches satisfy the secrecy requirements.
  10. Wang and Kerschbaum (2019) strategy to create undetectable watermarks in a white-box setting based on generative adversarial networks (GANs).
  11. Fan et al. (2019) embedding passport layers with digital signatures into NNs for ownership verification.
  12. Le Merrer et al. (2020) directly marking the model’s action itself by slightly moving the decision boundary through adversarial retraining such that specific queries can exploit it.

--

--