Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs

Yiyang Luo^*,1, Ke Lin^*,2, Chao Gu^*,3 Jiahui Hou³, Lijie Wen², Ping Luo²,

¹Nanyang Technological University, ²Tsinghua University, ³University of Science and Technology of China
NAACL '25 Findings
^*Equal Contribution

Paper Code arXiv

Abstract

The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread usage of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks, such as paraphrasing or translation. In this paper, we introduce watermark collision as a novel and general philosophy for watermark attacks, aimed at enhancing attack performance on top of any other attacking methods. We also provide a comprehensive demonstration that watermark collision poses a threat to all logit-based watermark algorithms, impacting not only specific attack scenarios but also downstream applications.

Summary

In summary, this paper proposes a new watermark attack philosophy for all logit-based watermarks in LLMs. Our contributions are as follows:

We propose a novel philosophy for watermark attacks that can effectively remove existing watermarks from text. This approach can be integrated with various traditional attack methods to enhance their performance.
We find that the strength of overlapping watermarks impacts detection performance. Upstream and downstream watermarks generally compete for detection accuracy, with one being stronger and the other weaker.
We discuss the vulnerability of watermarking techniques caused by watermark collisions.

Methodology

Overview pipeline of Watermark Collisions.

To prove the existence of watermark collisions, we design pipelines with three main components: watermarker, colliders, detectors:

Watermarker $W$ generates watermarked texts $T_W$ by using a language model (LM) to create content based on a specific corpus as context. As illustrated in the pipeline, we first produce the watermarked text data $T_W$ with Watermarker $W$. Additionally, we generate unwatermarked text $T_{W'}$ using the same context and prompt as $T_W$ for further comparisons.
Colliders $C$ are designed to attack the watermark created by the watermarker using collision techniques. There are three distinct colliders that apply such collision attacks through traditional attack methods, namely paraphraser, back-translator, and mask-and-filler.
- Paraphraser $P$ rephrases the watermarked texts $T_W$ with different watermarks, i.e. generated by different methods or keys, to generate paraphrased text data $T_P$, which are intended to contain dual watermarks simultaneously. Furthermore, we also generate texts $T_P'$ using the same paraphraser but without a watermark, denoted as $P'$, for further comparison.
- Translator $R$ translates the watermarked texts $T_W$ to other languages and then translates back to their original language with watermarks.
- Mask-and-filler (MnF) $M$ is specifically designed for mask-and-fill attacks. The MnF attack method is commonly used with masked language models, e.g., BERT-based models.
Detector $D_P$ targets watermarks in paraphrasers, $D_R$ focuses on those in translators, and $D_M$ is for watermarks in the MnF process. Detector $D_W$ aims to identify the original watermark embedded by the watermarker. By comparing the results from these detectors, we can assess the effectiveness of the attacks with or without additional watermarks.

Experiments

TPR of the paraphrased text $T_P$ with dual watermarks when $\text{FPR}=1\%$

$W$ and $P$ represent the watermarker and paraphraser, respectively. $D_W$ and $D_P$ represent the detector of the watermarker and paraphraser. $\varnothing$ indicates that no paraphrasing process is applied to the text, and its corresponding column represents the result of using $D_W$ to detect watermark $W$ in $T_W$. $P'$ represents paraphrasing $T_W$ without watermark

TPR of the back-translated text $T_R$ with dual watermarks when $\text{FPR}=1\%$.

Text Quality

Multi-round Collisions

BibTeX


@article{luo2024lost,
  title={Lost in Overlap: Exploring Watermark Collision in LLMs},
  author={Luo, Yiyang and Lin, Ke and Gu, Chao and Hou, Jiahui and Wen, Lijie and Luo, Ping},
  journal={arXiv preprint arXiv:2403.10020},
  year={2024}
}