Can Artificial Intelligence Plagiarize?

UM professor collaborates with Penn State to study copying, paraphrasing in bots such as ChatGPT

OXFORD, Miss. – Since the launch of ChatGPT in November, the online tool has gained a record-breaking 100 million active users. Its technology, which automatically generates text for its users based on prompts, is highly sophisticated. But are there ethical concerns?

A University of Mississippi professor has co-authored a paper, led by collaborators at Penn State University, showing that artificial intelligence-driven language models, possibly including ChatGPT, are guilty of plagiarism – in more ways than one.

“My co-authors and I started to think, if people use this technology to write essays, grant proposals, patent applications, we need to care about possibilities for plagiarism,” said Thai Le, assistant professor of computer and information science in the School of Engineering. “We decided to investigate whether these models display plagiarism behaviors.”

Thai Le

The study, which is the first of its kind, evaluated OpenAI’s GPT-2, a precursor to ChatGPT’s current technology. They tested three separate criteria for plagiarism: direct copying of content, paraphrasing and copying ideas from text without proper attribution.

To do this, they created a method to automatically detect plagiarism and tested it against GPT-2’s training data, which is “memorized” in part and reproduced by the technology. Much of this data, which is publicly available online, is scraped from the internet without informing content owners.

By comparing 210,000 generated texts to the 8 million GPT-2 pre-training documents, the team found evidence of all three types of plagiarism in the language models they tested. Their paper explains that GPT-2 can “exploit and reuse words, sentences and even core ideas in the generated texts.”

Furthermore, the team hypothesizes that the larger the model size and associated training data, the greater the possibility of plagiarism.

“People pursue large language models because the larger the model gets, generation abilities increase,” said Jooyoung Lee, first author and an information sciences and technology doctoral student at Penn State. “At the same time, they are jeopardizing the originality and creativity of the content within the training corpus. This is an important finding.”

The scientists believe that this automatic plagiarism detection method could be applied to later versions of OpenAI technology, such as those used by ChatGPT.

The research team will present their findings at the 2023 ACM Web Conference, set for April 30-May 4 in Austin, Texas.

Robert Cummings, associate professor of writing and rhetoric at Ole Miss, has given advice to higher education professionals about ChatGPT’s implications in the classroom. A collaborator with Le in other AI-related research, Cummings suggests that users should be pragmatic when referencing material gained from language models.

“We have to be careful about what ideas are ours and what are borrowed,” Cummings said. “Pre-ChatGPT, I’d Google something as part of my research, and it would be sourced. If I was looking for general knowledge, I’d consult Wikipedia.

“Now, it’s important to designate what came from ChatGPT and put it off to the side as unsourced ideas.”

Le acknowledges the importance of finding solutions to these ethical issues, whether that be on the user side or on the side of scientific advancement.

“There are many important philosophical questions related to this technology,” he said. “Computer science researchers will continue to think of ways to improve these language models to change the way they generate text in such a way that they would not plagiarize.”

This material is based upon work supported by the National Science Foundation under Grant Nos. 1934782 and 2114824.