FAQ U: Plagiarism, Part II
There’s a professor at my partner’s graduate school who’s very eager to catch students in the act of academic dishonesty. The class is split up into small groups to write and present a review article of publications in the field. Like clockwork, every year, the students who write the actual data summary for their respective groups’ projects get called to his office to beg for academic probation (as opposed to expulsion). Why? Because their portions of the project have undeniable similarities to published works as detected by the professor’s anti-plagiarism software.
You might notice the problem here. Those students wrote the data summary section of their review articles—the section where you cite, quote, and summarize existing literature to contextualize the more analytical sections of the review. Every year, a handful of poor students get slapped on the wrist by the professor but then are quietly freed by the dean who has seen this happen every year and has a smarter head on her shoulders. Although the professor’s software found content similarities between the student’s work and what published works were included in its comparison dataset, the professor did not take the next step to consider why those sections were flagged as similar. Believe you me, I have it on good word that he’s not a thinking type in general. Let me paint you the picture: male Mormon gynecologist. He’s white too, if that sweetens the pot.
I explained previously what plagiarism is: misrepresenting someone else’s work as your own. Did the students who wrote the data summary sections of their projects plagiarize the works which they summarized? No. Again: they were citing, quoting, and summarizing those existing works because that’s the entire point of (that section of) a review article. Each of those is important. If they didn’t summarize the data, they wouldn’t contribute anything to the project. If they didn’t quote the original publications, the summary would not necessarily be reliable in the eyes of a reader. If they didn’t cite the original works—even that by itself would just be neglectful since the reader would probably understand from context that they’re reading a summary. What if they didn’t cite or quote any of the original literature, but simply reworded it to give the impression that the data or insights were their own? Only then can content similarities be potentially attributed to dishonesty.
So, even barring that professor’s lacking interpretation of the data, how does a content similarity checker work? Let’s say you have a work by Alice and a work by Bob. When you put the two texts next to each other, maybe you notice that Alice’s work (for one reason or another) has striking similarities to Bob’s work. At that point you can interrogate why those similarities exist and maybe ask Alice why she copied Bob if, e.g., she wasn’t citing and quoting Bob throughout the paper. But what we call anti-plagiarism software doesn’t just compare one work to another. Rather, it has a collection of works published by Bob, Carol, David, Eve, Frank, and so on, against which it compares Alice’s article. Sometimes the software is a simple text comparison, though more recently there are some expensive (computationally and otherwise) machine-learning approaches. Regardless, red flags tend to pop up if Alice’s work is more than 10% similar to works in the software’s dataset, and then it’s the reviewer’s job to determine why.
Those poor students from earlier, since they wrote the data summary sections, probably had a stupid percent similarity (maybe 30% to 50%) to published works by the nature of writing the section to aggregate and contextualize existing data. That would be extremely alarming if they wrote a more analytical section of their group’s article, but it makes sense considering it’s the section where you cite, quote, and summarize already published work. I wouldn’t be surprised if the overall articles had similarity rates of 10–20% (which is the norm for outright review articles), with most content similarities occurring in those poor data summary sections. Human interpretation of comparison data is how we distinguish true positives from false positives, because all the machine can provide us is a percent of how much of a text is (superficially!) similar to texts it knows.
How about false negatives? Let’s say Alice is smart and bilingual, or maybe she just has some more-or-less shitty translation software. There’s a paper written in Spanish by José, about the exact topic Alice has to research and report on. Even if the professor’s content similarity checker has the paper in its dataset (which is potentially unlikely depending on whether the software developer has access to non-English journals), José’s original paper was written and published in Spanish. All Alice has to do—and this is a real strategy I have seen mentioned by others—is translate José’s article to English and pass it off as her work. Neither the professor nor the machine are any the wiser because textually speaking there are no detectable similarities between Alice’s work and any published work.
If positive cases warrant interpretation, negative cases warrant investigation. Software is just a first layer of checks for the most obvious clues of plagiarism. Are the citations ‘real’ and consistently cited? If Alice translated the entirety of José’s paper, the professor might find that none of the cited sources actually exist (except, perhaps, in Spanish); although, if Alice had not translated the citations, that would have registered as a textual similarity in the anti-plagiarism software (if its dataset included José’s article). Does the work read like Alice’s other works? Are there instances of awkward language or unfamiliar idioms? Does the raw document lack a version history? None of these are indicators of plagiarism by themselves. Again, evidence must be interpreted, and just because it warrants investigation doesn’t mean it proves plagiarism. You need the proverbial smoking gun: the original text which was passed off. The best the machine can do is compare a text against other texts it happens to know, which is on one hand why it’s a good place to start, but on the other hand why it’s not omniscient.
Here’s a practical example of the above case. Let’s take the below quotation from Ovid:
[…] quo mente feror? quid molior?" inquit
Ovid, Metamorphoses X.320–31
"di, precor, et pietas sacrataque iura parentum,
hoc prohibete nefas scelerique resistite nostro,
si tamen hoc scelus est. sed enim damnare negatur
hanc venerem pietas: coeunt animalia nullo
cetera dilectu, nec habetur turpe iuvencae
ferre patrem tergo, fit equo sua filia coniunx,
quasque creavit init pecudes caper, ipsaque, cuius
semine concepta est, ex illo concipit ales.
felices, quibus ista licent! humana malignas
cura dedit leges, et quod natura remittit,
invida iura negant. […]
That registers as 100% similar to some other works in one plagiarism checker’s database—obviously, it’s Ovid. Below is a translation by A.S. Kline:
[… and she] says to herself: “Where is my thought leading? What am I creating? You gods, I pray, and the duty and sacred laws respecting parents, prevent this wickedness, and oppose my sin, indeed, if sin it is. But it can be said that duty declines to condemn such love. Other creatures mate indiscriminately: it is no disgrace for a heifer to have her sire mount her, for his filly to be a stallion’s mate: the goat goes with the flocks he has made, and the birds themselves conceive, by him whose seed conceived them. Happy the creatures who are allowed to do so! Human concern has made malign laws, and what nature allows, jealous duty forbids."
That one was flagged at 97% similar (I can’t imagine what the 3% difference is; maybe formatting). Finally, below is my original translation:
Herself she asked, "Where is my mind leading? What am I arousing?
To the gods, I pray, and to the duty and sacred laws of parents:
forbid this transgression and oppose our wickedness,
if indeed it is wickedness—but truly duty deigns to condemn this loveliness.
Other animals come together indiscriminately
and it is not unseemly for a heifer to be taken by her father from behind,
for a stallion to make his filly his wife. A goat goes into the flock he made,
and birds themselves are impregnated by him whose seed had conceived them.
Happy are those who are allowed! Human anxiety imparts malignant laws,
and what nature permits, envious laws deny.
What shocked me is that translation registered as 0% similar. I must be a shitty translator if it’s not like anything ever published. Or maybe no one likes that passage. I wonder why. Now imagine if I presented that piece of poetry as my original work. Who else is going to admit to be familiar with Orpheus’ misogynistic incest screed in Metamorphoses X? Just joking, but you understand the point: if a text is translated or has been sufficiently reworded or is simply sufficiently obscure, there is no machine that could tell you whether someone has plagiarized it. They’re not that knowledgeable, no matter how much data they scrape, and even with knowledge the best they can do is a comparison on the level of content rather than of meaning (whether it's based on primitive comparison or machine learning). One thus needs to demonstrate a clear line of ‘descent’ from one text to another, with or without the machine’s help.
Anyway. Brush your teeth. Wash your butt. Eat your vegetables. Don't hash out what you can't take. Blend your foundation.
Comments
Post a Comment