‘I’m vulnerable. I’m vulnerable. I am not a robot’. So sang Marina Diamandis back in 2009. Little did she know that her words would describe the status quo for many human copywriters trying to make a living today. In a world where AI copywriting has become part of the furniture, copy agencies and their clients need to know that when they pay for human-written that’s what they’re getting.
And, let’s face it, who has the time to rigorously proofread every piece of content submitted to check for the telltale signs of AI?
In this context, it’s easy to see why agencies and their clients might turn to automated copy checkers to ensure that the content they pay for comes from a human writer. However, our experiences demonstrate that these tools may not be as reliable as users would like them to be. When the copy checker we were using began to flag content we had written as AI, we thought it prudent to investigate and share our findings with other agencies, clients and freelance copywriters.
For some time, the team at Write Arm has been experimenting with the online copy checker Originality.ai to check the veracity of incoming content. When we submitted some copy from an in-house copywriter, we got some unexpected results.
The copywriter found that their 100% human-generated content received an 82% AI score– meaning that the tool was 82% sure that the copy was written by AI. Naturally, we carried out a few Voight Kampff tests, drew some blood from the copywriter and asked them to identify several traffic lights to determine that they were, indeed, human.
While Originality.ai provided a detailed breakdown of how many sessions the document had been edited within and the characters written over time, it shared very little information on how it came to this (erroneous) conclusion.
We thought perhaps that exploring how these AI-detecting tools work could help us to better understand their capabilities, their limitations, and how anyone with a stake in copywriting can best utilise them.
While each copy checking tool uses a slightly different methodology, the principle is the same. In fact, they work in a very similar way to AI copywriting platforms like ChatGPT. They are trained on vast datasets from a range of sources, analysing the structure of written copy and looking for telltale hints. These may include repetitive words or phrases, or two lesser-known literary concepts – perplexity and burstiness.
Perplexity and burstiness, and their relationships with one another, are two key metrics that copy checkers use to detect AI-generated content.
Perplexity is the predictability of each word that follows the last in a sentence. When, for instance, a person sends a text message to their spouse, the predictive text model on their mobile phone will use a language processing model to try and preempt the next sentence in the sentence, or even finish the sentence on the writer’s behalf. If the model is able to complete the sentence before the writer can, this is considered low perplexity.
But language is an unpredictable creature, and human writers will often use unorthodox language choices in order to spice up the copy or make it fit better with the brand’s tone of voice. As such, they may opt for spicier word choices like ‘savour’ instead of ‘enjoy’ or ‘bounteous’ instead of ‘various’ that defy procedural language generation. As such, AI checkers tend to look for the lower perplexity sentence structures that give AI content its grammatically correct but ultimately bland syntax.
Burstiness, on the other hand, refers to the tendency of writers to use certain words or expressions in bursts depending on the context. If, for instance, a copywriter is writing a listicle of the ‘Five best post-workout foods’,’ they may expect to use words like ‘fat’, ‘protein’ or ‘carbohydrate’ liberally throughout the piece, but might use a more specific word like ‘broccoli’ a handful of times in a single paragraph. This burstiness is characteristic of human copywriting. It also influences perplexity as it tends to make sentences less predictable.
AI copy checkers use sophisticated mathematical formulae to calculate perplexity and burstiness. But, while impressive, the algorithms employed are nonetheless fallible. Human writing is often prone to low perplexity and burstiness, resulting in false flags. While they may indicate that the copy needs enlivening somewhat, they should not indicate that a freelance copywriter has subcontracted their work to ChatGPT.
In order to get a better feel for the state of AI copy detection as a whole, we decided to cast our net wider than Originality.ai. We chose this platform based on its strong reputation within the industry. Indeed, the company’s website states that its architecture (built on a modified version of Google’s BERT model) enables it to ‘outperform any alternative’.
Assuming that we were working with a market-leading platform, we wanted to see how Originality.ai’s peers would classify the same 1,500 word piece of copy.
Copyleaks promises to ‘fight fire with fire’, utilising AI to detect AI and claiming to do so with 99.12% accuracy. Copyleaks AI Content detector detects source code from ChatGPT, GitHub Copilot, Bard and more. It features support for multiple languages and offers support for GPT-4. It has a refreshingly uncomplicated interface and, best of all, is free to use.
Using Copyleaks’ free basic detector, we found that our copy was 78.4% likely to be written by a human.
Winston AI claims to be the most trusted AI detector, capable of identifying content generated by a wide range of large language models including Bard, Bing Chat, Claude and, of course, GPT-3 and GPT-4).
It is free to try with no need to submit credit card details. The free version detects both AI copy and readability but users will need to upgrade for plagiarism detection.
Winston AI found that only 30% of our 1,500 word piece was generated by a human, stating that ‘Our assessment is that an AI tool was likely used to generate all or a good part of the text’. It presented us with an AI predictions map highlighting which passages it believed to be procedurally-generated, but did not furnish us with any data on why it had reached this conclusion.
GPT Radar is, as the name suggests, built upon GPT-3. It determines the likelihood that text is procedurally generated by measuring perplexity and token probability distribution. This is a measure of how often a language model will generate a word based on context.
Its simple interface allowed us to copy and paste our text into the page and we were off to the races in a single click with no sign-up necessary.
Almost instantly, we were taken to a page that identified the text as 70% likely to be human-generated with a perplexity score of 29. It also identified the most likely passages to be written by humans and the most likely passages to be written by AI.
Given the rationale behind leveraging AI copy, the name Content at Scale is an SEO masterstroke. But the name is not the only thing that’s clever about this tool. It has been trained on billions of pages of web copy and uses a proprietary content generator that identifies numerous components of AI.
Content At Scale AI Content Detector claims to identify copy generated by GPT-4, Bard and more with 98% accuracy. It has a simple interface with no sign-up required. A quick copy and paste revealed some highly inconclusive results. Content At Scale gave us a 58% probability that our copy was human-generated, highlighting certain passages in green, yellow or red depending on the likelihood that they were procedurally generated.
Which, we know with 100% certainty, they were not.
As we can see, running the exact same content through multiple AI copy checkers can yield wildly different results.
GPT Radar and Copyleaks correctly identified that our copy was likely generated by a human, while Content At Scale was only 58% sure that our content came from a human writer.
This begs the question, why did Originality.ai and Winston flag our copy as AI-generated?
Given what we now know about how copy checkers work, especially in terms of perplexity, we can reasonably deduce that the nature of the copy may have caused a false flag. The piece in question was very factually dense, relying heavily on studies from a variety of sources. The client had requested a tone that, while friendly and engaging, was authoritative and factual when it needed to be. And, with the best will in the world, there are only so many ways in which one can say that X study yielded Y results. In other words, the content necessitated low perplexity, which in turn was a flag for copy checkers. By contrast, the post that you are reading, which has a more lively and engaging tone, registered as 98% likely to be human-generated by Originality.ai.
Does our experience mean that AI copy checkers are useless?
Absolutely not.
We continue to use Originality.ai both to check our incoming copy and to keep track of the platform’s capabilities. We also, however, strongly recommend that AI copy checking is supplemented with human proofreading.
As well as instinctively rooting out copy that feels procedurally generated, human proofreaders can subtly help to better align copy with a client’s branding or ToV guidelines. AI copy checkers are always evolving, and we can expect their capabilities to improve at the same incremental rate as generative AI. However, as advanced as they get, they are unlikely to be able to cross-reference the copy they check against a client’s very specific vocabulary, keyword or ToV mandate.
This is why we work closely with a select group of talented proofreaders to ensure that our copy meets the high standards that our clients expect from us, as well as offering comprehensive copy editing and proofreading services.