Hello fellow Babiato members
I have been shopping around for an AI Writer and was hoping members of this community could come together and map out what Ai Writers can pass or consistently pass the GTP-2/3 Output Detector.
We used HuggingFace and ContentScale detector for these test!
Ai Detector Links:
- HuggindFace - GPT-2 - https://huggingface.co/openai-detector
- ContentScale - NLP - https://contentatscale.ai/ai-content-detector/
- Writer.com - GTP-3 - https://writer.com/ai-content-detector/
- Originality.ai - (PAID) - https://originality.ai/
All Ai Writers Tested
HelloScribe
Outputs | 20 |
Human Score | 87% |
20 Failed Outputs in "Rewrite This" | 99.8% |
Content Quality is very good, and the Human Score is the new Highest Score for detection rates. The "Rewrite This" tool does an excellent job at improving the detection rate. From a 0 Human Score to 20 rewrites at an average of 99%!
Google Sheet with over 320 samples
View hidden content is available for registered users!
OpenAI - GPT-3 Davinci-003 ( Default settings)
Outputs | 20 |
Human Score | 0.01% |
20 Failed Outputs used in "Content Rewrite" | 06.9% |
Content quality is good, but the detection rate is the worst with default settings. Telling the AI to rewrite failed outputs did improve the detection rate.
OpenAI - GPT-3 Davinci-003 ( Custom Settings )
Outputs | 20 |
Human Score | 49% |
20 Failed Outputs used in "Content Rewrite" | 80% |
Content quality is good, and the detection rate is about half.
Telling the AI to rewrite failed outputs in a "funny" tone made a drastic difference. The AI did an excellent job of not being so robotic with the responses while still providing helpful information.
Writesonic
Outputs | 20 |
Human Score | 66% |
20 Failed Outputs used in "Content Rewrite" | 8.6% |
It seems like Wrtiesonic uses GTP-3 for the initial output and then runs another algorithm on top of it and sends the results to the user. These outputs, in terms of passing the detection.
On the contrary, the Content Rephraser had trouble improving the 20 failed outputs. I used the 3-4 failed outputs for the rewrites.
TextWizard
Outputs | 20 |
Human Score | 39% |
Frase
Outputs | 20 |
Human Score | 12.9% |
Rytr
Outputs | 20 |
Human Score | 7.5% |
Creator.ai
Outputs | 20 |
Human Score | 7.4% |
20 Failed Outputs used in "Content Rewrite" | 8.4% |
The output quality is really good, but the detection rate is high. Using the Content Rewriter, with 20 failed outputs made a noticeable difference.
WordHero
Outputs | 20 |
Human Score | 7.15% |
20 Failed Outputs used in "Content Rewriter V2" | 53.3% |
The output quality is pretty good with a high detection rate. We ran 20 failed inputs through their "Content Rewriter V2" and reduced the detection rate considerably.
Jasper
Outputs | 20 |
Human Score | 35% |
20 Failed Outputs used in "Content Improver" | 90.3% |
The output quality is the best of all the writers but has a high detection rate. The Content Improver they provide works exceptionally well to decrease the detection rate.
NeuronWriter - Davinci 003 Update
Outputs | 20 |
Human Score | 49.2% |
20 Failed Outputs use in "Rephrase Text" | 78.85 |
The original output quality is good, but the Human Score is decent. Rephrase Text does a good job at creating a high Human Score. Output could be better tho.
Google Sheet Examples
View hidden content is available for registered users!
ClosersCopy
Outputs | 20 |
Human Score | 0.02% |
Bramework
Outputs | 20 |
Human Score | 59% |
20 Failed Outputs used in "Content Improver" | 66% |
The quality of the output is very good, the detection is the best in this case study. Took 10 failed prompts (0.02-<9% real) and use their "Rephrase" option.
TLDR: These tools looks for obvious detection signs based on its knowledge of how GPT-3 creates the content from each prompt. Unreliable... but interesting to test.
- Helloscibe - Best overall for Human Score
- Jasper - Second Best Rewriter
- Bramework - Well Rounded Long Form
- Wordhero - Decent Rewriter
Methodology
This is the average detection rate. With a minimum of 50 words.
Some outputs are Usable, at 87% real and 13% fake. While others can be easily detected at 0.02 % real and 99.98 % fake and should not be used.
If the Ai Writer has a tool like "Content improver" or "Content Rewriter" it will be used with failed outputs of the same writer.
Failed Outputs are obvious detections - 0.02% -10%. Real
Human Score is based off the contentscale.ai tool
Special Thanks
@sundar50000 gave me access to some tools; thank you
@cesareborgia did as well, thanks
Will add more writers and data as they are discovered.
Last edited: