Distinguishing Human-Generated and AI-Generated Academic Writing: A Machine Learning Benchmark Study

Ali Raza; Mohib Ullah; Rafiullah Khan; Adeem Ali Anwar; Muhammad Inam Ul Haq; Shazia Riaz

doi:10.21015/vtse.v14i1.2274

Authors

Ali Raza The University of Agriculture, Peshawar, 25130, Pakistan https://orcid.org/0009-0005-0499-1336
Mohib Ullah The University of Agriculture, Peshawar, 25130, Pakistan https://orcid.org/0000-0003-0534-8826
Rafiullah Khan The University of Agriculture, Peshawar, 25130, Pakistan https://orcid.org/0000-0002-0229-7747
Adeem Ali Anwar Kent Institute, 10 Barrack St, Sydney, 2000, NSW, Australia https://orcid.org/0000-0002-6474-3810
Muhammad Inam Ul Haq Khushal Khan Khattak University, Karak, 27200, Pakistan https://orcid.org/0000-0001-9275-8318
Shazia Riaz Department of Computer Science, Government College Women University Faisalabad, Faisalabad, 38000, Pakistan https://orcid.org/0000-0001-9016-0478

DOI:

https://doi.org/10.21015/vtse.v14i1.2274

Abstract

The rapid adoption of large language models (LLMs) such as ChatGPT has raised critical questions about authorship, originality, and integrity in academic writing. Unlike conventional plagiarism testing tools, AI-generated or AI-rephrased text can preserve the original meaning and context of the text while modifying the writing style, making it challenging to detect using standard similarity checks. This study addresses this challenge by creating a domain-specific corpus of postgraduate-level academic texts. The corpus contains 22,520 samples, equally divided between human-written text and AI-rephrased text. All samples were preprocessed and represented using two common techniques: TF-IDF and Word2Vec. The dataset was evaluated using well-known machine learning and deep learning models, including Logistic Regression, Support Vector Machines, Recurrent Neural Networks, and transformer-based models BERT and T5. The results show that linear and sequential models provide low baseline performance, with accuracy between 50-54%. While BERT significantly outperforms the other models, achieving 83% precision along with a high recall rate. Confusion matrix analysis further shows that traditional models tend to overpredict AI authorship, whereas BERT demonstrates strong reliability in distinguishing between human-written and AI-generated text. The results show that transformer-based models are more effective for authorship verification in academic settings. They also emphasize the trade-offs among interpretability, computational cost, and predictive performance. In general, this study offers some important recommendations for the creation of credible, transparent, and domain-sensitive AI detectors for academia.

References

E. Clark, T. August, S. Serrano, N. Haduong, S. Gururangan, and N. A. Smith, “All that's ‘human’ is not gold: Evaluating human evaluation of generated text,” in Proc. 59th Annu. Meeting Assoc. Comput. Linguistics and 11th Int. Joint Conf. Natural Language Processing (ACL-IJCNLP), 2021, pp. 7282–7296, doi: 10.18653/v1/2021.acl-long.565.

Y. Belinkov and J. Glass, “Analysis methods in neural language processing: A survey,” Trans. Assoc. Comput. Linguistics, vol. 7, pp. 49–72, 2019.

H. Huang et al., “Can LLM-generated misinformation be detected: A study on Cyber Threat Intelligence,” Future Gener. Comput. Syst., vol. 173, p. 107877, 2025, doi: 10.1016/j.future.2025.107877.

S. Gehrmann, H. Strobelt, and A. Rush, “GLTR: Statistical detection and visualization of generated text,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics: Syst. Demonstrations, Florence, Italy, 2019, pp. 111–116, doi: 10.18653/v1/P19-3019.

F. Alqasemi et al., “A comparative study for Yemeni poets detection using TEXT-CNN and RNN-LSTM text classification,” in Proc. 5th Int. Conf. Emerging Smart Technol. Appl. (eSmarTA), 2025, pp. 1–8.

E. Al-Buraihy, D. Wang, R. Khan, and M. Ullah, “An ML-based classification scheme for analyzing the social network reviews of Yemeni people,” Int. Arab J. Inf. Technol., vol. 19, no. 6, pp. 904–914, 2022, doi: 10.34028/iajit/19/6/8.

Z. Iqbal et al., “Handling illusive text in document to improve accuracy of plagiarism detection algorithm,” in Proc. 11th Int. Conf. Robot., Vision, Signal Process. Power Appl., vol. 829, Lecture Notes Electr. Eng., Springer, 2022, pp. 93–100.

T. Kehkashan et al., “AI-generated text detection: A comprehensive review of methods, datasets, and applications,” Comput. Sci. Rev., vol. 58, p. 100793, 2025, doi: 10.1016/j.cosrev.2025.100793.

K. Krishna et al., “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Adv. Neural Inf. Process. Syst., vol. 36, pp. 27469–27500, 2023.

N. Selwyn, “The future of AI and education: Some cautionary notes,” Eur. J. Educ., vol. 57, no. 4, pp. 622–634, 2022.

I. Solaiman et al., “Release strategies and the social impacts of language models,” arXiv preprint arXiv:1908.09203, 2019.

H. Stiff and F. Johansson, “Detecting computer-generated disinformation,” Int. J. Data Sci. Anal., vol. 13, no. 4, pp. 363–383, 2022.

A. de Santana Correia and E. L. Colombini, “Attention, please! A survey of neural attention models in deep learning,” Artif. Intell. Rev., vol. 55, pp. 6037–6124, 2022.

J. P. Wahle et al., “How large language models are transforming machine-paraphrase plagiarism,” in Proc. EMNLP, 2022, pp. 952–963, doi: 10.18653/v1/2022.emnlp-main.62.

R. Zellers et al., “Defending against neural fake news,” Adv. Neural Inf. Process. Syst., vol. 32, 2019.

A. M. Salih et al., “A perspective on explainable artificial intelligence methods: SHAP and LIME,” Adv. Intell. Syst., vol. 7, no. 1, p. 2400304, 2025.

W. Antoun et al., “Towards a robust detection of language model-generated text: Is ChatGPT that easy to detect?,” in Actes 30e Conf. Traitement Automatique Langues Naturelles (TALN), 2023.

I. Katib et al., “Differentiating ChatGPT-generated text and human text using machine learning,” Mathematics, vol. 11, no. 15, p. 3400, 2023.

Z. Su et al., “HC3 Plus: A semantic-invariant human–ChatGPT comparison corpus,” arXiv preprint arXiv:2309.02731, 2023, doi: 10.48550/arXiv.2309.02731.

C. Vasilatos et al., “HowkGPT: Investigating the detection of ChatGPT-generated university student homework,” arXiv preprint arXiv:2305.18226, 2023.

A. Uchendu et al., “TURINGBENCH: A benchmark environment for Turing test in the age of neural text generation,” in Findings ACL: EMNLP, 2021, pp. 2001–2016, doi: 10.18653/v1/2021.findings-emnlp.172.

H. P. Nguyen et al., “Logistic regression on guard of students’ academic performance,” in Artificial Intelligence and System Engineering (CoMeSySo 2024), LNNS, vol. 1490, Springer, 2025, doi: 10.1007/978-3-031-96759-7_26.

F. Zhao and F. Yu, “Enhancing multi-class news classification through BERT-augmented prompt engineering,” in Proc. 10th Int. Sci. Practical Conf. Problems and Prospects of Modern Science and Education, 2024, p. 297.

S. Gul et al., “Tanz-Indicator: A novel framework for detection of Perso-Arabic-scripted Urdu sarcastic opinions,” Wireless Commun. Mobile Comput., vol. 2022, p. 9151890, 2022.

R. Khan, M. Ullah, and B. Shafi, “Web search privacy evaluation metrics,” in Protecting User Privacy in Web Search Utilization, IGI Global, 2023, pp. 46–62.

N. Fatima et al., “Sensors faults classification and faulty signals reconstruction using deep learning,” IEEE Access, vol. 12, pp. 1–10, 2024, doi: 10.1109/ACCESS.2024.3425408.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.

C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020.

G. Canbek, T. Taskaya Temizel, and S. Sagiroglu, “PToPI: A comprehensive review of binary classification performance measures,” SN Comput. Sci., vol. 4, no. 1, p. 13, 2022.

A. R. Bangash, “Sample Dataset: Human vs AI Authored Academic Writing,” GitHub Repository, 2025. [Online]. Available: https://github.com/alirazabangash1/Sample-Dataset

S. D. Aldeen, T. Abbas, and A. R. Abbas, “Review of detecting text generated by ChatGPT using machine and deep-learning models: A tools and methods analysis,” Diyala J. Eng. Sci., pp. 34–54, 2025.

Distinguishing Human-Generated and AI-Generated Academic Writing: A Machine Learning Benchmark Study

Authors

DOI:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information

ISSN

Scopus Metrics

SCImago

Scopus CiteScore

Make a Submission