Advanced Cross-Validation Framework for Mental Health AI: BERT and Neural Networks Achieve High Accuracy on Mental Chat16K

Main Article Content

Irfan Ali

Abstract

Conversational AI is becoming an essential tool for supporting mental health, yet there are still few robust evaluation frameworks for large-scale therapeutic dialogue datasets. This study presents a comprehensive analysis of the MentalChat16K dataset, which contains 16,084 mental health conversation pairs (6,338 real clinical interviews and 9,746 synthetic dialogues), using modern deep learning architectures. We develop and evaluate BERT-based text classification models and featureengineered neural networks for mental health conversation analysis. Our BERT classifier achieves 86.7% accuracy and 86.1% F1-score for sentiment-based mental health state classification. A feature-based neural network achieves 86.7% accuracy and 83.5% F1 Score for therapeutic response type prediction. In addition, five-fold cross-validation with a Random Forest classifier on engineered features yields 99.99% ± 0.02% accuracy. We show that this very high performance is driven by practical feature engineering on a more straightforward classification task, distinct from the primary BERT and neural network models. We further perform statistical significance testing using McNemar’s test and bootstrap confidence intervals, confirming that model performance differences are statistically significant (p < 0.05). Performance on real versus synthetic data is comparable (100.0% vs 99.95%), suggesting robustness across data sources. The dataset consists of 39.4% real clinical interviews and 60.6% GPT-3.5-generated conversational-stations; a demographic analysis highlights the lack of explicit demographic labels and the resulting limitations. Our methodology includes domain-optimised BERT architectures, thorough hyperparameter documentation, and a stratified cross-validation framework. GPU-accelerated experiments provide practical insights for deploying such models in workplace mental health systems. Overall, this study establishes performance benchmarks for conversational mental health AI with promising accuracy levels for research and development, while emphasising the need for independent clinical validation before any real-world use. This work contributes to the growing field of AI-powered mental health support technologies.

Downloads

Download data is not yet available.

Article Details

How to Cite
[1]
Irfan Ali , Tran., “Advanced Cross-Validation Framework for Mental Health AI: BERT and Neural Networks Achieve High Accuracy on Mental Chat16K”, IJAINN, vol. 6, no. 1, pp. 10–17, Dec. 2025, doi: 10.54105/ijainn.A1112.06011225.
Section
Articles

How to Cite

[1]
Irfan Ali , Tran., “Advanced Cross-Validation Framework for Mental Health AI: BERT and Neural Networks Achieve High Accuracy on Mental Chat16K”, IJAINN, vol. 6, no. 1, pp. 10–17, Dec. 2025, doi: 10.54105/ijainn.A1112.06011225.
Share |

References

World Health Organization. Mental Health and Substance Use Disorders. WHO Global Health Observatory, 2022. https://www.who.int/data/gho/data/themes/mental-health

Employee Assistance Professional Association. Global EAP Utilisation Patterns and Effectiveness Meta-Analysis. EAPA Research Quarterly, 2023. https://www.eapassn.org/Resources/Research

Chen, L., et al. AI-Powered Mental Health Interventions in Workplace Settings: A Systematic Review. Journal of Occupational Health Psychology, 28(4): 245-260, 2023. DOI: https://doi.org/10.1037/ocp0000362

Abd-Alrazaq, A., et al. Conversational AI for Mental Health: A Systematic Review of Applications, Challenges, and Future Directions. Journal of Medical Internet Research, 25: e51560, 2023. https://medinform.jmir.org/2024/1/e51560

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL- HLT, 2019. DOI: https://doi.org/10.18653/v1/N19-1423

Xu, Jia, et al. MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance. arXiv preprint arXiv:2503.13509, 2025. https://arxiv.org/abs/2503.13509

Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. Men- talBERT: Publicly Available Pretrained Language Models for Mental Healthcare. Proceedings of LREC, 2022. https://aclanthology.org/2022.lrec-1.403/

Matthew Matero, et al. Suicide Risk Assessment with Multi-level Dual-context Language and BERT. Proceedings of the Sixth Work- shop on Computational Linguistics and Clini- cal Psychology, 2019. https://aclanthology.org/W19-3015/

Wang, Y., et al. Recent Advances in Trans- former Models for Clinical Text Analysis: A Survey. Artificial Intelligence in Medicine, 142: 102567, 2023. DOI: https://doi.org/10.1016/j.artmed.2023.102567

Coppersmith, G., Dredze, M., and Harman, C. Quantifying Mental Health Signals in Twitter. Proceedings of the Workshop on Computational Linguistics and Clinical Psychology, 2015. https://aclanthology.org/W15-1201/, works remain significant, see the declaration

Turcan, E., and McKeown, K. Dreaddit: A Reddit Dataset for Stress Analysis in Social Media. Proceedings of the 12th Language Resources and Evaluation Conference, 2021. https://aclanthology.org/2021.lrec-1.265/

Taylor, J. M., et al. Development and Validation of Machine Learning Models for Stress Assessment from Text. Journal of Medical Internet Research, 22(10): e22145, 2020. DOI: https://doi.org/10.2196/22145

Vaswani, A., et al. Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.%20pdf

Rajkomar, A., et al. Scalable and Accurate Deep Learning with Electronic Health Records. NPJ Digital Medicine, 1: 18, 2018.

DOI: https://doi.org/10.1038/s41746-018-0029-1

Serrano, S., and Smith, N. A. Is Attention Interpretable? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. DOI: https://doi.org/10.18653/v1/P19-1282

Patel, N., et al. Bias Detection and Mitigation in Mental Health AI Systems: A Systematic Review. Journal of Medical Ethics, 49(8): 567-578, 2023. DOI: https://doi.org/10.1136/jme-2022-108847

Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal Loss for Dense Object Detection. Proceedings of the IEEE International Conference on Computer Vision, 2017. DOI: https://doi.org/10.1109/ICCV.2017.324

Loshchilov, I., and Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv preprint arXiv:1608.03983, 2016. https://arxiv.org/abs/1608.03983

Char, D. S., Shah, N. H., and Magnus, D. Implementing Machine Learning in Health Care: Addressing Ethical Challenges. New England Journal of Medicine, 378(11): 981-983, 2018. DOI: https://doi.org/10.1056/NEJMp1714229

Obermeyer, Z., Powers, B., Vogeli, C., and Mulainathan, S. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science, 366(6464): 447-453, 2019. DOI: https://doi.org/10.1126/science.aax2342