Advanced Cross-Validation Framework for Mental Health AI: BERT and Neural Networks Achieve High Accuracy on Mental Chat16K
Main Article Content
Abstract
Conversational AI is becoming an essential tool for supporting mental health, yet there are still few robust evaluation frameworks for large-scale therapeutic dialogue datasets. This study presents a comprehensive analysis of the MentalChat16K dataset, which contains 16,084 mental health conversation pairs (6,338 real clinical interviews and 9,746 synthetic dialogues), using modern deep learning architectures. We develop and evaluate BERT-based text classification models and featureengineered neural networks for mental health conversation analysis. Our BERT classifier achieves 86.7% accuracy and 86.1% F1-score for sentiment-based mental health state classification. A feature-based neural network achieves 86.7% accuracy and 83.5% F1 Score for therapeutic response type prediction. In addition, five-fold cross-validation with a Random Forest classifier on engineered features yields 99.99% ± 0.02% accuracy. We show that this very high performance is driven by practical feature engineering on a more straightforward classification task, distinct from the primary BERT and neural network models. We further perform statistical significance testing using McNemar’s test and bootstrap confidence intervals, confirming that model performance differences are statistically significant (p < 0.05). Performance on real versus synthetic data is comparable (100.0% vs 99.95%), suggesting robustness across data sources. The dataset consists of 39.4% real clinical interviews and 60.6% GPT-3.5-generated conversational-stations; a demographic analysis highlights the lack of explicit demographic labels and the resulting limitations. Our methodology includes domain-optimised BERT architectures, thorough hyperparameter documentation, and a stratified cross-validation framework. GPU-accelerated experiments provide practical insights for deploying such models in workplace mental health systems. Overall, this study establishes performance benchmarks for conversational mental health AI with promising accuracy levels for research and development, while emphasising the need for independent clinical validation before any real-world use. This work contributes to the growing field of AI-powered mental health support technologies.
Downloads
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
How to Cite
References
World Health Organization. Mental Health and Substance Use Disorders. WHO Global Health Observatory, 2022. https://www.who.int/data/gho/data/themes/mental-health
Employee Assistance Professional Association. Global EAP Utilisation Patterns and Effectiveness Meta-Analysis. EAPA Research Quarterly, 2023. https://www.eapassn.org/Resources/Research
Chen, L., et al. AI-Powered Mental Health Interventions in Workplace Settings: A Systematic Review. Journal of Occupational Health Psychology, 28(4): 245-260, 2023. DOI: https://doi.org/10.1037/ocp0000362
Abd-Alrazaq, A., et al. Conversational AI for Mental Health: A Systematic Review of Applications, Challenges, and Future Directions. Journal of Medical Internet Research, 25: e51560, 2023. https://medinform.jmir.org/2024/1/e51560
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL- HLT, 2019. DOI: https://doi.org/10.18653/v1/N19-1423
Xu, Jia, et al. MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance. arXiv preprint arXiv:2503.13509, 2025. https://arxiv.org/abs/2503.13509
Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. Men- talBERT: Publicly Available Pretrained Language Models for Mental Healthcare. Proceedings of LREC, 2022. https://aclanthology.org/2022.lrec-1.403/
Matthew Matero, et al. Suicide Risk Assessment with Multi-level Dual-context Language and BERT. Proceedings of the Sixth Work- shop on Computational Linguistics and Clini- cal Psychology, 2019. https://aclanthology.org/W19-3015/
Wang, Y., et al. Recent Advances in Trans- former Models for Clinical Text Analysis: A Survey. Artificial Intelligence in Medicine, 142: 102567, 2023. DOI: https://doi.org/10.1016/j.artmed.2023.102567
Coppersmith, G., Dredze, M., and Harman, C. Quantifying Mental Health Signals in Twitter. Proceedings of the Workshop on Computational Linguistics and Clinical Psychology, 2015. https://aclanthology.org/W15-1201/, works remain significant, see the declaration
Turcan, E., and McKeown, K. Dreaddit: A Reddit Dataset for Stress Analysis in Social Media. Proceedings of the 12th Language Resources and Evaluation Conference, 2021. https://aclanthology.org/2021.lrec-1.265/
Taylor, J. M., et al. Development and Validation of Machine Learning Models for Stress Assessment from Text. Journal of Medical Internet Research, 22(10): e22145, 2020. DOI: https://doi.org/10.2196/22145
Vaswani, A., et al. Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 2017. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.%20pdf
Rajkomar, A., et al. Scalable and Accurate Deep Learning with Electronic Health Records. NPJ Digital Medicine, 1: 18, 2018.
DOI: https://doi.org/10.1038/s41746-018-0029-1
Serrano, S., and Smith, N. A. Is Attention Interpretable? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. DOI: https://doi.org/10.18653/v1/P19-1282
Patel, N., et al. Bias Detection and Mitigation in Mental Health AI Systems: A Systematic Review. Journal of Medical Ethics, 49(8): 567-578, 2023. DOI: https://doi.org/10.1136/jme-2022-108847
Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal Loss for Dense Object Detection. Proceedings of the IEEE International Conference on Computer Vision, 2017. DOI: https://doi.org/10.1109/ICCV.2017.324
Loshchilov, I., and Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv preprint arXiv:1608.03983, 2016. https://arxiv.org/abs/1608.03983
Char, D. S., Shah, N. H., and Magnus, D. Implementing Machine Learning in Health Care: Addressing Ethical Challenges. New England Journal of Medicine, 378(11): 981-983, 2018. DOI: https://doi.org/10.1056/NEJMp1714229
Obermeyer, Z., Powers, B., Vogeli, C., and Mulainathan, S. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science, 366(6464): 447-453, 2019. DOI: https://doi.org/10.1126/science.aax2342