Machine learning approaches to analyzing public speaking and vocal delivery

Ali  Mohammed; Mehdi Mir; Ryan Gill

doi:10.31039/ljss.2023.6.106

Authors

Ali Mohammed https://orcid.org/0009-0005-8798-6128
Mehdi Mir
Ryan Gill

DOI:

https://doi.org/10.31039/ljss.2023.6.106

Keywords:

Machine learning, Public speaking, Speech analysis, Vocal delivery, SVM, CNN, LSTM, PAAN

Abstract

The 21st century has ushered in a wave of technological advancements, notably in machine learning, with profound implications for the analysis of public speaking and vocal delivery. This literature review scrutinizes the deployment of machine learning techniques in the evaluation and enhancement of public speaking skills, a critical facet of effective communication across various professions and everyday contexts.

The exploration begins with an examination of machine learning models such as Support Vector Machines, Convolutional Neural Networks, and Long Short-Term Memory models. These models' application in the analysis of non-verbal speech features, emotion detection, and performance evaluation offers a promising avenue for objective, scalable, and efficient analysis, surpassing the limitations of traditional, often subjective, methods.

The discussion extends to the real-world application of these techniques, encompassing public speaking skill analysis, teacher vocal delivery evaluation, and the assessment of public speaking anxiety. Various machine learning frameworks are presented, emphasizing their effectiveness in generating large-scale, objective evaluation results.

However, the discourse acknowledges the challenges and limitations inherent to these technologies, including data privacy concerns, potential over-reliance on technology, and the necessity for diverse and extensive datasets. The potential drawbacks of these approaches are highlighted, underscoring the need for further research to address these issues.

Despite these challenges, the successes of numerous machine learning applications in this field are underscored, along with their potential for future advancements. By dissecting past successes and failures, the review aims to provide guidance for the more effective deployment of these technologies in the future, contributing to the ongoing efforts to revolutionize the analysis of public speaking and vocal delivery.

References

Chollet, M., Wörtwein, T., Morency, L.-P., & Scherer, S. (2016). A Multimodal Corpus for the Assessment of Public Speaking Ability and Anxiety. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 488–495). European Language Resources Association (ELRA).

Chen, L., Feng, G., Joe, J., Leong, C. W., Kitchen, C., & Lee, C. M. (2014). Towards Automated Assessment of Public Speaking Skills Using Multimodal Cues. In Proceedings of the 16th International Conference on Multimodal Interaction (pp. 200–203). Association for Computing Machinery.

Chen, L., Leong, C. W., Feng, G., Lee, C. M., & Somasundaran, S. (2015). Utilizing multimodal cues to automatically evaluate public speaking performance. In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 394-400).

Jackson, A., Zhang, H., Skerry-Ryan, R.J., Bamman, D., & Glass, J. (2022). Controllable neural prosody synthesis with PitchNet. arXiv preprint arXiv:2203.09091.

Kim, J., Kim, K., Kumar, N., Raj, B., & Sundaram, S. (2021). Audio visual scene-aware dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14492-14502).

Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE transactions on speech and audio processing, 13(2), 293-303.

Legrain, M. (2022). The Art of Public Speaking: Machine Learning and Natural Language Processing To Analyze TED Talks. Available at SSRN: https://ssrn.com/abstract=4084043 or http://dx.doi.org/10.2139/ssrn.4084043

Liu, Z., Chollet, M., Wörtwein, T., Louis-Dorr, V., Morency, L., & Scherer, S. (2019). A Multimodal Dataset for Various Forms of Public Speaking Anxiety. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop (pp. 13-20).

Martin, L., Mulla, A., Patel, D., Pandey, M., Hussain, T., Montenegro, J. (2022). Responsible AI in Healthcare: A Review and Critical Analysis. arXiv preprint arXiv:2205.06003.

Mei, B., Qi, W., Huang, X., & Huang, S. (2022). Speeko: An Artificial Intelligence-Assisted Personal Public Speaking Coach. RELC Journal.

Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2020). SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Proc. Interspeech 2020, 2613-2617.

Patel, K., Yadav, D. K., Poria, S., & Cambria, E. (2020). A tale of two frequencies: Analyzing vocal patterns in speech using Autoencoder for emotion recognition. arXiv preprint arXiv:2007.00028.

Pfister, T., & Robinson, P. (2011). Real-Time Recognition of Affective States from Nonverbal Features of Speech and Its Application for Public Speaking Skill Analysis. IEEE Transactions on Affective Computing, 2(2), 66-78.

Sainath, T.N., Vinyals, O., Senior, A., Sak, H. (2015). Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4580-4584). IEEE.

Sak, H., Senior, A., Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.

Smith, Y. (2021). On Artificial Intelligence and Data. Journal of Medicine and Philosophy, 46(1), 6–33.

Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M.A., Schuller, B., Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200-5204). IEEE.

Wang, X., Takaki, S., Yamagishi, J. (2022). Neural Source-Filter Waveform Model with Transfer Learning from Speaker Verification for Any-to-Any Voice Conversion without Parallel Data. Proc. Interspeech 2022, 872-876.

Williams, D., Ramanarayanan, V., Suendermann-Oeft, D., Ivanov, A. V., Evanini, K., & Wang, X. (2020). Automatic speech scoring using LSTM networks. Proc. Interspeech 2020, 3775-3779.

Wörtwein, T., Chollet, M., Schauerte, B., Morency, L.-P., Stiefelhagen, R., & Scherer, S. (2015). Multimodal Public Speaking Performance Assessment. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (pp. 43–50). Association for Computing Machinery.

Yang, L. C., Ai, H., Guo, J., Croft, W. B., Frieder, O., Hartmann, D., ... & Wang, R. (2022). Challenges in responsible AI for healthcare. Nature Medicine, 28(5), 745-747.

Zhao, R., Sivadas, S., Sharma, N., Cutler, R., Zhai, J., Zhang, Z. (2022). Charisma Style Transfer using Pre-trained Models. Proc. Interspeech 2022, 1327-1331.