This study introduces a Vision Transformer (ViT) model that effectively automates the classification of videofluoroscopic swallowing studies (VFSS) for dysphagia, outperforming traditional CNNs by accurately detecting abnormalities.
Friday, December 12, 2025

Hesam Abdolmotalleby is the main researcher and lead author on the paper "Detecting Airway Invasion in Variable-Length Videofluoroscopic Swallowing Studies: A Vision Transformer Approach for Oropharyngeal Dysphagia".  Recognizing the burden of manual interpretation for videofluoroscopic swallowing studies (VFSS) in diagnosing dysphagia, this research developed a novel Vision Transformer (ViT) model. The ViT utilizes a temporal sliding window and 3D patch tokenization to robustly capture spatio-temporal dependencies within variable-length VFSS sequences. Evaluated against 1154 VFSS sequences, the ViT achieved an impressive 84.37% accuracy, 90.81% sensitivity, and 79.49% specificity, significantly outperforming several conventional Convolutional Neural Network (CNN) baselines like VGG-16 and ResNet-50. These results highlight the ViT's strong capability for automated VFSS classification, establishing a promising foundation for the clinical deployment of AI-driven tools to streamline dysphagia screening and improve timely abnormality detection.