Leveraging EuPathDB Genomic Datasets with AI for Advancements in Molecular Parasitology: A Path to Data-Driven Discoveries
Abstract
This study leverages deep learning models, specifically Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, to analyze genomic and gene expression data from the EuPathDB database for molecular parasitology applications. The CNN model demonstrated high efficacy in detecting pathogenic motifs within genomic sequences, achieving an accuracy of 86% and a balanced F1-score of 0.84, indicating strong potential for pathogenic feature identification in parasitic genomes. The LSTM model, while moderately accurate with a 79% test accuracy, effectively captured temporal patterns in gene expression relevant to infection stages, though it showed limitations in sensitivity that suggest avenues for further refinement. Confusion matrices and ROC curves provided insights into the classification accuracy and sensitivity of both models, indicating generalizability across parasite species. These findings highlight the potential for deep learning to transform data-driven parasitology research, with practical applications in genomic analysis, diagnostic support, and therapeutic target discovery. Future work should explore hybrid architectures and data augmentation techniques to enhance model robustness and accuracy.