Poster
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning
DeepChem-Variant: A Modular Open Source Framework for Genomic Variant Calling
Ankita Bisoi · Shreyas Vinaya Sathyanarayana · Jose Siguenza · Bharath Ramsundar
Variant calling is a fundamental task in genomic research for detecting genetic variations such as single nucleotide polymorphisms (SNPs) and insertions or deletions (indels). This paper presents an enhancement to DeepChem, a widely used open source drug discovery framework, through the integration of DeepVariant. We introduce DeepChem-Variant, a variant calling pipeline that leverages DeepVariant's convolutional neural network (CNN) architecture to improve variant detection accuracy and reliability. DeepChem-Variant has stages for realignment of sequencing reads, candidate variant detection, and pileup image generation, followed by variant classification using either the original modified Inception V3 model or our novel MobileNetV2 implementation. We performed 3 case studies to validate our approach. Our work also contributes optimized utility functions for genomic data formats, including enhanced DataLoaders for BAM, SAM, and CRAM files, and an optimized FASTALoader. These implementations collectively provide a modular and extensible variant calling framework within DeepChem, enabling tighter integration of DeepChem's drug discovery infrastructure with bioinformatics pipelines for future research.