Skip navigation and jump directly to page content

 IU Trident Indiana University

Whole Genome Sequencing

PI: Andrew J. Saykin

High Performance File Systems, Research Storage, Scientific Applications and Performance Tuning groups, Systems Group, UITS Research Technologies 

Research made possible by Big Red II, Data Capacitor II, and the Scholarly Data Archive (SDA)

Genome sequencing pipelin
Figure 1. Whole genome sequencing pipeline

It would be unrealistic to try to conduct this project on regular workstations because of the demand for large files, computational resources, and huge post-processing storage. With the support of the Scientific Applications and Performance Tuning (SciAPT) group, the High Performance File System group (HPFS), and the Research Storage group, it was possible to process the current ADNI genome dataset of over 100 TB on Big Red II. All results are backed up on the Scholarly Data Archive (SDA) for future reference.

Data analysis and storage have been bottlenecks to using whole genome data to interrogate the entire coding region of a genome and analyze the entire 3 billion base pair. The computational capabilities of Big Red II, used in concert with a large, high-speed, parallel file system and reliable high-speed archive storage has permitted a breakthrough for whole genome research. This project will provide a path for recognizing disease-causing mutations as well as evaluating non-coding variations. In the long term, it will make personalized medicine possible.

This project posed challenges in all respects. The HPFS team supported the 3.5 PB high-speed Data Capacitor II (DC2) file system. Dr. Saykin’s current ADNI input dataset includes 818 subjects, each averaging about 100 GB, totaling over 100 TB in size. The whole genome sequencing pipeline for each subject takes about 7-10 days to process, creating interim files up to 600 GB in size. Thus the working space on DC2 had to accommodate as much as 1 PB. Without the use of DC2 in conjunction with the processing power of Big Red II, it would have been almost impossible to complete such processing in a reasonable timeframe.   

The Scientific Applications and Performance Tuning (SciAPT) group delivers and supports software tools that promote effective and efficient use of IU’s advanced cyberinfrastructure, which, in turn, improves research and enables discoveries.

The High Performance File System group (HPFS) provides high-speed, disk-based storage of data for IU researchers. HPFS operates the Data Capacitor II and the Data Capacitor Wide Area Network (DC-WAN) file systems.

The Research Storage group enables the IU community to store data reliably, in large quantities, over long periods of time. The Research Storage team manages and supports the Scholarly Data Archive (SDA) and the Research File System (RFS).

NSF GSS Codes:

Primary Field: Genetics (610) - Genome Sciences/Genomics 

Secondary Field: Computer Science (401) Computer Systems Analysis