Scientific Challenge. The Sequence Read Archive (SRA) is a repository for the world's public unassembled genomics data---currently ~40 Petabytes of data. There is not currently a method of finding samples in SRA that contain reads similar to a search sequence. SRA search would enable us to monitor sequences for pathogen emergence, find novel gene homologs, and more. Hyper log-log sketching is a method of locality-sensitive hashing that could potentially be used to index SRA data and allow for fast search. But a current challenge with hyper log-log search is that the precision and recall vary greatly by sample. This work will develop a statistical model to index sequences in a way that provides balanced precision and recall across all SRA samples. We have the infrastructure in place to do large-scale indexing on the Google Cloud Platform and experience in engineering systems to set up and host a high-profile search tool.

Position. This postdoc is with the U.S. Department of Agriculture (USDA), Agricultural Research Service (ARS), Genomics and Bioinformatics Research Unit in Gainesville, Florida. Other work locations are also possible. It is part of the SCINet/Big Data Fellows Program of the USDA ARS offers research opportunities to motivated postdoctoral fellows interested in working on agricultural-related problems at a range of spatial and temporal scales, from the genome to the continent, and sub-daily to evolutionary time scales. One of the goals of the SCINet Initiative is to develop and apply new technologies, including AI and machine learning, to help solve complex agricultural problems that also depend on collaboration across scientific disciplines and geographic locations. In addition, many of these technologies rely on the synthesis, integration, and analysis of large, diverse datasets that benefit from high-performance computing clusters (HPC). The objective of this fellowship program is to facilitate cross-disciplinary, cross-location research through collaborative research on problems of interest to each applicant and amenable to or required by the HPC environment. Training will be provided in specific AI, machine learning, deep learning, and statistical software needed for a fellow to use the HPC to search and analyze large metagenomics datasets. 

USDA-ARS Contact: If you have questions about the nature of the research, please contact Dr. Adam Rivers, Lab web site  

Anticipated Appointment Start Date: Start date is flexible.

Appointment Length: The appointment will initially be for one year, but will be renewed upon recommendation of the mentor and ARS.

Participant Stipend. The participant(s) will receive a monthly stipend commensurate with their educational level and experience. The Stipend is approximately $90,000 per year plus a stipend for health insurance through ORISE.

ORISE Information. This program, administered by ORAU through its contract with the U.S. Department of Energy (DOE) to manage the Oak Ridge Institute for Science and Education (ORISE), was established through an interagency agreement between DOE and ARS. Participants do not become employees of USDA, ARS, DOE or the program administrator, and there are no employment-related benefits. Proof of health insurance is required for participation in this program. Health insurance can be obtained through ORISE.

Preferred skills:

*    Proficiency in Linux and Bash scripting
*    Experience in Python or other languages
*    Experience with Github and workflow managers like Nextflow
*    Some experience with statistical modeling
*    An interest in biological applications

We recognize that everyone has a unique mix of skills and welcome applications from anyone who has an established track record of productivity in genomics or AI/ML research.

Eligibility Requirements
*    Degree: Doctoral Degree.

Interested? Email me today or apply here:


Application deadline:
Start date: Flexible
Location: USDA Agricultural Research Service, Gainesville, FL