Skip to main content

Research Project Opportunities Available

Dear Undergrads,
My lab has a number of projects available in computational biology and machine learning that may be of interest. These projects are available immediately, and can be carried out for research credit or for pay. Interested students should send a cover letter, transcript, and resume to Dr. Ritambhara Singh <rsingh7@uw.edu>. Descriptions of the projects are below:

Hi-C Projects

Data Description: Studying the three-dimensional (3D) organization of the human genome is vital for understanding cellular functions. The spatial organization of the genome can directly or indirectly affect the regulation of genes that, in turn, can decide the fate of the cell. Various high-throughput experimental techniques, such as Hi-C, are used to study higher-order chromatin structure at different scales. The Hi-C assay uses high-throughput sequencing to measure 3D genome structure, where each read pair corresponds to an observed 3D contact between two genomic loci. Data from a Hi-C assay is typically coalesced into a matrix in which rows and columns correspond to fixed-width windows (“bins”) tiled along the genomic axis, and values in the matrix are counts of read pairs that fall into the corresponding bins.

Project #1: Automatic Resolution Selection for Hi-C Data

  • Project Description: Typical Hi-C analysis is done using bin sizes of 40 kb or 100 kb, which might not be the optimal bin size to gain useful biological insights from the data. The goal of this project is to produce a lightweight Python package that automatically selects an appropriate fixed-width bin size for a given set of Hi-C reads. The current implementation splits the Hi-C reads from a dataset into “train” and “test” sets and then varies the train set bin size, using the test set to evaluate how similar the two sets are. The bin size that gives the highest similarity score is recommended as the optimal resolution for that dataset. The project requires this implementation to be converted into an easy-to-use, fast and efficient package, with proper documentation, that will be widely used by the research community.

  • Recommended reading :

[1] Cameron, Christopher JF, Josee Dostie, and Mathieu Blanchette. “Estimating DNA-DNA interaction frequency from Hi-C data at restriction-fragment resolution.” bioRxiv (2018): 377523.

Project #2: Improving Hi-C Resolution using Graph Convolutional Networks (GCNs)

  • Project Description: Due to the high sequencing costs, Hi-C experiments can result in low read coverage and high data sparsity. To analyze such datasets, researchers use large (100kb) fixed-width bin sizes to reduce noise in the data. While this approach may give insights into the global interactions within and among chromosomes, it is hard to locate finer interactions among regulatory elements of the DNA. In order to improve the Hi-C resolution for better downstream analysis, this project proposes the use of deep neural networks to predict high-resolution Hi-C maps from low-resolution ones. Specifically, we will use graph convolution networks (GCNs), in which we treat the Hi-C map as an undirected graph G, with nodes V being the different genomic loci and the number of contacts between them representing as weighted edges E. Thus, the Hi-C resolution improvement task can be viewed as link prediction for missing links (or edges) in the graph G. The project requires implementation of novel model architectures involving GCNs and extensive hyperparameter tuning to achieve state-of-the-art performance on this prediction task.

  • Recommended reading:

[1] Zhang, Yan, et al. “Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus.” Nature Communications 9.1 (2018): 750.

[2] Kipf, Thomas N., and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907 (2016).

Manifold Alignment Project

Project #3: Learning Manifold Alignment for Two Distinct Datasets

  • Project Description: This project involves learning correspondences across datasets from different domains. For example, for a given population of cells, one can obtain two different sets of measurements from different experiments. If these experiments are performed on disjoint but similar subsets of cells, then it may be necessary to embed the two populations into a latent space in such a way that the two populations are distributed similarly. Recently, a Gromov-Wasserstein distance-based framework has been shown to successfully learn the cross-domain correspondence among languages, and the goal of this project is to use this framework in the biological setting described above. Therefore, the project requires an efficient implementation of Gromov-Wasserstein distance-based framework that learns correspondence between features from different biological experiments and aligns them in a common latent space.

  • Recommended reading :

[1] Alvarez-Melis, David, and Tommi S. Jaakkola. “Gromov-Wasserstein Alignment of Word Embedding Spaces.” arXiv preprint arXiv:1809.00013 (2018).

Bill Noble
Professor, Genome Sciences
Adjunct Professor, Computer Science
December 18, 2018