Hegemon (Rohan Subramanian)

Explore All
Stats , ProbeID

Types of Studies

Global

Retina

Blood

Eosinophils

Neutrophils and NETosis

Basophils

Inflammatory Bowel Disease (IBD)

Overview

I'm Rohan, a high school student in Bangalore, India. I have been a member of the Boolean Lab since June 2020.

Starting in summer 2020, I began a research project at the Boolean Lab under the guidance of Dr Debashis Sahoo.

We improved the prediction accuracy of in silico gene reporting methods for retinal cell types by augmenting correlational methods using boolean analysis.

A presentation of my work can be viewed here.

Click here to view annotated retina datasets in the Hegemon online tool.

Click here to view all annotated human and murine eye datasets in the Hegemon online tool.

Preprints and Publications

Rohan Subramanian, Debashis Sahoo. "Boolean Implication Analysis Improves Prediction Accuracy of In Silico Gene Reporting of Retinal Cell Types." bioRxiv 2020.09.28.317313;

doi: https://doi.org/10.1101/2020.09.28.317313

Daily Log

June 16th

I began a project on stem cells by reading a paper by Phillips et al. 2018, which can be viewed at this link.

The paper uses correlational methods and bait genes to find genes associated with differentiating retinal cell types such as photoreceptors.

I met with Dr. Debashis Sahoo, and learned about the power of a boolean approach, compared with a correlational approach, to capture asymmetric relationships.

Boolean analysis of datasets such as the single-cell RNA-seq data from differentiating hPSCs can help identify boolean relationships between genes.

These formal methods can help find invariants in biological systems.

June 17th

I downloaded and organized the expression data from the experiment, available at this link.

I reduced the gene expression levels using log2(v+1) transformation.

June 18th

I began working on the Hegemon server, and organizing the data there. I uploaded the expr file (see below) to Hegemon/Data.

Four files are required to analyze the data using the online Hegemon tools:

-expr.txt : A tab delimited file with the first two columns containing the transcript identifier and the gene name. The remaining columns describe the normalized expression values for each cell.
-ih.txt : A tab delimited file with columns ArrayID, ArrayHeader and ClinicalhHeader.
-idx.txt : A tab delimited file with columns transcript identifier, file pointer, gene name and gene description.
-survival.txt : A tab delimited file with columns ArrayID, time, status, followed by characteristics of the sample (denoted with c characteristic.)

June 19th

I created and uploaded the -ih.txt and -survival.txt file to Hegemon/Data.

June 20th

I wrote a Python program to find the byte adress of each gene in the -expr.file, and stored these as file pointers in the -idx.txt file, enabling each gene to be found efficiently in the file.

June 21st

I improved the algorithm to build the idx file from two-pass to one-pass. I imported all four files into the Hegemon online tool.

June 22nd

I fixed a few formatting errors in the expr file, and Phillips 2018 retina dataset can now be analyzed online with Hegemon here.

The next step of the investgation is to reproduce the results of the paper, and confirm their results derived using correlational methods. I created a Python notebook to begin this analysis on Hegemon.

June 23rd

I familiarized myself with the HegemonUtil Python tool, and wrote a program to find the correlation between two genes, and tested it.

June 24th

I attempted to reproduce the paper's results for the top 200 correlated PR genes using CRX and PRDM1, with mixed results. I decided to investigate whether using raw data or other methods of normalization might impact the results.

June 25th

I created two new datasets, one with raw data and one using Median To Ratio normalization (used by the paper). I continued correlational analysis on these two datasets. The MRN normalization using the EBSeq package as used in the paper appeared not to work fully.

June 26th - July 1st

Analysing the top 200 SRCCA genes yielded a reproduction accuracy of 75-87%.

When comparing the genes obtained within the thresholds detailed in the paper, the results could be reproduced accurately. Almost all of the genes identified by Phillips et al. 2018, were reproduced, in addition to some others.

Hence, the log2(v+1) data can be considered reliable, as it successfully reproduced the results from Fig. 7 of the paper, which use two bait genes to identify genes correlated with four retinal cell types.

July 2nd and 3rd

The next step of the investigation is to carry out boolean analysis of the dataset to see if it can improve prediction accuracy.

The types of boolean relations are:

0: no relation
1: low implies high
2: low implies low
3: high implies high
4: high implies low
5: equivalent
6: opposite

We need to look for genes with relationships 2, 3 and 5 to improve upon correlational methods.

To do this, we need to search for bait genes which can lead to a smaller set of genes which contains all of their "gold standard" genes i.e. those which are known genes for that cell type.

July 4th - 6th

Cone Cells (Fig. 6)

I worked on identifying bait genes for the boolean analysis of Cone cells (Fig. 6). I shortlisted PDE6H and GNAT2, and GNAT2 and ARR3, as they identified most gold standard genes and produced a smaller list. By analysing Hegemon scatter plots, I could hypothesize that PDE6H, GNAT2 and ARR3 were expressed successively as the stem cells continued to differentiate. Hence, GNAT2 and ARR3 were the best choice as bait genes instrumental in determining cell fate for cones.

Boolean analysis using GNAT2 and ARR3 as bait genes led to a much smaller list of genes, which did not include several high-confidence candiates put across by Phillips et al., such as AKAP9 and MEGF9. Investigating these genes in RGC datasets such as Sajgo 2018 (GSE87647 showed that AKAP9 was also expressedin other retinal cell types. Hence, these genes identified by correlational analysis are likely to be non-cone-specific.

July 7th - 12th

I identified gold standard genes for retinal ganglion cells and retinal pigment epithelial cells. I verified them across several annotated datasets in Hegemon, and used them as a starting point to find bait genes.

I found bait genes for RGCs that led to a smaller list of bait genes with more gold standard genes present. For RPEs, I was unable to find genes with many low-low boolean relations.

July 13th - 17th

I attempted to extend the process to multiple bait genes rather than just two, which led to identification of RPE genes with both high-high and low-low boolean relations. I also identified genes associated with retinal progenitor cells, a cell type that gives rise to all specialized retinal cell types, and identified multiple bait genes.

July 17th - 21st

I improved the analysis of RPCs, which was capturing many cell cycle related genes as opposed to retina-specific genes.

Now that I had lists of genes for each cell type, derived from Boolean implication analysis, the next step is to determine whether there ios a statistically significant difference in the quality of genes from this method as compared to correlational methods.

One way to do this is to quantify how retina-specific the genes are. I will attempt to do this by performing a one-tailed t-test between the expression values in the whole eye vs. the expression values in the retina alone in the Mustafi 2016 bulk RNA-seq dataset, and store the p-value as a rigorous measure of the likelihood that the gene is overexpressed in the retina.

July 22nd - 27th

I attempted to quantify the retina specificity of the genes obtained, but the datasets were not comprehensive to achieve meaningful results.

I began the next method of quantification, which was to check the reproducibility of the resutls in a similar, larger dataset.

Hence, I downloaded and processed the Voigt 2020 dataset (GSE130636 and GSE142449), whicch is a large human retina scRNA-seq data.

July 27th - 31st

I showed that there was a statistically signficant improvement in the proportion of genes that could be reproduced by repeating the analysis in both the Phillips and Voigt dataset using the same bait genes for cones: CRX, GNAT2 and GNB3.

August 1st - 3rd

Dr Sahoo and I annotated several human and mouse datasets and uploaded them into Hegemon under the key rt:. Some of these include purified rods and cones, which can allow for a robust quantification of genes through differential expression.

Using the Hartl 2017 and Sarin 2018 purified cell type mouse datasets, I was able to show a statistically significant improvement in cell type-specificity and cell class-specificity of the photoreceptor gene, more so in the rods genes and than the cones genes.

With a comprehensive quantification of the results complete, I presented the results during the Boolean Lab meeting on 3rd August, and will work on writing a manuscript in the coming days.

August 4th - 24th

I identified novel cone and rod-specific genes from the gene lists from Boolean analysis, and validated them in the Hgemon datasets.

I created the main figures and wrote the first draft of the manuscript for the paper.

August 24th - September 3rd

I created the supplementary figures. I took feedback from the lab about the figures and implemented it. I rectified some errors in the p-values.

September 3rd - 11th

I wrote a covering letter for the manuscript. Dr Sahoo and I adjusted it to the format of a presubmission inquiry and sent it to Stem Cells (where the Phillips paper had been published). They responded positively.

September 11th - 25th

To improve the background section, I performed a literature review of analysis of retinal cell types. I cited all the retina datasets in Hegemon and classified all the papers by the method used in vivo (knockout, reporter line, single cell etc.) and the bioinformatics methods developed. I added the mini-review to the background section.

I sent the latest draft of manuscript to whole lab for review before submission.

September 25th - October 1st

I wrote a significance statement for the manuscript.

Dr Sahoo edited the manuscript. I received feedback from whole lab on figures and implemented it. Dr Sahoo created a graphical abstract.

We submitted the manuscript to STEM CELLS, and put the preprint in bioRxiv.

October 1st - 3rd

The preprint was made available in bioRxiv. STEM CELLS placed the paper under peer review.

Meanwhile, we will attempt to find more methods of quantification and visualization using the other datasets.

October 9th - October 16th

While waiting for the reviews, I analyzed some more datasets.

Idea 1: Use Daum cone sc dataset for cones and kim 2016 rods mm dataset for rods to show that expression of genes over time fits model of Boolean implication. Read -thr file to get StepMiner thresholds in HegemonUtil. Make line graphs for gene expression over time.

Idea 2: Repeat analysis in new Kim 2019 cone enriched organoid sc Dataset. Test with same bait genes and check reproducibility.

I checked time course expression of WWC1 in Daum dataset and PPEF2 in kim dataset in the online tool. Both follow expected trend, especially PPEF2, hence I will make the plots with matplotlib soon.

October 22th - November 1st (Dussehra Break)

I made the plots for WWC1 and PPEF2, which fit the trend. I made a figure, presented it and received feedback during the 28th Oct meeting.

I paused work on the research during November due to college applications.

December 15th - December 25th

We decided to resubmit the paper to a computational journal, after addressing a few of the reviewers' comments.

We decided to include more single-cell validation from newer and more comprehensive datasets.

I was able to demonstrate the good performance of Boolean methods in newer and much larger single cell datasets.

I generated violin plots from several bulk and single-cell datasets.

January 15th - 31st

I generated more violin plots for time course datasets as well. I ran scripts to generate pdfs with the plots, and plan to compile them into a figure to add to the manuscript.