Global |
Retina |
Blood |
Eosinophils |
Neutrophils and NETosis |
Basophils |
Inflammatory Bowel Disease (IBD) |
Rohan Subramanian, Debashis Sahoo. "Boolean Implication Analysis Improves Prediction Accuracy of In Silico Gene Reporting of Retinal Cell Types." bioRxiv 2020.09.28.317313;
doi: https://doi.org/10.1101/2020.09.28.317313
I began a project on stem cells by reading a paper by Phillips et al. 2018, which can be viewed at this link.
The paper uses correlational methods and bait genes to find genes associated with differentiating retinal cell types such as photoreceptors.
I met with Dr. Debashis Sahoo, and learned about the power of a boolean approach, compared with a correlational approach, to capture asymmetric relationships.
Boolean analysis of datasets such as the single-cell RNA-seq data from differentiating hPSCs can help identify boolean relationships between genes.
These formal methods can help find invariants in biological systems.
I downloaded and organized the expression data from the experiment, available at this link.
I reduced the gene expression levels using log2(v+1) transformation.
I began working on the Hegemon server, and organizing the data there. I uploaded the expr file (see below) to Hegemon/Data.
Four files are required to analyze the data using the online Hegemon tools:
I created and uploaded the -ih.txt and -survival.txt file to Hegemon/Data.
I wrote a Python program to find the byte adress of each gene in the -expr.file, and stored these as file pointers in the -idx.txt file, enabling each gene to be found efficiently in the file.
I improved the algorithm to build the idx file from two-pass to one-pass. I imported all four files into the Hegemon online tool.
I fixed a few formatting errors in the expr file, and Phillips 2018 retina dataset can now be analyzed online with Hegemon here.
The next step of the investgation is to reproduce the results of the paper, and confirm their results derived using correlational methods. I created a Python notebook to begin this analysis on Hegemon.
I familiarized myself with the HegemonUtil Python tool, and wrote a program to find the correlation between two genes, and tested it.
I attempted to reproduce the paper's results for the top 200 correlated PR genes using CRX and PRDM1, with mixed results. I decided to investigate whether using raw data or other methods of normalization might impact the results.
I created two new datasets, one with raw data and one using Median To Ratio normalization (used by the paper). I continued correlational analysis on these two datasets. The MRN normalization using the EBSeq package as used in the paper appeared not to work fully.
Analysing the top 200 SRCCA genes yielded a reproduction accuracy of 75-87%.
When comparing the genes obtained within the thresholds detailed in the paper, the results could be reproduced accurately. Almost all of the genes identified by Phillips et al. 2018, were reproduced, in addition to some others.
Hence, the log2(v+1) data can be considered reliable, as it successfully reproduced the results from Fig. 7 of the paper, which use two bait genes to identify genes correlated with four retinal cell types.
The next step of the investigation is to carry out boolean analysis of the dataset to see if it can improve prediction accuracy.
The types of boolean relations are:
We need to look for genes with relationships 2, 3 and 5 to improve upon correlational methods.
To do this, we need to search for bait genes which can lead to a smaller set of genes which contains all of their "gold standard" genes i.e. those which are known genes for that cell type.
I worked on identifying bait genes for the boolean analysis of Cone cells (Fig. 6). I shortlisted PDE6H and GNAT2, and GNAT2 and ARR3, as they identified most gold standard genes and produced a smaller list. By analysing Hegemon scatter plots, I could hypothesize that PDE6H, GNAT2 and ARR3 were expressed successively as the stem cells continued to differentiate. Hence, GNAT2 and ARR3 were the best choice as bait genes instrumental in determining cell fate for cones.
Boolean analysis using GNAT2 and ARR3 as bait genes led to a much smaller list of genes, which did not include several high-confidence candiates put across by Phillips et al., such as AKAP9 and MEGF9. Investigating these genes in RGC datasets such as Sajgo 2018 (GSE87647 showed that AKAP9 was also expressedin other retinal cell types. Hence, these genes identified by correlational analysis are likely to be non-cone-specific.
I identified gold standard genes for retinal ganglion cells and retinal pigment epithelial cells. I verified them across several annotated datasets in Hegemon, and used them as a starting point to find bait genes.
I found bait genes for RGCs that led to a smaller list of bait genes with more gold standard genes present. For RPEs, I was unable to find genes with many low-low boolean relations.
I attempted to extend the process to multiple bait genes rather than just two, which led to identification of RPE genes with both high-high and low-low boolean relations. I also identified genes associated with retinal progenitor cells, a cell type that gives rise to all specialized retinal cell types, and identified multiple bait genes.
I improved the analysis of RPCs, which was capturing many cell cycle related genes as opposed to retina-specific genes.
Now that I had lists of genes for each cell type, derived from Boolean implication analysis, the next step is to determine whether there ios a statistically significant difference in the quality of genes from this method as compared to correlational methods.
One way to do this is to quantify how retina-specific the genes are. I will attempt to do this by performing a one-tailed t-test between the expression values in the whole eye vs. the expression values in the retina alone in the Mustafi 2016 bulk RNA-seq dataset, and store the p-value as a rigorous measure of the likelihood that the gene is overexpressed in the retina.
I attempted to quantify the retina specificity of the genes obtained, but the datasets were not comprehensive to achieve meaningful results.
I began the next method of quantification, which was to check the reproducibility of the resutls in a similar, larger dataset.
Hence, I downloaded and processed the Voigt 2020 dataset (GSE130636 and GSE142449), whicch is a large human retina scRNA-seq data.
I showed that there was a statistically signficant improvement in the proportion of genes that could be reproduced by repeating the analysis in both the Phillips and Voigt dataset using the same bait genes for cones: CRX, GNAT2 and GNB3.
Dr Sahoo and I annotated several human and mouse datasets and uploaded them into Hegemon under the key rt:. Some of these include purified rods and cones, which can allow for a robust quantification of genes through differential expression.
Using the Hartl 2017 and Sarin 2018 purified cell type mouse datasets, I was able to show a statistically significant improvement in cell type-specificity and cell class-specificity of the photoreceptor gene, more so in the rods genes and than the cones genes.
With a comprehensive quantification of the results complete, I presented the results during the Boolean Lab meeting on 3rd August, and will work on writing a manuscript in the coming days.
I identified novel cone and rod-specific genes from the gene lists from Boolean analysis, and validated them in the Hgemon datasets.
I created the main figures and wrote the first draft of the manuscript for the paper.
I created the supplementary figures. I took feedback from the lab about the figures and implemented it. I rectified some errors in the p-values.
I wrote a covering letter for the manuscript. Dr Sahoo and I adjusted it to the format of a presubmission inquiry and sent it to Stem Cells (where the Phillips paper had been published). They responded positively.
To improve the background section, I performed a literature review of analysis of retinal cell types. I cited all the retina datasets in Hegemon and classified all the papers by the method used in vivo (knockout, reporter line, single cell etc.) and the bioinformatics methods developed. I added the mini-review to the background section.
I sent the latest draft of manuscript to whole lab for review before submission.
I wrote a significance statement for the manuscript.
Dr Sahoo edited the manuscript. I received feedback from whole lab on figures and implemented it. Dr Sahoo created a graphical abstract.
We submitted the manuscript to STEM CELLS, and put the preprint in bioRxiv.
I checked time course expression of WWC1 in Daum dataset and PPEF2 in kim dataset in the online tool. Both follow expected trend, especially PPEF2, hence I will make the plots with matplotlib soon.