Kladi Tutorial

This tutorial will cover the main steps of the analysis from my presentation:

  1. Data loading and QC
  2. scIPM modeling
  3. Pseudotime inference
  4. Module analysis
  5. RP modeling
  6. Driver TF Analaysis (covISD)

Right now, I'm calling it Kladi which means "branch" in Greek. I want to rename scIPM too but haven't come up with anything yet.

0. Data loading and QC

I've aready run QC and joined cells from the Share-seq dataset, but like we discussed, best data format is two anndata objects with expression and accessibility, and they must have identical cell axes by the time you create your joint representation

1a. scIPM expression modeling

First, we start by learning a latent represetnation of our data using scIPM, and we need to tune the hyperparameters. There are four major parameters that may be changed, from least to most important:

  1. dropout: regularization parameter for encoder and decoder. Default of 0.2 appears to work well enough. If you experience "Node collapse", where multiple topics start to look the same or don't seem to describe a cohesive set of cells, increase this parameter.

  2. initial_counts: related to the dirichlet prior, default of 10 leads to the discovery of really sharp, sparse latent representations. The genes that will be highly activated for these types of topics will be genes whose expression changes rapidly with the rise of new modules. These are usually the most interesting. Increasing to 20+ will lead to the discovery of more gradual trends.

  3. num_modules: number of modules to extract from data. Should be carefully chosen. Too few and your imputations will not make sense and you'll miss interesting trends. Too many and eventually the model stops finding useful topics. It will still discover the major trends. Too many is probably better than too few.

Expr model only:

which genes to use for 1) latent variable features, 2) imputations.

When imputing data for later use in covISD and other downstream analysis, it's nice to know expression trends for genes that are not considered "highly variable", since basally-expressed TFs can still have intersting and cell-type specific effects. The first parameter of the expression model, "genes" should just be a list of all genes you want to impute, chosen by a tolerant mean_expression threshold.

The optional "highly_variable" parameter takes a boolean mask of the same length as your genelist. Genes marked with False will not be used as features for learning the latent variables for each cell. Excluding genes that are basally-expressed from the encoder features may help the model learn modules that are more fidelitous to the variability in the system rather than slow/basal trends.

Next, we can set a higher dispersion threshold for encoder features so we capture latent variable that track with dispersed genes, and to avoid training a high-dimensional encoder.

Next, we need to optimize the num_modules parameter. For this we use the ExpressionModels param_search function. This function takes args to specify a model, and an array of module nums to try in the num_modules argument.

It returns the test loss for each modeling condition. I may integrate this with the sklearn model selection construct in the future.

Use a GPU to train scIPM, otherwise it'll take forever. If you're sucessfully using a GPU, the "Using CUDA" log output will be True.

Make sure that the "counts" matrix columns and the "genes" list are lined up correctly: 4th column is expression of 4th gene, etc.

Once we've found the best number of modules, we can train the "official" model of our expression data.

At this point, saving the model is probably a good idea.

to reload the model, instantiate a scIPM object using the same parameters:

expr_model = ExpressionModel(gex_data.var_names, 
                             highly_variable=gex_data.var.highly_variable.values, 
                             num_modules=24)
expr_model.load('expr_model.pth')

After training the model, we can get our topics and imputations

And we're done with expression modeling for now. Moving on to accessibility modeling.

1b. Accessibility Modeling

Training the accessibility model takes much much longer than the expression model, so it's best to use the same parameters optimized in the expression model.

Usually, we don't need to filter out any peaks from ATAC-seq peakcount matrices, since peaks are defined by having met a certain threshold of cells/fragments to be recognized in the first place. Filtering rare peaks makes it more difficult to learn modules for small populations.

The API is the same as the expression model, except that highly_available is not available, and instead of genes, we pass peaks (used for TF enrichment). The format for peaks is [[chr, start, end], ... ]:

[['chr9', 123461850, 123462150],
 ['chr1', 56782095, 56782395],
 ...
 ['chr16', 18533123, 18533423]]

They don't have to be sorted.

To load:

atac_model = AccessibilityModel(atac_data.var[['chr','start','end']].values.tolist(), num_modules=24)
atac_model.load('atac_model.pth')

We can get our atac latent variables, but we won't impute peaks since this will create a #cells x #peaks dense matrix which we really don't need in-memory for the analysis.

So that concludes the construction of our latent variable understanding of gene and peak modules in the data. Next step is to use these modules to make a joint low-dimensional representation of the data, then identify our intersting differentiation system.

2a. Joint representation + system identification

To make the joint cell representation, we paste together the ILR-transformed latent compositions from the expression and accessibility models. This will have a scanpy-style function in the future, but for now, manually use numpy and make sure your cell axes are aligned between the two views!

Now we can make a UMAP view of the data and isolate a differentiation system to study.

I use 20 neighbors instead of the default of 30 to show more local structures.

Using the marker genes above, we can see that clusters 9, 4, and 11 contain the hair follicle differentiation system. Below, I use a previously-known set of cells to isolate our system-of-interest.

There we go! Hair follicle system isolated! Next, we do trajectory inference using Palantir.

2b. Pseudotime Trajectory Inference

To instate an inference object, you just need to pass the features that you want to define the diffusion map to cell features, which in our case is the joint representation.

n_neighbors is the number of nearest neighbors to computed various functions with. 30 works fine.

Second, we want to make a UMAP representation of the data for plotting. The parameters of this function are just passed to UMAP, except continuity, which is 1/negative sampling rate.

This parameter isn't talked about that much, but it directly controls how "continous" the data looks, and I've changed the default in this function to build a more continuous representation.

Third, we need to choose start and terminal states in the system. To do that, I've added an interactive plotting function that allows you to pick and choose cells. Below, I plot the UMAP representation, colored by WNT3 expression since I know that WNT3 is highly expressed at the ends of the Cortex and Medulla lineages.

Setting projection = '3d' uses plotly to generate the UMAP instead of matplotlib. Hovering over cells shows their position and their cell# at the bottom of the box. That cell number is used in the next step to calculate pseudotime.

The PalantirTrajectoryInference.get_pseudotime function below takes the cell# of the start cell as the first argument, followed by some options that will work in almost all situations. The user must also specify terminal states, which are passed as keyword arguments. The name of the argument is the name of that terminal state, and the value is the cell#.

Next, we can extract lineages from this data. The get_lineages function takes two parameters:

If cells that appear unlikely to reach the end of a lineage are included in a lineage, try ajusting shift to early (0.7) and stretch to be more gradual (10). Usually, the default works fine.

From the lineages above, we can then solved for the tree structure of the data. The get_cell_tree_states function takes one parameter, earliness_shift, which allows you to change the locations of the branch sites. The default of 0 usually works best, but you can set it to values from -1 to 1, exclusive. Negative values will shift the branch locations to be later in pseudotime, while positive values with shift the branch to be earlier.

The plot_states function colos the cells based on which terminal lineages they may still differentiate into. Changes in colors indicate the presence of a branch site at those cells. We can show how Kladi breaks down the lineage tree structure using the get_graphviz_tree function

Following the steps above (start and terminal state selection --> pseudotime --> lineages --> branch sites), we can start to analyze pseudotemporal patterns. There are two main plotting functions: plot_swarm_tree and plot_feature_stream.

The former shows every cell in the sample arranged along a the tree structure of the data, and is useful for plotting discrete or qualitative data for each cell, such as read depth/QC statistics, cluster identities, or raw read counts.

Meanwhile, plot_feature_stream is best for plotting continuous values over the course of the differentiation, like imuted gene expression, gene accessibility, etc. The features argument takes a (N x d) numpy array with a value for each cell (N) for each feature (d). Passing just one feature can be used to make easy-to-read streams for small multiples. The three genes below show three lineage-specific genes in this manner:

But you can also overlay features for comparitive analysis to find antagonsistic and lineage-specific relationships:

Important parameters for making attractive streamplots are:

Notably, plot_feature_stream can be used to plot module compositions of our cells during the differentiation process:

We can see that the branch between cortex and medulla cells are driven by the emergence of topic 14 in the Cortex, and topic 5 in the Medulla. Later, we see that topic 16 is more active in Cortex cells but is also present in the Medulla. The IRS lineage is dominated by topic 21 expression before topic 14 emergence. Let's see what enrichment analysis can say about the functional identity of each topic.

Using the expr_model object, we may query Enrichr using top genes from each module.

Great, we can see that topic 14, which is shared between late-stage IRS cells and drives the branch between Cortex and Medulla cells, is enriched for WNT and delta-Notch signaling! If we want to see the expression of genes from a certain enrichment term, we can fetch gene names from the results of the get_enrichments method. We see below that the inferred expression patterns for delta-notch genes generally shows increased expression in both Cortex and Medulla lineages.

So if Cortex and IRS expression is governed by Notch signaling, what controls Medulla expression? Using enrichment of Medulla-specific topic 5, we see that beta-catenin and TGF-beta signalling control Medulla expression identity.

To further analyze our topics, we can see for a gene which topics activate its expression using ExpressionModel.rank_modules. We can get the top genes from a topic for analysis with other tools using ExpressionModel.get_top_genes.

Just as we can find gene-set enrichments for RNA topics, we can find TF-motif enrichments in accessibility topics. First, let's make a streamplot for accessibility topics to see which topics define cellular identities.

Now we'd like to know which TFs are influential with the rise of certain topics. Using the AccessibilityModel.get_motif_hits_in_peaks function, Kladi will scan the sequence of each peak for motif hits using MOODS. Kladi will automatically download up-to-date JASPAR position frequency matrices for all available factors.

Motif scanning can take a while, so after this it's good to save the accessibility model again.

An accessibility module/topic describes an activation for each peak under a certain condition. By finding which motifs are preferentially found in the most-activated peaks of a given module, we can see which factors are most influential in those conditions. The enrich_TFs function takes a module number and which quantile of peaks to consider as "activated". The default quantile of 0.2 finds motifs that are enriched in the top 20% of peaks relative to all others. Results are a table of [(motif_id, factor, pvalue, fisher-exact test statistic), ... ], sorted by pvalue.

Topic 21 is enriched for SMAD2/3, FOS-JUN, and HOX13 activity! Topic 10 is enriched for DLL, HOXA, MEOX, and VSX activity, among others. These later factors are related to NOTCH signaling.

We can also give a motif score for each cell based on the probability of sampling a motif's binding site given that cell's latent composition. The get_motif_score function takes the latent compositions of cells and returns a normalized score for each motif. This value can be plotted on stream trees to show when accessibility of a TF's binding sites increased during differentiation.

These accessibility topics contain more information than just TF modules, since they also encode patterns of accessibility around genes. We can compare accessibility around a gene to it's expression using RP models.

3. RP Modeling

RP modeling is used to connect our latent understanding of accessibility to expression through proximal peak activity around each gene. It takes ~2 seconds to train each RP model, so training 1000s of genes can take quite a while. Instead of training a model for each gene we imputed, we'll train RP models for genes whose expression shows interesting variance with respect to our expression topics. Eventually, we will seek to understand the transcription factors that regulate expression of our gene modules using these RP functions, so it is important that the most-activated genes for important modules are modeled.

Below, I simply take the top 250 genes from each topic that is highly activate in the hair follicle system.

The RP modeler takes the species, then the accessibility topic model and expression topic model, respectively. These two models work together to learn the best RP function connecting accessibility to gene epxression.

3a. Training

To train RP models, one must provide the raw expression matrix used to train the expression topic model, as well as either to raw accessibility matrix used to train the accessibility topic model, or the accessibility latent compositions for each cell (this saves a little time on the setup computations.

Once instantiated, the RP Modeler object has two methods:

Training RP models can take a while. If you don't have paired expression data or don't wish to train RP models, you can use get_naive_models to get base models for each gene, where upstream, downstream, and promoter peaks are weighted equally, and the influence of a peak on a gene decays by 1/2 every 15 kilobases.

rp_models = rp_modeler.train(rp_genes, gex_data.layers['counts'], accessibility_matrix=atac_data.X, iters = 200)

3b. Predicted expression

Using the RP models, we can predict the expression of genes. Below, the top UMAPs show expression predictions using the RP models, while the bottom UMAPs show true expression.

We can use these predictions to see how proximal accessibility relates to expression over pseudotime:

From the plots above, we can see WNT3 and LEF1 accessibility increases before expression, with the greatest difference being at the Cortex-Medulla branch, while for EDNRA they track more closely.

3c. Relating gene accessibility to topics

Another thing that we can do with these RP models is to find which genes' proximal accessibility is controlled by each accessibility topic using the AccessibilityModel.get_most_influenced_genes function. This function works similarly to the enrich_TFs function, where genes are ranked according to how many influential nearby peaks are activated by the topic.

For accessibility topic 21, which is most influential in Cortex cells, we can see the genes whose proximal peaks are most controlled by the topic include NOTCH1, DSG4, and FOXO3, among others. This is very useful for determining which genes are seeing focused regulation in terms of accessibility changes at each stage in the differentiation.

Finally, the third major function of RP models is to measure potential TF influence on genes using covariance ISD.

4. Covariance Insilico-Deletion (CovISD)

Measures which transcription factors drive sets of genes based on how expression appears to change with respect to that TF's occupancy in the proximal chromatin. To instantiate a CovISD object, you must pass an AccessibilityModel, ExpressionModel, and a PalantirTrajectoryInference model.

The predict function takes a list of gene RP models as gene_models, either expression latent compositions or a raw expression matrix (and the covISD object will calculate the latent compositions itself), and either accessibility latent compositions or raw accessibility / peak-count matrix.

Using the covISD object, we can investigate driver TF-gene relationsips. For instance, if we wanted to know the predicted drivers of WNT3 expression, we can use CovISD.rank_factor_influence. We see below that the top factors are RORC, TCF7, LEF1, and FOS/JUN, which are all factors that mediate the $\beta$-catenin signaling pathway and appear to participate in positive feedback by expressing the WNT3 ligand.

Gene-by-gene analysis is fine, but its more interesting to know which factors or signals are coordinating to cause major functional expression and identity changes during differentiation. Since the assumption behind gene modules is that covarying expression of genes implies shared regulation, we can use genesets derived from our expression modules to identify the driver TFs of major expression events.

Using the ExpressionModel.get_top_genes function to extract a genelist, we pop that list into the CovISD.get_driver_TFs function to rank each factor based on how specifically it iteracts with the input genes vs the background (all other genes). This answers the question of "which TFs are mediating the expression of topic 5 genes?".

The resulting list is in the format [(factor name, pvalue, test-statistic), ...], sorted by pvalue. We can see that FOS/JUN, LEF1, TCF7, ETV4, and BACH2 drive the expression of these genes. Since topic 5 is highly activated in Medulla cells, we know these factors are influential to atleast one component of Medulla expression.

Often, it is useful to compare the driver TFs between modules to find specific regulators for each, as well as shared influential factors. This may be facilated with the CovISD.plot_compare_genelists function, and the CovISD.plot_compare_gene_modules function.

plot_compare_genelists takes as arguments two genesets, then displays a contrastive plot like below. The user may also pass a hue for each transcription factor (factor names can be found with the .factor_names attribute). Modules themselves are already interesting genesets, so a shortcut for comparitive analysis of TF drivers between two modules is the plot_compare_gene_modules. The user may pass the number of two modules, and the factors will be colored by their relative expression given those modules. Often, the most influential factors for a module are also more highly expressed.

For attractive plots, one may tune the pval_threshold parameter, which controls the threshold at which TFs are labeled, and the label_closeness parameter. Higher values move the label closer to the datapoint, while lower values enforce larger distances between labels. This may be better for readability of densely-labeled plots.

From the plot above, we can see that EGR3, ID1, ETV, and BACH2 are specific for module 5 genes (among others), while HOXC13, LHX5, and PLAG1 are specific for module 16 genes. FOS, LEF1, and RUNX1 are influential to the expression of both genesets.