immuneREF feature layers

The feature layers of immuneREF (Overview)

This section provides an overview over the six immuneREF feature layers:

Diversity

The diversity profiles were calculated as previously described [1]. Briefly, we calculated the Hill-diversity for alphas 0–10 in steps of 0.1 with alpha = 1 being defined as the Shannon entropy. Each entry in the profile varies between ≈0 and 1, where higher values indicate an increasingly uniform frequency distribution.

{}^{\alpha}D = (\sum_{i=1}^{n} f^{\alpha})^\frac{1}{1-\alpha}

Germline gene usage

The frequency of occurrence of IMGT germline genes [2] in each repertoire was calculated for each repertoire depending on species and immune receptor class (Ig, TCR). The frequency of germline genes (defined by the ImmunoGenetics Database, IMGT) [2] across clones in each repertoire was calculated for each repertoire depending on species and immune receptor class (Ig, TR). The germline gene usage allows insight into deviations from a baseline recombinational likelihood and thereby captures the potential impact of disease, vaccine or other events on the immune state. To determine germline gene usage similarities, germline gene frequencies across clones are extracted for each individual.

Positional amino acid frequency

The positional amino acid frequencies were calculated separately for each CDR3 sequence length present in the repertoire. To decrease the chance of bias by small sets of short and long CDR3 sequences, we limit this analysis to a range of most common lengths (i.e., between 8–20 amino acids) [3]. This range can be adapted via the calc_characteristics() parameter aa_range which takes in a vector containing the range of lengths (e.g. aa_range = c(10:22) to consider AA frequenceis of lengths 10-22)

Sequential occurrence of gapped k-mers

For a given (nucleotide) k-mer size k and maximal gap length m , optimized to maximize information via parameter search, the gapped-pair-k-mer occurrences were counted for all gap sizes <= m. The counts were normalized by the total number of found patterns across all gap sizes such that short range patterns were weighted higher than large gap sizes. Users are free to introduce a new k-mer dictionary to analyze different k-mer lengths and gap sizes using the function make_kmer_dictionary().

# Constructs a gapped kmer dictionary used to count k-mers
# The arguments determine the k_nt and gap_size_nt determine the pattern length and gap size
dictionary_counts <- make_kmer_dictionary(
            k_nt = 3,
            gap_size_nt = c(0,1,2,3))

Network architecture

immuneREF constructs a similarity network for each repertoire [4] where nodes represent CDR3 sequences connected by Levenshtein Distance LD=1 similarity-edges using the igraph package [5]. The resulting networks were analyzed as previously described with respect to four measures: (i) cumulative degree distribution, (ii) mean hub score (Kleinberg hub centrality score), (iii) fraction of unconnected clusters and nodes and (iv) percent of sequences in the largest connected component.

Convergence (Repertoire overlap, Immunosignatures)

To compare repertoire similarity with respect to the presence of immunosignatures, immuneREF provides two options.

The first (and standard) option is to measure the convergence of repertoires by calculating a simple overlap measure according to:

\mathrm{overlap}=  \frac{\mathrm{A} \cap \mathrm{B}}{\min(|\mathrm{A}|, |\mathrm{B}|)}

For this, we provide the repertoire_overlap() function which includes the basis option that determines based on which sequence the overlap is calculated. The options are: “CDR3_aa”,”CDR3_nt”,”VDJ_aa”,”VDJ_nt”,”V_CDR3_J_aa” and “V_CDR3_J_nt”

# Calculate the overlap across all repertoire pairs. Default basis is CDR3_aa
overlap_layer <- repertoire_overlap(list_simulated_repertoires, basis = "CDR3_aa")

The second option integrated in immuneREF is to evaluate the repertoires using pre-trained machine learning models. In this approach, the models are used to predict presence of sequences with given characteristics (public clones, antigen-specificity). For each repertoire, the percentage of positively predicted sequences is then taken as the repertoire immunosignature feature. The base immuneREF package provides an SVM model trained to classify public and private CDR3 sequences (murine Ig) [6] using the KeBABS R-package [7].

Once a larger number of robust pretrained ML models are available future versions of immuneREF will enable their rapid application for immune repertoire diagnostics.

The calc_characteristics already calculates a basic immunosignature layer based on the contained SVM model (predicting similarity with respect to presence of of murine Ig public clones). The associated similarity layer can be calculated using:

immunosignature_layer <- make_cormat(repertoires_analyzed, weights_overall = c(0,0,0,1,0,0))

Additionally, the calc_characterstics() function has an additional parameter models which allows the user to read-in a list of ML models compatible with the kebabs function``predict()`` for the calculation of the immunosignature feature.

Gene Expression

immuneREF allows integration of immune repertoires and gene expression, which is of high interest to experiments that include both receptor and global transcript sequences (i.e. RNA-seq or repertoire experiments paired with transcriptomics). The analysis of omics data is handled by calculate_omics_layer(), which takes a preprocessed and normalized gene expression matrix and sample info data as input.

Firstly, there is a low variation filter to keep the most informative genes from the thousand genes obtained in transcriptomics. Standard deviation (SD) is calculated per gene across samples and all the genes over a certain threshold (default, SD>1) are preserved for subsequent analysis.

Second, users can choose between three correlation-based methods to construct the immuneREF gene expression layer: (1) Pairwise Pearson correlation between samples (method=”PC”); (2) Mutual Rank defined as

\sqrt{Rank_{AB}\times Rank_{BA}}

which is the geometric average between the rank of sample A from sample B and the rank of sample B from sample A after pairwise Pearson correlation (method="MR"); and (3) Principal Component Analysis (PCA) is applied to the gene expression matrix, the PCA scores matrix that explains enough variability (default, 80%) is retained and pairwise Pearson correlation between samples is applied (method="PCA").

Running the feature layer analysis

immuneREF analyzes the similarity between repertoires across six major features. Before the similarity calculation, each repertoire is analyzed with respect to all of these features. This analysis can be performed all in one or in separate steps per single layer (especially useful for large datasets). Below is an example of the analysis across all features:

# Feature calculation

# Extract repertoire sizes and names
repertoire_lengths <- sapply(1:length(tutorial_repertoires),function(x) nrow(tutorial_repertoires[[x]]))
repertoire_names <- sapply(1:length(tutorial_repertoires),function(x) as.character(unique(tutorial_repertoires[[x]]$name_repertoire)))


# Calculate all features for each repertoire
repertoires_analyzed <- list()
for(i in 1:length(tutorial_repertoires)){
  repertoires_analyzed[[repertoire_names[i]]] <- calc_characteristics(
        repertoire_df = tutorial_repertoires[[i]],
        species = strsplit(repertoire_names[i],"_")[[1]][2],
        receptor = strsplit(repertoire_names[i],"_")[[1]][3],
        chain = strsplit(repertoire_names[i],"_")[[1]][4],
        identifier_rep = repertoire_names[i])
}

save(repertoires_analyzed,file = "repertoires_analyzed")

Running immuneREF on species not yet included

immuneREF is currently limited to the analysis of human (“hs”) and mouse (“mm”) repertoires. For the analysis of additional species the extension of list_germline_genes needs to be extend to include germline genes of these species. Once the list_germline_genes has been extended by information on the species, the species can be specified in the calc_characteristics() function via the species parameter.

#Example process to include new VDJ information for the analysis of IgH repertoires of speciesX

#Dataframe containing V-gene information on new species
df_species_VDJ<-list()
df_species_VDJ[["V"]] <- data.frame(gene=c("IGHV1","IGHV2"),allele="01",sequence=c("gattaca","acattag"),species="speciesX",frequency_uniform=c(0.5,0.5), frequency=c(0.6,0.4))
df_species_VDJ[["D"]] <- data.frame(gene=c("IGHD1","IGHD2"),allele="01",sequence=c("gat","tac"),species="speciesX",frequency_uniform=c(0.5,0.5), frequency=c(0.6,0.4))
df_species_VDJ[["J"]] <- data.frame(gene=c("IGHJ1","IGHJ2"),allele="01",sequence=c("aca","tta"),species="speciesX",frequency_uniform=c(0.5,0.5), frequency=c(0.6,0.4))

# Extending list_germline_genes
list_germline_genes[["speciesX"]][["ig"]][["h"]] <- df_species_VDJ

References

[1]A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status, Greiff et al., Genome Medicine, 2015, https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-015-0169-8
[2](1, 2) IMGT/JunctionAnalysis: IMGT Standardized Analysis of the V-J and V-D-J Junctions of the Rearranged Immunoglobulins (IG) and T Cell Receptors (TR), Giudicelli et al., Cold Spring Harbor Protocols, 2011, http://cshprotocols.cshlp.org/content/2011/6/pdb.prot5634
[3]Systems Analysis Reveals High Genetic and Antigen-Driven Predetermination of Antibody Repertoires throughout B Cell Development, Greiff et al., Cell Reports, 19(7), 2017, https://www.sciencedirect.com/science/article/pii/S221112471730565X
[4]Large-scale network analysis reveals the sequence space architecture of antibody repertoires, Miho et al., Nature Communications, 2019, https://www.nature.com/articles/s41467-019-09278-8
[5]The igraph software package for complex network research, Csardi G, Nepusz T, InterJournal, Complex Systems 1695, 2006, http://igraph.org
[6]Learning the High-Dimensional Immunogenomic Features That Predict Public and Private Antibody Repertoires, Greiff et al., Journal of Immunology, 99(8), 2017, http://www.jimmunol.org/content/199/8/2985
[7]KeBABS: an R package for kernel-based analysis of biological sequences. Palme et al., Bioinformatics, 31, 2015, https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btv176