Version 1.0
SMARTIV Overview
SMARTIV is a web accessible computational tool for discovering combined sequence and structure binding motifs for RNA Binding Proteins (RBPs) from in-vivo binding data. The algorithm relies on the sequences of the target sites, the ranking of their binding scores and their predicted secondary structure.
The motifs are presented as graphical logos, where the combined sequence and structural information is represented in an eight-letter alphabet (A, C, G, U, a, c, g, u - upper case for unpaired and lower case for paired nucleotides), which is informative and easy for visual perception.
SMARTIV methodology
Pre-processing of the data: SMARTIV gets as an input a list of processed sequences from a given CLIP-based experiment or downloaded from a CLIP database. The list should be sorted according to the binding score value in a descending order (stronger binding scores are at the top of the list). For calculation efficiency, SMARTIV selects from the ranked list a total of 10000 sequences in the following way: 1000 sequences from top of the list and 9000 sequences from the bottom of the list.
RNA secondary structure prediction: SMARTIV uses the RNAsubpot algorithm, from the RNA Vienna package, to calculate the RNA suboptimal structures, finally defining each nucleotide in the sequence as either paired or unpaired.
Translating the sequences to a combined sequence and structure alphabet: SMARTIV considers the original length of the CLIP sequences and translates the sequences to an eight-letter alphabet (A, G, C, U, a, g, c, u), where each position in the sequence holds the information for both the nucleotide identity and its predicted secondary structure (paired/unpaired). The capital letters stand for unpaired nucleotides and the lower case letters stand for the paired nucleotides.
The following stages are performed separately for the combined sequence and structure list (eight-letter alphabet) and the original sequence list (four-letter alphabet):
Extracting enriched k-mers from the ranked CLIP data: SMARTIV algorithm is based on the assumption that binding motifs are derived from overrepresented sub-sequences of length k (k-mers) that occur more frequently in the bound sequences (as defined by the experimental assay). To extract enriched k-mers, SMARTIV employs the DRIMUST de-novo motif search algorithm, which is a rank-based approach for detecting imbalanced enriched motifs (Leibovich et al., 2013 and Eden et al., 2007). The great advantage of DRIMUST over other algorithms for extracting over-represented k-mers is that it searches the k-mers at the top of the input sequences list, where the top of the list is dynamically determined by the mHG statistics without a requirement to define bound versus unbound. For each k-mer, DRIMUST assigns a statistical significance value using an mHG score, corrected for multiple testing, which is a tight bound to the p-value (p-value ≤ corrected mHG score). SMARTIV uses and presents the k-mers that have passed the threshold of 10-5 for the combined sequence and structure data or 10-8 for the sequence only data.
Clustering and aligning the k-mers: The clustering of the k-mers is performed for each length k separately using VSEARCH, a greedy centroid-based algorithm with an adjustable k-mer similarity function. Prior to the clustering, SMARTIV sorts the enriched k-mers based on their p-values, obtained by the DRIMUST algorithm. Briefly, the clustering process starts by selecting the k-mer with the lowest p-value, which is then used as the cluster centroid. Subsequently, k-mers are added to the cluster if their similarity to the centroid is above a certain threshold. The process continues until all k-mers are assigned to some cluster. Further, VSEARCH aligns the k-mers in each cluster, prohibiting internal gaps.
Building Position Weight Matrices (PWMs): To generate a PWM from a given cluster, SMARTIV multiplies each k-mer in the aligned cluster by the number of times the k-mer was found at the top of the list, as defined by the DRIMUST algorithm parameter b. For the graphical representation, SMARTIV uses a modified version of the WebLogo algorithm, adjusted to present the PWMs for both the eight-letter and the four-letter alphabet.
Assigning occurrence score and p-values to the PWMs: To select the best motifs for a given RBP, each cluster is assigned with an occurrence score, which is defined as the total number of occurrences of the k-mers (within the cluster) at the top of the list of the sorted CLIP sequences. While the occurrence score can be used for ranking the clusters derived from a set of k-mers of a given length k, it is not applicable for comparing between clusters of k-mers of different lengths and cannot be used for comparing PWMs generated from different CLIP datasets.
For that purpose, SMARTIV assigns a p-value to each PWM, based on the correspondence between ranking the sequences according to their match to the PWM and ranking them by the original binding scores, derived from the CLIP data. The p-value is calculated using the mmHG statistics, which evaluates the association between two ranked lists (Steinfeld et al., 2013).
Ranking the sequences by their match to the PWM is done by scanning each sequence against the PWM, calculating a log-odds score for each sub-sequence, where the background probabilities are defined as 0.25 and 0.125 for the eight-letter and four-letter alphabet, respectively. The final score that is used for ranking the sequences is the maximal score for a sub-sequence of the motif length in each sequence.
Selecting the best PWMs to present: SMARTIV enables the user to search for motifs within a range of lengths. For each requested length, the PWM with highest occurrence score is presented together with its assigned p-value.