Motifs Module
This module provides algorithms and utilities for identifying conserved DNA motifs across multiple sequences. It includes both deterministic and stochastic methods for motif detection, scoring, and probabilistic modeling.
Basic Functions
- GenomeVisualizer.motifs.Count(Motifs: list[str]) dict[str, list[int]][source]
Counts the occurrences of each nucleotide at every position in a list of motifs.
This function constructs a count matrix from a list of equal-length DNA strings (motifs), where each entry count[symbol][j] represents the number of times nucleotide symbol (A, C, G, or T) appears in position j across all motifs.
- Parameters:
Motifs (list[str]) – A list of DNA strings (motifs) of equal length.
- Returns:
A dictionary with keys ‘A’, ‘C’, ‘G’, ‘T’, and values as lists of counts for each position.
- Return type:
dict[str, list[int]]
Example
>>> Count(["ATG", "ACG", "AAG", "AGG", "ATG"]) {'A': [5, 0, 0], 'C': [0, 1, 0], 'G': [0, 1, 5], 'T': [0, 3, 0]}
- GenomeVisualizer.motifs.Profile(Motifs: list[str]) dict[str, list[float]][source]
Computes the profile matrix of a list of motifs.
The profile matrix is a normalized version of the count matrix, where each entry profile[symbol][j] represents the frequency of nucleotide symbol at position j across all motifs. The values are computed as relative frequencies (i.e., between 0 and 1).
- Parameters:
Motifs (list[str]) – A list of DNA strings (motifs) of equal length.
- Returns:
A dictionary with keys ‘A’, ‘C’, ‘G’, ‘T’, and values as lists of nucleotide frequencies at each position.
- Return type:
dict[str, list[float]]
Example
>>> Profile(["ATG", "ACG", "AAG", "AGG", "ATG"]) {'A': [1.0, 0.0, 0.0], 'C': [0.0, 0.2, 0.0], 'G': [0.0, 0.2, 1.0], 'T': [0.0, 0.6, 0.0]}
- GenomeVisualizer.motifs.Consensus(Motifs: list[str]) str[source]
Determines the consensus string from a list of motifs.
The consensus string is formed by selecting the most frequent nucleotide at each position across all motifs. It summarizes the most likely motif pattern present in the input sequences.
- Parameters:
Motifs (list[str]) – A list of DNA strings (motifs) of equal length.
- Returns:
The consensus DNA string formed from the most common nucleotides at each position.
- Return type:
str
Example
>>> Consensus(["ATG", "ACG", "AAG", "AGG", "ATG"]) 'ATG'
- GenomeVisualizer.motifs.Score(Motifs: list[str]) int[source]
Calculates the total score of a set of motifs based on their similarity to the consensus.
The score is defined as the total number of mismatches between each motif and the consensus string. A lower score indicates a more conserved motif set, while a higher score indicates greater variability.
- Parameters:
Motifs (list[str]) – A list of DNA strings (motifs) of equal length.
- Returns:
The total number of mismatches compared to the consensus across all positions and motifs.
- Return type:
int
Example
>>> Score(["ATG", "ACG", "AAG", "AGG", "ATG"]) 3
Profile Evaluation
- GenomeVisualizer.motifs.Pr(Text: str, Profile: dict[str, list[float]]) float[source]
Computes the probability of a DNA string given a profile matrix.
This function calculates the probability that the given DNA string Text was generated by the given profile matrix. It multiplies the corresponding probabilities from the profile for each nucleotide at each position.
- Parameters:
Text (str) – A DNA string (motif) of length k.
Profile (dict[str, list[float]]) – A profile matrix containing nucleotide probabilities at each position, with keys ‘A’, ‘C’, ‘G’, ‘T’.
- Returns:
The probability of the motif according to the profile.
- Return type:
float
Example
>>> profile = { ... 'A': [0.2, 0.2, 0.3], ... 'C': [0.4, 0.3, 0.1], ... 'G': [0.3, 0.3, 0.4], ... 'T': [0.1, 0.2, 0.2] ... } >>> Pr("ACG", profile) 0.2 * 0.3 * 0.4 = 0.024
- GenomeVisualizer.motifs.ProfileMostProbableKmer(text: str, k: int, profile: dict[str, list[float]]) str[source]
Finds the most probable k-mer in a DNA sequence based on a given profile matrix.
Given a DNA string text, an integer k, and a 4 x k profile matrix, this function finds the k-mer in text that is most likely to have been generated by the profile. In the case of a tie (multiple k-mers with equal maximum probability), the function returns the first one that occurs in text.
- Parameters:
text (str) – The DNA sequence to search within.
k (int) – The length of the k-mers.
profile (dict[str, list[float]]) – A profile matrix containing nucleotide probabilities at each position (keys: ‘A’, ‘C’, ‘G’, ‘T’).
- Returns:
The k-mer from the input text that has the highest probability based on the profile.
- Return type:
str
Example
>>> profile = { ... 'A': [0.2, 0.2, 0.3, 0.2, 0.3], ... 'C': [0.4, 0.3, 0.1, 0.5, 0.1], ... 'G': [0.3, 0.3, 0.5, 0.2, 0.4], ... 'T': [0.1, 0.2, 0.1, 0.1, 0.2] ... } >>> text = "ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT" >>> ProfileMostProbableKmer(text, 5, profile) 'CCGAG'
- GenomeVisualizer.motifs.Motifs(Profile: dict[str, list[float]], Dna: list[str]) list[str][source]
Identifies the profile-most probable motif (k-mer) in each DNA string from a given profile matrix.
For each string in the input list Dna, this function finds the k-mer that is most likely to have been generated by the given profile matrix. The result is a list of motifs—one from each sequence—that best match the provided profile.
- Parameters:
Profile (dict[str, list[float]]) – A profile matrix represented as a dictionary mapping nucleotides (‘A’, ‘C’, ‘G’, ‘T’) to lists of positional probabilities.
Dna (list[str]) – A list of t DNA strings (assumed to be of equal or similar length).
- Returns:
A list of k-mers (motifs), one from each input string, representing the most probable subsequences under the profile model.
- Return type:
list[str]
Example
>>> profile = { ... 'A': [0.8, 0.0, 0.0, 0.2], ... 'C': [0.0, 0.6, 0.2, 0.0], ... 'G': [0.2, 0.2, 0.8, 0.0], ... 'T': [0.0, 0.2, 0.0, 0.8] ... } >>> Dna = ["TTACCTTAAC", "GATGTCTGTC", "ACGGCGTTAG", "CCCTAACGAG", "CGTCAGAGGT"] >>> Motifs(profile, Dna) ['ACCT', 'ATGT', 'GCGT', 'ACGA', 'AGGT']
Motif Search Algorithms
Greedy Search:
- GenomeVisualizer.motifs.GreedyMotifSearch(Dna: list[str], k: int, t: int) list[str][source]
Finds the best-scoring collection of motifs across multiple DNA strings using the greedy motif search algorithm.
This function implements the classic GreedyMotifSearch algorithm, which iteratively selects the most probable k-mer in each string based on a profile matrix built from previously selected motifs. The search is initialized by trying every possible k-mer from the first DNA string.
At each iteration, a new profile is built from the existing motifs, and the next string contributes its most probable k-mer according to that profile. The score of the resulting motif set is compared to the current best, and updated if an improvement is found.
- Parameters:
Dna (list[str]) – A list of t DNA strings (all of equal length).
k (int) – The length of the motif to search for.
t (int) – The number of DNA strings.
- Returns:
A list of t k-mers (one from each string), representing the best motif set found.
- Return type:
list[str]
Notes
In case of ties in the profile-most probable k-mer selection, the leftmost occurrence is chosen.
This basic version does not include pseudocounts; therefore, the presence of zeroes in the profile matrix can suppress potential motifs. A pseudocount-enhanced version is more robust.
Example
>>> k = 3 >>> t = 5 >>> Dna = [ ... "GGCGTTCAGGCA", ... "AAGAATCAGTCA", ... "CAAGGAGTTCGC", ... "CACGTCAATCAC", ... "CAATAATATTCG" ... ] >>> GreedyMotifSearch(Dna, k, t) ['CAG', 'CAG', 'CAA', 'CAA', 'CAA']
- GenomeVisualizer.motifs.GreedyMotifSearchWithPseudocounts(Dna: list[str], k: int, t: int) list[str][source]
Executes the greedy motif search algorithm using a pseudocount-corrected profile matrix.
This enhanced version of GreedyMotifSearch prevents zero probabilities in profile matrices by applying Laplace correction (pseudocounts). It initializes with the first k-mer from each string, then explores all possible k-mers in the first DNA string, building a motif matrix by iteratively adding the profile-most probable k-mer from the remaining strings.
- Parameters:
Dna (list[str]) – A list of t DNA strings (assumed to be of equal or similar length).
k (int) – Length of the motif to identify.
t (int) – Number of DNA strings in the input list.
- Returns:
A list of t k-mers (one from each string) representing the highest scoring motifs.
- Return type:
list[str]
Notes
Profile construction uses ProfileWithPseudocounts() to avoid zero-probability values.
This version is more stable for motif detection than the zero-pseudocount version.
In case of ties in most probable k-mer selection, the first occurrence is returned.
Example
>>> Dna = ["GGCGTTCAGGCA", "AAGAATCAGTCA", "CAAGGAGTTCGC", "CACGTCAATCAC", "CAATAATATTCG"] >>> GreedyMotifSearchWithPseudocounts(Dna, 3, 5) ['TTC', 'ATC', 'TTC', 'ATC', 'TTC']
Randomized Algorithms:
- GenomeVisualizer.motifs.RandomizedMotifSearch(Dna: list[str], k: int, t: int) list[str][source]
Performs the Randomized Motif Search algorithm to identify conserved k-mers across DNA sequences.
This stochastic algorithm begins with a random selection of k-mers from each string in Dna (using RandomMotifs()), then iteratively refines them based on a profile matrix with pseudocounts. At each step, it constructs a profile matrix from the current motifs, selects the most probable k-mers in all sequences, and updates the best motifs if the score improves. The algorithm stops when no improvement is made.
This method is particularly useful for discovering weak signals (subtle motifs) in genomic sequences and often outperforms deterministic approaches like greedy motif search, especially when repeated many times from different random initializations.
- Parameters:
Dna (list[str]) – A list of t DNA strings.
k (int) – Length of the motifs to find.
t (int) – Number of DNA strings to process.
- Returns:
A list of t k-mers representing the best-scoring motifs found.
- Return type:
list[str]
Example
>>> Dna = [ ... "CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA", ... "GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG", ... "TAGTACCGAGACCGAAAGAAGTATACAGGCGT", ... "TAGATCAAGTTTCAGGTGCACGTCGGTGAACC", ... "AATCCACCAGCTCCACGTGCAATGTTGGCCTA" ... ] >>> RandomizedMotifSearch(Dna, 8, 5) ['TCTCGGGG', 'CCAAGGTG', 'TACAGGCG', 'TTCAGGTG', 'TCCACGTG']
Notes
The algorithm may return different results on each run due to its randomized nature.
For more reliable results, run the function multiple times and retain the best output.
Uses pseudocounts in profile construction to avoid zero probabilities.
- GenomeVisualizer.motifs.GibbsSampler(Dna: list[str], k: int, t: int, N: int) list[str][source]
Implements the Gibbs Sampling algorithm for motif discovery in a set of DNA sequences.
Gibbs sampling is a stochastic optimization technique that iteratively updates one motif at a time by probabilistically sampling from a profile built on the remaining motifs. This allows the algorithm to escape local optima and potentially find better solutions than greedy or deterministic methods.
- Parameters:
Dna (list[str]) – A list of DNA strings.
k (int) – The length of the motif to search for.
t (int) – The number of DNA strings (should be equal to len(Dna)).
N (int) – Number of iterations for the Gibbs sampling process.
- Returns:
A list of t k-mers (one from each DNA string) representing the best motif set found.
- Return type:
list[str]
Example
>>> Dna = [ ... "CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA", ... "GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG", ... "TAGTACCGAGACCGAAAGAAGTATACAGGCGT", ... "TAGATCAAGTTTCAGGTGCACGTCGGTGAACC", ... "AATCCACCAGCTCCACGTGCAATGTTGGCCTA" ... ] >>> GibbsSampler(Dna, 8, 5, 100) ['TCTCGGGG', 'CCAAGGTG', 'TACAGGCG', 'TTCAGGTG', 'TCCACGTG']
Pseudocount Utilities
- GenomeVisualizer.motifs.CountWithPseudocounts(Motifs: list[str]) dict[str, list[int]][source]
Computes the count matrix of motifs with pseudocounts (Laplace’s Rule of Succession).
This function is a modified version of Count(), where each position in the count matrix is initialized with 1 instead of 0. This prevents zero values in subsequent profile computations and is especially useful in motif discovery algorithms to avoid assigning zero probability to unseen symbols.
- Parameters:
Motifs (list[str]) – A list of DNA strings (motifs) of equal length.
- Returns:
A dictionary mapping each nucleotide (‘A’, ‘C’, ‘G’, ‘T’) to a list of integer counts per position, each initialized with 1 (pseudocount).
- Return type:
dict[str, list[int]]
Example
>>> motifs = ["AACGTA", "CCCGTT", "CACCTT", "GGATTA", "TTCCGG"] >>> CountWithPseudocounts(motifs) { 'A': [2, 3, 2, 1, 1, 3], 'C': [3, 2, 5, 3, 1, 1], 'G': [2, 2, 1, 3, 2, 2], 'T': [2, 2, 1, 2, 5, 3] }
- GenomeVisualizer.motifs.ProfileWithPseudocounts(Motifs: list[str]) dict[str, list[float]][source]
Computes the nucleotide profile matrix for a list of motifs using pseudocounts.
Unlike the basic profile computation, this version adds 1 to each nucleotide count before normalization, preventing zero probabilities and improving the robustness of the profile in greedy or probabilistic motif search algorithms.
- Parameters:
Motifs (list[str]) – A list of DNA strings (motifs) of equal length.
- Returns:
A dictionary mapping each nucleotide (‘A’, ‘C’, ‘G’, ‘T’) to a list of probabilities for each position.
- Return type:
dict[str, list[float]]
Notes
This function avoids zero-probability pitfalls in probabilistic models.
Normalization uses (t + 4) to account for the pseudocounts.
Example
>>> motifs = ["AACGTA", "CCCGTT", "CACCTT", "GGATTA", "TTCCGG"] >>> ProfileWithPseudocounts(motifs) { 'A': [0.222, 0.333, 0.222, 0.111, 0.111, 0.333], 'C': [0.333, 0.222, 0.556, 0.333, 0.111, 0.111], 'G': [0.222, 0.222, 0.111, 0.333, 0.222, 0.222], 'T': [0.222, 0.222, 0.111, 0.222, 0.556, 0.333] }
Probabilistic Sampling Tools
- GenomeVisualizer.motifs.RandomMotifs(Dna: list[str], k: int, t: int) list[str][source]
Randomly selects one k-mer motif from each DNA string in the input list.
This function is typically used as an initialization step in randomized motif search algorithms such as Gibbs sampling or Randomized Motif Search. It selects a random starting index in each DNA string and extracts a k-mer from that position.
- Parameters:
Dna (list[str]) – A list of t DNA strings (assumed to be of equal or similar length).
k (int) – Length of the motif to select.
t (int) – Number of DNA strings to process (usually len(Dna)).
- Returns:
A list of t randomly chosen k-mers (one from each DNA string).
- Return type:
list[str]
Example
>>> Dna = ["TTACCTTAAC", "GATGTCTGTC", "ACGGCGTTAG", "CCCTAACGAG", "CGTCAGAGGT"] >>> RandomMotifs(Dna, 3, 5) ['ACC', 'GAT', 'TAG', 'TAA', 'AGA']
- GenomeVisualizer.motifs.Normalize(Probabilities: dict[str, float]) dict[str, float][source]
Normalizes a dictionary of probabilities so that the values sum to 1.
This function takes a dictionary of raw scores or unnormalized probabilities and scales them proportionally so that their total equals 1. This is a common preprocessing step in probabilistic models, such as those used in Gibbs sampling or probabilistic motif selection.
- Parameters:
Probabilities (dict[str, float]) – A dictionary mapping nucleotide symbols (‘A’, ‘C’, ‘G’, ‘T’) to non-negative float values.
- Returns:
A new dictionary where the values are normalized to sum to 1.
- Return type:
dict[str, float]
Example
>>> Normalize({'A': 0.1, 'C': 0.1, 'G': 0.1, 'T': 0.1}) {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
- GenomeVisualizer.motifs.WeightedDie(Probabilities: dict[str, float]) str[source]
Randomly selects a k-mer based on weighted probabilities.
This function performs weighted random sampling from a probability distribution represented as a dictionary. Each key (e.g., a k-mer or nucleotide) is associated with a probability value, and the function returns one key randomly, proportionally to its assigned probability.
- Parameters:
Probabilities (dict[str, float]) – A dictionary where keys represent k-mers or nucleotides, and values are probabilities that sum to 1 (use Normalize() if needed).
- Returns:
A randomly selected key based on the weights.
- Return type:
str
Example
>>> WeightedDie({'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}) 'A'
- GenomeVisualizer.motifs.ProfileGeneratedString(Text: str, profile: dict[str, list[float]], k: int) str[source]
Selects a k-mer from the input string according to its probability based on a given profile.
This function computes the probability of each k-mer of length k in Text based on a nucleotide position-specific profile matrix. It then normalizes these probabilities and randomly selects one k-mer using a weighted die (probabilistic sampling).
- Parameters:
Text (str) – The DNA string from which to extract the k-mer.
profile (dict[str, list[float]]) – Profile matrix as a dictionary mapping nucleotides to lists of position-specific probabilities.
k (int) – Length of the k-mers to evaluate.
- Returns:
A k-mer selected probabilistically based on the profile matrix.
- Return type:
str
Example
>>> profile = {'A': [0.5, 0.1], 'C': [0.3, 0.2], 'G': [0.2, 0.4], 'T': [0.0, 0.3]} >>> ProfileGeneratedString("AAACCCAAACCC", profile, 2) 'AA'