Replication Module

This module provides functions for analyzing DNA sequences to detect replication origins (ori), based on GC-skew, pattern matching, and approximate motif searches.

Pattern-based Functions

GenomeVisualizer.replication.PatternCount(Text: str, Pattern: str) → int[source]

Counts the number of exact occurrences of a pattern in a given DNA sequence.

This function uses a sliding window approach to iterate through the input DNA text and counts how many times the exact pattern appears.

Parameters:

Text (str) – DNA sequence in which the pattern is searched.
Pattern (str) – DNA pattern to find within the sequence.

Returns:

Number of times the pattern occurs exactly in the sequence.

Return type:

int

Example

>>> PatternCount("ATATAT", "ATA")
2

GenomeVisualizer.replication.PatternMatching(Pattern: str, Genome: str) → list[int][source]

Finds all starting positions where a given pattern appears exactly in a genome.

This function searches the genome string for exact matches of the input pattern and returns a list of all starting indices where the match occurs.

Parameters:

Pattern (str) – DNA pattern to search for.
Genome (str) – DNA sequence in which to search for the pattern.

Returns:

List of starting positions where the pattern occurs.

Return type:

list[int]

Example

>>> PatternMatching("ATG", "ATGCATGATG")
[0, 4, 7]

GenomeVisualizer.replication.Reverse(Pattern: str) → str[source]

Reverses the given DNA pattern.

This function returns the reverse of the input DNA sequence by reversing the order of its characters.

Parameters:: Pattern (str) – DNA sequence to be reversed.
Returns:: The reversed DNA sequence.
Return type:: str

Example

>>> Reverse("ATCG")
"GCTA"

GenomeVisualizer.replication.Complement(Pattern: str) → str[source]

Returns the complementary DNA strand of the given pattern.

This function substitutes each nucleotide in the input DNA sequence with its Watson-Crick complement: A ↔ T, C ↔ G.

Parameters:: Pattern (str) – DNA sequence consisting of characters ‘A’, ‘T’, ‘C’, and ‘G’.
Returns:: The complementary DNA sequence.
Return type:: str

Example

>>> Complement("ATCG")
"TAGC"

GenomeVisualizer.replication.ReverseComplement(Pattern: str) → str[source]

Computes the reverse complement of a DNA sequence.

This function first reverses the input DNA sequence, then replaces each nucleotide with its Watson-Crick complement: A ↔ T, C ↔ G.

Parameters:: Pattern (str) – DNA sequence to be reverse-complemented.
Returns:: The reverse complement of the input sequence.
Return type:: str

Example

>>> ReverseComplement("ATCG")
"CGAT"

GC-Skew and Symbol Analysis

GenomeVisualizer.replication.SkewArray(Genome: str) → list[int][source]

Computes the skew array of a DNA genome.

The skew array records the difference between the cumulative counts of ‘G’ and ‘C’ nucleotides at each position in the genome. It starts at 0 and increments by +1 for every ‘G’, -1 for every ‘C’, and remains unchanged for other nucleotides (e.g., ‘A’ or ‘T’).

This array is particularly useful for identifying the origin of replication, as the minimum point typically corresponds to the location of the ori.

Parameters:: Genome (str) – The DNA sequence to analyze.
Returns:: A list of skew values, one for each position from 0 to len(Genome).
Return type:: list[int]

Example

>>> SkewArray("CAGTGC")
[0, -1, -1, 0, 1, 1, 0]

GenomeVisualizer.replication.MinimumSkew(Genome: str) → list[int][source]

Identifies all positions in the genome where the skew array reaches its minimum value.

This function computes the skew array of the genome and returns all indices where the skew is minimal. These positions are biologically significant, as the origin of replication (ori) often occurs near the minimum skew point.

Parameters:: Genome (str) – The DNA sequence to analyze.
Returns:: A list of genome positions where the skew is minimal.
Return type:: list[int]

Example

>>> MinimumSkew("TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT")
[11, 24]

GenomeVisualizer.replication.FasterSymbolArray(Genome: str, symbol: str) → dict[int, int][source]

Efficiently computes the symbol frequency array over a sliding window of size n/2.

This optimized version of SymbolArray avoids redundant computations by using a sliding window approach. Instead of recomputing the number of occurrences of the symbol from scratch for each window, it updates the count by considering only the symbol that exits the window and the one that enters it. This reduces the time complexity from O(n^2) to O(n), making it suitable for long genomes.

Parameters:

Genome (str) – The DNA sequence to analyze.
symbol (str) – The nucleotide symbol (‘A’, ‘C’, ‘G’, or ‘T’) to count.

Returns:

A dictionary where keys are starting positions and values are the counts of the symbol in the corresponding window.

Return type:

dict[int, int]

Example

>>> FasterSymbolArray("AAAAGGGG", "A")
{0: 4, 1: 3, 2: 2, 3: 1, 4: 0, 5: 1, 6: 2, 7: 3}

Notes

The sliding window is of length n/2.
The genome is virtually extended by its first n/2 characters to handle wrapping.

Distance and Approximate Matching

GenomeVisualizer.replication.HammingDistance(p: str, q: str) → int[source]

Computes the Hamming distance between two DNA strings.

The Hamming distance is defined as the number of positions at which the corresponding symbols are different. It is commonly used to measure the similarity between two sequences of equal length.

Parameters:

p (str) – First DNA string.
q (str) – Second DNA string, must be the same length as p.

Returns:

The number of differing positions between the two strings.

Return type:

int

Raises:

ValueError – If the input strings are not of equal length.

Example

>>> HammingDistance("GGGCCGTTGGT", "GGACCGTTGAC")
3

GenomeVisualizer.replication.ApproximatePatternMatching(Text: str, Pattern: str, d: int) → list[int][source]

Finds all starting positions where a pattern appears in a text with at most d mismatches.

This function performs approximate pattern matching by sliding the pattern over the text and computing the Hamming distance at each position. All positions where the distance is less than or equal to d are returned.

Parameters:

Text (str) – The DNA sequence in which to search for the pattern.
Pattern (str) – The DNA pattern to search for.
d (int) – Maximum number of allowed mismatches (Hamming distance threshold).

Returns:

A list of starting positions where the pattern appears with ≤ d mismatches.

Return type:

list[int]

Example

>>> ApproximatePatternMatching("CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT", "ATTCTGGA", 3)
[6, 7, 26 27]

GenomeVisualizer.replication.ApproximatePatternCount(Pattern: str, Text: str, d: int) → int[source]

Counts the number of times a pattern appears in a text with at most d mismatches.

This function scans the text for substrings that approximately match the pattern, allowing for up to d mismatches based on Hamming distance, and returns how many such matches exist.

Parameters:

Pattern (str) – The DNA pattern to search for.
Text (str) – The DNA sequence in which to search.
d (int) – Maximum number of allowed mismatches.

Returns:

The total number of approximate occurrences of the pattern.

Return type:

int

Example

>>> ApproximatePatternCount("GAGG", "TTTAGAGCCTTCAGAGG", 2)
4