Skip to content

API reference

pydustmasker.DustMasker

Identify and mask low-complexity regions in nucleotide sequences using the SDUST algorithm from DustMasker.

Parameters:

Name Type Description Default

sequence

str

The nucleotide sequence to be processed. Characters other than 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' will be considered ambiguous bases. The minimum allowed sequence length is 4 bases.

required

window_size

int

The length of the window used by symmetric DUST algorithm. The minimum allowed value is 4.

64

score_threshold

int

Score threshold for identifying low-complexity regions. Higher values result in fewer regions being masked.

20

Attributes:

Name Type Description
sequence str

The nucleotide sequence that was provided as input.

window_size int

The length of the window used by symmetric DUST algorithm.

score_threshold int

Score threshold for identifying low-complexity regions.

intervals tuple of tuples

A tuple of tuples representing the start and end positions of the low-complexity regions identified in the sequence.

n_masked_bases int

The total number of bases that were masked.

Methods:

Name Description
mask

Returns the sequence with low-complexity regions masked.

Raises:

Type Description
ValueError

If the input parameters violate the following constraints:

  • sequence length < 4
  • window_size < 4
TypeError

If the input parameters are not of the expected type.

OverflowError

If a negative integer is passed to window_size or score_threshold.

Functions

mask

mask(hard: bool = False) -> str

Returns the sequence with low-complexity regions masked.

Parameters:

Name Type Description Default
hard
bool

If True, low-complexity regions will be masked with 'N' characters. By default, bases within low-complexity regions are converted to lowercase (i.e., soft-masking).

False

pydustmasker.LongdustMasker

Identify and mask low-complexity regions in nucleotide sequences using the Longdust algorithm.

Parameters:

Name Type Description Default

sequence

str

A string representing the nucleotide sequence to be processed. Characters other than 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' will be considered ambiguous bases. The minimum allowed sequence length is 4 bases.

required

window_size

int

Maximum size of the sliding window used to scan for low-complexity regions. Larger windows can detect longer repeats but increase memory usage. For optimal performance, keep window_size < 4^kmer.

5000

score_threshold

float

Score threshold for identifying low-complexity regions. Higher values result in fewer regions being masked.

0.6

kmer

int

The k-mer length used by the Longdust algorithm. Must be at least 1.

7

gc

float | 'auto' | None

GC content for bias correction. If None (default), assume a uniform base composition. If 'auto', compute GC from the input sequence. If a float between 0.0 and 1.0, use that value.

None

xdrop

int | None

Maximum allowable score drop for X-drop extension termination. During backward scanning, extension continues as long as (max_score - current_score) remains below (score_threshold * xdrop). Once the score drops by more than this amount from the peak score observed during the scan, extension stops immediately. Lower values enforce stricter extensions and tighter boundaries, potentially truncating part of the low-complexity region, whereas higher values allow more permissive extensions and looser boundaries, which may include non-low-complexity regions. If set to None, X-drop is disabled.

50

min_start_cnt

int

Minimum k-mer frequency in the window to trigger a backward scan. Only when a k-mer appears at least this many times does the algorithm attempt to identify a low-complexity region starting at that position. Must be at least 2. Lower values are more sensitive but slower, while higher values will result in faster processing but may miss shorter repeats.

3

approx

bool

If True, use approximate mode for guaranteed O(L*w) time complexity. In this mode, only the first candidate starting position is examined during backward scanning, rather than checking all candidates to find the optimal one.

False

forward_only

bool

If True, only process the forward strand. By default, both strands are processed.

False

Attributes:

Name Type Description
sequence str

The nucleotide sequence that was provided as input.

window_size int

The size of the sliding window used to scan for low-complexity regions.

score_threshold int

Score threshold for determining low-complexity regions.

kmer int

k-mer length.

gc float | 'auto' | None

Option used for GC bias correction. Can be None (a uniform base composition was assumed), 'auto' (GC was computed from the input sequence), or a float between 0.0 and 1.0 (provided by the user).

xdrop int | None

Extension X-drop length.

min_start_cnt int

Minimum k-mer frequency to trigger backward scan.

approx bool

Whether approximate mode was enabled.

forward_only bool

Whether only the forward strand was processed.

intervals tuple of tuples

A tuple of tuples representing the start and end positions of the low-complexity regions identified in the sequence.

n_masked_bases int

The total number of bases that were masked.

Methods:

Name Description
mask

Returns the sequence with low-complexity regions masked.

Raises:

Type Description
ValueError

If the input parameters violate the following constraints:

  • sequence length < kmer + 1
  • window_size < kmer + 1
  • kmer is 0
  • score_threshold <= 0.0
  • min_start_cnt < 2
  • xdrop is 0
  • gc is invalid (not 'auto', None, or float between 0.0 and 1.0)
TypeError

If the input parameters are not of the expected type.

OverflowError

If a negative integer is passed to window_size, kmer, xdrop, or min_start_cnt.

Functions

mask

mask(hard: bool = False) -> str

Returns the sequence with low-complexity regions masked.

Parameters:

Name Type Description Default
hard
bool

If True, low-complexity regions will be masked with 'N' characters. By default, bases within low-complexity regions are converted to lowercase (i.e., soft-masking).

False