API reference

pydustmasker.DustMasker

Identify and mask low-complexity regions in nucleotide sequences using the SDUST algorithm from DustMasker.

Parameters:

Name	Type	Description	Default
`sequence`	`str`	The nucleotide sequence to be processed. Characters other than 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' will be considered ambiguous bases. The minimum allowed sequence length is 4 bases.	required
`window_size`	`int`	The length of the window used by symmetric DUST algorithm. The minimum allowed value is 4.	`64`
`score_threshold`	`int`	Score threshold for identifying low-complexity regions. Higher values result in fewer regions being masked.	`20`

Attributes:

Name	Type	Description
`sequence`	`str`	The nucleotide sequence that was provided as input.
`window_size`	`int`	The length of the window used by symmetric DUST algorithm.
`score_threshold`	`int`	Score threshold for identifying low-complexity regions.
`intervals`	`tuple of tuples`	A tuple of tuples representing the start and end positions of the low-complexity regions identified in the sequence.
`n_masked_bases`	`int`	The total number of bases that were masked.

Methods:

Name	Description
`mask`	Returns the sequence with low-complexity regions masked.

Raises:

Type	Description
`ValueError`	If the input parameters violate the following constraints: sequence length < 4 window_size < 4
`TypeError`	If the input parameters are not of the expected type.
`OverflowError`	If a negative integer is passed to `window_size` or `score_threshold`.

Functions

mask

mask(hard: bool = False) -> str

Returns the sequence with low-complexity regions masked.

Parameters:

Name	Type	Description	Default
`hard`	`bool`	If True, low-complexity regions will be masked with 'N' characters. By default, bases within low-complexity regions are converted to lowercase (i.e., soft-masking).	`False`

pydustmasker.LongdustMasker

Identify and mask low-complexity regions in nucleotide sequences using the Longdust algorithm.

Parameters:

Name	Type	Description	Default
`sequence`	`str`	A string representing the nucleotide sequence to be processed. Characters other than 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' will be considered ambiguous bases. The minimum allowed sequence length is 4 bases.	required
`window_size`	`int`	Maximum size of the sliding window used to scan for low-complexity regions. Larger windows can detect longer repeats but increase memory usage. For optimal performance, keep window_size < 4^kmer.	`5000`
`score_threshold`	`float`	Score threshold for identifying low-complexity regions. Higher values result in fewer regions being masked.	`0.6`
`kmer`	`int`	The k-mer length used by the Longdust algorithm. Must be at least 1.	`7`
`gc`	`float \| 'auto' \| None`	GC content for bias correction. If None (default), assume a uniform base composition. If 'auto', compute GC from the input sequence. If a float between 0.0 and 1.0, use that value.	`None`
`xdrop`	`int \| None`	Maximum allowable score drop for X-drop extension termination. During backward scanning, extension continues as long as (max_score - current_score) remains below (score_threshold * xdrop). Once the score drops by more than this amount from the peak score observed during the scan, extension stops immediately. Lower values enforce stricter extensions and tighter boundaries, potentially truncating part of the low-complexity region, whereas higher values allow more permissive extensions and looser boundaries, which may include non-low-complexity regions. If set to None, X-drop is disabled.	`50`
`min_start_cnt`	`int`	Minimum k-mer frequency in the window to trigger a backward scan. Only when a k-mer appears at least this many times does the algorithm attempt to identify a low-complexity region starting at that position. Must be at least 2. Lower values are more sensitive but slower, while higher values will result in faster processing but may miss shorter repeats.	`3`
`approx`	`bool`	If True, use approximate mode for guaranteed O(L*w) time complexity. In this mode, only the first candidate starting position is examined during backward scanning, rather than checking all candidates to find the optimal one.	`False`
`forward_only`	`bool`	If True, only process the forward strand. By default, both strands are processed.	`False`

Attributes:

Name	Type	Description
`sequence`	`str`	The nucleotide sequence that was provided as input.
`window_size`	`int`	The size of the sliding window used to scan for low-complexity regions.
`score_threshold`	`int`	Score threshold for determining low-complexity regions.
`kmer`	`int`	k-mer length.
`gc`	`float \| 'auto' \| None`	Option used for GC bias correction. Can be None (a uniform base composition was assumed), 'auto' (GC was computed from the input sequence), or a float between 0.0 and 1.0 (provided by the user).
`xdrop`	`int \| None`	Extension X-drop length.
`min_start_cnt`	`int`	Minimum k-mer frequency to trigger backward scan.
`approx`	`bool`	Whether approximate mode was enabled.
`forward_only`	`bool`	Whether only the forward strand was processed.
`intervals`	`tuple of tuples`	A tuple of tuples representing the start and end positions of the low-complexity regions identified in the sequence.
`n_masked_bases`	`int`	The total number of bases that were masked.

Methods:

Name	Description
`mask`	Returns the sequence with low-complexity regions masked.

Raises:

Type	Description
`ValueError`	If the input parameters violate the following constraints: sequence length < kmer + 1 window_size < kmer + 1 kmer is 0 score_threshold <= 0.0 min_start_cnt < 2 xdrop is 0 gc is invalid (not 'auto', None, or float between 0.0 and 1.0)
`TypeError`	If the input parameters are not of the expected type.
`OverflowError`	If a negative integer is passed to `window_size`, `kmer`, `xdrop`, or `min_start_cnt`.

Functions

mask

mask(hard: bool = False) -> str

Returns the sequence with low-complexity regions masked.

Parameters:

Name	Type	Description	Default
`hard`	`bool`	If True, low-complexity regions will be masked with 'N' characters. By default, bases within low-complexity regions are converted to lowercase (i.e., soft-masking).	`False`

API reference

pydustmasker.DustMasker

`sequence`

`window_size`

`score_threshold`

Functions

mask

`hard`

pydustmasker.LongdustMasker

`sequence`

`window_size`

`score_threshold`

`kmer`

`gc`

`xdrop`

`min_start_cnt`

`approx`

`forward_only`

Functions

mask

`hard`