API reference
pydustmasker.DustMasker
Identify and mask low-complexity regions in nucleotide sequences using the SDUST algorithm from DustMasker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
The nucleotide sequence to be processed. Characters other than 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' will be considered ambiguous bases. The minimum allowed sequence length is 4 bases. |
required |
|
int
|
The length of the window used by symmetric DUST algorithm. The minimum allowed value is 4. |
64
|
|
int
|
Score threshold for identifying low-complexity regions. Higher values result in fewer regions being masked. |
20
|
Attributes:
| Name | Type | Description |
|---|---|---|
sequence |
str
|
The nucleotide sequence that was provided as input. |
window_size |
int
|
The length of the window used by symmetric DUST algorithm. |
score_threshold |
int
|
Score threshold for identifying low-complexity regions. |
intervals |
tuple of tuples
|
A tuple of tuples representing the start and end positions of the low-complexity regions identified in the sequence. |
n_masked_bases |
int
|
The total number of bases that were masked. |
Methods:
| Name | Description |
|---|---|
mask |
Returns the sequence with low-complexity regions masked. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input parameters violate the following constraints:
|
TypeError
|
If the input parameters are not of the expected type. |
OverflowError
|
If a negative integer is passed to |
Functions
mask
mask(hard: bool = False) -> str
Returns the sequence with low-complexity regions masked.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
bool
|
If True, low-complexity regions will be masked with 'N' characters. By default, bases within low-complexity regions are converted to lowercase (i.e., soft-masking). |
False
|
pydustmasker.LongdustMasker
Identify and mask low-complexity regions in nucleotide sequences using the Longdust algorithm.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
A string representing the nucleotide sequence to be processed. Characters other than 'A', 'C', 'G', 'T', 'a', 'c', 'g', 't' will be considered ambiguous bases. The minimum allowed sequence length is 4 bases. |
required |
|
int
|
Maximum size of the sliding window used to scan for low-complexity regions. Larger windows can detect longer repeats but increase memory usage. For optimal performance, keep window_size < 4^kmer. |
5000
|
|
float
|
Score threshold for identifying low-complexity regions. Higher values result in fewer regions being masked. |
0.6
|
|
int
|
The k-mer length used by the Longdust algorithm. Must be at least 1. |
7
|
|
float | 'auto' | None
|
GC content for bias correction. If None (default), assume a uniform base composition. If 'auto', compute GC from the input sequence. If a float between 0.0 and 1.0, use that value. |
None
|
|
int | None
|
Maximum allowable score drop for X-drop extension termination. During backward scanning, extension continues as long as (max_score - current_score) remains below (score_threshold * xdrop). Once the score drops by more than this amount from the peak score observed during the scan, extension stops immediately. Lower values enforce stricter extensions and tighter boundaries, potentially truncating part of the low-complexity region, whereas higher values allow more permissive extensions and looser boundaries, which may include non-low-complexity regions. If set to None, X-drop is disabled. |
50
|
|
int
|
Minimum k-mer frequency in the window to trigger a backward scan. Only when a k-mer appears at least this many times does the algorithm attempt to identify a low-complexity region starting at that position. Must be at least 2. Lower values are more sensitive but slower, while higher values will result in faster processing but may miss shorter repeats. |
3
|
|
bool
|
If True, use approximate mode for guaranteed O(L*w) time complexity. In this mode, only the first candidate starting position is examined during backward scanning, rather than checking all candidates to find the optimal one. |
False
|
|
bool
|
If True, only process the forward strand. By default, both strands are processed. |
False
|
Attributes:
| Name | Type | Description |
|---|---|---|
sequence |
str
|
The nucleotide sequence that was provided as input. |
window_size |
int
|
The size of the sliding window used to scan for low-complexity regions. |
score_threshold |
int
|
Score threshold for determining low-complexity regions. |
kmer |
int
|
k-mer length. |
gc |
float | 'auto' | None
|
Option used for GC bias correction. Can be None (a uniform base composition was assumed), 'auto' (GC was computed from the input sequence), or a float between 0.0 and 1.0 (provided by the user). |
xdrop |
int | None
|
Extension X-drop length. |
min_start_cnt |
int
|
Minimum k-mer frequency to trigger backward scan. |
approx |
bool
|
Whether approximate mode was enabled. |
forward_only |
bool
|
Whether only the forward strand was processed. |
intervals |
tuple of tuples
|
A tuple of tuples representing the start and end positions of the low-complexity regions identified in the sequence. |
n_masked_bases |
int
|
The total number of bases that were masked. |
Methods:
| Name | Description |
|---|---|
mask |
Returns the sequence with low-complexity regions masked. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input parameters violate the following constraints:
|
TypeError
|
If the input parameters are not of the expected type. |
OverflowError
|
If a negative integer is passed to |
Functions
mask
mask(hard: bool = False) -> str
Returns the sequence with low-complexity regions masked.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
bool
|
If True, low-complexity regions will be masked with 'N' characters. By default, bases within low-complexity regions are converted to lowercase (i.e., soft-masking). |
False
|