Quick start
pydustmasker is a Python library for efficient identification and masking of low-complexity regions in nucleotide sequences. Below, we describe the basic steps required to install and use the library. For a complete description of the available functionality, refer to the API reference.
Installation
pydustmasker can be installed from PyPI with pip or uv, or from Bioconda with pixi, conda, or mamba.
pip install pydustmasker
uv init my_project && cd my_project # (1)!
uv add pydustmasker # (2)!
- Create a new Python project with uv and navigate into the project directory.
- Add
pydustmaskeras a dependency to your project.
pixi init my_workspace && cd my_workspace # (1)!
pixi workspace channel add bioconda # (2)!
pixi add pydustmasker # (3)!
- Create a new Pixi workspace and navigate into the workspace directory.
- Add Bioconda to the list of channels of your Pixi workspace.
- Add
pydustmaskerto your Pixi workspace.
conda create -n my_environment -c conda-forge -c bioconda pydustmasker # (1)!
conda activate my_environment # (2)!
- Create a new Conda environment with
pydustmasker. - Activate the environment.
mamba create -n my_environment -c conda-forge -c bioconda pydustmasker # (1)!
mamba activate my_environment # (2)!
- Create a new Mamba environment with
pydustmasker. - Activate the environment.
Usage
To identify and mask low-complexity regions in a nucleotide sequence, create an instance of a masker class and provide your sequence to it. A masker class implements a specific low-complexity detection algorithm and provides methods to retrieve the detected regions and to generate a masked version of the sequence. pydustmasker provides two such classes, corresponding to different detection algorithms: SDUST and Longdust. The SDUST algorithm is implemented in the DustMasker class, while the Longdust algorithm is implemented in the LongdustMasker class.
>>> import pydustmasker
>>> seq = "CGTATATATATAGTATGCGTACTGGGGGGGCT"
>>> masker = pydustmasker.DustMasker(seq)
>>> len(masker) # (1)!
1
>>> masker.n_masked_bases # (2)!
7
>>> masker.intervals # (3)!
((23, 30),)
>>> for start, end in masker: # (4)!
... print(f"{start}-{end}: {seq[start:end]}")
23-30: GGGGGGG
- The
len()function returns the number of low-complexity regions detected in the sequence. - The
n_masked_basesattribute returns the total number of bases within low-complexity regions. - The
intervalsattribute returns a tuple of low-complexity regions detected in the sequence, represented as(start, end)index pairs. - The masker object is iterable, yielding
(start, end)index pairs for each low-complexity region.
You can generate a masked version of the sequence using the mask() method. By default, low-complexity regions are soft-masked by converting bases to lowercase. Setting the hard parameter to True enables hard-masking, in which affected bases are replaced with the ambiguous nucleotide N.
>>> masker.mask()
'CGTATATATATAGTATGCGTACTgggggggCT'
>>> masker.mask(hard=True)
'CGTATATATATAGTATGCGTACTNNNNNNNCT'
The identification of low-complexity regions can be tuned via algorithm-specific parameters. Both DustMasker and LongdustMasker provide multiple options, documented in the API reference, that control how low-complexity regions are determined. One shared parameter is score_threshold, which controls detection stringency: lowering this threshold results in more regions being classified as low-complexity, whereas increasing it restricts detection to the most clearly low-complexity regions.
>>> masker = pydustmasker.DustMasker(seq, score_threshold=10) # (1)!
>>> len(masker)
2
>>> masker.intervals
((2, 12), (23, 30))
>>> masker.mask()
'CGtatatatataGTATGCGTACTgggggggCT'
- The default
score_thresholdforDustMaskeris20. Setting it to10results in more low-complexity regions being detected.
Processing sequences in parallel
When working with large numbers of sequences, you can run pydustmasker in parallel to process multiple sequences at the same time. This can substantially reduce the total time needed to process all sequences.
The example below uses Biopython to parse a FASTA file containing multiple sequences, which are then processed in parallel using a pool of worker processes from the multiprocessing module. Each sequence record is submitted to the worker pool via imap and processed with LongdustMasker to identify low-complexity regions using the Longdust algorithm. The resulting intervals are written to the output file as they become available.
#!/usr/bin/env python
import multiprocessing.pool
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import pydustmasker
INPUT_FILE = "sequences.fna"
OUTPUT_FILE = "lc_intervals.tsv"
Intervals = tuple[tuple[int, int], ...]
def process_record(record: SeqRecord) -> tuple[str, Intervals]: # (1)!
masker = pydustmasker.LongdustMasker(str(record.seq))
return str(record.id), masker.intervals
if __name__ == "__main__":
with open(OUTPUT_FILE, "w") as f, multiprocessing.pool.Pool() as pool: # (2)!
records = SeqIO.parse(INPUT_FILE, "fasta")
for name, intervals in pool.imap(process_record, records): # (3)!
for start, end in intervals:
f.write(f"{name}\t{start}\t{end}\n")
- The
process_record()function encapsulates the computation performed on a single sequence record. A wrapper like this is necessary because process pools distribute work by invoking a single function for each item. - A process pool is created using
multiprocessing.pool.Pool(), which automatically manages a set of worker processes. If you are using a free-threaded version of Python, you can also usemultiprocessing.pool.ThreadPool()to create a pool of threads instead of processes. SeqIO.parse()is used to lazily read sequence records from the input FASTA file, avoiding the need to load all sequences into memory at once. Each record is submitted to the process pool viapool.imap(), which returns an iterator that yields results in the order they were submitted.