Quick start

pydustmasker is a Python library for efficient identification and masking of low-complexity regions in nucleotide sequences. Below, we describe the basic steps required to install and use the library. For a complete description of the available functionality, refer to the API reference.

Installation

pydustmasker can be installed from PyPI with pip or uv, or from Bioconda with pixi, conda, or mamba.

pipuvPixiCondaMamba

pip install pydustmasker

uv init my_project && cd my_project # (1)!
uv add pydustmasker # (2)!

Create a new Python project with uv and navigate into the project directory.
Add pydustmasker as a dependency to your project.

pixi init my_workspace && cd my_workspace # (1)!
pixi workspace channel add bioconda # (2)!
pixi add pydustmasker # (3)!

Create a new Pixi workspace and navigate into the workspace directory.
Add Bioconda to the list of channels of your Pixi workspace.
Add pydustmasker to your Pixi workspace.

conda create -n my_environment -c conda-forge -c bioconda pydustmasker # (1)!
conda activate my_environment # (2)!

Create a new Conda environment with pydustmasker.
Activate the environment.

mamba create -n my_environment -c conda-forge -c bioconda pydustmasker # (1)!
mamba activate my_environment # (2)!

Create a new Mamba environment with pydustmasker.
Activate the environment.

Usage

To identify and mask low-complexity regions in a nucleotide sequence, create an instance of a masker class and provide your sequence to it. A masker class implements a specific low-complexity detection algorithm and provides methods to retrieve the detected regions and to generate a masked version of the sequence. pydustmasker provides two such classes, corresponding to different detection algorithms: SDUST and Longdust. The SDUST algorithm is implemented in the DustMasker class, while the Longdust algorithm is implemented in the LongdustMasker class.

>>> import pydustmasker
>>> seq = "CGTATATATATAGTATGCGTACTGGGGGGGCT"
>>> masker = pydustmasker.DustMasker(seq)
>>> len(masker) # (1)!
1
>>> masker.n_masked_bases # (2)!
7
>>> masker.intervals # (3)!
((23, 30),)
>>> for start, end in masker: # (4)!
...     print(f"{start}-{end}: {seq[start:end]}")
23-30: GGGGGGG

The len() function returns the number of low-complexity regions detected in the sequence.
The n_masked_bases attribute returns the total number of bases within low-complexity regions.
The intervals attribute returns a tuple of low-complexity regions detected in the sequence, represented as (start, end) index pairs.
The masker object is iterable, yielding (start, end) index pairs for each low-complexity region.

You can generate a masked version of the sequence using the mask() method. By default, low-complexity regions are soft-masked by converting bases to lowercase. Setting the hard parameter to True enables hard-masking, in which affected bases are replaced with the ambiguous nucleotide N.

>>> masker.mask()
'CGTATATATATAGTATGCGTACTgggggggCT'
>>> masker.mask(hard=True)
'CGTATATATATAGTATGCGTACTNNNNNNNCT'

The identification of low-complexity regions can be tuned via algorithm-specific parameters. Both DustMasker and LongdustMasker provide multiple options, documented in the API reference, that control how low-complexity regions are determined. One shared parameter is score_threshold, which controls detection stringency: lowering this threshold results in more regions being classified as low-complexity, whereas increasing it restricts detection to the most clearly low-complexity regions.

>>> masker = pydustmasker.DustMasker(seq, score_threshold=10) # (1)!
>>> len(masker)
2
>>> masker.intervals
((2, 12), (23, 30))
>>> masker.mask()
'CGtatatatataGTATGCGTACTgggggggCT'

The default score_threshold for DustMasker is 20. Setting it to 10 results in more low-complexity regions being detected.

Processing sequences in parallel

When working with large numbers of sequences, you can run pydustmasker in parallel to process multiple sequences at the same time. This can substantially reduce the total time needed to process all sequences.

The example below uses Biopython to parse a FASTA file containing multiple sequences, which are then processed in parallel using a pool of worker processes from the multiprocessing module. Each sequence record is submitted to the worker pool via imap and processed with LongdustMasker to identify low-complexity regions using the Longdust algorithm. The resulting intervals are written to the output file as they become available.

#!/usr/bin/env python

import multiprocessing.pool

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

import pydustmasker

INPUT_FILE = "sequences.fna"
OUTPUT_FILE = "lc_intervals.tsv"

Intervals = tuple[tuple[int, int], ...]


def process_record(record: SeqRecord) -> tuple[str, Intervals]: # (1)!
    masker = pydustmasker.LongdustMasker(str(record.seq))
    return str(record.id), masker.intervals


if __name__ == "__main__":
    with open(OUTPUT_FILE, "w") as f, multiprocessing.pool.Pool() as pool: # (2)!
        records = SeqIO.parse(INPUT_FILE, "fasta")
        for name, intervals in pool.imap(process_record, records): # (3)!
            for start, end in intervals:
                f.write(f"{name}\t{start}\t{end}\n")

The process_record() function encapsulates the computation performed on a single sequence record. A wrapper like this is necessary because process pools distribute work by invoking a single function for each item.
A process pool is created using multiprocessing.pool.Pool(), which automatically manages a set of worker processes. If you are using a free-threaded version of Python, you can also use multiprocessing.pool.ThreadPool() to create a pool of threads instead of processes.
SeqIO.parse() is used to lazily read sequence records from the input FASTA file, avoiding the need to load all sequences into memory at once. Each record is submitted to the process pool via pool.imap(), which returns an iterator that yields results in the order they were submitted.