# Using MAGpurify2
# Generating bins
Recommend binners and DAS Tool.
# Quick start
magpurify2 composition test_data/genomes/* output
magpurify2 coverage test_data/genomes/* output --bam_files test_data/bam_files/*
magpurify2 taxonomy test_data/genomes/* output magpurify2DB
magpurify2 filter test_data/genomes/* output filtered_genomes
# The composition
module
magpurify2 composition
is the command used to identify putative contaminants using tetranucleotide frequencies.
usage: magpurify2 composition [-h] [-s STRICTNESS] [-t THREADS] [--quiet] genomes [genomes ...] output_directory
positional arguments:
genomes Input genomes in the FASTA format.
output_directory Directory to write the output files to.
optional arguments:
-h, --help show this help message and exit
-s STRICTNESS, --strictness STRICTNESS
Strictness of the contaminant detection algorithm. Must be a number between 0 (less strict) and 1 (more strict). (default: 0.5)
-t THREADS, --threads THREADS
Number of threads to use. All by default. (default: 4)
--quiet Suppress the logger output (default: False)
# The coverage
module
How to generate the BAM files
BAM files store read alignment information to the target metagenome (or MAG) and used by MAGpurify2 to estimate the coverage of each contig. To generate the BAM inputs to MAGpurify2 you should first map your reads to the complete metagenome using a proper tool (such as Bowtie 2 (opens new window), minimap2 (opens new window) or BWA-MEM2 (opens new window)) and then sort the output using samtools (opens new window). For example:
# Create a Bowtie 2 index for your metagenome inside the 'bt2' directory:
$ mkdir bt2
$ bowtie2-build --threads 4 metagenome.fna bt2/metagenome
# Map the reads, sort the output and write it to 'sample1.bam':
$ bowtie2 --threads 4 -x bt2/metagenome \
-1 sample1_R1.fastq.gz -2 sample1_R2.fastq.gz \
| samtools sort -@ 4 -o sample1.bam -
We recommend mapping the reads to the metagenome (superset) and not directly to the MAGs retrieved from it (subsets). There are two main reasons for that:
- When you map the reads to the MAG there's a chance that reads that were originated from the sequencing of a closely related genome will be erroneously aligned to the MAG (cross-mapping), introducing bias to the coverage estimation.
- Metagenome-wide mappings can be used to estimate the coverage of all the contigs in the metagenome, thus allowing MAGpurify2 to process multiple MAGs in a single execution.
If the target MAGs are derived from multiple source metagenomes you need input BAM files containing read mappings to each one of them.
If you don't have access to the raw sequencing data or to previously generated BAM files you can input coverage data stored in a tab-separated values (TSV) file. To do so, you should use the --coverage_file
argument:
coverage genomes/* output --coverage_file contig_coverages.tsv --threads 4
The first column of the coverage file must store the contigs names. The remaining columns should contain the coverage of each contig across multiple samples, as shown in the example below:
contig_1 15.744 12.605 25.148 3.728 0.000
contig_2 34.466 48.019 18.222 3.707 4.195
contig_3 0.000 22.356 21.944 4.479 4.463
contig_4 14.201 9.993 0.000 0.925 4.608
contig_5 17.179 12.280 56.643 3.586 4.226
contig_6 5.239 8.430 5.2070 3.988 0.000
contig_7 17.737 16.005 29.692 4.039 4.190
contig_8 0.000 15.866 13.663 0.877 2.269
contig_9 19.129 15.145 21.249 0.000 2.342
contig_10 11.074 9.574 19.673 0.934 4.732