Skip to content

Reference#

taxopy.TaxDb #

Create an object of the TaxDb class.

Parameters:

Name Type Description Default
taxdb_dir str

A directory to download NCBI's taxonomy database files to. If the directory does not exist it will be created.

None
nodes_dmp str

The path for a pre-downloaded nodes.dmp file. If both nodes.dmp and names.dmp are supplied NCBI's taxonomy database won't be downloaded.

None
names_dmp str

The path for a pre-downloaded names.dmp file. If both names.dmp and nodes.dmp are supplied NCBI's taxonomy database won't be downloaded.

None
merged_dmp str

The path for a pre-downloaded merged.dmp file.

None
taxdump_url str

The URL of the taxdump file (in .tar.gz) to be downloaded. By default, the latest version of NCBI's taxdump will be fetched.

None
keep_files bool

Keep the nodes.dmp and names.dmp files after the TaxDb object is created. If taxdb_dir was supplied the whole directory will be deleted. By default the files are deleted, unless nodes_dmp and names_dmp were manually supplied.

False

Attributes:

Name Type Description
taxid2name dict

A dictionary where the keys are taxonomic identifiers and the values are their corresponding names.

taxid2parent dict

A dictionary where the keys are taxonomic identifiers and the values are the taxonomic identifiers of their corresponding parent taxon.

taxid2rank dict

A dictionary where the keys are taxonomic identifiers and the values are their corresponding ranks.

oldtaxid2newtaxid dict

A dictionary where the keys are legacy taxonomic identifiers and the values are their corresponding new identifiers. If pre-downloaded nodes.dmp and names.dmp files were provided but the merged.dmp file was not supplied, this attribute will be None.

Raises:

Type Description
DownloadError

If the download of the taxonomy database fails.

ExtractionError

If the decompression of the taxonomy database fails.

Source code in taxopy/core.py
class TaxDb:
    """
    Create an object of the TaxDb class.

    Parameters
    ----------
    taxdb_dir : str, optional
        A directory to download NCBI's taxonomy database files to. If the
        directory does not exist it will be created.
    nodes_dmp : str, optional
        The path for a pre-downloaded `nodes.dmp` file. If both `nodes.dmp` and
        `names.dmp` are supplied NCBI's taxonomy database won't be downloaded.
    names_dmp : str, optional
        The path for a pre-downloaded `names.dmp` file. If both `names.dmp` and
        `nodes.dmp` are supplied NCBI's taxonomy database won't be downloaded.
    merged_dmp : str, optional
        The path for a pre-downloaded `merged.dmp` file.
    taxdump_url : str, optional
        The URL of the taxdump file (in .tar.gz) to be downloaded. By default,
        the latest version of NCBI's taxdump will be fetched.
    keep_files : bool, default False
        Keep the `nodes.dmp` and `names.dmp` files after the TaxDb object is
        created. If `taxdb_dir` was supplied the whole directory will be deleted.
        By default the files are deleted, unless `nodes_dmp` and `names_dmp`
        were manually supplied.

    Attributes
    ----------
    taxid2name : dict
        A dictionary where the keys are taxonomic identifiers and the values are
        their corresponding names.
    taxid2parent: dict
        A dictionary where the keys are taxonomic identifiers and the values are
        the taxonomic identifiers of their corresponding parent taxon.
    taxid2rank: dict
        A dictionary where the keys are taxonomic identifiers and the values are
        their corresponding ranks.
    oldtaxid2newtaxid: dict
        A dictionary where the keys are legacy taxonomic identifiers and the
        values are their corresponding new identifiers. If pre-downloaded
        `nodes.dmp` and `names.dmp` files were provided but the `merged.dmp`
        file was not supplied, this attribute will be `None`.

    Raises
    ------
    DownloadError
        If the download of the taxonomy database fails.
    ExtractionError
        If the decompression of the taxonomy database fails.
    """

    def __init__(
        self,
        *,
        taxdb_dir: str = None,
        taxdump_url: str = None,
        nodes_dmp: str = None,
        names_dmp: str = None,
        merged_dmp: str = None,
        keep_files: bool = False,
    ):
        if not taxdb_dir:
            self._taxdb_dir = os.getcwd()
        elif not os.path.isdir(taxdb_dir):
            os.makedirs(taxdb_dir)
            self._taxdb_dir = taxdb_dir
        else:
            self._taxdb_dir = taxdb_dir
        # If `nodes_dmp` and `names_dmp` were not provided:
        if not nodes_dmp or not names_dmp:
            nodes_dmp_path = os.path.join(self._taxdb_dir, "nodes.dmp")
            names_dmp_path = os.path.join(self._taxdb_dir, "names.dmp")
            merged_dmp_path = os.path.join(self._taxdb_dir, "merged.dmp")
            # If the `nodes.dmp` and `names.dmp` files are not in the `taxdb_dir` directory,
            # download the taxonomy from NCBI:
            if not os.path.isfile(nodes_dmp_path) or not os.path.isfile(names_dmp_path):
                (
                    self._nodes_dmp,
                    self._names_dmp,
                    self._merged_dmp,
                ) = self._download_taxonomy(taxdump_url)
            else:
                self._nodes_dmp, self._names_dmp = nodes_dmp_path, names_dmp_path
                # If `merged.dmp` is not in the `taxdb_dir` directory, set the `_merged_dmp`
                # attribute to `None`:
                self._merged_dmp = (
                    merged_dmp_path if os.path.isfile(merged_dmp_path) else None
                )
        else:
            self._nodes_dmp, self._names_dmp = nodes_dmp, names_dmp
            # If `merged_dmp` was not provided, set the `_merged_dmp` attribute to `None`:
            self._merged_dmp = merged_dmp or None
        # If a `merged.dmp` file was provided or downloaded, create the oldtaxid2newtaxid
        # dictionary:
        self._oldtaxid2newtaxid = self._import_merged() if self._merged_dmp else None
        # Create the taxid2parent, taxid2rank, and taxid2name dictionaries:
        self._taxid2parent, self._taxid2rank = self._import_nodes()
        self._taxid2name = self._import_names()
        # Delete temporary files if `keep_files` is set to `False`, unless
        # `nodes_dmp` and `names_dmp` were manually supplied:
        if not keep_files and (not nodes_dmp or not names_dmp):
            self._delete_files()

    @property
    def taxid2name(self) -> Dict[int, str]:
        return self._taxid2name

    @property
    def taxid2parent(self) -> Dict[int, int]:
        return self._taxid2parent

    @property
    def taxid2rank(self) -> Dict[int, str]:
        return self._taxid2rank

    @property
    def oldtaxid2newtaxid(self) -> Dict[int, int]:
        return self._oldtaxid2newtaxid

    def _download_taxonomy(self, url: str = None):
        if not url:
            url = "ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz"
        tmp_taxonomy_file = os.path.join(self._taxdb_dir, "taxdump.tar.gz")
        try:
            urllib.request.urlretrieve(url, tmp_taxonomy_file)
        except:
            raise DownloadError(
                "Download of taxonomy files failed. The server may be offline."
            )
        try:
            nodes_dmp, names_dmp, merged_dmp = None, None, None
            with tarfile.open(tmp_taxonomy_file) as tf:
                for member in tf.getmembers():
                    if os.path.basename(member.name) == "nodes.dmp" and member.isfile():
                        nodes_dmp = ("nodes.dmp", tf.extractfile(member))
                    elif os.path.basename(member.name) == "names.dmp" and member.isfile():
                        names_dmp = ("names.dmp", tf.extractfile(member))
                    elif os.path.basename(member.name) == "merged.dmp" and member.isfile():
                        merged_dmp = ("merged.dmp", tf.extractfile(member))
                for p, m in [nodes_dmp, names_dmp, merged_dmp]:
                    with open(os.path.join(self._taxdb_dir, p), "wb") as fo:
                        while True:
                            chunk = m.read(1024)
                            if not chunk:
                                break
                            fo.write(chunk)
        except:
            raise ExtractionError(
                "Something went wrong while extracting the taxonomy files."
            )
        os.remove(tmp_taxonomy_file)
        return (
            os.path.join(self._taxdb_dir, "nodes.dmp"),
            os.path.join(self._taxdb_dir, "names.dmp"),
            os.path.join(self._taxdb_dir, "merged.dmp"),
        )

    def _import_merged(self):
        oldtaxid2newtaxid = {}
        with open(self._merged_dmp, "r") as f:
            for line in f:
                line = line.split("\t")
                taxid = int(line[0])
                merged = int(line[2])
                oldtaxid2newtaxid[taxid] = merged
        return oldtaxid2newtaxid

    def _import_nodes(self):
        taxid2parent = {}
        taxid2rank = {}
        with open(self._nodes_dmp, "r") as f:
            for line in f:
                line = line.split("\t")
                taxid = int(line[0])
                parent = int(line[2])
                rank = line[4].strip()
                taxid2parent[taxid] = parent
                taxid2rank[taxid] = rank
        if self._merged_dmp:
            for oldtaxid, newtaxid in self._oldtaxid2newtaxid.items():
                taxid2rank[oldtaxid] = taxid2rank[newtaxid]
                taxid2parent[oldtaxid] = taxid2parent[newtaxid]
        return taxid2parent, taxid2rank

    def _import_names(self):
        taxid2name = {}
        with open(self._names_dmp, "r") as f:
            for line in f:
                line = line.split("\t")
                if line[6] == "scientific name":
                    taxid = int(line[0])
                    name = line[2].strip()
                    taxid2name[taxid] = name
        if self._merged_dmp:
            for oldtaxid, newtaxid in self._oldtaxid2newtaxid.items():
                taxid2name[oldtaxid] = taxid2name[newtaxid]
        return taxid2name

    def _delete_files(self):
        os.remove(self._nodes_dmp)
        os.remove(self._names_dmp)
        if self._merged_dmp:
            os.remove(self._merged_dmp)
        if not os.listdir(self._taxdb_dir) and self._taxdb_dir != os.getcwd():
            os.rmdir(self._taxdb_dir)

taxopy.Taxon #

Create an object of the Taxon class.

Parameters:

Name Type Description Default
taxid int

A NCBI taxonomic identifier.

required
taxdb TaxDb

A TaxDb object.

required

Attributes:

Name Type Description
taxid int

The NCBI taxonomic identifier the object represents (e.g., 9606).

name str

The name of the taxon (e.g., 'Homo sapiens').

rank str

The rank of the taxon (e.g., 'species').

legacy_taxid bool

A boolean that represents whether the NCBI taxonomic identifier was merged to another identifier (True) or not (False). If pre-downloaded nodes.dmp and names.dmp files were provided to build taxdb but the merged.dmp file was not supplied, this attribute will be None.

taxid_lineage list

An ordered list containing the taxonomic identifiers of the whole lineage of the taxon, from the most specific to the most general.

name_lineage list

An ordered list containing the names of the whole lineage of the taxon, from the most specific to the most general.

rank_lineage list

An ordered list containing the rank names of the whole lineage of the taxon, from the most specific to the most general.

ranked_name_lineage list

An ordered list of tuples, where each tuple represents a rank in the lineage, with the first element denoting the rank name and the second indicating the taxon's name.

ranked_taxid_lineage list

An ordered list of tuples, where each tuple represents a rank in the lineage, with the first element denoting the rank name and the second indicating the taxon's taxonomic identifier.

rank_taxid_dictionary dict

A dictionary where the keys are named ranks and the values are the taxids of the taxa that correspond to each of the named ranks in the lineage.

rank_name_dictionary dict

A dictionary where the keys are named ranks and the values are the names of the taxa that correspond to each of the named ranks in the lineage.

Methods:

Name Description
parent

Returns a Taxon object of the parent node.

Raises:

Type Description
TaxidError

If the input integer is not a valid NCBI taxonomic identifier.

Source code in taxopy/core.py
class Taxon:
    """
    Create an object of the Taxon class.

    Parameters
    ----------
    taxid : int
        A NCBI taxonomic identifier.
    taxdb : TaxDb
        A TaxDb object.

    Attributes
    ----------
    taxid : int
        The NCBI taxonomic identifier the object represents (e.g., 9606).
    name: str
        The name of the taxon (e.g., 'Homo sapiens').
    rank: str
        The rank of the taxon (e.g., 'species').
    legacy_taxid: bool
        A boolean that represents whether the NCBI taxonomic identifier was
        merged to another identifier (`True`) or not (`False`). If pre-downloaded
        `nodes.dmp` and `names.dmp` files were provided to build `taxdb` but the
        `merged.dmp` file was not supplied, this attribute will be `None`.
    taxid_lineage: list
        An ordered list containing the taxonomic identifiers of the whole lineage
        of the taxon, from the most specific to the most general.
    name_lineage: list
        An ordered list containing the names of the whole lineage of the taxon,
        from the most specific to the most general.
    rank_lineage: list
        An ordered list containing the rank names of the whole lineage of the
        taxon, from the most specific to the most general.
    ranked_name_lineage : list
        An ordered list of tuples, where each tuple represents a rank in the
        lineage, with the first element denoting the rank name and the second
        indicating the taxon's name.
    ranked_taxid_lineage : list
        An ordered list of tuples, where each tuple represents a rank in the
        lineage, with the first element denoting the rank name and the second
        indicating the taxon's taxonomic identifier.
    rank_taxid_dictionary: dict
        A dictionary where the keys are named ranks and the values are the taxids
        of the taxa that correspond to each of the named ranks in the lineage.
    rank_name_dictionary: dict
        A dictionary where the keys are named ranks and the values are the names
        of the taxa that correspond to each of the named ranks in the lineage.

    Methods
    -------
    parent(taxdb)
        Returns a Taxon object of the parent node.

    Raises
    ------
    TaxidError
        If the input integer is not a valid NCBI taxonomic identifier.
    """

    def __init__(self, taxid: int, taxdb: TaxDb):
        self._taxid = taxid
        if self.taxid not in taxdb.taxid2name:
            raise TaxidError(
                "The input integer is not a valid NCBI taxonomic identifier."
            )
        self._name = taxdb.taxid2name[self.taxid]
        self._rank = taxdb.taxid2rank[self.taxid]
        if taxdb.oldtaxid2newtaxid:
            self._legacy_taxid = self.taxid in taxdb.oldtaxid2newtaxid
        else:
            self._legacy_taxid = None
        self._taxid_lineage = self._find_lineage(taxdb.taxid2parent)
        self._name_lineage = self._convert_to_names(taxdb.taxid2name)
        self._rank_lineage = [taxdb._taxid2rank[taxid] for taxid in self.taxid_lineage]
        (
            self._rank_taxid_dictionary,
            self._rank_name_dictionary,
        ) = self._convert_to_rank_dictionary(taxdb.taxid2rank, taxdb.taxid2name)

    @property
    def taxid(self) -> int:
        return self._taxid

    @property
    def name(self) -> str:
        return self._name

    @property
    def rank(self) -> str:
        return self._rank

    @property
    def legacy_taxid(self) -> bool:
        return self._legacy_taxid

    @property
    def taxid_lineage(self) -> List[int]:
        return self._taxid_lineage

    @property
    def name_lineage(self) -> List[str]:
        return self._name_lineage

    @property
    def rank_lineage(self) -> List[str]:
        return self._rank_lineage

    @property
    def ranked_taxid_lineage(self) -> List[Tuple[str, int]]:
        return list(zip(self.rank_lineage, self.taxid_lineage))

    @property
    def ranked_name_lineage(self) -> List[Tuple[str, str]]:
        return list(zip(self.rank_lineage, self.name_lineage))

    @property
    def rank_taxid_dictionary(self) -> Dict[str, int]:
        return self._rank_taxid_dictionary

    @property
    def rank_name_dictionary(self) -> Dict[str, str]:
        return self._rank_name_dictionary

    def parent(self, taxdb) -> Taxon:
        """
        Returns the parent node of the taxon.

        Returns
        -------
        Taxon
            The Taxon object of the parent node.
        """
        parent_taxid = taxdb.taxid2parent[self.taxid]
        return Taxon(parent_taxid, taxdb)

    def _find_lineage(self, taxid2parent):
        current_taxid = self.taxid
        lineage = [current_taxid]
        while taxid2parent[current_taxid] != current_taxid:
            current_taxid = taxid2parent[current_taxid]
            lineage.append(current_taxid)
        return lineage

    def _convert_to_names(self, taxid2name):
        return [taxid2name[taxid] for taxid in self.taxid_lineage]

    def _convert_to_rank_dictionary(self, taxid2rank, taxid2name):
        rank_taxid_dictionary = OrderedDict()
        rank_name_dictionary = OrderedDict()
        for taxid in self.taxid_lineage:
            rank = taxid2rank[taxid]
            if rank != "no rank":
                rank_taxid_dictionary[rank] = taxid
                rank_name_dictionary[rank] = taxid2name[taxid]
        return rank_taxid_dictionary, rank_name_dictionary

    def __str__(self) -> str:
        lineage = [
            f"{rank[0]}__{name}" for rank, name in self.rank_name_dictionary.items()
        ]
        return ";".join(reversed(lineage))

    def __repr__(self) -> str:
        return str(self)

    def __eq__(self, other: object) -> bool:
        if other.__class__ is not self.__class__:
            return NotImplemented
        return self.taxid_lineage == other.taxid_lineage

    def __hash__(self) -> int:
        return hash(self.taxid)

parent #

parent(taxdb) -> Taxon

Returns the parent node of the taxon.

Returns:

Type Description
Taxon

The Taxon object of the parent node.

Source code in taxopy/core.py
def parent(self, taxdb) -> Taxon:
    """
    Returns the parent node of the taxon.

    Returns
    -------
    Taxon
        The Taxon object of the parent node.
    """
    parent_taxid = taxdb.taxid2parent[self.taxid]
    return Taxon(parent_taxid, taxdb)

taxopy.find_lca #

find_lca(taxon_list: List[Taxon], taxdb: TaxDb) -> Taxon

Takes a list of multiple Taxon objects and returns their lowest common ancestor (LCA).

Parameters:

Name Type Description Default
taxon_list list

A list containing at least two Taxon objects.

required
taxdb TaxDb

A TaxDb object.

required

Returns:

Type Description
_AggregatedTaxon

The _AggregatedTaxon object of the lowest common ancestor (LCA) of the inputs.

Raises:

Type Description
LCAError

If the input list has less than two Taxon objects.

Source code in taxopy/utilities.py
def find_lca(taxon_list: List[Taxon], taxdb: TaxDb) -> Taxon:
    """
    Takes a list of multiple Taxon objects and returns their lowest common
    ancestor (LCA).

    Parameters
    ----------
    taxon_list : list
        A list containing at least two Taxon objects.
    taxdb : TaxDb
        A TaxDb object.

    Returns
    -------
    _AggregatedTaxon
        The _AggregatedTaxon object of the lowest common ancestor (LCA) of the inputs.

    Raises
    ------
    LCAError
        If the input list has less than two Taxon objects.
    """
    if len(taxon_list) < 2:
        raise LCAError("The input list must contain at least two Taxon objects.")
    lineage_list = [taxon.taxid_lineage for taxon in taxon_list]
    overlap = set.intersection(*map(set, lineage_list))
    for taxid in lineage_list[0]:
        if taxid in overlap:
            aggregated_taxa = [taxon.taxid for taxon in taxon_list]
            return _AggregatedTaxon(taxid, taxdb, 1.0, aggregated_taxa)
    return _AggregatedTaxon(1, taxdb, 1.0, [])

taxopy.find_majority_vote #

find_majority_vote(taxon_list: List[Taxon], taxdb: TaxDb, fraction: float = 0.5, weights: Optional[List[float]] = None) -> Taxon

Takes a list of multiple Taxon objects and returns the most specific taxon that is shared by more than the chosen fraction of the input lineages.

Parameters:

Name Type Description Default
taxon_list list

A list containing at least two Taxon objects.

required
taxdb TaxDb

A TaxDb object.

required
fraction float

The returned taxon will be shared by more than fraction of the input taxa lineages. This value must be greater than 0.0 and less than 1.0.

0.5
weights Optional[List[float]]

A list of weights associated with the taxa lineages in taxon_list. These values are used to weight the votes of their associated lineages.

None

Returns:

Type Description
_AggregatedTaxon

The _AggregatedTaxon object of the most specific taxon that is shared by more than a specified fraction of the input lineages.

Raises:

Type Description
MajorityVoteError

If the input taxon list has less than two Taxon objects or if the fraction parameter is less than or equal to 0.0 or greater than or equal to 1.

Source code in taxopy/utilities.py
def find_majority_vote(
    taxon_list: List[Taxon],
    taxdb: TaxDb,
    fraction: float = 0.5,
    weights: Optional[List[float]] = None,
) -> Taxon:
    """
    Takes a list of multiple Taxon objects and returns the most specific taxon
    that is shared by more than the chosen fraction of the input lineages.

    Parameters
    ----------
    taxon_list : list
        A list containing at least two Taxon objects.
    taxdb : TaxDb
        A TaxDb object.
    fraction: float, default 0.5
        The returned taxon will be shared by more than `fraction` of the input
        taxa lineages. This value must be greater than 0.0 and less than 1.0.
    weights: list, optional
        A list of weights associated with the taxa lineages in `taxon_list`.
        These values are used to weight the votes of their associated lineages.

    Returns
    -------
    _AggregatedTaxon
        The _AggregatedTaxon object of the most specific taxon that is shared by
        more than a specified fraction of the input lineages.

    Raises
    ------
    MajorityVoteError
        If the input taxon list has less than two Taxon objects or if the
        `fraction` parameter is less than or equal to 0.0 or greater than or
        equal to 1.
    """
    if fraction <= 0.0 or fraction >= 1:
        raise MajorityVoteError(
            "The `fraction` parameter must be greater than 0.0 and less than 1."
        )
    if len(taxon_list) < 2:
        raise MajorityVoteError(
            "The input taxon list must contain at least two Taxon objects."
        )
    if weights and len(taxon_list) != len(weights):
        raise MajorityVoteError(
            "The input taxon and weights lists must have the same length."
        )
    if weights:
        return _weighted_majority_vote(taxon_list, taxdb, fraction, weights)
    else:
        return _unweighted_majority_vote(taxon_list, taxdb, fraction)
    return _AggregatedTaxon(1, taxdb, 1.0, [])

taxopy.taxid_from_name #

taxid_from_name(names: Union[str, List[str]], taxdb: TaxDb, fuzzy: bool = False, score_cutoff: float = 0.9) -> Union[List[int], List[List[int]]]

Takes one (or more) taxon name and returns a list (or list of lists) containing the taxonomic identifiers associated with it (or them).

Parameters:

Name Type Description Default
names str or list of str

The name of the taxon whose taxonomic identifier will be returned. A list of names can also be provided.

required
taxdb TaxDb

A TaxDb object.

required
fuzzy bool

If True, the input name will be matched to the taxa names in the database using fuzzy string matching.

False
score_cutoff float

The minimum score required for a match to be considered valid when fuzzy string matching is used. This value must be between 0.0 and 1.0.

0.9

Returns:

Type Description
list or list of list

A list of all the taxonomic identifiers associated with the input taxon name. If a list of names is provided, a list of lists is returned.

Source code in taxopy/utilities.py
def taxid_from_name(
    names: Union[str, List[str]],
    taxdb: TaxDb,
    fuzzy: bool = False,
    score_cutoff: float = 0.9,
) -> Union[List[int], List[List[int]]]:
    """
    Takes one (or more) taxon name and returns a list (or list of lists)
    containing the taxonomic identifiers associated with it (or them).

    Parameters
    ----------
    names : str or list of str
        The name of the taxon whose taxonomic identifier will be returned. A
        list of names can also be provided.
    taxdb : TaxDb
        A TaxDb object.
    fuzzy : bool, default False
        If True, the input name will be matched to the taxa names in the
        database using fuzzy string matching.
    score_cutoff : float, default 0.9
        The minimum score required for a match to be considered valid when
        fuzzy string matching is used. This value must be between 0.0 and 1.0.

    Returns
    -------
    list or list of list
        A list of all the taxonomic identifiers associated with the input taxon
        name. If a list of names is provided, a list of lists is returned.
    """
    score_cutoff = score_cutoff * 100
    if isinstance(names, list):
        taxid_list = _get_taxid_from_multiple_names(names, taxdb, fuzzy, score_cutoff)
        if not all(len(taxids) for taxids in taxid_list):
            warnings.warn(
                "At least one of the input names was not found in the taxonomy database.",
                Warning,
            )
    else:
        taxid_list = _get_taxid_from_single_name(names, taxdb, fuzzy, score_cutoff)
        if not len(taxid_list):
            warnings.warn(
                "The input name was not found in the taxonomy database.", Warning
            )
    return taxid_list