ppanggolin.cluster package

Submodules

ppanggolin.cluster.cluster module

ppanggolin.cluster.cluster.align_rep(faa_file: Path, tmpdir: Path, cpu: int = 1, coverage: float = 0.8, identity: float = 0.8) Path

Align representative sequence

Parameters:
  • faa_file – sequence of representative family

  • tmpdir – Temporary directory

  • cpu – number of CPU cores to use

  • coverage – minimal coverage threshold for the alignment

  • identity – minimal identity threshold for the alignment

Returns:

Result of alignment

ppanggolin.cluster.cluster.check_pangenome_for_clustering(pangenome: Pangenome, sequences: Path, force: bool = False, disable_bar: bool = False)

Check the pangenome statuses and write the gene sequences in the provided tmpFile. (whether they are written in the .h5 file or currently in memory)

Parameters:
  • pangenome – Annotated Pangenome

  • sequences – Path to write the sequences

  • force – Force to write on existing pangenome information

  • disable_bar – Allow to disable progress bar

ppanggolin.cluster.cluster.check_pangenome_former_clustering(pangenome: Pangenome, force: bool = False)

Checks pangenome status and .h5 files for former clusterings, delete them if allowed or raise an error

Parameters:
  • pangenome – Annotated Pangenome

  • force – Force to write on existing pangenome information

ppanggolin.cluster.cluster.clustering(pangenome: Pangenome, tmpdir: Path, cpu: int = 1, defrag: bool = True, code: int = 11, coverage: float = 0.8, identity: float = 0.8, mode: int = 1, force: bool = False, disable_bar: bool = False, keep_tmp_files: bool = True)

Cluster gene sequences from an annotated pangenome into families.

Parameters:
  • pangenome – Annotated Pangenome object.

  • tmpdir – Path to a temporary directory for intermediate files.

  • cpu – Number of CPU cores to use for clustering.

  • defrag – Allow removal of fragmented sequences during clustering.

  • code – Genetic code used for sequence translation.

  • coverage – Minimum coverage threshold for sequence alignment during clustering.

  • identity – Minimum identity threshold for sequence alignment during clustering.

  • mode – Clustering mode (MMseqs2 mode).

  • force – Force writing clustering results back to the pangenome.

  • disable_bar – Disable the progress bar during clustering.

  • keep_tmp_files – Keep temporary files (useful for debugging).

ppanggolin.cluster.cluster.first_clustering(sequences: Path, tmpdir: Path, cpu: int = 1, code: int = 11, coverage: float = 0.8, identity: float = 0.8, mode: int = 1) Tuple[Path, Path]

Make a first clustering of all sequences in pangenome

Parameters:
  • sequences – Sequence from pangenome

  • tmpdir – Temporary directory

  • cpu – number of CPU cores to use

  • code – Genetic code used

  • coverage – minimal coverage threshold for the alignment

  • identity – minimal identity threshold for the alignment

  • mode – MMseqs2 clustering mode

Returns:

path to representative sequence file and path to tsv clustering result

ppanggolin.cluster.cluster.get_family_representative_sequences(pangenome: Pangenome, code: int = 11, cpu: int = 1, tmpdir: Path | None = None, keep_tmp: bool = False)
ppanggolin.cluster.cluster.infer_singletons(pangenome: Pangenome)

Creates a new family for each gene with no associated family.

Parameters:

pangenome – Input pangenome object

ppanggolin.cluster.cluster.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.cluster.cluster.mk_local_to_gene(pangenome: Pangenome) dict

Creates a dictionary that stores local identifiers, if all local identifiers are unique (and if they exist)

Parameters:

pangenome – Input Pangenome

Returns:

Dictionary with local identifiers

ppanggolin.cluster.cluster.parser_clust(parser: ArgumentParser)

Parser for specific argument of cluster command

Parameters:

parser – parser for align argument

ppanggolin.cluster.cluster.read_clustering(pangenome: Pangenome, families_tsv_path: Path, infer_singleton: bool = False, code: int = 11, cpu: int = 1, tmpdir: Path | None = None, keep_tmp: bool = False, force: bool = False, disable_bar: bool = False)

Get the pangenome information, the gene families and the genes with an associated gene family. Reads a families tsv file from mmseqs2 output and adds the gene families and the genes to the pangenome.

Parameters:
  • pangenome – Input Pangenome

  • families_tsv_path – Clustering results path

  • infer_singleton – creates a new family for each gene with no associated family

  • code – Genetic code used for sequence translation.

  • cpu – Number of CPU cores to use for clustering.

  • tmpdir – Path to a temporary directory for intermediate files.

  • keep_tmp – Keep temporary files (useful for debugging).

  • force – force to write in the pangenome

  • disable_bar – Allow to disable progress bar

ppanggolin.cluster.cluster.read_clustering_file(families_tsv_path: Path) Tuple[DataFrame, bool]

Read and process a gene families clustering file.

This function reads a tab-separated gene families file and processes it into a DataFrame with appropriate columns. It handles different formats of the input file based on the number of columns.

The function expects the file to have 2, 3, or 4 columns: - 2 columns: [“family”, “gene”] - 3 columns: [“family”, “gene”, “is_frag”] or [“family”, “gene”, “representative”] - 4 columns: [“family”, “representative”, “gene”, “is_frag”]

Parameters:

families_tsv_path – The path to the gene families file, which can be compressed or uncompressed.

Raises:
  • ValueError – If the file has only one column or an unexpected number of columns.

  • Exception – If there are duplicated gene IDs in the clustering.

Returns:

The processed DataFrame and a boolean indicating if any gene is marked as fragmented.

ppanggolin.cluster.cluster.read_faa(faa_file_name: Path) Dict[str, str]

Read a faa file to link pangenome families to sequences.

Parameters:

faa_file_name – path to the faa file

Returns:

dictionary with families ID as key and sequence as value

ppanggolin.cluster.cluster.read_fam2seq(pangenome: Pangenome, fam_to_seq: Dict[str, str])

Add gene family to pangenome and sequences to gene families

Parameters:
  • pangenome – Annotated pangenome

  • fam_to_seq – Dictionary which link families and sequences

ppanggolin.cluster.cluster.read_gene2fam(pangenome: Pangenome, gene_to_fam: dict, disable_bar: bool = False)

Add gene to pangenome families

Parameters:
  • pangenome – Annotated Pangenome

  • gene_to_fam – Dictionary which link gene to families

  • disable_bar – Allow to disable progress bar

ppanggolin.cluster.cluster.read_tsv(tsv_file_name: Path) Tuple[Dict[str, Tuple[str, bool]], Dict[str, Set[str]]]

Reading tsv file

Parameters:

tsv_file_name – path to the tsv

Returns:

two dictionaries which link genes and families

ppanggolin.cluster.cluster.refine_clustering(tsv: Path, aln_file: Path, fam_to_seq: dict) Tuple[Dict[str, Tuple[str, bool]], Dict[str, str]]

Refine clustering by removing fragment

Parameters:
  • tsv – First clusterin result

  • aln_file – Reprensentative alignment result

  • fam_to_seq – Dictionary which link families to sequence

Returns:

Two dictionary which link genes and families

ppanggolin.cluster.cluster.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

Module contents