ppanggolin.cluster package

Submodules

ppanggolin.cluster.cluster module

ppanggolin.cluster.cluster.align_rep(faa_file: Path, tmpdir: Path, cpu: int = 1, coverage: float = 0.8, identity: float = 0.8) → Path

Align representative sequence

Parameters:

faa_file – sequence of representative family
tmpdir – Temporary directory
cpu – number of CPU cores to use
coverage – minimal coverage threshold for the alignment
identity – minimal identity threshold for the alignment

Returns:

Result of alignment

ppanggolin.cluster.cluster.check_pangenome_for_clustering(pangenome: Pangenome, sequences: Path, force: bool = False, disable_bar: bool = False)

Check the pangenome statuses and write the gene sequences in the provided tmpFile. (whether they are written in the .h5 file or currently in memory)

Parameters:

pangenome – Annotated Pangenome
sequences – Path to write the sequences
force – Force to write on existing pangenome information
disable_bar – Allow to disable progress bar

ppanggolin.cluster.cluster.check_pangenome_former_clustering(pangenome: Pangenome, force: bool = False)

Checks pangenome status and .h5 files for former clusterings, delete them if allowed or raise an error

Parameters:

pangenome – Annotated Pangenome
force – Force to write on existing pangenome information

ppanggolin.cluster.cluster.clustering(pangenome: Pangenome, tmpdir: Path, cpu: int = 1, defrag: bool = True, code: int = 11, coverage: float = 0.8, identity: float = 0.8, mode: int = 1, force: bool = False, disable_bar: bool = False, keep_tmp_files: bool = True)

Cluster gene sequences from an annotated pangenome into families.

Parameters:

pangenome – Annotated Pangenome object.
tmpdir – Path to a temporary directory for intermediate files.
cpu – Number of CPU cores to use for clustering.
defrag – Allow removal of fragmented sequences during clustering.
code – Genetic code used for sequence translation.
coverage – Minimum coverage threshold for sequence alignment during clustering.
identity – Minimum identity threshold for sequence alignment during clustering.
mode – Clustering mode (MMseqs2 mode).
force – Force writing clustering results back to the pangenome.
disable_bar – Disable the progress bar during clustering.
keep_tmp_files – Keep temporary files (useful for debugging).

ppanggolin.cluster.cluster.first_clustering(sequences: Path, tmpdir: Path, cpu: int = 1, code: int = 11, coverage: float = 0.8, identity: float = 0.8, mode: int = 1) → Tuple[Path, Path]

Make a first clustering of all sequences in pangenome

Parameters:

sequences – Sequence from pangenome
tmpdir – Temporary directory
cpu – number of CPU cores to use
code – Genetic code used
coverage – minimal coverage threshold for the alignment
identity – minimal identity threshold for the alignment
mode – MMseqs2 clustering mode

Returns:

path to representative sequence file and path to tsv clustering result

ppanggolin.cluster.cluster.get_family_representative_sequences(pangenome: Pangenome, code: int = 11, cpu: int = 1, tmpdir: Path | None = None, keep_tmp: bool = False)

ppanggolin.cluster.cluster.infer_singletons(pangenome: Pangenome)

Creates a new family for each gene with no associated family.

Parameters:: pangenome – Input pangenome object

ppanggolin.cluster.cluster.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.cluster.cluster.mk_local_to_gene(pangenome: Pangenome) → dict

Creates a dictionary that stores local identifiers, if all local identifiers are unique (and if they exist)

Parameters:: pangenome – Input Pangenome
Returns:: Dictionary with local identifiers

ppanggolin.cluster.cluster.parser_clust(parser: ArgumentParser)

Parser for specific argument of cluster command

Parameters:: parser – parser for align argument

ppanggolin.cluster.cluster.read_clustering(pangenome: Pangenome, families_tsv_path: Path, infer_singleton: bool = False, code: int = 11, cpu: int = 1, tmpdir: Path | None = None, keep_tmp: bool = False, force: bool = False, disable_bar: bool = False)

Get the pangenome information, the gene families and the genes with an associated gene family. Reads a families tsv file from mmseqs2 output and adds the gene families and the genes to the pangenome.

Parameters:

pangenome – Input Pangenome
families_tsv_path – Clustering results path
infer_singleton – creates a new family for each gene with no associated family
code – Genetic code used for sequence translation.
cpu – Number of CPU cores to use for clustering.
tmpdir – Path to a temporary directory for intermediate files.
keep_tmp – Keep temporary files (useful for debugging).
force – force to write in the pangenome
disable_bar – Allow to disable progress bar

ppanggolin.cluster.cluster.read_clustering_file(families_tsv_path: Path) → Tuple[DataFrame, bool]

Read and process a gene families clustering file.

This function reads a tab-separated gene families file and processes it into a DataFrame with appropriate columns. It handles different formats of the input file based on the number of columns.

The function expects the file to have 2, 3, or 4 columns: - 2 columns: [“family”, “gene”] - 3 columns: [“family”, “gene”, “is_frag”] or [“family”, “gene”, “representative”] - 4 columns: [“family”, “representative”, “gene”, “is_frag”]

Parameters:

families_tsv_path – The path to the gene families file, which can be compressed or uncompressed.

Raises:

ValueError – If the file has only one column or an unexpected number of columns.
Exception – If there are duplicated gene IDs in the clustering.

Returns:

The processed DataFrame and a boolean indicating if any gene is marked as fragmented.

ppanggolin.cluster.cluster.read_faa(faa_file_name: Path) → Dict[str, str]

Read a faa file to link pangenome families to sequences.

Parameters:: faa_file_name – path to the faa file
Returns:: dictionary with families ID as key and sequence as value

ppanggolin.cluster.cluster.read_fam2seq(pangenome: Pangenome, fam_to_seq: Dict[str, str])

Add gene family to pangenome and sequences to gene families

Parameters:

pangenome – Annotated pangenome
fam_to_seq – Dictionary which link families and sequences

ppanggolin.cluster.cluster.read_gene2fam(pangenome: Pangenome, gene_to_fam: dict, disable_bar: bool = False)

Add gene to pangenome families

Parameters:

pangenome – Annotated Pangenome
gene_to_fam – Dictionary which link gene to families
disable_bar – Allow to disable progress bar

ppanggolin.cluster.cluster.read_tsv(tsv_file_name: Path) → Tuple[Dict[str, Tuple[str, bool]], Dict[str, Set[str]]]

Reading tsv file

Parameters:: tsv_file_name – path to the tsv
Returns:: two dictionaries which link genes and families

ppanggolin.cluster.cluster.refine_clustering(tsv: Path, aln_file: Path, fam_to_seq: dict) → Tuple[Dict[str, Tuple[str, bool]], Dict[str, str]]

Refine clustering by removing fragment

Parameters:

tsv – First clusterin result
aln_file – Reprensentative alignment result
fam_to_seq – Dictionary which link families to sequence

Returns:

Two dictionary which link genes and families

ppanggolin.cluster.cluster.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.cluster package

Submodules

ppanggolin.cluster.cluster module

Module contents