ppanggolin.cluster package
Submodules
ppanggolin.cluster.cluster module
- ppanggolin.cluster.cluster.align_rep(faa_file: Path, tmpdir: Path, cpu: int = 1, coverage: float = 0.8, identity: float = 0.8) Path
Align representative sequence
- Parameters:
faa_file – sequence of representative family
tmpdir – Temporary directory
cpu – number of CPU cores to use
coverage – minimal coverage threshold for the alignment
identity – minimal identity threshold for the alignment
- Returns:
Result of alignment
- ppanggolin.cluster.cluster.check_pangenome_for_clustering(pangenome: Pangenome, sequences: Path, force: bool = False, disable_bar: bool = False)
Check the pangenome statuses and write the gene sequences in the provided tmpFile. (whether they are written in the .h5 file or currently in memory)
- Parameters:
pangenome – Annotated Pangenome
sequences – Path to write the sequences
force – Force to write on existing pangenome information
disable_bar – Allow to disable progress bar
- ppanggolin.cluster.cluster.check_pangenome_former_clustering(pangenome: Pangenome, force: bool = False)
Checks pangenome status and .h5 files for former clusterings, delete them if allowed or raise an error
- Parameters:
pangenome – Annotated Pangenome
force – Force to write on existing pangenome information
- ppanggolin.cluster.cluster.clustering(pangenome: Pangenome, tmpdir: Path, cpu: int = 1, defrag: bool = True, code: int = 11, coverage: float = 0.8, identity: float = 0.8, mode: int = 1, force: bool = False, disable_bar: bool = False, keep_tmp_files: bool = True)
Cluster gene sequences from an annotated pangenome into families.
- Parameters:
pangenome – Annotated Pangenome object.
tmpdir – Path to a temporary directory for intermediate files.
cpu – Number of CPU cores to use for clustering.
defrag – Allow removal of fragmented sequences during clustering.
code – Genetic code used for sequence translation.
coverage – Minimum coverage threshold for sequence alignment during clustering.
identity – Minimum identity threshold for sequence alignment during clustering.
mode – Clustering mode (MMseqs2 mode).
force – Force writing clustering results back to the pangenome.
disable_bar – Disable the progress bar during clustering.
keep_tmp_files – Keep temporary files (useful for debugging).
- ppanggolin.cluster.cluster.first_clustering(sequences: Path, tmpdir: Path, cpu: int = 1, code: int = 11, coverage: float = 0.8, identity: float = 0.8, mode: int = 1) Tuple[Path, Path]
Make a first clustering of all sequences in pangenome
- Parameters:
sequences – Sequence from pangenome
tmpdir – Temporary directory
cpu – number of CPU cores to use
code – Genetic code used
coverage – minimal coverage threshold for the alignment
identity – minimal identity threshold for the alignment
mode – MMseqs2 clustering mode
- Returns:
path to representative sequence file and path to tsv clustering result
- ppanggolin.cluster.cluster.get_family_representative_sequences(pangenome: Pangenome, code: int = 11, cpu: int = 1, tmpdir: Path | None = None, keep_tmp: bool = False)
- ppanggolin.cluster.cluster.infer_singletons(pangenome: Pangenome)
Creates a new family for each gene with no associated family.
- Parameters:
pangenome – Input pangenome object
- ppanggolin.cluster.cluster.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.cluster.cluster.mk_local_to_gene(pangenome: Pangenome) dict
Creates a dictionary that stores local identifiers, if all local identifiers are unique (and if they exist)
- Parameters:
pangenome – Input Pangenome
- Returns:
Dictionary with local identifiers
- ppanggolin.cluster.cluster.parser_clust(parser: ArgumentParser)
Parser for specific argument of cluster command
- Parameters:
parser – parser for align argument
- ppanggolin.cluster.cluster.read_clustering(pangenome: Pangenome, families_tsv_path: Path, infer_singleton: bool = False, code: int = 11, cpu: int = 1, tmpdir: Path | None = None, keep_tmp: bool = False, force: bool = False, disable_bar: bool = False)
Get the pangenome information, the gene families and the genes with an associated gene family. Reads a families tsv file from mmseqs2 output and adds the gene families and the genes to the pangenome.
- Parameters:
pangenome – Input Pangenome
families_tsv_path – Clustering results path
infer_singleton – creates a new family for each gene with no associated family
code – Genetic code used for sequence translation.
cpu – Number of CPU cores to use for clustering.
tmpdir – Path to a temporary directory for intermediate files.
keep_tmp – Keep temporary files (useful for debugging).
force – force to write in the pangenome
disable_bar – Allow to disable progress bar
- ppanggolin.cluster.cluster.read_clustering_file(families_tsv_path: Path) Tuple[DataFrame, bool]
Read and process a gene families clustering file.
This function reads a tab-separated gene families file and processes it into a DataFrame with appropriate columns. It handles different formats of the input file based on the number of columns.
The function expects the file to have 2, 3, or 4 columns: - 2 columns: [“family”, “gene”] - 3 columns: [“family”, “gene”, “is_frag”] or [“family”, “gene”, “representative”] - 4 columns: [“family”, “representative”, “gene”, “is_frag”]
- Parameters:
families_tsv_path – The path to the gene families file, which can be compressed or uncompressed.
- Raises:
ValueError – If the file has only one column or an unexpected number of columns.
Exception – If there are duplicated gene IDs in the clustering.
- Returns:
The processed DataFrame and a boolean indicating if any gene is marked as fragmented.
- ppanggolin.cluster.cluster.read_faa(faa_file_name: Path) Dict[str, str]
Read a faa file to link pangenome families to sequences.
- Parameters:
faa_file_name – path to the faa file
- Returns:
dictionary with families ID as key and sequence as value
- ppanggolin.cluster.cluster.read_fam2seq(pangenome: Pangenome, fam_to_seq: Dict[str, str])
Add gene family to pangenome and sequences to gene families
- Parameters:
pangenome – Annotated pangenome
fam_to_seq – Dictionary which link families and sequences
- ppanggolin.cluster.cluster.read_gene2fam(pangenome: Pangenome, gene_to_fam: dict, disable_bar: bool = False)
Add gene to pangenome families
- Parameters:
pangenome – Annotated Pangenome
gene_to_fam – Dictionary which link gene to families
disable_bar – Allow to disable progress bar
- ppanggolin.cluster.cluster.read_tsv(tsv_file_name: Path) Tuple[Dict[str, Tuple[str, bool]], Dict[str, Set[str]]]
Reading tsv file
- Parameters:
tsv_file_name – path to the tsv
- Returns:
two dictionaries which link genes and families
- ppanggolin.cluster.cluster.refine_clustering(tsv: Path, aln_file: Path, fam_to_seq: dict) Tuple[Dict[str, Tuple[str, bool]], Dict[str, str]]
Refine clustering by removing fragment
- Parameters:
tsv – First clusterin result
aln_file – Reprensentative alignment result
fam_to_seq – Dictionary which link families to sequence
- Returns:
Two dictionary which link genes and families
- ppanggolin.cluster.cluster.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command