ppanggolin.align package

Submodules

ppanggolin.align.alignOnPang module

ppanggolin.align.alignOnPang.align(pangenome: Pangenome, sequence_file: Path, output: Path, identity: float = 0.8, coverage: float = 0.8, no_defrag: bool = False, cpu: int = 1, getinfo: bool = False, use_representatives: bool = False, draw_related: bool = False, translation_table: int = 11, tmpdir: Path | None = None, disable_bar: bool = False, keep_tmp=False)

Aligns pangenome sequences with sequences in a FASTA file using MMSeqs2.

Parameters:

pangenome – Pangenome object containing gene families to align with the input sequences.
sequence_file – Path to a FASTA file containing sequences to align with the pangenome.
output – Path to the output directory.
identity – Minimum identity threshold for the alignment.
coverage – Minimum coverage threshold for the alignment.
no_defrag – If True, the defrag workflow will not be used.
cpu – Number of CPU cores to use.
getinfo – If True, extract information related to the best hit of each query, such as the RGP it is in or the spots.
use_representatives – If True, use representative sequences of gene families rather than all sequences to align input genes.
draw_related – If True, draw figures and graphs in a gexf format of spots associated with the input sequences.
translation_table – Translation table ID for nucleotide sequences.
tmpdir – Temporary directory for intermediate files.
disable_bar – If True, disable the progress bar.
keep_tmp – If True, keep temporary files.

ppanggolin.align.alignOnPang.align_seq_to_pang(target_seq_file: Path | Iterable[Path], query_seq_files: Path | Iterable[Path], tmpdir: Path, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, query_type: str = 'unknow', is_query_slf: bool = False, target_type: str = 'unknow', is_target_slf: bool = False, translation_table: int | None = None) → Path

Align fasta sequence to pangenome sequences.

Parameters:

target_seq_file – File with sequences of pangenome (target)
query_seq_files – Iterable of files with sequences from input file (query)
tmpdir – Temporary directory to align sequences
cpu – Number of available cpu
no_defrag – Do not apply defragmentation
identity – minimal identity threshold for the alignment
coverage – minimal identity threshold for the alignment
query_type – Sequences type of the file (query). [nucleotide, protein, unknow]
is_query_slf – Is the sequence file (query) with single line fasta. If True, MMSeqs2 database will be with soft link
target_type – Sequences type of pangenome (target). [nucleotide, aminoacid, protein]
is_target_slf – Is the sequences of pangenome (target) with single line fasta. If True, MMSeqs2 database will be with soft link
translation_table – Translation table to use, if sequences are nucleotide and need to be translated.

Returns:

Alignment result file

ppanggolin.align.alignOnPang.draw_spot_gexf(spots: set, output: Path, multigenics: set, fam_to_mod: dict, set_size: int = 3)

Draw a gexf graph of the spot

Parameters:

spots – spot find in the alignment between pangenome and input sequences
output – Path of the output directory
multigenics – multigenics families
fam_to_mod – dictionary which link families and modules
set_size –

ppanggolin.align.alignOnPang.get_fam_to_rgp(pangenome, multigenics: set) → dict

Associate families to the RGP they belong to, and those they are bordering

Parameters:

pangenome – Input pangenome
multigenics – multigenics families

Returns:

Dictionary link families to RGP

ppanggolin.align.alignOnPang.get_fam_to_spot(pangenome: Pangenome, multigenics: Set[GeneFamily]) → Tuple[Dict[str, List[Spot]], Dict[str, List[Spot]]]

Reads a pangenome object to link families and spots and indicate where each family is.

Parameters:

pangenome – Input pangenome
multigenics – multigenics families

Returns:

Dictionary of family to RGP and family to spot

ppanggolin.align.alignOnPang.get_input_seq_to_family_with_all(pangenome: Pangenome, sequence_files: Path | Iterable[Path], output: Path, tmpdir: Path, input_type: str = 'unknow', is_input_slf: bool = False, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, translation_table: int = 11, disable_bar: bool = False) → Tuple[Path, Dict[str, GeneFamily]]

Assign gene families from a pangenome to input sequences.

This function aligns input sequences to all genes of the pangenome using MMseqs2 and assigns them to a gene families based on alignment results.

Parameters:

pangenome – Annotated pangenome containing genes.
sequence_files – Iterable of paths of FASTA files containing input sequences to align.
output – Path to the output directory where alignment results will be stored.
tmpdir – Temporary directory for intermediate files.
input_type – Sequences type of the file (query). [nucleotide, protein, unknow]
is_input_slf – Is the sequence file with single line fasta. If True, MMSeqs2 database will be with soft link
cpu – Number of CPU cores to use for the alignment (default: 1).
no_defrag – If True, the defragmentation workflow is skipped (default: False).
identity – Minimum identity threshold for the alignment (default: 0.8).
coverage – Minimum coverage threshold for the alignment (default: 0.8).
translation_table – Translation table to use if sequences need to be translated (default: 11).
disable_bar – If True, disable the progress bar.

Returns:

A tuple containing the path to the alignment result file, and a dictionary mapping input sequences to gene families.

ppanggolin.align.alignOnPang.get_input_seq_to_family_with_rep(pangenome: Pangenome, sequence_files: Path | Iterable[Path], output: Path, tmpdir: Path, input_type: str = 'unknow', is_input_slf: bool = False, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, translation_table: int = 11, disable_bar: bool = False) → Tuple[Path, Dict[str, GeneFamily]]

Assign gene families from a pangenome to input sequences.

This function aligns input sequences to gene families in a pangenome using MMseqs2 and assigns them to appropriate gene families based on alignment results.

Parameters:

pangenome – Annotated pangenome containing gene families.
sequence_files – Iterable of paths of FASTA files containing input sequences to align.
output – Path to the output directory where alignment results will be stored.
tmpdir – Temporary directory for intermediate files.
input_type – Type of input sequence file. [nucleotide, aminoacid, unknow]
is_input_slf – Is the sequence file with single line fasta. If True, MMSeqs2 database will be with soft link
cpu – Number of CPU cores to use for the alignment (default: 1).
no_defrag – If True, the defragmentation workflow is skipped (default: False).
identity – Minimum identity threshold for the alignment (default: 0.8).
coverage – Minimum coverage threshold for the alignment (default: 0.8).
translation_table – Translation table to use if sequences need to be translated (default: 11).
disable_bar – If True, disable the progress bar.

Returns:

A tuple containing the path to the alignment result file, and a dictionary mapping input sequences to gene families.

ppanggolin.align.alignOnPang.get_seq_ids(seq_file: TextIOWrapper) → Tuple[Set[str], bool, bool]

Get sequence IDs from a sequence input file in FASTA format and guess the sequence type based on the first sequences.

Parameters:: seq_file – A file object containing sequences in FASTA format.
Returns:: A tuple containing a set of sequence IDs and a boolean indicating if the sequences are nucleotide sequences.

ppanggolin.align.alignOnPang.get_seq_info(seq_to_pang: dict, pangenome: Pangenome, output: Path, draw_related: bool = False, disable_bar=False)

Get sequences information after alignment

Parameters:

seq_to_pang – Alignment result
pangenome – Pangenome which contain information
output – Path of the output directory
draw_related – Draw figures and graphs in a gexf format of spots associated to the input sequences
disable_bar – disable progress bar

Returns:

ppanggolin.align.alignOnPang.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.align.alignOnPang.map_input_gene_to_family_all_aln(aln_res: Path, outdir: Path, pangenome: Pangenome) → Tuple[Dict[str, GeneFamily], Path]

Read alignment result to link input sequences to pangenome gene family. Alignment have been made against all genes of the pangenome.

Parameters:

aln_res – Alignment result file
outdir – Output directory
pangenome – Input pangenome

Returns:

Dictionary with sequence link to pangenome gene families and actual path to the cleaned alignment file

ppanggolin.align.alignOnPang.map_input_gene_to_family_rep_aln(aln_res: Path, outdir: Path, pangenome: Pangenome) → Tuple[Dict[Any, GeneFamily], Path]

Read alignment result to link input sequences to pangenome gene family. Alignment have been made against representative sequence of gene families of the pangenome.

Parameters:

aln_res – Alignment result file
outdir – Output directory
pangenome – Input pangenome

Returns:

Dictionary with sequence link to pangenome gene families and actual path to the cleaned alignment file

ppanggolin.align.alignOnPang.parser_align(parser: ArgumentParser)

Parser for specific argument of align command

Parameters:: parser – parser for align argument

ppanggolin.align.alignOnPang.project_and_write_partition(seqid_to_gene_family: Dict[str, GeneFamily], seq_set: Set[str], output: Path) → Path

Project the partition of each sequence from the input file and write them in a file

Parameters:

seqid_to_gene_family – dictionary which link sequence and pangenome gene family
seq_set – input sequences
output – Path of the output directory

Returns:

Path to file which contain partition projection

ppanggolin.align.alignOnPang.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.align.alignOnPang.write_all_gene_sequences(pangenome: Pangenome, output: Path, add: str = '', disable_bar: bool = False)

Export the sequence of pangenome genes

Parameters:

pangenome – Pangenome containing genes
output – Path to file where sequences will be written
add – Add prefix to sequence name
disable_bar – disable progress bar

ppanggolin.align.alignOnPang.write_gene_fam_sequences(pangenome: Pangenome, output: Path, add: str = '', disable_bar: bool = False)

Export the sequence of gene families

Parameters:

pangenome – Pangenome containing families
output – Path to file where sequences will be written
add – Add prefix to sequence name
disable_bar – disable progress bar

ppanggolin.align.alignOnPang.write_gene_to_gene_family(seqid_to_gene_family: Dict[str, GeneFamily], seq_set: Set[str], output: Path) → Path

Write input gene to pangenome gene family.

Parameters:

seqid_to_gene_family – dictionary which links input sequence and pangenome gene family
seq_set – input sequences
output – Path of the output directory

Returns:

Path to the file which contains gene to gene family projection results

ppanggolin.align package

Submodules

ppanggolin.align.alignOnPang module

Module contents