ppanggolin.align package

Submodules

ppanggolin.align.alignOnPang module

ppanggolin.align.alignOnPang.align(pangenome: Pangenome, sequence_file: Path, output: Path, identity: float = 0.8, coverage: float = 0.8, no_defrag: bool = False, cpu: int = 1, getinfo: bool = False, use_representatives: bool = False, draw_related: bool = False, translation_table: int = 11, tmpdir: Path | None = None, disable_bar: bool = False, keep_tmp=False)

Aligns pangenome sequences with sequences in a FASTA file using MMSeqs2.

Parameters:
  • pangenome – Pangenome object containing gene families to align with the input sequences.

  • sequence_file – Path to a FASTA file containing sequences to align with the pangenome.

  • output – Path to the output directory.

  • identity – Minimum identity threshold for the alignment.

  • coverage – Minimum coverage threshold for the alignment.

  • no_defrag – If True, the defrag workflow will not be used.

  • cpu – Number of CPU cores to use.

  • getinfo – If True, extract information related to the best hit of each query, such as the RGP it is in or the spots.

  • use_representatives – If True, use representative sequences of gene families rather than all sequences to align input genes.

  • draw_related – If True, draw figures and graphs in a gexf format of spots associated with the input sequences.

  • translation_table – Translation table ID for nucleotide sequences.

  • tmpdir – Temporary directory for intermediate files.

  • disable_bar – If True, disable the progress bar.

  • keep_tmp – If True, keep temporary files.

ppanggolin.align.alignOnPang.align_seq_to_pang(target_seq_file: Path | Iterable[Path], query_seq_files: Path | Iterable[Path], tmpdir: Path, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, query_type: str = 'unknow', is_query_slf: bool = False, target_type: str = 'unknow', is_target_slf: bool = False, translation_table: int | None = None) Path

Align fasta sequence to pangenome sequences.

Parameters:
  • target_seq_file – File with sequences of pangenome (target)

  • query_seq_files – Iterable of files with sequences from input file (query)

  • tmpdir – Temporary directory to align sequences

  • cpu – Number of available cpu

  • no_defrag – Do not apply defragmentation

  • identity – minimal identity threshold for the alignment

  • coverage – minimal identity threshold for the alignment

  • query_type – Sequences type of the file (query). [nucleotide, protein, unknow]

  • is_query_slf – Is the sequence file (query) with single line fasta. If True, MMSeqs2 database will be with soft link

  • target_type – Sequences type of pangenome (target). [nucleotide, aminoacid, protein]

  • is_target_slf – Is the sequences of pangenome (target) with single line fasta. If True, MMSeqs2 database will be with soft link

  • translation_table – Translation table to use, if sequences are nucleotide and need to be translated.

Returns:

Alignment result file

ppanggolin.align.alignOnPang.draw_spot_gexf(spots: set, output: Path, multigenics: set, fam_to_mod: dict, set_size: int = 3)

Draw a gexf graph of the spot

Parameters:
  • spots – spot find in the alignment between pangenome and input sequences

  • output – Path of the output directory

  • multigenics – multigenics families

  • fam_to_mod – dictionary which link families and modules

  • set_size

ppanggolin.align.alignOnPang.get_fam_to_rgp(pangenome, multigenics: set) dict

Associate families to the RGP they belong to, and those they are bordering

Parameters:
  • pangenome – Input pangenome

  • multigenics – multigenics families

Returns:

Dictionary link families to RGP

ppanggolin.align.alignOnPang.get_fam_to_spot(pangenome: Pangenome, multigenics: Set[GeneFamily]) Tuple[Dict[str, List[Spot]], Dict[str, List[Spot]]]

Reads a pangenome object to link families and spots and indicate where each family is.

Parameters:
  • pangenome – Input pangenome

  • multigenics – multigenics families

Returns:

Dictionary of family to RGP and family to spot

ppanggolin.align.alignOnPang.get_input_seq_to_family_with_all(pangenome: Pangenome, sequence_files: Path | Iterable[Path], output: Path, tmpdir: Path, input_type: str = 'unknow', is_input_slf: bool = False, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, translation_table: int = 11, disable_bar: bool = False) Tuple[Path, Dict[str, GeneFamily]]

Assign gene families from a pangenome to input sequences.

This function aligns input sequences to all genes of the pangenome using MMseqs2 and assigns them to a gene families based on alignment results.

Parameters:
  • pangenome – Annotated pangenome containing genes.

  • sequence_files – Iterable of paths of FASTA files containing input sequences to align.

  • output – Path to the output directory where alignment results will be stored.

  • tmpdir – Temporary directory for intermediate files.

  • input_type – Sequences type of the file (query). [nucleotide, protein, unknow]

  • is_input_slf – Is the sequence file with single line fasta. If True, MMSeqs2 database will be with soft link

  • cpu – Number of CPU cores to use for the alignment (default: 1).

  • no_defrag – If True, the defragmentation workflow is skipped (default: False).

  • identity – Minimum identity threshold for the alignment (default: 0.8).

  • coverage – Minimum coverage threshold for the alignment (default: 0.8).

  • translation_table – Translation table to use if sequences need to be translated (default: 11).

  • disable_bar – If True, disable the progress bar.

Returns:

A tuple containing the path to the alignment result file, and a dictionary mapping input sequences to gene families.

ppanggolin.align.alignOnPang.get_input_seq_to_family_with_rep(pangenome: Pangenome, sequence_files: Path | Iterable[Path], output: Path, tmpdir: Path, input_type: str = 'unknow', is_input_slf: bool = False, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, translation_table: int = 11, disable_bar: bool = False) Tuple[Path, Dict[str, GeneFamily]]

Assign gene families from a pangenome to input sequences.

This function aligns input sequences to gene families in a pangenome using MMseqs2 and assigns them to appropriate gene families based on alignment results.

Parameters:
  • pangenome – Annotated pangenome containing gene families.

  • sequence_files – Iterable of paths of FASTA files containing input sequences to align.

  • output – Path to the output directory where alignment results will be stored.

  • tmpdir – Temporary directory for intermediate files.

  • input_type – Type of input sequence file. [nucleotide, aminoacid, unknow]

  • is_input_slf – Is the sequence file with single line fasta. If True, MMSeqs2 database will be with soft link

  • cpu – Number of CPU cores to use for the alignment (default: 1).

  • no_defrag – If True, the defragmentation workflow is skipped (default: False).

  • identity – Minimum identity threshold for the alignment (default: 0.8).

  • coverage – Minimum coverage threshold for the alignment (default: 0.8).

  • translation_table – Translation table to use if sequences need to be translated (default: 11).

  • disable_bar – If True, disable the progress bar.

Returns:

A tuple containing the path to the alignment result file, and a dictionary mapping input sequences to gene families.

ppanggolin.align.alignOnPang.get_seq_ids(seq_file: TextIOWrapper) Tuple[Set[str], bool, bool]

Get sequence IDs from a sequence input file in FASTA format and guess the sequence type based on the first sequences.

Parameters:

seq_file – A file object containing sequences in FASTA format.

Returns:

A tuple containing a set of sequence IDs and a boolean indicating if the sequences are nucleotide sequences.

ppanggolin.align.alignOnPang.get_seq_info(seq_to_pang: dict, pangenome: Pangenome, output: Path, draw_related: bool = False, disable_bar=False)

Get sequences information after alignment

Parameters:
  • seq_to_pang – Alignment result

  • pangenome – Pangenome which contain information

  • output – Path of the output directory

  • draw_related – Draw figures and graphs in a gexf format of spots associated to the input sequences

  • disable_bar – disable progress bar

Returns:

ppanggolin.align.alignOnPang.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.align.alignOnPang.map_input_gene_to_family_all_aln(aln_res: Path, outdir: Path, pangenome: Pangenome) Tuple[Dict[str, GeneFamily], Path]

Read alignment result to link input sequences to pangenome gene family. Alignment have been made against all genes of the pangenome.

Parameters:
  • aln_res – Alignment result file

  • outdir – Output directory

  • pangenome – Input pangenome

Returns:

Dictionary with sequence link to pangenome gene families and actual path to the cleaned alignment file

ppanggolin.align.alignOnPang.map_input_gene_to_family_rep_aln(aln_res: Path, outdir: Path, pangenome: Pangenome) Tuple[Dict[Any, GeneFamily], Path]

Read alignment result to link input sequences to pangenome gene family. Alignment have been made against representative sequence of gene families of the pangenome.

Parameters:
  • aln_res – Alignment result file

  • outdir – Output directory

  • pangenome – Input pangenome

Returns:

Dictionary with sequence link to pangenome gene families and actual path to the cleaned alignment file

ppanggolin.align.alignOnPang.parser_align(parser: ArgumentParser)

Parser for specific argument of align command

Parameters:

parser – parser for align argument

ppanggolin.align.alignOnPang.project_and_write_partition(seqid_to_gene_family: Dict[str, GeneFamily], seq_set: Set[str], output: Path) Path

Project the partition of each sequence from the input file and write them in a file

Parameters:
  • seqid_to_gene_family – dictionary which link sequence and pangenome gene family

  • seq_set – input sequences

  • output – Path of the output directory

Returns:

Path to file which contain partition projection

ppanggolin.align.alignOnPang.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.align.alignOnPang.write_all_gene_sequences(pangenome: Pangenome, output: Path, add: str = '', disable_bar: bool = False)

Export the sequence of pangenome genes

Parameters:
  • pangenome – Pangenome containing genes

  • output – Path to file where sequences will be written

  • add – Add prefix to sequence name

  • disable_bar – disable progress bar

ppanggolin.align.alignOnPang.write_gene_fam_sequences(pangenome: Pangenome, output: Path, add: str = '', disable_bar: bool = False)

Export the sequence of gene families

Parameters:
  • pangenome – Pangenome containing families

  • output – Path to file where sequences will be written

  • add – Add prefix to sequence name

  • disable_bar – disable progress bar

ppanggolin.align.alignOnPang.write_gene_to_gene_family(seqid_to_gene_family: Dict[str, GeneFamily], seq_set: Set[str], output: Path) Path

Write input gene to pangenome gene family.

Parameters:
  • seqid_to_gene_family – dictionary which links input sequence and pangenome gene family

  • seq_set – input sequences

  • output – Path of the output directory

Returns:

Path to the file which contains gene to gene family projection results

Module contents