ppanggolin.align package
Submodules
ppanggolin.align.alignOnPang module
- ppanggolin.align.alignOnPang.align(pangenome: Pangenome, sequence_file: Path, output: Path, identity: float = 0.8, coverage: float = 0.8, no_defrag: bool = False, cpu: int = 1, getinfo: bool = False, use_representatives: bool = False, draw_related: bool = False, translation_table: int = 11, tmpdir: Path | None = None, disable_bar: bool = False, keep_tmp=False)
Aligns pangenome sequences with sequences in a FASTA file using MMSeqs2.
- Parameters:
pangenome – Pangenome object containing gene families to align with the input sequences.
sequence_file – Path to a FASTA file containing sequences to align with the pangenome.
output – Path to the output directory.
identity – Minimum identity threshold for the alignment.
coverage – Minimum coverage threshold for the alignment.
no_defrag – If True, the defrag workflow will not be used.
cpu – Number of CPU cores to use.
getinfo – If True, extract information related to the best hit of each query, such as the RGP it is in or the spots.
use_representatives – If True, use representative sequences of gene families rather than all sequences to align input genes.
draw_related – If True, draw figures and graphs in a gexf format of spots associated with the input sequences.
translation_table – Translation table ID for nucleotide sequences.
tmpdir – Temporary directory for intermediate files.
disable_bar – If True, disable the progress bar.
keep_tmp – If True, keep temporary files.
- ppanggolin.align.alignOnPang.align_seq_to_pang(target_seq_file: Path | Iterable[Path], query_seq_files: Path | Iterable[Path], tmpdir: Path, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, query_type: str = 'unknow', is_query_slf: bool = False, target_type: str = 'unknow', is_target_slf: bool = False, translation_table: int | None = None) Path
Align fasta sequence to pangenome sequences.
- Parameters:
target_seq_file – File with sequences of pangenome (target)
query_seq_files – Iterable of files with sequences from input file (query)
tmpdir – Temporary directory to align sequences
cpu – Number of available cpu
no_defrag – Do not apply defragmentation
identity – minimal identity threshold for the alignment
coverage – minimal identity threshold for the alignment
query_type – Sequences type of the file (query). [nucleotide, protein, unknow]
is_query_slf – Is the sequence file (query) with single line fasta. If True, MMSeqs2 database will be with soft link
target_type – Sequences type of pangenome (target). [nucleotide, aminoacid, protein]
is_target_slf – Is the sequences of pangenome (target) with single line fasta. If True, MMSeqs2 database will be with soft link
translation_table – Translation table to use, if sequences are nucleotide and need to be translated.
- Returns:
Alignment result file
- ppanggolin.align.alignOnPang.draw_spot_gexf(spots: set, output: Path, multigenics: set, fam_to_mod: dict, set_size: int = 3)
Draw a gexf graph of the spot
- Parameters:
spots – spot find in the alignment between pangenome and input sequences
output – Path of the output directory
multigenics – multigenics families
fam_to_mod – dictionary which link families and modules
set_size –
- ppanggolin.align.alignOnPang.get_fam_to_rgp(pangenome, multigenics: set) dict
Associate families to the RGP they belong to, and those they are bordering
- Parameters:
pangenome – Input pangenome
multigenics – multigenics families
- Returns:
Dictionary link families to RGP
- ppanggolin.align.alignOnPang.get_fam_to_spot(pangenome: Pangenome, multigenics: Set[GeneFamily]) Tuple[Dict[str, List[Spot]], Dict[str, List[Spot]]]
Reads a pangenome object to link families and spots and indicate where each family is.
- Parameters:
pangenome – Input pangenome
multigenics – multigenics families
- Returns:
Dictionary of family to RGP and family to spot
- ppanggolin.align.alignOnPang.get_input_seq_to_family_with_all(pangenome: Pangenome, sequence_files: Path | Iterable[Path], output: Path, tmpdir: Path, input_type: str = 'unknow', is_input_slf: bool = False, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, translation_table: int = 11, disable_bar: bool = False) Tuple[Path, Dict[str, GeneFamily]]
Assign gene families from a pangenome to input sequences.
This function aligns input sequences to all genes of the pangenome using MMseqs2 and assigns them to a gene families based on alignment results.
- Parameters:
pangenome – Annotated pangenome containing genes.
sequence_files – Iterable of paths of FASTA files containing input sequences to align.
output – Path to the output directory where alignment results will be stored.
tmpdir – Temporary directory for intermediate files.
input_type – Sequences type of the file (query). [nucleotide, protein, unknow]
is_input_slf – Is the sequence file with single line fasta. If True, MMSeqs2 database will be with soft link
cpu – Number of CPU cores to use for the alignment (default: 1).
no_defrag – If True, the defragmentation workflow is skipped (default: False).
identity – Minimum identity threshold for the alignment (default: 0.8).
coverage – Minimum coverage threshold for the alignment (default: 0.8).
translation_table – Translation table to use if sequences need to be translated (default: 11).
disable_bar – If True, disable the progress bar.
- Returns:
A tuple containing the path to the alignment result file, and a dictionary mapping input sequences to gene families.
- ppanggolin.align.alignOnPang.get_input_seq_to_family_with_rep(pangenome: Pangenome, sequence_files: Path | Iterable[Path], output: Path, tmpdir: Path, input_type: str = 'unknow', is_input_slf: bool = False, cpu: int = 1, no_defrag: bool = False, identity: float = 0.8, coverage: float = 0.8, translation_table: int = 11, disable_bar: bool = False) Tuple[Path, Dict[str, GeneFamily]]
Assign gene families from a pangenome to input sequences.
This function aligns input sequences to gene families in a pangenome using MMseqs2 and assigns them to appropriate gene families based on alignment results.
- Parameters:
pangenome – Annotated pangenome containing gene families.
sequence_files – Iterable of paths of FASTA files containing input sequences to align.
output – Path to the output directory where alignment results will be stored.
tmpdir – Temporary directory for intermediate files.
input_type – Type of input sequence file. [nucleotide, aminoacid, unknow]
is_input_slf – Is the sequence file with single line fasta. If True, MMSeqs2 database will be with soft link
cpu – Number of CPU cores to use for the alignment (default: 1).
no_defrag – If True, the defragmentation workflow is skipped (default: False).
identity – Minimum identity threshold for the alignment (default: 0.8).
coverage – Minimum coverage threshold for the alignment (default: 0.8).
translation_table – Translation table to use if sequences need to be translated (default: 11).
disable_bar – If True, disable the progress bar.
- Returns:
A tuple containing the path to the alignment result file, and a dictionary mapping input sequences to gene families.
- ppanggolin.align.alignOnPang.get_seq_ids(seq_file: TextIOWrapper) Tuple[Set[str], bool, bool]
Get sequence IDs from a sequence input file in FASTA format and guess the sequence type based on the first sequences.
- Parameters:
seq_file – A file object containing sequences in FASTA format.
- Returns:
A tuple containing a set of sequence IDs and a boolean indicating if the sequences are nucleotide sequences.
- ppanggolin.align.alignOnPang.get_seq_info(seq_to_pang: dict, pangenome: Pangenome, output: Path, draw_related: bool = False, disable_bar=False)
Get sequences information after alignment
- Parameters:
seq_to_pang – Alignment result
pangenome – Pangenome which contain information
output – Path of the output directory
draw_related – Draw figures and graphs in a gexf format of spots associated to the input sequences
disable_bar – disable progress bar
- Returns:
- ppanggolin.align.alignOnPang.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.align.alignOnPang.map_input_gene_to_family_all_aln(aln_res: Path, outdir: Path, pangenome: Pangenome) Tuple[Dict[str, GeneFamily], Path]
Read alignment result to link input sequences to pangenome gene family. Alignment have been made against all genes of the pangenome.
- Parameters:
aln_res – Alignment result file
outdir – Output directory
pangenome – Input pangenome
- Returns:
Dictionary with sequence link to pangenome gene families and actual path to the cleaned alignment file
- ppanggolin.align.alignOnPang.map_input_gene_to_family_rep_aln(aln_res: Path, outdir: Path, pangenome: Pangenome) Tuple[Dict[Any, GeneFamily], Path]
Read alignment result to link input sequences to pangenome gene family. Alignment have been made against representative sequence of gene families of the pangenome.
- Parameters:
aln_res – Alignment result file
outdir – Output directory
pangenome – Input pangenome
- Returns:
Dictionary with sequence link to pangenome gene families and actual path to the cleaned alignment file
- ppanggolin.align.alignOnPang.parser_align(parser: ArgumentParser)
Parser for specific argument of align command
- Parameters:
parser – parser for align argument
- ppanggolin.align.alignOnPang.project_and_write_partition(seqid_to_gene_family: Dict[str, GeneFamily], seq_set: Set[str], output: Path) Path
Project the partition of each sequence from the input file and write them in a file
- Parameters:
seqid_to_gene_family – dictionary which link sequence and pangenome gene family
seq_set – input sequences
output – Path of the output directory
- Returns:
Path to file which contain partition projection
- ppanggolin.align.alignOnPang.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
- ppanggolin.align.alignOnPang.write_all_gene_sequences(pangenome: Pangenome, output: Path, add: str = '', disable_bar: bool = False)
Export the sequence of pangenome genes
- Parameters:
pangenome – Pangenome containing genes
output – Path to file where sequences will be written
add – Add prefix to sequence name
disable_bar – disable progress bar
- ppanggolin.align.alignOnPang.write_gene_fam_sequences(pangenome: Pangenome, output: Path, add: str = '', disable_bar: bool = False)
Export the sequence of gene families
- Parameters:
pangenome – Pangenome containing families
output – Path to file where sequences will be written
add – Add prefix to sequence name
disable_bar – disable progress bar
- ppanggolin.align.alignOnPang.write_gene_to_gene_family(seqid_to_gene_family: Dict[str, GeneFamily], seq_set: Set[str], output: Path) Path
Write input gene to pangenome gene family.
- Parameters:
seqid_to_gene_family – dictionary which links input sequence and pangenome gene family
seq_set – input sequences
output – Path of the output directory
- Returns:
Path to the file which contains gene to gene family projection results