ppanggolin.annotate package
Submodules
ppanggolin.annotate.annotate module
- ppanggolin.annotate.annotate.add_metadata_from_gff_file(contig_name_to_region_info: Dict[str, Dict[str, str]], org: Organism, gff_file_path: Path)
Add metadata to the organism object from a GFF file.
- Parameters:
contig_name_to_region_info – A dictionary mapping contig names to their corresponding region information.
org – The organism object to which metadata will be added.
gff_file_path – The path to the GFF file.
- ppanggolin.annotate.annotate.annotate_pangenome(pangenome: Pangenome, fasta_list: Path, tmpdir: str, cpu: int = 1, translation_table: int = 11, kingdom: str = 'bacteria', norna: bool = False, allow_overlap: bool = False, procedure: str | None = None, disable_bar: bool = False)
Main function to annotate a pangenome
- Parameters:
pangenome – Pangenome with gene families to align with the given input sequences
fasta_list – List of fasta file containing sequences that will be base of pangenome
tmpdir – Path to temporary directory
cpu – number of CPU cores to use
translation_table – Translation table (genetic code) to use.
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
norna – Use to avoid annotating RNA features.
allow_overlap – Use to not remove genes overlapping with RNA features
procedure – prodigal procedure used
disable_bar – Disable the progress bar
- ppanggolin.annotate.annotate.check_and_add_extra_gene_part(gene: Gene, new_gene_info: Dict, max_separation: int = 10)
Checks and potentially adds extra gene parts based on new gene information. This is done before checking for potential overlapping edge genes. Gene coordinates are expected to be in ascending order, and no circularity is taken into account here.
- Parameters:
gene – Gene object to be compared and potentially merged with new_gene_info.
new_gene_info – Dictionary containing information about the new gene.
max_separation – Maximum allowed separation between gene coordinates for merging. Default is 10.
- Raises:
AssertionError – If the start position is greater than the stop position in new_gene_info.
ValueError – If the coordinates of genes are too far apart to merge, or if the gene attributes do not match.
- ppanggolin.annotate.annotate.check_annotate_args(args: Namespace)
Check That the given arguments are usable
- Parameters:
args – All arguments provide by user
- Raises:
Exception –
- ppanggolin.annotate.annotate.chose_gene_identifiers(pangenome: Pangenome) bool
Parses the pangenome genes to decide whether to use local_identifiers or ppanggolin generated gene identifiers. If the local identifiers are unique within the pangenome they are picked, otherwise ppanggolin ones are used.
- Parameters:
pangenome – input pangenome
- Returns:
Boolean stating True if local identifiers are used, and False otherwise
- ppanggolin.annotate.annotate.combine_contigs_metadata(contig_to_metadata: Dict[Contig, Dict[str, str]]) Tuple[Dict[str, str], Dict[Contig, Dict[str, str]]]
Combine contig metadata to identify shared and unique metadata tags and values.
- Parameters:
contig_to_metadata – A dictionary mapping each contig to its associated metadata.
- Returns:
A tuple containing: - A dictionary of shared metadata tags and values present in all contigs. - A dictionary mapping each contig to its unique metadata tags and values.
- ppanggolin.annotate.annotate.correct_putative_overlaps(contigs: Iterable[Contig])
Corrects putative overlaps in gene coordinates for circular contigs.
- Parameters:
contigs – Iterable of Contig objects representing circular contigs.
- Raises:
ValueError – If a gene start position is higher than the length of the contig.
- ppanggolin.annotate.annotate.create_gene(org: Organism, contig: Contig, gene_counter: int, rna_counter: int, gene_id: str, dbxrefs: Set[str], coordinates: List[Tuple[int, int]], strand: str, gene_type: str, position: int | None = None, gene_name: str = '', product: str = '', genetic_code: int = 11, protein_id: str = '') Gene
Create a Gene object and associate to contig and Organism
- Parameters:
org – Organism to add gene
contig – Contig to add gene
gene_counter – Gene counter to name gene
rna_counter – RNA counter to name RNA
gene_id – local identifier
dbxrefs – cross-reference to external DB
coordinates – Gene start and stop positions
strand – gene strand association
gene_type – gene type
position – position in contig
gene_name – Gene name
product – Function of gene
genetic_code – Genetic code used
protein_id – Protein identifier
- ppanggolin.annotate.annotate.determine_genetic_code_from_annotation_files(pangenome)
Determine the genetic code from the pangenome based on gene annotations.
This function counts the occurrence of each genetic code across all genes in the pangenome (excluding genes with genetic_code == 0) and selects the most common one. A pangenome is expected to have a unique genetic code, so a warning is issued if multiple genetic codes are detected, including examples of genes and genomes with different codes.
- Parameters:
pangenome – A Pangenome object containing genes with genetic_code attributes
- Returns:
The most common genetic code (int) found in the pangenome, or None if no genetic code information is available in the annotations (all genes have genetic_code == 0)
- Warning:
Issues a warning if multiple genetic codes are detected in the pangenome
- ppanggolin.annotate.annotate.determine_genetic_code_to_use(user_translation_table: int, is_user_specified: bool, genetic_code_from_annotation: int | None = None) int
Determine the genetic code to use for the pangenome.
This function implements the following priority logic: 1. Extract genetic code from annotation files 2. If user explicitly specified a translation table and it differs from annotation,
issue a warning and use the user-specified value
If no genetic code found in annotations, use the user-specified/default table
- Parameters:
user_translation_table – The translation table value provided by the user (default or explicitly specified)
is_user_specified – Whether the translation table was explicitly specified by the user
genetic_code_from_annotation – Genetic code value inferred from annotation files, or None if no genetic code was found
- Returns:
The genetic code to use for the pangenome
- ppanggolin.annotate.annotate.extract_positions(string: str) Tuple[List[Tuple[int, int]], bool, bool, bool]
Extracts start and stop positions from a string and determines whether it is complement and pseudogene.
Example of strings that the function is able to process:
“join(190..7695,7695..12071)”, “complement(join(4359800..4360707,4360707..4360962))”, “join(6835405..6835731,1..1218)”, “join(1375484..1375555,1375557..1376579)”, “complement(6815492..6816265)”, “6811501..6812109”, “complement(6792573..>6795461)”, “join(1038313,1..1016)”
- Parameters:
string – The input string containing position information.
- Returns:
A tuple containing a list of tuples representing start and stop positions, a boolean indicating whether it is complement, a boolean indicating whether it is a partial gene at start position and a boolean indicating whether it is a partial gene at end position.
- Raises:
ValueError – If the string is not formatted as expected or if positions cannot be parsed as integers.
- ppanggolin.annotate.annotate.fix_partial_gene_coordinates(coordinates: List[Tuple[int, int]], is_complement: bool, start_shift: int, ensure_codon_multiple: bool = True) List[Tuple[int, int]]
Adjusts gene coordinates if they have partial starts or ends, ensuring the gene length is a multiple of 3.
If the gene is on the complement strand, the adjustments will be reversed (i.e., applied to the opposite ends).
- Parameters:
coordinates – List of coordinate tuples (start, stop) for the gene.
is_complement – Flag indicating if the gene is on the complement strand.
start_shift – The value by which the start coordinate should be shifted.
ensure_codon_multiple – Flag to check that gene length is a multiple of 3.
- Returns:
A new list of adjusted coordinate tuples.
- ppanggolin.annotate.annotate.get_gene_sequences_from_fastas(pangenome: Pangenome, fasta_files: Path, disable_bar: bool = False)
Get gene sequences from fastas
- Parameters:
pangenome – Input pangenome
fasta_files – list of fasta file
disable_bar – Flag to disable progress bar
- ppanggolin.annotate.annotate.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.annotate.annotate.local_identifiers_are_unique(genes: Iterable[Gene]) bool
Check if local_identifiers of genes are uniq in order to decide if they should be used as gene id.
- Parameters:
genes – Iterable of gene objects
- Returns:
Boolean stating True if local identifiers are uniq, and False otherwise
- ppanggolin.annotate.annotate.parse_contig_header_lines(header_lines: List[str]) Dict[str, str]
Parse required information from header lines of a contig from a GBFF file.
- Parameters:
header_lines – List of strings representing header lines of a contig from a GBFF file.
- Returns:
A dict with keys representing different fields and values representing their corresponding values joined by new line.
- ppanggolin.annotate.annotate.parse_db_xref_metadata(db_xref_values: List[str], annot_file_path: Path = '') Dict[str, str]
Parses a list of db_xref values and returns a dictionary with formatted keys and identifiers.
- Parameters:
db_xref_values – List of db_xref strings in the format <database>:<identifier>.
annot_file_path – Path to the annotation file being processed.
- Returns:
Dictionary with keys formatted as ‘db_xref_<database>’ and their corresponding identifiers.
- ppanggolin.annotate.annotate.parse_dna_seq_lines(sequence_lines: List[str]) str
Parse sequence_lines from a GBFF file and return dna sequence
- Parameters:
sequence_lines – List of strings representing sequence lines from a GBFF file.
- Returns:
a string in upper case of the DNA sequences that have been cleaned
- ppanggolin.annotate.annotate.parse_feature_lines(feature_lines: List[str]) Generator[Dict[str, str | Set[str]], None, None]
Parse feature lines from a GBFF file and yield dictionaries representing each feature.
- Parameters:
feature_lines – List of strings representing feature lines from a GBFF file.
- Returns:
A generator that yields dictionaries, each representing a feature with its type, location, and qualifiers.
- ppanggolin.annotate.annotate.parse_gbff_by_contig(gbff_file_path: Path) Generator[Tuple[Dict[str, str], Generator[Dict[str, str | Set[str]], None, None], str], None, None]
Parse a GBFF file by contig and yield tuples containing header, feature, and sequence info for each contig.
- Parameters:
gbff_file_path – Path to the GBFF file.
- Returns:
A generator that yields tuples containing header lines, feature lines, and sequence info for each contig.
- ppanggolin.annotate.annotate.parser_annot(parser: ArgumentParser)
Parser for specific argument of annotate command
- Parameters:
parser – parser for annotate argument
- ppanggolin.annotate.annotate.read_anno_file(organism_name: str, filename: Path, circular_contigs: list, pseudo: bool = False) Tuple[Organism, bool]
Read a GBFF file for one organism
- Parameters:
organism_name – Name of the organism
filename – Path to the corresponding file
circular_contigs – list of sequence in contig
pseudo – allow to read pseudogene
- Returns:
Annotated organism for pangenome and true for sequence in file
- ppanggolin.annotate.annotate.read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, pseudo: bool = False, translation_table: int = 11, is_translation_table_specified: bool = False, disable_bar: bool = False)
Read the annotation from GBFF file
- Parameters:
pangenome – pangenome object
organisms_file – List of GBFF files for each organism
cpu – number of CPU cores to use
pseudo – allow to read pseudogene
translation_table – Translation table (genetic code) to use when /transl_table is missing from CDS tags.
disable_bar – Disable the progress bar
- ppanggolin.annotate.annotate.read_org_gbff(organism_name: str, gbff_file_path: Path, circular_contigs: List[str], use_pseudogenes: bool = False) Tuple[Organism, bool]
Read a GBFF file and fills Organism, Contig and Genes objects based on information contained in this file
- Parameters:
organism_name – Organism name
gbff_file_path – Path to corresponding GBFF file
circular_contigs – list of contigs
use_pseudogenes – Allow to read pseudogenes
- Returns:
Organism complete and true for sequence in file
- ppanggolin.annotate.annotate.read_org_gff(organism: str, gff_file_path: Path, circular_contigs: List[str], pseudo: bool = False) Tuple[Organism, bool]
Read annotation from GFF file
- Parameters:
organism – Organism name
gff_file_path – Path corresponding to GFF file
circular_contigs – List of circular contigs
pseudo – Allow to read pseudogene
- Returns:
Organism object and if there are sequences associated or not
- ppanggolin.annotate.annotate.reverse_complement_coordinates(coordinates: List[Tuple[int, int]]) List[Tuple[int, int]]
Reverses and inverts the given list of coordinates. Each coordinate pair (start, end) is transformed into (-end, -start) and the order of the coordinates is reversed.
- Parameters:
coordinates – A list of tuples representing the coordinates to be reversed and inverted.
- Returns:
A list of reversed and inverted coordinates.
- ppanggolin.annotate.annotate.shift_end_coordinates(coordinates: List[Tuple[int, int]], shift: int) List[Tuple[int, int]]
Shifts the end of a set of coordinates by a specified amount and then returns the final shifted coordinates. This involves reversing the coordinates twice, shifting the start, and then returning the original orientation.
- Parameters:
coordinates – A list of tuples representing the original coordinates.
shift – The amount by which the end coordinate should be shifted.
- Returns:
The coordinates after the end shift and reverse complement transformations.
- ppanggolin.annotate.annotate.shift_start_coordinates(coordinates: List[Tuple[int, int]], shift: int) List[Tuple[int, int]]
Shifts the start of the first coordinate in the list by the specified amount. If the shift results in a negative or zero-length interval for the first coordinate, this interval is removed, and the shift is propagated to the next coordinate if necessary.
- Parameters:
coordinates – A list of tuples representing the coordinates.
shift – The amount by which the start coordinate should be shifted.
- Returns:
A new list of coordinates with the shifted start.
- ppanggolin.annotate.annotate.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
ppanggolin.annotate.synta module
- ppanggolin.annotate.synta.annotate_organism(org_name: str, file_name: Path, circular_contigs: List[str], tmpdir: str, code: int = 11, norna: bool = False, kingdom: str = 'bacteria', allow_overlap: bool = False, procedure: str | None = None) Organism
Function to annotate a single organism
- Parameters:
org_name – Name of the organism / genome
file_name – Path to the fasta file containing organism sequences
circular_contigs – list of contigs
code – Translation table (genetic code) to use.
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
norna – Use to avoid annotating RNA features.
tmpdir – Path to temporary directory
allow_overlap – Use to not remove genes overlapping with RNA features
procedure – prodigal procedure used
- Returns:
Complete organism object for pangenome
- ppanggolin.annotate.synta.check_sequence_tuple(name: str, sequence: str)
Checks and validates a sequence name and its corresponding sequence.
- Parameters:
name – The name (header) of the sequence, typically extracted from the FASTA file header.
sequence – The sequence string corresponding to the name, containing the nucleotide or protein sequence.
- Returns:
A tuple containing the validated name and sequence.
- Raises:
ValueError –
If the sequence is empty, a ValueError is raised with a message containing the header name.
If the name is empty, a ValueError is raised with a message containing a preview of the sequence.
- ppanggolin.annotate.synta.get_contigs_from_fasta_file(org: Organism, fna_file: TextIOWrapper | list) Dict[str, str]
Processes contigs from a parsed FASTA generator and stores in a dictionary.
- Parameters:
org – Organism instance to update with contig info.
fna_file – Input FASTA file or list of lines as sequences.
- Returns:
Dictionary with contig names as keys and sequences as values.
- ppanggolin.annotate.synta.get_dna_sequence(contig_seq: str, gene: Gene | RNA) str
Return the gene sequence
- Parameters:
contig_seq – Contig sequence
gene – Gene
- Returns:
str
- ppanggolin.annotate.synta.init_contig_counter(value: Value)
Initialize the contig counter for later use
- ppanggolin.annotate.synta.launch_aragorn(fna_file: str, org: Organism, contig_to_length: Dict[str, int]) defaultdict
Launches Aragorn to annotate tRNAs.
- Parameters:
fna_file – file-like object containing the uncompressed fasta sequences
org – Organism which will be annotated
- Returns:
Annotated genes in a list of gene objects
- ppanggolin.annotate.synta.launch_infernal(fna_file: str, org: Organism, tmpdir: str, kingdom: str = 'bacteria') defaultdict
Launches Infernal in hmmer-only mode to annotate rRNAs.
- Parameters:
fna_file – file-like object containing the uncompressed fasta sequences
org – Organism which will be annotated
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
tmpdir – Path to temporary directory
- Returns:
Annotated genes in a list of gene objects.
- ppanggolin.annotate.synta.launch_prodigal(contig_sequences: Dict[str, str], org: Organism, code: int = 11, use_meta: bool = False) defaultdict
Launches Prodigal to annotate CDS. Takes a fna file name and a locustag to give an ID to the pred genes.
- Parameters:
contig_sequences – Dict containing contig sequences for pyrodigal
org – Organism which will be annotated
code – Translation table (genetic code) to use.
use_meta – use meta procedure in Prodigal
- Returns:
Annotated genes in a list of gene objects
- ppanggolin.annotate.synta.overlap_filter(all_genes: defaultdict, allow_overlap: bool = False) defaultdict
Removes the CDS that overlap with RNA genes.
- Parameters:
all_genes – Dictionary with complete list of genes
allow_overlap – Use to not remove genes overlapping with RNA features
- Returns:
Dictionary with genes filtered
- ppanggolin.annotate.synta.parse_fasta(fna_file: TextIOWrapper | list) Generator[Tuple[str, str], None, None]
Yields each sequence name and sequence from a FASTA file or stream as a tuple.
- Parameters:
fna_file – Input FASTA file or list of lines as sequences.
- Yield:
Tuple with contig header (without ‘>’) and sequence.
- Raises:
ValueError – If the file does not contain valid FASTA format.
- ppanggolin.annotate.synta.reverse_complement(seq: str)
reverse complement the given dna sequence
- Parameters:
seq – sequence which need to be reversed
- Returns:
reverse sequence
- ppanggolin.annotate.synta.syntaxic_annotation(org: Organism, fasta_file: TextIOWrapper, contig_sequences: Dict[str, str], tmpdir: str, norna: bool = False, kingdom: str = 'bacteria', code: int = 11, use_meta: bool = False) defaultdict
Runs the different software for the syntaxic annotation.
- Parameters:
org – Organism which will be annotated
fasta_file – file-like object containing the uncompressed fasta sequences
contig_sequences – Dict containing contig sequences for pyrodigal
tmpdir – Path to temporary directory
norna – Use to avoid annotating RNA features.
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
code – Translation table (genetic code) to use.
use_meta – Use meta prodigal procedure
- Returns:
list of genes in the organism
- ppanggolin.annotate.synta.write_tmp_fasta(contigs: dict, tmpdir: str) _TemporaryFileWrapper
Writes a temporary fna formatted file and returns the file-like object. Useful in case of compressed input file. The file will be deleted when close() is called.
- Parameters:
contigs – Contigs sequences of each contig
tmpdir – path to temporary directory
- Returns:
fasta file