ppanggolin.annotate package

Submodules

ppanggolin.annotate.annotate module

ppanggolin.annotate.annotate.add_metadata_from_gff_file(contig_name_to_region_info: Dict[str, Dict[str, str]], org: Organism, gff_file_path: Path)

Add metadata to the organism object from a GFF file.

Parameters:

contig_name_to_region_info – A dictionary mapping contig names to their corresponding region information.
org – The organism object to which metadata will be added.
gff_file_path – The path to the GFF file.

ppanggolin.annotate.annotate.annotate_pangenome(pangenome: Pangenome, fasta_list: Path, tmpdir: str, cpu: int = 1, translation_table: int = 11, kingdom: str = 'bacteria', norna: bool = False, allow_overlap: bool = False, procedure: str | None = None, disable_bar: bool = False)

Main function to annotate a pangenome

Parameters:

pangenome – Pangenome with gene families to align with the given input sequences
fasta_list – List of fasta file containing sequences that will be base of pangenome
tmpdir – Path to temporary directory
cpu – number of CPU cores to use
translation_table – Translation table (genetic code) to use.
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
norna – Use to avoid annotating RNA features.
allow_overlap – Use to not remove genes overlapping with RNA features
procedure – prodigal procedure used
disable_bar – Disable the progress bar

ppanggolin.annotate.annotate.check_and_add_extra_gene_part(gene: Gene, new_gene_info: Dict, max_separation: int = 10)

Checks and potentially adds extra gene parts based on new gene information. This is done before checking for potential overlapping edge genes. Gene coordinates are expected to be in ascending order, and no circularity is taken into account here.

Parameters:

gene – Gene object to be compared and potentially merged with new_gene_info.
new_gene_info – Dictionary containing information about the new gene.
max_separation – Maximum allowed separation between gene coordinates for merging. Default is 10.

Raises:

AssertionError – If the start position is greater than the stop position in new_gene_info.
ValueError – If the coordinates of genes are too far apart to merge, or if the gene attributes do not match.

ppanggolin.annotate.annotate.check_annotate_args(args: Namespace)

Check That the given arguments are usable

Parameters:: args – All arguments provide by user
Raises:: Exception –

ppanggolin.annotate.annotate.chose_gene_identifiers(pangenome: Pangenome) → bool

Parses the pangenome genes to decide whether to use local_identifiers or ppanggolin generated gene identifiers. If the local identifiers are unique within the pangenome they are picked, otherwise ppanggolin ones are used.

Parameters:: pangenome – input pangenome
Returns:: Boolean stating True if local identifiers are used, and False otherwise

ppanggolin.annotate.annotate.combine_contigs_metadata(contig_to_metadata: Dict[Contig, Dict[str, str]]) → Tuple[Dict[str, str], Dict[Contig, Dict[str, str]]]

Combine contig metadata to identify shared and unique metadata tags and values.

Parameters:: contig_to_metadata – A dictionary mapping each contig to its associated metadata.
Returns:: A tuple containing: - A dictionary of shared metadata tags and values present in all contigs. - A dictionary mapping each contig to its unique metadata tags and values.

ppanggolin.annotate.annotate.correct_putative_overlaps(contigs: Iterable[Contig])

Corrects putative overlaps in gene coordinates for circular contigs.

Parameters:: contigs – Iterable of Contig objects representing circular contigs.
Raises:: ValueError – If a gene start position is higher than the length of the contig.

ppanggolin.annotate.annotate.create_gene(org: Organism, contig: Contig, gene_counter: int, rna_counter: int, gene_id: str, dbxrefs: Set[str], coordinates: List[Tuple[int, int]], strand: str, gene_type: str, position: int | None = None, gene_name: str = '', product: str = '', genetic_code: int = 11, protein_id: str = '') → Gene

Create a Gene object and associate to contig and Organism

Parameters:

org – Organism to add gene
contig – Contig to add gene
gene_counter – Gene counter to name gene
rna_counter – RNA counter to name RNA
gene_id – local identifier
dbxrefs – cross-reference to external DB
coordinates – Gene start and stop positions
strand – gene strand association
gene_type – gene type
position – position in contig
gene_name – Gene name
product – Function of gene
genetic_code – Genetic code used
protein_id – Protein identifier

ppanggolin.annotate.annotate.determine_genetic_code_from_annotation_files(pangenome)

Determine the genetic code from the pangenome based on gene annotations.

This function counts the occurrence of each genetic code across all genes in the pangenome (excluding genes with genetic_code == 0) and selects the most common one. A pangenome is expected to have a unique genetic code, so a warning is issued if multiple genetic codes are detected, including examples of genes and genomes with different codes.

Parameters:: pangenome – A Pangenome object containing genes with genetic_code attributes
Returns:: The most common genetic code (int) found in the pangenome, or None if no genetic code information is available in the annotations (all genes have genetic_code == 0)
Warning:: Issues a warning if multiple genetic codes are detected in the pangenome

ppanggolin.annotate.annotate.determine_genetic_code_to_use(user_translation_table: int, is_user_specified: bool, genetic_code_from_annotation: int | None = None) → int

Determine the genetic code to use for the pangenome.

This function implements the following priority logic: 1. Extract genetic code from annotation files 2. If user explicitly specified a translation table and it differs from annotation,

issue a warning and use the user-specified value

If no genetic code found in annotations, use the user-specified/default table

Parameters:

user_translation_table – The translation table value provided by the user (default or explicitly specified)
is_user_specified – Whether the translation table was explicitly specified by the user
genetic_code_from_annotation – Genetic code value inferred from annotation files, or None if no genetic code was found

Returns:

The genetic code to use for the pangenome

ppanggolin.annotate.annotate.extract_positions(string: str) → Tuple[List[Tuple[int, int]], bool, bool, bool]

Extracts start and stop positions from a string and determines whether it is complement and pseudogene.

Example of strings that the function is able to process:

“join(190..7695,7695..12071)”, “complement(join(4359800..4360707,4360707..4360962))”, “join(6835405..6835731,1..1218)”, “join(1375484..1375555,1375557..1376579)”, “complement(6815492..6816265)”, “6811501..6812109”, “complement(6792573..>6795461)”, “join(1038313,1..1016)”

Parameters:: string – The input string containing position information.
Returns:: A tuple containing a list of tuples representing start and stop positions, a boolean indicating whether it is complement, a boolean indicating whether it is a partial gene at start position and a boolean indicating whether it is a partial gene at end position.
Raises:: ValueError – If the string is not formatted as expected or if positions cannot be parsed as integers.

ppanggolin.annotate.annotate.fix_partial_gene_coordinates(coordinates: List[Tuple[int, int]], is_complement: bool, start_shift: int, ensure_codon_multiple: bool = True) → List[Tuple[int, int]]

Adjusts gene coordinates if they have partial starts or ends, ensuring the gene length is a multiple of 3.

If the gene is on the complement strand, the adjustments will be reversed (i.e., applied to the opposite ends).

Parameters:

coordinates – List of coordinate tuples (start, stop) for the gene.
is_complement – Flag indicating if the gene is on the complement strand.
start_shift – The value by which the start coordinate should be shifted.
ensure_codon_multiple – Flag to check that gene length is a multiple of 3.

Returns:

A new list of adjusted coordinate tuples.

ppanggolin.annotate.annotate.get_gene_sequences_from_fastas(pangenome: Pangenome, fasta_files: Path, disable_bar: bool = False)

Get gene sequences from fastas

Parameters:

pangenome – Input pangenome
fasta_files – list of fasta file
disable_bar – Flag to disable progress bar

ppanggolin.annotate.annotate.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.annotate.annotate.local_identifiers_are_unique(genes: Iterable[Gene]) → bool

Check if local_identifiers of genes are uniq in order to decide if they should be used as gene id.

Parameters:: genes – Iterable of gene objects
Returns:: Boolean stating True if local identifiers are uniq, and False otherwise

ppanggolin.annotate.annotate.parse_contig_header_lines(header_lines: List[str]) → Dict[str, str]

Parse required information from header lines of a contig from a GBFF file.

Parameters:: header_lines – List of strings representing header lines of a contig from a GBFF file.
Returns:: A dict with keys representing different fields and values representing their corresponding values joined by new line.

ppanggolin.annotate.annotate.parse_db_xref_metadata(db_xref_values: List[str], annot_file_path: Path = '') → Dict[str, str]

Parses a list of db_xref values and returns a dictionary with formatted keys and identifiers.

Parameters:

db_xref_values – List of db_xref strings in the format <database>:<identifier>.
annot_file_path – Path to the annotation file being processed.

Returns:

Dictionary with keys formatted as ‘db_xref_<database>’ and their corresponding identifiers.

ppanggolin.annotate.annotate.parse_dna_seq_lines(sequence_lines: List[str]) → str

Parse sequence_lines from a GBFF file and return dna sequence

Parameters:: sequence_lines – List of strings representing sequence lines from a GBFF file.
Returns:: a string in upper case of the DNA sequences that have been cleaned

ppanggolin.annotate.annotate.parse_feature_lines(feature_lines: List[str]) → Generator[Dict[str, str | Set[str]], None, None]

Parse feature lines from a GBFF file and yield dictionaries representing each feature.

Parameters:: feature_lines – List of strings representing feature lines from a GBFF file.
Returns:: A generator that yields dictionaries, each representing a feature with its type, location, and qualifiers.

ppanggolin.annotate.annotate.parse_gbff_by_contig(gbff_file_path: Path) → Generator[Tuple[Dict[str, str], Generator[Dict[str, str | Set[str]], None, None], str], None, None]

Parse a GBFF file by contig and yield tuples containing header, feature, and sequence info for each contig.

Parameters:: gbff_file_path – Path to the GBFF file.
Returns:: A generator that yields tuples containing header lines, feature lines, and sequence info for each contig.

ppanggolin.annotate.annotate.parser_annot(parser: ArgumentParser)

Parser for specific argument of annotate command

Parameters:: parser – parser for annotate argument

ppanggolin.annotate.annotate.read_anno_file(organism_name: str, filename: Path, circular_contigs: list, pseudo: bool = False) → Tuple[Organism, bool]

Read a GBFF file for one organism

Parameters:

organism_name – Name of the organism
filename – Path to the corresponding file
circular_contigs – list of sequence in contig
pseudo – allow to read pseudogene

Returns:

Annotated organism for pangenome and true for sequence in file

ppanggolin.annotate.annotate.read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, pseudo: bool = False, translation_table: int = 11, is_translation_table_specified: bool = False, disable_bar: bool = False)

Read the annotation from GBFF file

Parameters:

pangenome – pangenome object
organisms_file – List of GBFF files for each organism
cpu – number of CPU cores to use
pseudo – allow to read pseudogene
translation_table – Translation table (genetic code) to use when /transl_table is missing from CDS tags.
disable_bar – Disable the progress bar

ppanggolin.annotate.annotate.read_org_gbff(organism_name: str, gbff_file_path: Path, circular_contigs: List[str], use_pseudogenes: bool = False) → Tuple[Organism, bool]

Read a GBFF file and fills Organism, Contig and Genes objects based on information contained in this file

Parameters:

organism_name – Organism name
gbff_file_path – Path to corresponding GBFF file
circular_contigs – list of contigs
use_pseudogenes – Allow to read pseudogenes

Returns:

Organism complete and true for sequence in file

ppanggolin.annotate.annotate.read_org_gff(organism: str, gff_file_path: Path, circular_contigs: List[str], pseudo: bool = False) → Tuple[Organism, bool]

Read annotation from GFF file

Parameters:

organism – Organism name
gff_file_path – Path corresponding to GFF file
circular_contigs – List of circular contigs
pseudo – Allow to read pseudogene

Returns:

Organism object and if there are sequences associated or not

ppanggolin.annotate.annotate.reverse_complement_coordinates(coordinates: List[Tuple[int, int]]) → List[Tuple[int, int]]

Reverses and inverts the given list of coordinates. Each coordinate pair (start, end) is transformed into (-end, -start) and the order of the coordinates is reversed.

Parameters:: coordinates – A list of tuples representing the coordinates to be reversed and inverted.
Returns:: A list of reversed and inverted coordinates.

ppanggolin.annotate.annotate.shift_end_coordinates(coordinates: List[Tuple[int, int]], shift: int) → List[Tuple[int, int]]

Shifts the end of a set of coordinates by a specified amount and then returns the final shifted coordinates. This involves reversing the coordinates twice, shifting the start, and then returning the original orientation.

Parameters:

coordinates – A list of tuples representing the original coordinates.
shift – The amount by which the end coordinate should be shifted.

Returns:

The coordinates after the end shift and reverse complement transformations.

ppanggolin.annotate.annotate.shift_start_coordinates(coordinates: List[Tuple[int, int]], shift: int) → List[Tuple[int, int]]

Shifts the start of the first coordinate in the list by the specified amount. If the shift results in a negative or zero-length interval for the first coordinate, this interval is removed, and the shift is propagated to the next coordinate if necessary.

Parameters:

coordinates – A list of tuples representing the coordinates.
shift – The amount by which the start coordinate should be shifted.

Returns:

A new list of coordinates with the shifted start.

ppanggolin.annotate.annotate.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.annotate.synta module

ppanggolin.annotate.synta.annotate_organism(org_name: str, file_name: Path, circular_contigs: List[str], tmpdir: str, code: int = 11, norna: bool = False, kingdom: str = 'bacteria', allow_overlap: bool = False, procedure: str | None = None) → Organism

Function to annotate a single organism

Parameters:

org_name – Name of the organism / genome
file_name – Path to the fasta file containing organism sequences
circular_contigs – list of contigs
code – Translation table (genetic code) to use.
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
norna – Use to avoid annotating RNA features.
tmpdir – Path to temporary directory
allow_overlap – Use to not remove genes overlapping with RNA features
procedure – prodigal procedure used

Returns:

Complete organism object for pangenome

ppanggolin.annotate.synta.check_sequence_tuple(name: str, sequence: str)

Checks and validates a sequence name and its corresponding sequence.

Parameters:

name – The name (header) of the sequence, typically extracted from the FASTA file header.
sequence – The sequence string corresponding to the name, containing the nucleotide or protein sequence.

Returns:

A tuple containing the validated name and sequence.

Raises:

ValueError –

If the sequence is empty, a ValueError is raised with a message containing the header name.
If the name is empty, a ValueError is raised with a message containing a preview of the sequence.

ppanggolin.annotate.synta.get_contigs_from_fasta_file(org: Organism, fna_file: TextIOWrapper | list) → Dict[str, str]

Processes contigs from a parsed FASTA generator and stores in a dictionary.

Parameters:

org – Organism instance to update with contig info.
fna_file – Input FASTA file or list of lines as sequences.

Returns:

Dictionary with contig names as keys and sequences as values.

ppanggolin.annotate.synta.get_dna_sequence(contig_seq: str, gene: Gene | RNA) → str

Return the gene sequence

Parameters:

contig_seq – Contig sequence
gene – Gene

Returns:

str

ppanggolin.annotate.synta.init_contig_counter(value: Value): Initialize the contig counter for later use

ppanggolin.annotate.synta.launch_aragorn(fna_file: str, org: Organism, contig_to_length: Dict[str, int]) → defaultdict

Launches Aragorn to annotate tRNAs.

Parameters:

fna_file – file-like object containing the uncompressed fasta sequences
org – Organism which will be annotated

Returns:

Annotated genes in a list of gene objects

ppanggolin.annotate.synta.launch_infernal(fna_file: str, org: Organism, tmpdir: str, kingdom: str = 'bacteria') → defaultdict

Launches Infernal in hmmer-only mode to annotate rRNAs.

Parameters:

fna_file – file-like object containing the uncompressed fasta sequences
org – Organism which will be annotated
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
tmpdir – Path to temporary directory

Returns:

Annotated genes in a list of gene objects.

ppanggolin.annotate.synta.launch_prodigal(contig_sequences: Dict[str, str], org: Organism, code: int = 11, use_meta: bool = False) → defaultdict

Launches Prodigal to annotate CDS. Takes a fna file name and a locustag to give an ID to the pred genes.

Parameters:

contig_sequences – Dict containing contig sequences for pyrodigal
org – Organism which will be annotated
code – Translation table (genetic code) to use.
use_meta – use meta procedure in Prodigal

Returns:

Annotated genes in a list of gene objects

ppanggolin.annotate.synta.overlap_filter(all_genes: defaultdict, allow_overlap: bool = False) → defaultdict

Removes the CDS that overlap with RNA genes.

Parameters:

all_genes – Dictionary with complete list of genes
allow_overlap – Use to not remove genes overlapping with RNA features

Returns:

Dictionary with genes filtered

ppanggolin.annotate.synta.parse_fasta(fna_file: TextIOWrapper | list) → Generator[Tuple[str, str], None, None]

Yields each sequence name and sequence from a FASTA file or stream as a tuple.

Parameters:: fna_file – Input FASTA file or list of lines as sequences.
Yield:: Tuple with contig header (without ‘>’) and sequence.
Raises:: ValueError – If the file does not contain valid FASTA format.

ppanggolin.annotate.synta.reverse_complement(seq: str)

reverse complement the given dna sequence

Parameters:: seq – sequence which need to be reversed
Returns:: reverse sequence

ppanggolin.annotate.synta.syntaxic_annotation(org: Organism, fasta_file: TextIOWrapper, contig_sequences: Dict[str, str], tmpdir: str, norna: bool = False, kingdom: str = 'bacteria', code: int = 11, use_meta: bool = False) → defaultdict

Runs the different software for the syntaxic annotation.

Parameters:

org – Organism which will be annotated
fasta_file – file-like object containing the uncompressed fasta sequences
contig_sequences – Dict containing contig sequences for pyrodigal
tmpdir – Path to temporary directory
norna – Use to avoid annotating RNA features.
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
code – Translation table (genetic code) to use.
use_meta – Use meta prodigal procedure

Returns:

list of genes in the organism

ppanggolin.annotate.synta.write_tmp_fasta(contigs: dict, tmpdir: str) → _TemporaryFileWrapper

Writes a temporary fna formatted file and returns the file-like object. Useful in case of compressed input file. The file will be deleted when close() is called.

Parameters:

contigs – Contigs sequences of each contig
tmpdir – path to temporary directory

Returns:

fasta file

ppanggolin.annotate package

Submodules

ppanggolin.annotate.annotate module

ppanggolin.annotate.synta module

Module contents