ppanggolin.annotate package

Submodules

ppanggolin.annotate.annotate module

ppanggolin.annotate.annotate.add_metadata_from_gff_file(contig_name_to_region_info: Dict[str, Dict[str, str]], org: Organism, gff_file_path: Path)

Add metadata to the organism object from a GFF file.

Parameters:
  • contig_name_to_region_info – A dictionary mapping contig names to their corresponding region information.

  • org – The organism object to which metadata will be added.

  • gff_file_path – The path to the GFF file.

ppanggolin.annotate.annotate.annotate_pangenome(pangenome: Pangenome, fasta_list: Path, tmpdir: str, cpu: int = 1, translation_table: int = 11, kingdom: str = 'bacteria', norna: bool = False, allow_overlap: bool = False, procedure: str | None = None, disable_bar: bool = False)

Main function to annotate a pangenome

Parameters:
  • pangenome – Pangenome with gene families to align with the given input sequences

  • fasta_list – List of fasta file containing sequences that will be base of pangenome

  • tmpdir – Path to temporary directory

  • cpu – number of CPU cores to use

  • translation_table – Translation table (genetic code) to use.

  • kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.

  • norna – Use to avoid annotating RNA features.

  • allow_overlap – Use to not remove genes overlapping with RNA features

  • procedure – prodigal procedure used

  • disable_bar – Disable the progress bar

ppanggolin.annotate.annotate.check_and_add_extra_gene_part(gene: Gene, new_gene_info: Dict, max_separation: int = 10)

Checks and potentially adds extra gene parts based on new gene information. This is done before checking for potential overlapping edge genes. Gene coordinates are expected to be in ascending order, and no circularity is taken into account here.

Parameters:
  • gene – Gene object to be compared and potentially merged with new_gene_info.

  • new_gene_info – Dictionary containing information about the new gene.

  • max_separation – Maximum allowed separation between gene coordinates for merging. Default is 10.

Raises:
  • AssertionError – If the start position is greater than the stop position in new_gene_info.

  • ValueError – If the coordinates of genes are too far apart to merge, or if the gene attributes do not match.

ppanggolin.annotate.annotate.check_annotate_args(args: Namespace)

Check That the given arguments are usable

Parameters:

args – All arguments provide by user

Raises:

Exception

ppanggolin.annotate.annotate.chose_gene_identifiers(pangenome: Pangenome) bool

Parses the pangenome genes to decide whether to use local_identifiers or ppanggolin generated gene identifiers. If the local identifiers are unique within the pangenome they are picked, otherwise ppanggolin ones are used.

Parameters:

pangenome – input pangenome

Returns:

Boolean stating True if local identifiers are used, and False otherwise

ppanggolin.annotate.annotate.combine_contigs_metadata(contig_to_metadata: Dict[Contig, Dict[str, str]]) Tuple[Dict[str, str], Dict[Contig, Dict[str, str]]]

Combine contig metadata to identify shared and unique metadata tags and values.

Parameters:

contig_to_metadata – A dictionary mapping each contig to its associated metadata.

Returns:

A tuple containing: - A dictionary of shared metadata tags and values present in all contigs. - A dictionary mapping each contig to its unique metadata tags and values.

ppanggolin.annotate.annotate.correct_putative_overlaps(contigs: Iterable[Contig])

Corrects putative overlaps in gene coordinates for circular contigs.

Parameters:

contigs – Iterable of Contig objects representing circular contigs.

Raises:

ValueError – If a gene start position is higher than the length of the contig.

ppanggolin.annotate.annotate.create_gene(org: Organism, contig: Contig, gene_counter: int, rna_counter: int, gene_id: str, dbxrefs: Set[str], coordinates: List[Tuple[int, int]], strand: str, gene_type: str, position: int | None = None, gene_name: str = '', product: str = '', genetic_code: int = 11, protein_id: str = '') Gene

Create a Gene object and associate to contig and Organism

Parameters:
  • org – Organism to add gene

  • contig – Contig to add gene

  • gene_counter – Gene counter to name gene

  • rna_counter – RNA counter to name RNA

  • gene_id – local identifier

  • dbxrefs – cross-reference to external DB

  • coordinates – Gene start and stop positions

  • strand – gene strand association

  • gene_type – gene type

  • position – position in contig

  • gene_name – Gene name

  • product – Function of gene

  • genetic_code – Genetic code used

  • protein_id – Protein identifier

ppanggolin.annotate.annotate.determine_genetic_code_from_annotation_files(pangenome)

Determine the genetic code from the pangenome based on gene annotations.

This function counts the occurrence of each genetic code across all genes in the pangenome (excluding genes with genetic_code == 0) and selects the most common one. A pangenome is expected to have a unique genetic code, so a warning is issued if multiple genetic codes are detected, including examples of genes and genomes with different codes.

Parameters:

pangenome – A Pangenome object containing genes with genetic_code attributes

Returns:

The most common genetic code (int) found in the pangenome, or None if no genetic code information is available in the annotations (all genes have genetic_code == 0)

Warning:

Issues a warning if multiple genetic codes are detected in the pangenome

ppanggolin.annotate.annotate.determine_genetic_code_to_use(user_translation_table: int, is_user_specified: bool, genetic_code_from_annotation: int | None = None) int

Determine the genetic code to use for the pangenome.

This function implements the following priority logic: 1. Extract genetic code from annotation files 2. If user explicitly specified a translation table and it differs from annotation,

issue a warning and use the user-specified value

  1. If no genetic code found in annotations, use the user-specified/default table

Parameters:
  • user_translation_table – The translation table value provided by the user (default or explicitly specified)

  • is_user_specified – Whether the translation table was explicitly specified by the user

  • genetic_code_from_annotation – Genetic code value inferred from annotation files, or None if no genetic code was found

Returns:

The genetic code to use for the pangenome

ppanggolin.annotate.annotate.extract_positions(string: str) Tuple[List[Tuple[int, int]], bool, bool, bool]

Extracts start and stop positions from a string and determines whether it is complement and pseudogene.

Example of strings that the function is able to process:

“join(190..7695,7695..12071)”, “complement(join(4359800..4360707,4360707..4360962))”, “join(6835405..6835731,1..1218)”, “join(1375484..1375555,1375557..1376579)”, “complement(6815492..6816265)”, “6811501..6812109”, “complement(6792573..>6795461)”, “join(1038313,1..1016)”

Parameters:

string – The input string containing position information.

Returns:

A tuple containing a list of tuples representing start and stop positions, a boolean indicating whether it is complement, a boolean indicating whether it is a partial gene at start position and a boolean indicating whether it is a partial gene at end position.

Raises:

ValueError – If the string is not formatted as expected or if positions cannot be parsed as integers.

ppanggolin.annotate.annotate.fix_partial_gene_coordinates(coordinates: List[Tuple[int, int]], is_complement: bool, start_shift: int, ensure_codon_multiple: bool = True) List[Tuple[int, int]]

Adjusts gene coordinates if they have partial starts or ends, ensuring the gene length is a multiple of 3.

If the gene is on the complement strand, the adjustments will be reversed (i.e., applied to the opposite ends).

Parameters:
  • coordinates – List of coordinate tuples (start, stop) for the gene.

  • is_complement – Flag indicating if the gene is on the complement strand.

  • start_shift – The value by which the start coordinate should be shifted.

  • ensure_codon_multiple – Flag to check that gene length is a multiple of 3.

Returns:

A new list of adjusted coordinate tuples.

ppanggolin.annotate.annotate.get_gene_sequences_from_fastas(pangenome: Pangenome, fasta_files: Path, disable_bar: bool = False)

Get gene sequences from fastas

Parameters:
  • pangenome – Input pangenome

  • fasta_files – list of fasta file

  • disable_bar – Flag to disable progress bar

ppanggolin.annotate.annotate.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.annotate.annotate.local_identifiers_are_unique(genes: Iterable[Gene]) bool

Check if local_identifiers of genes are uniq in order to decide if they should be used as gene id.

Parameters:

genes – Iterable of gene objects

Returns:

Boolean stating True if local identifiers are uniq, and False otherwise

ppanggolin.annotate.annotate.parse_contig_header_lines(header_lines: List[str]) Dict[str, str]

Parse required information from header lines of a contig from a GBFF file.

Parameters:

header_lines – List of strings representing header lines of a contig from a GBFF file.

Returns:

A dict with keys representing different fields and values representing their corresponding values joined by new line.

ppanggolin.annotate.annotate.parse_db_xref_metadata(db_xref_values: List[str], annot_file_path: Path = '') Dict[str, str]

Parses a list of db_xref values and returns a dictionary with formatted keys and identifiers.

Parameters:
  • db_xref_values – List of db_xref strings in the format <database>:<identifier>.

  • annot_file_path – Path to the annotation file being processed.

Returns:

Dictionary with keys formatted as ‘db_xref_<database>’ and their corresponding identifiers.

ppanggolin.annotate.annotate.parse_dna_seq_lines(sequence_lines: List[str]) str

Parse sequence_lines from a GBFF file and return dna sequence

Parameters:

sequence_lines – List of strings representing sequence lines from a GBFF file.

Returns:

a string in upper case of the DNA sequences that have been cleaned

ppanggolin.annotate.annotate.parse_feature_lines(feature_lines: List[str]) Generator[Dict[str, str | Set[str]], None, None]

Parse feature lines from a GBFF file and yield dictionaries representing each feature.

Parameters:

feature_lines – List of strings representing feature lines from a GBFF file.

Returns:

A generator that yields dictionaries, each representing a feature with its type, location, and qualifiers.

ppanggolin.annotate.annotate.parse_gbff_by_contig(gbff_file_path: Path) Generator[Tuple[Dict[str, str], Generator[Dict[str, str | Set[str]], None, None], str], None, None]

Parse a GBFF file by contig and yield tuples containing header, feature, and sequence info for each contig.

Parameters:

gbff_file_path – Path to the GBFF file.

Returns:

A generator that yields tuples containing header lines, feature lines, and sequence info for each contig.

ppanggolin.annotate.annotate.parser_annot(parser: ArgumentParser)

Parser for specific argument of annotate command

Parameters:

parser – parser for annotate argument

ppanggolin.annotate.annotate.read_anno_file(organism_name: str, filename: Path, circular_contigs: list, pseudo: bool = False) Tuple[Organism, bool]

Read a GBFF file for one organism

Parameters:
  • organism_name – Name of the organism

  • filename – Path to the corresponding file

  • circular_contigs – list of sequence in contig

  • pseudo – allow to read pseudogene

Returns:

Annotated organism for pangenome and true for sequence in file

ppanggolin.annotate.annotate.read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, pseudo: bool = False, translation_table: int = 11, is_translation_table_specified: bool = False, disable_bar: bool = False)

Read the annotation from GBFF file

Parameters:
  • pangenome – pangenome object

  • organisms_file – List of GBFF files for each organism

  • cpu – number of CPU cores to use

  • pseudo – allow to read pseudogene

  • translation_table – Translation table (genetic code) to use when /transl_table is missing from CDS tags.

  • disable_bar – Disable the progress bar

ppanggolin.annotate.annotate.read_org_gbff(organism_name: str, gbff_file_path: Path, circular_contigs: List[str], use_pseudogenes: bool = False) Tuple[Organism, bool]

Read a GBFF file and fills Organism, Contig and Genes objects based on information contained in this file

Parameters:
  • organism_name – Organism name

  • gbff_file_path – Path to corresponding GBFF file

  • circular_contigs – list of contigs

  • use_pseudogenes – Allow to read pseudogenes

Returns:

Organism complete and true for sequence in file

ppanggolin.annotate.annotate.read_org_gff(organism: str, gff_file_path: Path, circular_contigs: List[str], pseudo: bool = False) Tuple[Organism, bool]

Read annotation from GFF file

Parameters:
  • organism – Organism name

  • gff_file_path – Path corresponding to GFF file

  • circular_contigs – List of circular contigs

  • pseudo – Allow to read pseudogene

Returns:

Organism object and if there are sequences associated or not

ppanggolin.annotate.annotate.reverse_complement_coordinates(coordinates: List[Tuple[int, int]]) List[Tuple[int, int]]

Reverses and inverts the given list of coordinates. Each coordinate pair (start, end) is transformed into (-end, -start) and the order of the coordinates is reversed.

Parameters:

coordinates – A list of tuples representing the coordinates to be reversed and inverted.

Returns:

A list of reversed and inverted coordinates.

ppanggolin.annotate.annotate.shift_end_coordinates(coordinates: List[Tuple[int, int]], shift: int) List[Tuple[int, int]]

Shifts the end of a set of coordinates by a specified amount and then returns the final shifted coordinates. This involves reversing the coordinates twice, shifting the start, and then returning the original orientation.

Parameters:
  • coordinates – A list of tuples representing the original coordinates.

  • shift – The amount by which the end coordinate should be shifted.

Returns:

The coordinates after the end shift and reverse complement transformations.

ppanggolin.annotate.annotate.shift_start_coordinates(coordinates: List[Tuple[int, int]], shift: int) List[Tuple[int, int]]

Shifts the start of the first coordinate in the list by the specified amount. If the shift results in a negative or zero-length interval for the first coordinate, this interval is removed, and the shift is propagated to the next coordinate if necessary.

Parameters:
  • coordinates – A list of tuples representing the coordinates.

  • shift – The amount by which the start coordinate should be shifted.

Returns:

A new list of coordinates with the shifted start.

ppanggolin.annotate.annotate.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.annotate.synta module

ppanggolin.annotate.synta.annotate_organism(org_name: str, file_name: Path, circular_contigs: List[str], tmpdir: str, code: int = 11, norna: bool = False, kingdom: str = 'bacteria', allow_overlap: bool = False, procedure: str | None = None) Organism

Function to annotate a single organism

Parameters:
  • org_name – Name of the organism / genome

  • file_name – Path to the fasta file containing organism sequences

  • circular_contigs – list of contigs

  • code – Translation table (genetic code) to use.

  • kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.

  • norna – Use to avoid annotating RNA features.

  • tmpdir – Path to temporary directory

  • allow_overlap – Use to not remove genes overlapping with RNA features

  • procedure – prodigal procedure used

Returns:

Complete organism object for pangenome

ppanggolin.annotate.synta.check_sequence_tuple(name: str, sequence: str)

Checks and validates a sequence name and its corresponding sequence.

Parameters:
  • name – The name (header) of the sequence, typically extracted from the FASTA file header.

  • sequence – The sequence string corresponding to the name, containing the nucleotide or protein sequence.

Returns:

A tuple containing the validated name and sequence.

Raises:

ValueError

  • If the sequence is empty, a ValueError is raised with a message containing the header name.

  • If the name is empty, a ValueError is raised with a message containing a preview of the sequence.

ppanggolin.annotate.synta.get_contigs_from_fasta_file(org: Organism, fna_file: TextIOWrapper | list) Dict[str, str]

Processes contigs from a parsed FASTA generator and stores in a dictionary.

Parameters:
  • org – Organism instance to update with contig info.

  • fna_file – Input FASTA file or list of lines as sequences.

Returns:

Dictionary with contig names as keys and sequences as values.

ppanggolin.annotate.synta.get_dna_sequence(contig_seq: str, gene: Gene | RNA) str

Return the gene sequence

Parameters:
  • contig_seq – Contig sequence

  • gene – Gene

Returns:

str

ppanggolin.annotate.synta.init_contig_counter(value: Value)

Initialize the contig counter for later use

ppanggolin.annotate.synta.launch_aragorn(fna_file: str, org: Organism, contig_to_length: Dict[str, int]) defaultdict

Launches Aragorn to annotate tRNAs.

Parameters:
  • fna_file – file-like object containing the uncompressed fasta sequences

  • org – Organism which will be annotated

Returns:

Annotated genes in a list of gene objects

ppanggolin.annotate.synta.launch_infernal(fna_file: str, org: Organism, tmpdir: str, kingdom: str = 'bacteria') defaultdict

Launches Infernal in hmmer-only mode to annotate rRNAs.

Parameters:
  • fna_file – file-like object containing the uncompressed fasta sequences

  • org – Organism which will be annotated

  • kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.

  • tmpdir – Path to temporary directory

Returns:

Annotated genes in a list of gene objects.

ppanggolin.annotate.synta.launch_prodigal(contig_sequences: Dict[str, str], org: Organism, code: int = 11, use_meta: bool = False) defaultdict

Launches Prodigal to annotate CDS. Takes a fna file name and a locustag to give an ID to the pred genes.

Parameters:
  • contig_sequences – Dict containing contig sequences for pyrodigal

  • org – Organism which will be annotated

  • code – Translation table (genetic code) to use.

  • use_meta – use meta procedure in Prodigal

Returns:

Annotated genes in a list of gene objects

ppanggolin.annotate.synta.overlap_filter(all_genes: defaultdict, allow_overlap: bool = False) defaultdict

Removes the CDS that overlap with RNA genes.

Parameters:
  • all_genes – Dictionary with complete list of genes

  • allow_overlap – Use to not remove genes overlapping with RNA features

Returns:

Dictionary with genes filtered

ppanggolin.annotate.synta.parse_fasta(fna_file: TextIOWrapper | list) Generator[Tuple[str, str], None, None]

Yields each sequence name and sequence from a FASTA file or stream as a tuple.

Parameters:

fna_file – Input FASTA file or list of lines as sequences.

Yield:

Tuple with contig header (without ‘>’) and sequence.

Raises:

ValueError – If the file does not contain valid FASTA format.

ppanggolin.annotate.synta.reverse_complement(seq: str)

reverse complement the given dna sequence

Parameters:

seq – sequence which need to be reversed

Returns:

reverse sequence

ppanggolin.annotate.synta.syntaxic_annotation(org: Organism, fasta_file: TextIOWrapper, contig_sequences: Dict[str, str], tmpdir: str, norna: bool = False, kingdom: str = 'bacteria', code: int = 11, use_meta: bool = False) defaultdict

Runs the different software for the syntaxic annotation.

Parameters:
  • org – Organism which will be annotated

  • fasta_file – file-like object containing the uncompressed fasta sequences

  • contig_sequences – Dict containing contig sequences for pyrodigal

  • tmpdir – Path to temporary directory

  • norna – Use to avoid annotating RNA features.

  • kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.

  • code – Translation table (genetic code) to use.

  • use_meta – Use meta prodigal procedure

Returns:

list of genes in the organism

ppanggolin.annotate.synta.write_tmp_fasta(contigs: dict, tmpdir: str) _TemporaryFileWrapper

Writes a temporary fna formatted file and returns the file-like object. Useful in case of compressed input file. The file will be deleted when close() is called.

Parameters:
  • contigs – Contigs sequences of each contig

  • tmpdir – path to temporary directory

Returns:

fasta file

Module contents