ppanggolin.formats package

Submodules

ppanggolin.formats.readBinaries module

class ppanggolin.formats.readBinaries.Genedata(start: int, stop: int, strand: str, gene_type: str, position: int, name: str, product: str, genetic_code: int, coordinates: List[Tuple[int]] | None = None)

Bases: object

This is a general class storing unique gene-related data to be written in a specific genedata table

ppanggolin.formats.readBinaries.check_pangenome_info(pangenome, need_annotations: bool = False, need_families: bool = False, need_graph: bool = False, need_partitions: bool = False, need_rgp: bool = False, need_spots: bool = False, need_gene_sequences: bool = False, need_modules: bool = False, need_metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None, disable_bar: bool = False)

Defines what needs to be read depending on what is needed, and automatically checks if the required elements have been computed with regard to the pangenome.status

Parameters:
  • pangenome – Pangenome object without some information

  • need_annotations – get annotation

  • need_families – get gene families

  • need_graph – get graph

  • need_partitions – get partition

  • need_rgp – get RGP

  • need_spots – get hotspot

  • need_gene_sequences – get gene sequences

  • need_modules – get modules

  • need_metadata – get metadata

  • metatypes – metatypes of the metadata to get (None means all types with metadata)

  • sources – sources of the metadata to get (None means all possible sources)

  • disable_bar – Allow to disable the progress bar

ppanggolin.formats.readBinaries.create_info_dict(info_group: Group)

Read the pangenome content

Parameters:

info_group – group in pangenome HDF5 file containing information about pangenome

ppanggolin.formats.readBinaries.get_families_from_genes(h5f: File, genes: Set[bytes]) Set[bytes]

Retrieves gene families associated with a specified set of genes from the pangenome file.

Parameters:
  • h5f – The open HDF5 pangenome file containing gene family data.

  • genes – A set of gene names (as bytes) for which to retrieve the associated families.

Returns:

A set of gene family names (as bytes) associated with the specified genes.

ppanggolin.formats.readBinaries.get_families_matching_partition(h5f: File, partition: str) Set[bytes]

Retrieves gene families that match the specified partition.

Parameters:
  • h5f – The open HDF5 pangenome file containing gene family information.

  • partition – The partition name (as a string). If “all”, all gene families are included. Otherwise, it filters by the first letter of the partition.

Returns:

A set of gene family names (as bytes) that match the partition criteria.

ppanggolin.formats.readBinaries.get_family_to_genome_count(h5f: File) Dict[bytes, int]

Computes the number of unique genomes associated with each gene family.

Parameters:

h5f – The open HDF5 pangenome file containing contig, gene, and gene family data.

Returns:

A dictionary mapping gene family names (as bytes) to the count of unique genomes.

ppanggolin.formats.readBinaries.get_gene_to_genome(h5f: File) Dict[bytes, bytes]

Generates a mapping between gene IDs and their corresponding genome.

Parameters:

h5f – The open HDF5 pangenome file containing contig and gene annotations.

Returns:

A dictionary mapping gene IDs to genome names.

ppanggolin.formats.readBinaries.get_genes_from_families(h5f: File, families: List[bytes]) Set[bytes]

Retrieves a set of genes that belong to the specified families.

This function reads the gene family data from an HDF5 pangenome file and returns a set of genes that are part of the given list of gene families.

Parameters:
  • h5f – The open HDF5 pangenome file containing gene family data.

  • families – A list of gene families (as bytes) to filter genes by.

Returns:

A set of genes (as bytes) that belong to the specified families.

ppanggolin.formats.readBinaries.get_need_info(pangenome, need_annotations: bool = False, need_families: bool = False, need_graph: bool = False, need_partitions: bool = False, need_rgp: bool = False, need_spots: bool = False, need_gene_sequences: bool = False, need_modules: bool = False, need_metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None)
ppanggolin.formats.readBinaries.get_non_redundant_gene_sequences_from_file(pangenome_filename: str, output: Path, add: str = '', disable_bar: bool = False)

Writes the non-redundant CDS sequences of the Pangenome object to a File object that can be filtered or not by a list of CDS, and adds the eventual str ‘add’ in front of the identifiers. Loads the sequences from a .h5 pangenome file.

Parameters:
  • pangenome_filename – Name of the pangenome file

  • output – Path to the output file

  • add – Add a prefix to sequence header

  • disable_bar – disable progress bar

ppanggolin.formats.readBinaries.get_number_of_organisms(pangenome: Pangenome) int

Standalone function to get the number of organisms in a pangenome

Parameters:

pangenome – Annotated pangenome

Returns:

Number of organisms in the pangenome

ppanggolin.formats.readBinaries.get_pangenome_parameters(h5f: File) Dict[str, Dict[str, Any]]

Read and return the pangenome parameters.

Parameters:

h5f – Pangenome HDF5 file

Returns:

A dictionary containing the name of the ppanggolin step as the key, and a dictionary of parameter names and their corresponding values used for that step.

ppanggolin.formats.readBinaries.get_seqid_to_genes(h5f: File, genes: Set[bytes], get_all_genes: bool = False, disable_bar: bool = False) Dict[int, List[str]]

Creates a mapping of sequence IDs to gene names.

Parameters:
  • h5f – The open HDF5 pangenome file containing gene sequence data.

  • genes – A list of gene names to include in the mapping (if get_all_genes is False).

  • get_all_genes – Boolean flag to indicate if all genes should be included in the mapping. If set to True, all genes will be added regardless of the genes parameter.

  • disable_bar – Boolean flag to disable the progress bar if set to True.

Returns:

A dictionary mapping sequence IDs (integers) to lists of gene names (strings).

ppanggolin.formats.readBinaries.get_soft_core_families(h5f: File, soft_core: float) Set[bytes]

Identifies gene families that are present in at least a specified proportion of genomes.

Parameters:
  • h5f – The open HDF5 pangenome file containing gene family and genome data.

  • soft_core – The proportion of genomes (between 0 and 1) that a gene family must be present in to be considered a soft core family.

Returns:

A set of gene family names (as bytes) that are classified as soft core.

ppanggolin.formats.readBinaries.get_status(pangenome: Pangenome, pangenome_file: Path)

Checks which elements are already present in the file.

Parameters:
  • pangenome – Blank pangenome

  • pangenome_file – path to the pangenome file

ppanggolin.formats.readBinaries.read_annotation(pangenome: Pangenome, h5f: File, load_organisms: bool = True, load_contigs: bool = True, load_genes: bool = True, load_rnas: bool = True, chunk_size: int = 20000, disable_bar: bool = False)

Read annotation in pangenome hdf5 file to add in pangenome object

Parameters:
  • pangenome – Pangenome object without annotation

  • h5f – Pangenome HDF5 file with annotation

  • load_organisms – Flag to load organisms

  • load_contigs – Flag to load contigs

  • load_genes – Flag to load genes

  • load_rnas – Flag to load RNAs

  • chunk_size – Size of chunks reading

  • disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_chunks(table: Table, column: str | None = None, chunk: int = 10000)

Reading entirely the provided table (or column if specified) chunk per chunk to limit RAM usage.

Parameters:
  • table

  • column

  • chunk

ppanggolin.formats.readBinaries.read_contigs(pangenome: Pangenome, table: Table, chunk_size: int = 20000, disable_bar: bool = False)

Read contig table in pangenome file to add them to the pangenome object

Parameters:
  • pangenome – Pangenome object

  • table – Contig table

  • chunk_size – Size of the chunk reading

  • disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_gene_families(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read gene families in pangenome hdf5 file to add in pangenome object

Parameters:
  • pangenome – Pangenome object without gene families

  • h5f – Pangenome HDF5 file with gene families information

  • disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_gene_families_info(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read information about gene families in pangenome hdf5 file to add in pangenome object

Parameters:
  • pangenome – Pangenome object without gene families information

  • h5f – Pangenome HDF5 file with gene families information

  • disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_gene_sequences(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read gene sequences in pangenome hdf5 file to add in pangenome object

Parameters:
  • pangenome – Pangenome object without gene sequence associate to gene

  • h5f – Pangenome HDF5 file with gene sequence associate to gene

  • disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_genedata(h5f: File) Dict[int, Genedata]

Reads the genedata table and returns a genedata_id2genedata dictionary

Parameters:

h5f – the hdf5 file handler

Returns:

dictionary linking genedata to the genedata identifier

Raises:

KeyError – If a Genedata entry with joined coordinates is not found in the annotations.joinCoordinates table.

ppanggolin.formats.readBinaries.read_genes(pangenome: Pangenome, table: Table, genedata_dict: Dict[int, Genedata], link: bool = True, chunk_size: int = 20000, disable_bar: bool = False)

Read genes in pangenome file to add them to the pangenome object

Parameters:
  • pangenome – Pangenome object

  • table – Genes table

  • genedata_dict – Dictionary to link genedata with gene

  • link – Allow to link gene to organism and contig

  • chunk_size – Size of the chunk reading

  • disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_graph(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read information about graph in pangenome hdf5 file to add in pangenome object

Parameters:
  • pangenome – Pangenome object without graph information

  • h5f – Pangenome HDF5 file with graph information

  • disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_info(h5f)

Read the pangenome content

Parameters:

h5f – Pangenome HDF5 file

ppanggolin.formats.readBinaries.read_join_coordinates(h5f: File) Dict[str, List[Tuple[int, int]]]

Read join coordinates from a HDF5 file and return a dictionary mapping genedata_id to coordinates.

Parameters:

h5f – An HDF5 file object.

Returns:

A dictionary mapping genedata_id to a list of tuples representing start and stop coordinates.

ppanggolin.formats.readBinaries.read_metadata(pangenome: Pangenome, h5f: File, metatype: str, sources: Set[str] | None = None, disable_bar: bool = False)

Read metadata to add them to the pangenome object

Parameters:
  • pangenome – Pangenome object

  • h5f – Pangenome file

  • metatype – Object type to associate metadata

  • sources – Source name of metadata

  • disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_module_families_from_pangenome_file(h5f: File, module_name: str) Set[bytes]

Retrieves gene families associated with a specified module from the pangenome file.

Parameters:
  • h5f – The open HDF5 pangenome file containing module data.

  • module_name – The name of the module (as a string). The module ID is extracted from the name by removing the “module_” prefix.

Returns:

A set of gene family names (as bytes) associated with the specified module.

ppanggolin.formats.readBinaries.read_modules(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read modules in pangenome hdf5 file to add in pangenome object

Parameters:
  • pangenome – Pangenome object without modules

  • h5f – Pangenome HDF5 file with modules computed

  • disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_organisms(pangenome: Pangenome, table: Table, chunk_size: int = 20000, disable_bar: bool = False)

Read organism table in pangenome file to add them to the pangenome object

Parameters:
  • pangenome – Pangenome object

  • table – Organism table

  • chunk_size – Size of the chunk reading

  • disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_pangenome(pangenome, annotation: bool = False, gene_families: bool = False, graph: bool = False, rgp: bool = False, spots: bool = False, gene_sequences: bool = False, modules: bool = False, metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None, disable_bar: bool = False)

Reads a previously written pangenome, with all of its parts, depending on what is asked, with regard to what is filled in the ‘status’ field of the hdf5 file.

Parameters:
  • pangenome – Pangenome object without some information

  • annotation – get annotation

  • gene_families – get gene families

  • graph – get graph

  • rgp – get RGP

  • spots – get hotspot

  • gene_sequences – get gene sequences

  • modules – get modules

  • metadata – get metadata

  • metatypes – metatypes of the metadata to get

  • sources – sources of the metadata to get (None means all sources)

  • disable_bar – Allow to disable the progress bar

ppanggolin.formats.readBinaries.read_parameters(h5f: File)

Read pangenome parameters

Parameters:

h5f – Pangenome HDF5 file

ppanggolin.formats.readBinaries.read_rgp(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read region of genomic plasticity in pangenome hdf5 file to add in pangenome object

Parameters:
  • pangenome – Pangenome object without RGP

  • h5f – Pangenome HDF5 file with RGP computed

  • disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_rgp_genes_from_pangenome_file(h5f: File) Set[bytes]

Retrieves a list of RGP genes from the pangenome file.

Parameters:

h5f – The open HDF5 pangenome file containing RGP gene data.

Returns:

A list of gene names (as bytes) from the RGP.

ppanggolin.formats.readBinaries.read_rnas(pangenome: Pangenome, table: Table, genedata_dict: Dict[int, Genedata], link: bool = True, chunk_size: int = 20000, disable_bar: bool = False)

Read RNAs in pangenome file to add them to the pangenome object

Parameters:
  • pangenome – Pangenome object

  • table – RNAs table

  • genedata_dict – Dictionary to link genedata with gene

  • link – Allow to link gene to organism and contig

  • chunk_size – Size of the chunk reading

  • disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_sequences(h5f: File) dict

Reads the sequences table and returns a sequence id to sequence dictionary :param h5f: the hdf5 file handler :return: dictionary linking sequences to the seq identifier

ppanggolin.formats.readBinaries.read_spots(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read hotspots in the pangenome HDF5 file and add them to the pangenome object.

Parameters:
  • pangenome – Pangenome object without spot

  • h5f – Pangenome HDF5 file with spot computed

  • disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.write_fasta_gene_fam_from_pangenome_file(pangenome_filename: str, output: Path, family_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)

Write representative nucleotide sequences of gene families

Parameters:
  • pangenome – Pangenome object with gene families sequences

  • output – Path to output directory

  • gene_families – Selected partition of gene families

  • soft_core – Soft core threshold to use

  • compress – Compress the file in .gz

  • disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.write_fasta_prot_fam_from_pangenome_file(pangenome_filename: str, output: Path, family_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)

Write representative amino acid sequences of gene families.

Parameters:
  • pangenome – Pangenome object with gene families sequences

  • output – Path to output directory

  • prot_families – Selected partition of protein families

  • soft_core – Soft core threshold to use

  • compress – Compress the file in .gz

  • disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.write_gene_sequences_from_pangenome_file(pangenome_filename: str, output: Path, list_cds: Iterator | None = None, add: str = '', compress: bool = False, disable_bar: bool = False)

Writes the CDS sequences of the Pangenome object to a File object that can be filtered or not by a list of CDS, and adds the eventual str ‘add’ in front of the identifiers. Loads the sequences from a .h5 pangenome file.

Parameters:
  • pangenome_filename – Name of the pangenome file

  • output – Path to the sequences file

  • list_cds – An iterable object of CDS

  • add – Add a prefix to sequence header

  • compress – Compress the output file

  • disable_bar – Prevent to print disable progress bar

ppanggolin.formats.readBinaries.write_genes_from_pangenome_file(pangenome_filename: str, output: Path, gene_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)

Write representative nucleotide sequences of gene families

Parameters:
  • pangenome – Pangenome object with gene families sequences

  • output – Path to output directory

  • gene_families – Selected partition of gene families

  • soft_core – Soft core threshold to use

  • compress – Compress the file in .gz

  • disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.write_genes_seq_from_pangenome_file(h5f: File, outpath: Path, compress: bool, seq_id_to_genes: Dict[int, List[str]], disable_bar: bool)

Writes gene sequences from the pangenome file to an output file.

Only sequences whose IDs match the ones in seq_id_to_genes will be written.

Parameters:
  • h5f – The open HDF5 pangenome file containing sequence data.

  • outpath – The path to the output file where sequences will be written.

  • compress – Boolean flag to indicate whether output should be compressed.

  • seq_id_to_genes – A dictionary mapping sequence IDs to lists of gene names.

  • disable_bar – Boolean flag to disable the progress bar if set to True.

ppanggolin.formats.writeAnnotations module

ppanggolin.formats.writeAnnotations.contig_desc(contig_len: int, org_len: int) NewCol]

Table description to save contig-related information

Parameters:
  • contig_len – Maximum size of contig name

  • org_len – Maximum size of organism name.

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.gene_desc(id_len: int, max_local_id: int) NewCol]

Table description to save gene-related information

Parameters:
  • id_len – Maximum size of gene name

  • max_local_id – Maximum size of gene local identifier

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.gene_joined_coordinates_desc() NewCol]

Creates a table for gene-related data

Parameters:
  • type_len – Maximum size of gene Type.

  • name_len – Maximum size of gene name

  • product_len – Maximum size of gene product

Returns:

Formatted table for gene metadata

ppanggolin.formats.writeAnnotations.gene_sequences_desc(gene_id_len: int, gene_type_len: int) NewCol]

Create table to save gene sequences

Parameters:
  • gene_id_len – Maximum size of gene sequence identifier

  • gene_type_len – Maximum size of gene type

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.genedata_desc(type_len: int, name_len: int, product_len: int) NewCol]

Creates a table for gene-related data

Parameters:
  • type_len – Maximum size of gene Type.

  • name_len – Maximum size of gene name

  • product_len – Maximum size of gene product

Returns:

Formatted table for gene metadata

ppanggolin.formats.writeAnnotations.get_gene_sequences_len(pangenome: Pangenome) Tuple[int, int]

Get the maximum size of gene sequences to optimize disk space :param pangenome: Annotated pangenome :return: maximum size of each annotation

ppanggolin.formats.writeAnnotations.get_genedata(feature: Gene | RNA) Genedata

Gets the genedata type of Feature

Parameters:

feature – Gene or RNA object

Returns:

Tuple with a Feature associated data

ppanggolin.formats.writeAnnotations.get_max_len_annotations(pangenome: Pangenome) Tuple[int, int, int, int, int]

Get the maximum size of each annotation information to optimize disk space

Parameters:

pangenome – Annotated pangenome

Returns:

Maximum size of each annotation

ppanggolin.formats.writeAnnotations.get_max_len_genedata(pangenome: Pangenome) Tuple[int, int, int]

Get the maximum size of each gene data information to optimize disk space

Parameters:

pangenome – Annotated pangenome

Returns:

maximum size of each annotation

ppanggolin.formats.writeAnnotations.get_sequence_len(pangenome: Pangenome) int

Get the maximum size of gene sequences to optimize disk space :param pangenome: Annotated pangenome :return: maximum size of each annotation

ppanggolin.formats.writeAnnotations.organism_desc(org_len: int) NewCol]

Table description to save organism-related information

Parameters:

org_len – Maximum size of organism name.

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.rna_desc(id_len: int) NewCol]

Table description to save rna-related information

Parameters:
  • id_len – Maximum size of RNA identifier

  • max_contig_len – Maximum size of contig identifier

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.sequence_desc(max_seq_len: int) NewCol]

Table description to save sequences :param max_seq_len: Maximum size of gene type :return: Formatted table

ppanggolin.formats.writeAnnotations.write_annotations(pangenome: Pangenome, h5f: File, rec_organisms: bool = True, rec_contigs: bool = True, rec_genes: bool = True, rec_rnas: bool = True, disable_bar: bool = False)

Function writing all the pangenome annotations

Parameters:
  • pangenome – Annotated pangenome

  • h5f – Pangenome HDF5 file

  • rec_organisms – Allow writing organisms in pangenomes

  • rec_contigs – Allow writing contigs in pangenomes

  • rec_genes – Allow writing genes in pangenomes

  • rec_rnas – Allow writing RNAs in pangenomes

  • disable_bar – Allow to disable progress bar

ppanggolin.formats.writeAnnotations.write_contigs(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, contig_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False)

Write contigs information in the pangenome file :param pangenome: Annotated pangenome object :param h5f: Pangenome file :param annotation: Annotation table group :param contig_desc: Contigs table description :param disable_bar: Allow disabling progress bar

ppanggolin.formats.writeAnnotations.write_gene_joined_coordinates(h5f, annotation, genes_with_joined_coordinates_2_id, disable_bar)

Writing genedata information in pangenome file

Parameters:
  • h5f – Pangenome file

  • annotation – Annotation group in Table

  • genedata2gene – Dictionary linking genedata to gene identifier.

  • disable_bar – Allow disabling progress bar

ppanggolin.formats.writeAnnotations.write_gene_sequences(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Function writing all the pangenome gene sequences :param pangenome: Pangenome with gene sequences :param h5f: Pangenome HDF5 file without sequences :param disable_bar: Disable progress bar

ppanggolin.formats.writeAnnotations.write_genedata(pangenome: Pangenome, h5f: File, annotation: Group, genedata2gene: Dict[Genedata, int], disable_bar=False)

Writing genedata information in pangenome file

Parameters:
  • pangenome – Pangenome object filled with annotation.

  • h5f – Pangenome file

  • annotation – Annotation group in Table

  • genedata2gene – Dictionary linking genedata to gene identifier.

  • disable_bar – Allow disabling progress bar

ppanggolin.formats.writeAnnotations.write_genes(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, gene_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False) Dict[Genedata, int]

Write genes information in the pangenome file

Parameters:
  • pangenome – Annotated pangenome object

  • h5f – Pangenome file

  • annotation – Annotation table group

  • gene_desc – Genes table description

  • disable_bar – Allow to disable progress bar

Returns:

Dictionary linking genedata to gene identifier

ppanggolin.formats.writeAnnotations.write_organisms(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, organism_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False)

Write organisms information in the pangenome file

Parameters:
  • pangenome – Annotated pangenome object

  • h5f – Pangenome file

  • annotation – Annotation table group

  • organism_desc – Organisms table description.

  • disable_bar – Allow disabling progress bar

ppanggolin.formats.writeAnnotations.write_rnas(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, rna_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False) Dict[Genedata, int]

Write RNAs information in the pangenome file

Parameters:
  • pangenome – Annotated pangenome object

  • h5f – Pangenome file

  • annotation – Annotation table group

  • rna_desc – RNAs table description

  • disable_bar – Allow to disable progress bar

Returns:

Dictionary linking genedata to RNA identifier

ppanggolin.formats.writeBinaries module

ppanggolin.formats.writeBinaries.erase_pangenome(pangenome: Pangenome, graph: bool = False, gene_families: bool = False, partition: bool = False, rgp: bool = False, spots: bool = False, modules: bool = False, metadata: bool = False, metatype: str | None = None, source: str | None = None)

Erases tables from a pangenome .h5 file

Parameters:
  • pangenome – Pangenome

  • graph – remove graph information

  • gene_families – remove gene families information

  • partition – remove partition information

  • rgp – remove rgp information

  • spots – remove spots information

  • modules – remove modules information

  • metadata – remove metadata information

  • metatype

  • source

ppanggolin.formats.writeBinaries.gene_fam_desc(max_name_len: int, max_sequence_length: int, max_part_len: int) dict

Create a formatted table for gene families description

Parameters:
  • max_name_len – Maximum size of gene family name

  • max_sequence_length – Maximum size of gene family representing gene sequences

  • max_part_len – Maximum size of gene family partition

Returns:

Formatted table

ppanggolin.formats.writeBinaries.gene_to_fam_desc(gene_fam_name_len: int, gene_id_len: int) dict

Create a formatted table for gene in gene families information

Parameters:
  • gene_fam_name_len – Maximum size of gene family names

  • gene_id_len – Maximum size of gene identifier

Returns:

formatted table

ppanggolin.formats.writeBinaries.get_gene_fam_len(pangenome: Pangenome) Tuple[int, int, int]

Get maximum size of gene families information

Parameters:

pangenome – Pangenome with gene families computed

Returns:

Maximum size of each element

ppanggolin.formats.writeBinaries.get_gene_id_len(pangenome: Pangenome) int

Get maximum size of gene id in pangenome graph

Parameters:

pangenome – Pangenome with graph computed

Returns:

Maximum size of gene id

ppanggolin.formats.writeBinaries.get_gene_to_fam_len(pangenome: Pangenome)

Get maximum size of gene in gene families information

Parameters:

pangenome – Pangenome with gene families computed

Returns:

Maximum size of each element

ppanggolin.formats.writeBinaries.get_mod_desc(pangenome: Pangenome) int

Get maximum size of gene families name in modules

Parameters:

pangenome – Pangenome with modules computed

Returns:

Maximum size of each element

ppanggolin.formats.writeBinaries.get_rgp_len(pangenome: Pangenome) Tuple[int, int]

Get maximum size of region of genomic plasticity and gene

Parameters:

pangenome – Pangenome with gene families computed

Returns:

Maximum size of each element

ppanggolin.formats.writeBinaries.get_spot_desc(pangenome: Pangenome) int

Get maximum size of region of genomic plasticity in hotspot

Parameters:

pangenome – Pangenome with gene families computed

Returns:

Maximum size of each element

ppanggolin.formats.writeBinaries.getmax(arg: iter) float

Get the maximum of arguments if exist 0 else

Parameters:

arg – list of values

Returns:

return the maximum

ppanggolin.formats.writeBinaries.getmean(arg: iter) float

Compute the mean of arguments if exist 0 else

Parameters:

arg – list of values

Returns:

return the mean

ppanggolin.formats.writeBinaries.getmin(arg: iter) float

Get the minimum of arguments if exist 0 else

Parameters:

arg – list of values

Returns:

return the minimum

ppanggolin.formats.writeBinaries.getstdev(arg: iter) float

Compute the standard deviation of arguments if exist 0 else

Parameters:

arg – list of values

Returns:

return the sd

ppanggolin.formats.writeBinaries.graph_desc(max_gene_id_len)

Create a formatted table for pangenome graph

Parameters:

max_gene_id_len – Maximum size of gene id

Returns:

formatted table

ppanggolin.formats.writeBinaries.mod_desc(gene_fam_name_len)

Create a formatted table for hotspot

Parameters:

gene_fam_name_len – Maximum size of gene families name

Returns:

formatted table

ppanggolin.formats.writeBinaries.rgp_desc(max_rgp_len, max_gene_len)

Create a formatted table for region of genomic plasticity

Parameters:
  • max_rgp_len – Maximum size of RGP

  • max_gene_len – Maximum sizez of gene

Returns:

formatted table

ppanggolin.formats.writeBinaries.spot_desc(max_rgp_len)

Create a formatted table for hotspot

Parameters:

max_rgp_len – Maximum size of RGP

Returns:

formatted table

ppanggolin.formats.writeBinaries.update_gene_fam_partition(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Update the gene families table with partition information

Parameters:
  • pangenome – Partitioned pangenome

  • h5f – HDF5 file with gene families

  • disable_bar – Allow to disable progress bar

ppanggolin.formats.writeBinaries.update_gene_fragments(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Updates the annotation table with the fragmentation information from the defrag pipeline

Parameters:
  • pangenome – Annotated pangenome

  • h5f – HDF5 pangenome file

  • disable_bar – Allow to disable progress bar

ppanggolin.formats.writeBinaries.write_gene_fam_info(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Writing a table containing the protein sequences of each family

Parameters:
  • pangenome – Pangenome with gene families computed

  • h5f – HDF5 file to write gene families

  • force – force to write information if precedent information exist

  • disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_gene_families(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing all the pangenome gene families

Parameters:
  • pangenome – pangenome with gene families computed

  • h5f – HDF5 file to save pangenome with gene families

  • force – Force to write gene families in hdf5 file if there is already gene families

  • disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_graph(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing the pangenome graph

Parameters:
  • pangenome – pangenome with graph computed

  • h5f – HDF5 file to save pangenome graph

  • force – Force to write graph in hdf5 file if there is already one

  • disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_info(pangenome: Pangenome, h5f: File)

Writes information and numbers to be eventually called with the ‘info’ submodule

Parameters:
  • pangenome – Pangenome object with some information computed

  • h5f – Pangenome file to save information

ppanggolin.formats.writeBinaries.write_info_modules(pangenome: Pangenome, h5f: File)

Writes information about modules

Parameters:
  • pangenome – Pangenome object with some information computed

  • h5f – Pangenome file to save information

ppanggolin.formats.writeBinaries.write_modules(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing all the pangenome modules

Parameters:
  • pangenome – pangenome with spot computed

  • h5f – HDF5 file to save pangenome with spot

  • force – Force to write gene families in hdf5 file if there is already spot

  • disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_pangenome(pangenome: Pangenome, filename, force: bool = False, disable_bar: bool = False)

Writes or updates a pangenome file

Parameters:
  • pangenome – pangenome object

  • filename – HDF5 file to save pangenome

  • force – force to write on pangenome if information already exist

  • disable_bar – Allow to disable progress bar

ppanggolin.formats.writeBinaries.write_rgp(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing all the region of genomic plasticity in pangenome

Parameters:
  • pangenome – pangenome with RGP computed

  • h5f – HDF5 file to save pangenome with RGP

  • force – Force to write gene families in hdf5 file if there is already RGP

  • disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_spots(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing all the pangenome hotspot

Parameters:
  • pangenome – pangenome with spot computed

  • h5f – HDF5 file to save pangenome with spot

  • force – Force to write gene families in hdf5 file if there is already spot

  • disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_status(pangenome: Pangenome, h5f: File)

Write pangenome status in HDF5 file

Parameters:
  • pangenome – Pangenome object

  • h5f – Pangenome file

ppanggolin.formats.writeFlatGenomes module

ppanggolin.formats.writeFlatGenomes.convert_overlapping_coordinates_for_gff(coordinates: List[Tuple[int, int]], contig_length: int)

Converts overlapping gene coordinates in GFF format for circular contigs.

Parameters:
  • coordinates – List of tuples representing gene coordinates.

  • contig_length – Length of the circular contig.

ppanggolin.formats.writeFlatGenomes.count_neighbors_partitions(gene_family: GeneFamily)

Count partition of neighbors families.

Parameters:

gene_family – Gene family for which we count neighbors

ppanggolin.formats.writeFlatGenomes.encode_attribute_val(product: str) str

Encode special characters forbidden in column 9 of the GFF3 format.

Parameters:

product – The input string to encode.

Returns:

The encoded string with special characters replaced.

Reference: - GFF3 format requirement: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md - Code source taken from Bakta: https://github.com/oschwengers/bakta

ppanggolin.formats.writeFlatGenomes.encode_attributes(attributes: List[Tuple]) str

Encode a list of attributes in GFF3 format.

Parameters:

attributes – A list of attribute key-value pairs represented as tuples.

Returns:

The encoded attributes as a semicolon-separated string.

ppanggolin.formats.writeFlatGenomes.get_organism_list(organisms_filt: str, pangenome: Pangenome) Set[Organism]

Get a list of organisms to include in the output.

Parameters:
  • organisms_filt – Filter for selecting organisms. It can be a file path with one organism name per line or a comma-separated list of organism names.

  • pangenome – The pangenome from which organisms will be selected.

Returns:

A set of selected Organism objects.

ppanggolin.formats.writeFlatGenomes.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.formats.writeFlatGenomes.manage_module_colors(modules: Set[Module], window_size: int = 100) Dict[Module, str]

Manages colors for a list of modules based on gene positions and a specified window size.

Parameters:
  • modules – A list of module objects for which you want to determine colors.

  • window_size – Minimum number of genes between two modules to color them with the same color. A higher value results in more module colors.

Returns:

A dictionary that maps each module to its assigned color.

ppanggolin.formats.writeFlatGenomes.mp_write_genomes_file(organism: Organism, output: Path, genome_file: Path | None = None, proksee: bool = False, gff: bool = False, table: bool = False, **kwargs) str

Wrapper for the write_genomes_file function that allows it to be used in multiprocessing.

Parameters:
  • organism – Specify the organism to be written

  • output – Specify the path to the output directory

  • genome_file – Read the genome sequences from a file

  • proksee – Write a proksee file for the organism

  • gff – Write the gff file for the organism

  • table – Write the organism file for the organism

  • kwargs – Pass any number of keyword arguments to the function

Returns:

The organism name

ppanggolin.formats.writeFlatGenomes.palette(nb_colors: int) List[str]

Generates a palette of colors for visual representation.

Parameters:

nb_colors – The number of colors needed in the palette.

Returns:

A list of color codes in hexadecimal format.

ppanggolin.formats.writeFlatGenomes.parser_flat(parser: ArgumentParser)

Parser for specific argument of write command

Parameters:

parser – parser for align argument

ppanggolin.formats.writeFlatGenomes.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.formats.writeFlatGenomes.write_flat_genome_files(pangenome: Pangenome, output: Path, table: bool = False, gff: bool = False, proksee: bool = False, compress: bool = False, fasta: Path | None = None, anno: Path | None = None, organisms_filt: str = 'all', add_metadata: bool = False, metadata_sep: str = '|', metadata_sources: List[str] | None = None, cpu: int = 1, disable_bar: bool = False)

Main function to write flat files from pangenome

Parameters:
  • pangenome – Pangenome object

  • output – Path to output directory

  • cpu – Number of available core

  • table – write table with pangenome annotation for each genome

  • gff – write a gff file with pangenome annotation for each organism

  • proksee – write a proksee file with pangenome annotation for each organisms

  • compress – Compress the file in .gz

  • disable_bar – Disable progress bar

  • fasta – File containing the list FASTA files for each organism

  • anno – File containing the list of GBFF/GFF files for each organism

  • organisms_filt – String used to specify which organism to write. if all, all organisms are written.

  • add_metadata – Add metadata to GFF files

  • metadata_sep – The separator used to join multiple metadata values

  • metadata_sources – Sources of the metadata to use and write in the outputs. None means all sources are used.

ppanggolin.formats.writeFlatGenomes.write_gff_file(organism: Organism, outdir: Path, annotation_sources: Dict[str, str], genome_sequences: Dict[str, str], metadata_sep: str = '|', compress: bool = False)

Write the GFF file of the provided organism.

Parameters:
  • organism – Organism object for which the GFF file is being written.

  • outdir – Path to the output directory where the GFF file will be written.

  • metadata_sep – The separator used to join multiple metadata values

  • compress – If True, compress the output GFF file using .gz format.

  • annotation_sources – A dictionary that maps types of features to their source information.

  • genome_sequences – A dictionary mapping contig names to their DNA sequences (default: None).

ppanggolin.formats.writeFlatGenomes.write_tsv_genome_file(organism: Organism, output: Path, compress: bool = False, metadata_sep: str = '|', need_regions: bool = False, need_spots: bool = False, need_modules: bool = False)

Write the table of genes with pangenome annotation for one organism in tsv

Parameters:
  • organism – An organism

  • output – Path to output directory

  • compress – Compress the file in .gz

  • need_regions – Write information about regions

  • need_spots – Write information about spots

  • need_modules – Write information about modules

ppanggolin.formats.writeFlatPangenome module

ppanggolin.formats.writeFlatPangenome.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.formats.writeFlatPangenome.parser_flat(parser: ArgumentParser)

Parser for specific argument of write command

Parameters:

parser – parser for align argument

ppanggolin.formats.writeFlatPangenome.spot2rgp(spots: set, output: Path, compress: bool = False)

Write a tsv file providing association between spot and rgp

Parameters:
  • spots – set of spots in pangenome

  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.formats.writeFlatPangenome.summarize_genome(organism: Organism, pangenome_persistent_count: int, pangenome_persistent_single_copy_families: Set[GeneFamily], soft_core_families: Set[GeneFamily], exact_core_families: Set[GeneFamily], rgp_count: int, spot_count: int, module_count: int) Dict[str, any]

Summarizes genomic information of an organism.

Parameters:
  • input_organism – The organism for which the genome is being summarized.

  • pangenome_persistent_count – Count of persistent genes in the pangenome.

  • pangenome_persistent_single_copy_families – Set of gene families considered as persistent single-copy in the pangenome.

  • soft_core_families – soft core families of the pangenome

  • exact_core_families – exact core families of the pangenome

  • input_org_rgps – Number of regions of genomic plasticity in the input organism. None if not computed.

  • input_org_spots – Number of spots in the input organism. None if not computed.

  • input_org_modules – Number of modules in the input organism. None if not computed.

Returns:

A dictionary containing various summary information about the genome.

ppanggolin.formats.writeFlatPangenome.summarize_spots(spots: set, output: Path, compress: bool = False, file_name='summarize_spots.tsv')

Write a file providing summarize information about hotspots

Parameters:
  • spots – set of spots in pangenome

  • output – Path to output directory

  • compress – Compress the file in .gz

Patam file_name:

Name of the output file

ppanggolin.formats.writeFlatPangenome.write_borders(output: Path, dup_margin: float = 0.05, compress: bool = False)

Write all gene families bordering each spot

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

  • dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated

ppanggolin.formats.writeFlatPangenome.write_gene_families_tsv(output: Path, compress: bool = False, disable_bar: bool = False)

Write the file providing the association between genes and gene families

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

  • disable_bar – Flag to disable progress bar

ppanggolin.formats.writeFlatPangenome.write_gene_presence_absence(output: Path, compress: bool = False)

Write the gene presence absence matrix

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_gexf(output: Path, light: bool = True, compress: bool = False)

Write the node of pangenome in gexf file

Parameters:
  • output – Path to output directory

  • light – save the light version of the pangenome graph

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_gexf_edges(gexf: TextIO, light: bool = True)

Write the edge of pangenome graph in gexf file

Parameters:
  • gexf – file-like object, compressed or not

  • light – save the light version of the pangenome graph

ppanggolin.formats.writeFlatPangenome.write_gexf_end(gexf: TextIO)

Write the end of gexf file to save pangenome

Parameters:

gexf – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_gexf_header(gexf: TextIO, light: bool = True)

Write the header of gexf file to save graph

Parameters:
  • gexf – file-like object, compressed or not

  • light – save the light version of the pangenome graph

ppanggolin.formats.writeFlatPangenome.write_gexf_nodes(gexf: TextIO, light: bool = True, soft_core: False = 0.95)

Write the node of pangenome graph in gexf file

Parameters:
  • gexf – file-like object, compressed or not

  • light – save the light version of the pangenome graph

  • soft_core – Soft core threshold to use

ppanggolin.formats.writeFlatPangenome.write_json(output: Path, compress: bool = False)

Writes the graph in a json file format

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_json_edge(edge: Edge, json: TextIO)

Write the edge graph in json file

Parameters:
  • edge – file-like object, compressed or not

  • json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_json_edges(json)

Write the edge graph in json file

Parameters:

json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_json_gene_fam(gene_fam: GeneFamily, json: TextIO)

Write the gene families corresponding to node graph in json file

Parameters:
  • gene_fam – file-like object, compressed or not

  • json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_json_header(json: TextIO)

Write the header of json file to save graph

Parameters:

json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_json_nodes(json: TextIO)

Write the node graph in json file

Parameters:

json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_matrix(output: Path, sep: str = ',', ext: str = 'csv', compress: bool = False, gene_names: bool = False)

Write a csv file format as used by Roary, among others. The alternative gene ID will be the partition, if there is one

Parameters:
  • sep – Column field separator

  • ext – file extension

  • output – Path to output directory

  • compress – Compress the file in .gz

  • gene_names – write the genes name if there are saved in pangenome

ppanggolin.formats.writeFlatPangenome.write_module_summary(output: Path, compress: bool = False)

Write a file providing summarize information about modules

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_modules(output: Path, compress: bool = False)

Write a tsv file providing association between modules and gene families

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_org_modules(output: Path, compress: bool = False)

Write a tsv file providing association between modules and organisms

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_pangenome_flat_files(pangenome: Pangenome, output: Path, cpu: int = 1, soft_core: float = 0.95, dup_margin: float = 0.05, csv: bool = False, gene_pa: bool = False, gexf: bool = False, light_gexf: bool = False, stats: bool = False, json: bool = False, partitions: bool = False, families_tsv: bool = False, regions: bool = False, regions_families: bool = False, spots: bool = False, borders: bool = False, modules: bool = False, spot_modules: bool = False, compress: bool = False, disable_bar: bool = False)

Main function to write flat files from pangenome

Parameters:
  • pangenome – Pangenome object

  • output – Path to output directory

  • cpu – Number of available core

  • soft_core – Soft core threshold to use

  • dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated

  • csv – write csv file format as used by Roary

  • gene_pa – write gene presence absence matrix

  • gexf – write pangenome graph in gexf format

  • light_gexf – write pangenome graph with only gene families

  • stats – write statistics about pangenome

  • json – write pangenome graph in json file

  • partitions – write the gene families for each partition

  • families_tsv – write gene families information

  • regions – write RGP information

  • spots – write information on spots

  • borders – write gene families bordering spots

  • modules – write information about modules

  • spot_modules – write association between modules and RGP and modules and spots

  • compress – Compress the file in .gz

  • disable_bar – Disable progress bar

ppanggolin.formats.writeFlatPangenome.write_partitions(output: Path, soft_core: float = 0.95)

Write the list of gene families for each partition

Parameters:
  • output – Path to output directory

  • soft_core – Soft core threshold to use

ppanggolin.formats.writeFlatPangenome.write_persistent_duplication_statistics(pangenome: Pangenome, output: Path, dup_margin: float, compress: bool) Set[GeneFamily]

Writes statistics on persistent duplications in gene families to a specified output file.

Parameters:
  • pangenome – The Pangenome object containing gene families.

  • output – The Path specifying the output file location.

  • dup_margin – The duplication margin used for determining single copy markers.

  • compress – A boolean indicating whether to compress the output file.

:return :

ppanggolin.formats.writeFlatPangenome.write_regions(output: Path, compress: bool = False)

Write the file providing information about RGP content

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_regions_families(output: Path, compress: bool = False)

Write the file providing the association between regions of genomic plasticity and gene families.

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_rgp_modules(output: Path, compress: bool = False)

Write a tsv file providing association between modules and RGP

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_rgp_table(regions: Set[Region], output: Path, compress: bool = False)

Write the file providing information about regions of genomic plasticity.

Parameters:
  • regions – Set of Region objects representing regions.

  • output – Path to the output directory.

  • compress – Whether to compress the file in .gz format.

ppanggolin.formats.writeFlatPangenome.write_spot_modules(output: Path, compress: bool = False)

Write a tsv file providing association between modules and spots

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_spots(output: Path, compress: bool = False)

Write tsv files providing spots information and association with RGP

Parameters:
  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_stats(output: Path, soft_core: float = 0.95, dup_margin: float = 0.05, compress: bool = False)

Write pangenome statistics for each genomes

Parameters:
  • output – Path to output directory

  • soft_core – Soft core threshold to use

  • dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated

  • compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_summaries_in_tsv(summaries: List[Dict[str, Any]], output_file: Path, dup_margin: float, soft_core: float, compress: bool = False)

Writes summaries of organisms stored in a dictionary into a Tab-Separated Values (TSV) file.

Parameters:
  • summaries – A list containing organism summaries.

  • output_file – The Path specifying the output TSV file location.

  • soft_core – Soft core threshold used

  • dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated

  • compress – Compress the file in .gz

ppanggolin.formats.writeMSA module

ppanggolin.formats.writeMSA.compute_msa(families: Set[GeneFamily], output: Path, tmpdir: Path, cpu: int = 1, source: str = 'protein', use_gene_id: bool = False, code: str = '11', disable_bar: bool = False)

Compute MSA between pangenome gene families

Parameters:
  • families – Set of families specific to given partition

  • output – output directory name for families alignment

  • cpu – number of available core

  • tmpdir – path to temporary directory

  • source – indicates whether to use protein or dna sequences to compute the msa

  • use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA

  • code – Genetic code to use

  • disable_bar – Disable progress bar

ppanggolin.formats.writeMSA.get_families_to_write(pangenome: Pangenome, partition_filter: str = 'core', soft_core: float = 0.95, dup_margin: float = 0.95, single_copy: bool = True) Set[GeneFamily]

Get families corresponding to the given partition

Parameters:
  • pangenome – Partitioned pangenome

  • partition_filter – choice of partition to compute Multiple Sequence Alignment of the gene families

  • soft_core – Soft core threshold to use

  • dup_margin – maximal number of genomes in which the gene family can have multiple members and still be considered a ‘single copy’ gene family

  • single_copy – Use “single copy” (defined by dup_margin) gene families only

Returns:

set of families unique to one partition

ppanggolin.formats.writeMSA.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.formats.writeMSA.launch_mafft(fname: Path, output: Path, fam_name: str)

Compute the MSA with mafft

Parameters:
  • fname – family gene sequence in fasta

  • output – directory to save alignment

  • fam_name – Name of the gene family

ppanggolin.formats.writeMSA.launch_multi_mafft(args: List[Tuple[Path, Path, str]])

Allow to launch mafft in multiprocessing

Parameters:

args – Pack of argument for launch_mafft

Returns:

Organism object for pangenome

ppanggolin.formats.writeMSA.parser_msa(parser: ArgumentParser)

Parser for specific argument of msa command

Parameters:

parser – parser for align argument

ppanggolin.formats.writeMSA.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.formats.writeMSA.translate(gene: Gene, code: Dict[str, Dict[str, str]]) Tuple[str, bool]

translates the given dna sequence with the given translation table

Parameters:
  • gene – given gene

  • code – translation table corresponding to genetic code to use

Returns:

protein sequence

ppanggolin.formats.writeMSA.write_fasta_families(family: GeneFamily, tmpdir: TemporaryDirectory, code_table: Dict[str, Dict[str, str]], source: str = 'protein', use_gene_id: bool = False) Tuple[Path, bool]

Write fasta files for each gene family

Parameters:
  • family – gene family to write

  • tmpdir – path to temporary directory

  • source – indicates whether to use protein or dna sequences to compute the msa

  • use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA

  • code_table – Genetic code to use

Returns:

path to fasta file

ppanggolin.formats.writeMSA.write_msa_files(pangenome: Pangenome, output: Path, cpu: int = 1, partition: str = 'core', tmpdir: Path | None = None, source: str = 'protein', soft_core: float = 0.95, phylo: bool = False, use_gene_id: bool = False, translation_table: str = '11', dup_margin: float = 0.95, single_copy: bool = True, force: bool = False, disable_bar: bool = False)

Main function to write MSA files

Parameters:
  • pangenome – Pangenome object with partition

  • output – Path to output directory

  • cpu – number of available core

  • partition – choice of partition to compute Multiple Sequence Alignment of the gene families

  • tmpdir – path to temporary directory

  • source – indicates whether to use protein or dna sequences to compute the msa

  • soft_core – Soft core threshold to use

  • phylo – Writes a whole genome msa file for additional phylogenetic analysis

  • use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA

  • translation_table – Translation table (genetic code) to use.

  • dup_margin – maximal number of genomes in which the gene family can have multiple members and still be considered a ‘single copy’ gene family

  • single_copy – Use “single copy” (defined by dup_margin) gene families only

  • force – force to write in the directory

  • disable_bar – Disable progress bar

ppanggolin.formats.writeMSA.write_whole_genome_msa(pangenome: Pangenome, families: set, phylo_name: Path, outdir: Path, use_gene_id: bool = False)

Writes a whole genome msa file for additional phylogenetic analysis

Parameters:
  • pangenome – Pangenome object

  • families – Set of families specific to given partition

  • phylo_name – output file name for phylo alignment

  • outdir – output directory name for families alignment

  • use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA

ppanggolin.formats.writeMetadata module

ppanggolin.formats.writeMetadata.desc_metadata(max_len_dict: Dict[str, int], type_dict: Dict[str, Col]) dict

Create a formatted table for metadata description

Returns:

Formatted table

ppanggolin.formats.writeMetadata.erase_metadata(pangenome: Pangenome, h5f: File, status_group: Group, metatype: str | None = None, source: str | None = None)

Erase metadata in pangenome

Parameters:
  • pangenome – Pangenome with metadata to erase

  • h5f – HDF5 file with pangenome metadata

  • status_group – pangenome status in HDF5

  • metatype – select to which pangenome element metadata should be erased

  • source – name of the metadata source

ppanggolin.formats.writeMetadata.get_metadata_contig_len(select_ctg: List[Contig], source: str) Tuple[Dict[str, int], Dict[str, Col], int]

Get maximum size of contig metadata information

Parameters:
  • select_ctg – selected elements from source

  • source – Name of the metadata source

Returns:

Maximum type and size of each element

ppanggolin.formats.writeMetadata.get_metadata_len(select_elem: List[Gene] | List[Organism] | List[GeneFamily] | List[Region] | List[Spot] | List[Module], source: str) Tuple[Dict[str, int], Dict[str, Col], int]

Get maximum size of metadata information

Parameters:
  • select_elem – selected elements from source

  • source – Name of the metadata source

Returns:

Maximum type and size of each element

ppanggolin.formats.writeMetadata.write_metadata(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Write metadata in pangenome

Parameters:
  • pangenome – Pangenome where should be written metadata

  • h5f – HDF5 file with pangenome

  • disable_bar – Disable progress bar

ppanggolin.formats.writeMetadata.write_metadata_contig(h5f: File, source: str, select_contigs: List[Contig], disable_bar: bool = False)

Writing a table containing the metadata associated to contig

Parameters:
  • h5f – HDF5 file to write gene families

  • source – name of the metadata source

  • select_contigs – List of contig withj metadata

  • disable_bar – Disable progress bar

ppanggolin.formats.writeMetadata.write_metadata_group(h5f: File, metatype: str) Group

Check and write the group in HDF5 file to organize metadata

Parameters:
  • h5f – HDF5 file with pangenome

  • metatype – select to which pangenome element metadata should be written

Returns:

Metadata group of the corresponding metatype

ppanggolin.formats.writeMetadata.write_metadata_metatype(h5f: File, source: str, metatype: str, select_elements: List[Gene] | List[Organism] | List[GeneFamily] | List[Region] | List[Spot] | List[Module], disable_bar: bool = False)

Writing a table containing the metadata associated to element from the metatype

Parameters:
  • h5f – HDF5 file to write gene families

  • source – name of the metadata source

  • metatype – select to which pangenome element metadata should be written

  • select_elements – Elements selected to write metadata

  • disable_bar – Disable progress bar

ppanggolin.formats.writeMetadata.write_metadata_status(pangenome: Pangenome, h5f: File, status_group: Group) bool

Write status of metadata in pangenome file

Parameters:
  • pangenome – pangenome with metadata

  • h5f – HDF5 file with pangenome

  • status_group – Pangenome status information group

Returns:

ppanggolin.formats.writeSequences module

ppanggolin.formats.writeSequences.check_write_sequences_args(args: Namespace) None

Check arguments compatibility in CLI

Parameters:

args – argparse namespace arguments from CLI

Raises:

argparse.ArgumentTypeError – if region is given but neither fasta nor anno is given

ppanggolin.formats.writeSequences.create_mmseqs_db(sequences: Iterable[Path], db_name: str, tmpdir: Path, db_mode: int = 0, db_type: int = 0) Path

Create a MMseqs2 database from a sequences file.

Parameters:
  • sequences – File with the sequences

  • db_name – name of the database

  • tmpdir – Temporary directory to save the MMSeqs2 files

  • db_mode – Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q)

  • db_type – Database type 0: auto, 1: amino acid 2: nucleotides

Returns:

Path to the MMSeqs2 database

ppanggolin.formats.writeSequences.filter_values(arg_value: str)

Check filter value to ensure they are in the expected format.

Parameters:

arg_value – Argument value that is being tested.

Returns:

The same argument if it is valid.

Raises:

argparse.ArgumentTypeError – If the argument value is not in the expected format.

ppanggolin.formats.writeSequences.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.formats.writeSequences.parser_seq(parser: ArgumentParser)

Parser for specific argument of fasta command

Parameters:

parser – parser for align argument

ppanggolin.formats.writeSequences.read_fasta_gbk(file_path: Path) Dict[str, str]

Read the genome file in gbk format

Parameters:

file_path – Path to genome file

Returns:

Dictionary with all sequences associated to contig

ppanggolin.formats.writeSequences.read_fasta_or_gff(file_path: Path) Dict[str, str]

Read the genome file in fasta or gbff format

Parameters:

file_path – Path to genome file

Returns:

Dictionary with all sequences associated to contig

ppanggolin.formats.writeSequences.read_genome_file(genome_file: Path, organism: Organism) Dict[str, str]

Read the genome file associated to organism to extract sequences

Parameters:
  • genome_file – Path to a fasta file or gbff/gff file

  • organism – organism object

Returns:

Dictionary with all sequences associated to contig

Raises:
  • TypeError – If the file containing sequences is not recognized

  • KeyError – If their inconsistency between pangenome contigs and the given contigs

ppanggolin.formats.writeSequences.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.formats.writeSequences.translate_genes(sequences: Path | Iterable[Path], tmpdir: Path, cpu: int = 1, is_single_line_fasta: bool = False, code: int = 11) Path

Translate nucleotide sequences into MMSeqs2 amino acid sequences database

Parameters:
  • sequences – File with the nucleotide sequences

  • tmpdir – Temporary directory to save the MMSeqs2 files

  • cpu – Number of available threads to use

  • is_single_line_fasta – Allow to use soft link in MMSeqs2 database

  • code – Translation code to use

Returns:

Path to the MMSeqs2 database

ppanggolin.formats.writeSequences.write_gene_protein_sequences(pangenome_filename: str, output: Path, gene_filter: str, soft_core: float = 0.95, compress: bool = False, keep_tmp: bool = False, tmp: Path | None = None, cpu: int = 1, code: int = 11, disable_bar: bool = False)

Write all amino acid sequences from given genes in pangenome

Parameters:
  • pangenome – Pangenome object with gene families sequences

  • output – Path to output directory

  • proteins – Selected partition of gene

  • soft_core – Soft core threshold to use

  • compress – Compress the file in .gz

  • keep_tmp – Keep temporary directory

  • tmp – Path to temporary directory

  • cpu – Number of threads available

  • code – Genetic code use to translate nucleotide sequences to protein sequences

  • disable_bar – Disable progress bar

ppanggolin.formats.writeSequences.write_gene_sequences_from_annotations(genes_to_write: Iterable[Gene], output: Path, add: str = '', compress: bool = False, disable_bar: bool = False)

Writes the CDS sequences to a File object, and adds the string provided through add in front of it. Loads the sequences from previously computed or loaded annotations.

Parameters:
  • genes_to_write – Genes to write.

  • output – Path to output file to write sequences.

  • add – Add prefix to gene ID.

  • compress – Compress the file in .gz

  • disable_bar – Disable progress bar.

ppanggolin.formats.writeSequences.write_regions_sequences(pangenome: Pangenome, output: Path, regions: str, fasta: Path | None = None, anno: Path | None = None, compress: bool = False, disable_bar: bool = False)

Write representative amino acid sequences of gene families.

Parameters:
  • pangenome – Pangenome object with gene families sequences

  • output – Path to output directory

  • regions – Write the RGP nucleotide sequences

  • fasta – A tab-separated file listing the organism names, fasta filepath of its genomic sequences

  • anno – A tab-separated file listing the organism names, and the gff/gbff filepath of its annotations

  • compress – Compress the file in .gz

  • disable_bar – Disable progress bar

Raises:

SyntaxError – if no tabulation are found in list genomes file

ppanggolin.formats.writeSequences.write_sequence_files(pangenome: Pangenome, output: Path, fasta: Path | None = None, anno: Path | None = None, soft_core: float = 0.95, regions: str | None = None, genes: str | None = None, proteins: str | None = None, gene_families: str | None = None, prot_families: str | None = None, compress: bool = False, disable_bar: bool = False, **translate_kwgs)

Main function to write sequence file from pangenome

Parameters:
  • pangenome – Pangenome object containing sequences

  • output – Path to output directory

  • fasta – A tab-separated file listing the organism names, fasta filepath of its genomic sequences

  • anno – A tab-separated file listing the organism names, and the gff/gbff filepath of its annotations

  • soft_core – Soft core threshold to use

  • regions – Write the RGP nucleotide sequences

  • genes – Write all nucleotide CDS sequences

  • proteins – Write amino acid CDS sequences.

  • gene_families – Write representative nucleotide sequences of gene families.

  • prot_families – Write representative amino acid sequences of gene families.

  • compress – Compress the file in .gz

  • disable_bar – Disable progress bar

ppanggolin.formats.writeSequences.write_spaced_fasta(sequence: str, space: int = 60) str

Write a maximum of element per line

Parameters:
  • sequence – sequence to write

  • space – maximum of size for one line

Returns:

a sequence of maximum space character

ppanggolin.formats.write_proksee module

ppanggolin.formats.write_proksee.initiate_proksee_data(features: List[str], organism: Organism, module_to_color: Dict[Module, str] | None = None)

Initializes ProkSee data structure with legends, tracks, and captions.

Parameters:
  • features – A list of features to include in the ProkSee data.

  • organism – The organism for which the ProkSee data is being generated.

  • module_to_color – A dictionary mapping modules to their assigned colors.

Returns:

ProkSee data structure containing legends, tracks, and captions.

ppanggolin.formats.write_proksee.write_contig(organism: Organism, genome_sequences: Dict[str, str] | None = None) List[Dict]

Writes contig data for a given organism in proksee format.

Parameters:
  • organism – The organism for which contig data will be written.

  • genome_sequences – A dictionary mapping contig names to their DNA sequences (default: None).

Returns:

A list of contig data in a structured format.

ppanggolin.formats.write_proksee.write_genes(organism: Organism, multigenics: Set[GeneFamily], disable_bar: bool = True) Tuple[List[Dict], Dict[str, List[Gene]]]

Writes gene data for a given organism, including both protein-coding genes and RNA genes.

Parameters:
  • organism – The organism for which gene data will be written.

  • disable_bar – A flag to disable the progress bar when processing genes (default: True).

Returns:

List of gene data in a structured format and a dictionary mapping gene families to genes.

ppanggolin.formats.write_proksee.write_legend_items(features: List[str], module_to_color: Dict[Module, str] | None = None)

Generates legend items based on the selected features and module-to-color mapping.

Parameters:
  • features – A list of features to include in the legend.

  • module_to_color – A dictionary mapping modules to their assigned colors.

Returns:

A data structure containing legend items based on the selected features and module colors.

ppanggolin.formats.write_proksee.write_modules(organism: Organism, gf2genes: Dict[str, List[Gene]])

Writes module data in proksee format for a list of modules associated with a given organism.

Parameters:
  • organism – The organism to which the modules are associated.

  • gf2genes – A dictionary that maps gene families to the genes they contain.

Returns:

A list of module data in a structured format.

ppanggolin.formats.write_proksee.write_proksee_organism(organism: Organism, output_file: Path, features: List[str] | None = None, module_to_colors: Dict[Module, str] | None = None, genome_sequences: Dict[str, str] | None = None, multigenics: Set[GeneFamily] = [], compress: bool = False)

Writes ProkSee data for a given organism, including contig information, genes colored by partition, RGPs, and modules. The resulting data is saved as a JSON file in the specified output file.

Parameters:
  • organism – The organism for which ProkSee data will be written.

  • output_file – The output file where ProkSee data will be written.

  • features – A list of features to include in the ProkSee data, e.g., [“rgp”, “modules”, “all”].

  • module_to_colors – A dictionary mapping modules to their assigned colors.

  • genome_sequences – The genome sequences for the organism.

  • compress – Compress the output file

ppanggolin.formats.write_proksee.write_rgp(organism: Organism)

Writes RGP (Region of Genomic Plasticity) data for a given organism in proksee format. :param organism: The specific organism for which RGP data will be written.

Returns:

A list of RGP data in a structured format.

ppanggolin.formats.write_proksee.write_tracks(features: List[str])

Generates track information based on the selected features.

Parameters:

features – A list of features to include in the ProkSee data.

Returns:

A list of track configurations based on the selected features.

Module contents