ppanggolin.formats package

Submodules

ppanggolin.formats.readBinaries module

class ppanggolin.formats.readBinaries.Genedata(start: int, stop: int, strand: str, gene_type: str, position: int, name: str, product: str, genetic_code: int, coordinates: List[Tuple[int]] | None = None)

Bases: object

This is a general class storing unique gene-related data to be written in a specific genedata table

ppanggolin.formats.readBinaries.check_pangenome_info(pangenome, need_annotations: bool = False, need_families: bool = False, need_graph: bool = False, need_partitions: bool = False, need_rgp: bool = False, need_spots: bool = False, need_gene_sequences: bool = False, need_modules: bool = False, need_metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None, disable_bar: bool = False)

Defines what needs to be read depending on what is needed, and automatically checks if the required elements have been computed with regard to the pangenome.status

Parameters:

pangenome – Pangenome object without some information
need_annotations – get annotation
need_families – get gene families
need_graph – get graph
need_partitions – get partition
need_rgp – get RGP
need_spots – get hotspot
need_gene_sequences – get gene sequences
need_modules – get modules
need_metadata – get metadata
metatypes – metatypes of the metadata to get (None means all types with metadata)
sources – sources of the metadata to get (None means all possible sources)
disable_bar – Allow to disable the progress bar

ppanggolin.formats.readBinaries.create_info_dict(info_group: Group)

Read the pangenome content

Parameters:: info_group – group in pangenome HDF5 file containing information about pangenome

ppanggolin.formats.readBinaries.get_families_from_genes(h5f: File, genes: Set[bytes]) → Set[bytes]

Retrieves gene families associated with a specified set of genes from the pangenome file.

Parameters:

h5f – The open HDF5 pangenome file containing gene family data.
genes – A set of gene names (as bytes) for which to retrieve the associated families.

Returns:

A set of gene family names (as bytes) associated with the specified genes.

ppanggolin.formats.readBinaries.get_families_matching_partition(h5f: File, partition: str) → Set[bytes]

Retrieves gene families that match the specified partition.

Parameters:

h5f – The open HDF5 pangenome file containing gene family information.
partition – The partition name (as a string). If “all”, all gene families are included. Otherwise, it filters by the first letter of the partition.

Returns:

A set of gene family names (as bytes) that match the partition criteria.

ppanggolin.formats.readBinaries.get_family_to_genome_count(h5f: File) → Dict[bytes, int]

Computes the number of unique genomes associated with each gene family.

Parameters:: h5f – The open HDF5 pangenome file containing contig, gene, and gene family data.
Returns:: A dictionary mapping gene family names (as bytes) to the count of unique genomes.

ppanggolin.formats.readBinaries.get_gene_to_genome(h5f: File) → Dict[bytes, bytes]

Generates a mapping between gene IDs and their corresponding genome.

Parameters:: h5f – The open HDF5 pangenome file containing contig and gene annotations.
Returns:: A dictionary mapping gene IDs to genome names.

ppanggolin.formats.readBinaries.get_genes_from_families(h5f: File, families: List[bytes]) → Set[bytes]

Retrieves a set of genes that belong to the specified families.

This function reads the gene family data from an HDF5 pangenome file and returns a set of genes that are part of the given list of gene families.

Parameters:

h5f – The open HDF5 pangenome file containing gene family data.
families – A list of gene families (as bytes) to filter genes by.

Returns:

A set of genes (as bytes) that belong to the specified families.

ppanggolin.formats.readBinaries.get_need_info(pangenome, need_annotations: bool = False, need_families: bool = False, need_graph: bool = False, need_partitions: bool = False, need_rgp: bool = False, need_spots: bool = False, need_gene_sequences: bool = False, need_modules: bool = False, need_metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None)

ppanggolin.formats.readBinaries.get_non_redundant_gene_sequences_from_file(pangenome_filename: str, output: Path, add: str = '', disable_bar: bool = False)

Writes the non-redundant CDS sequences of the Pangenome object to a File object that can be filtered or not by a list of CDS, and adds the eventual str ‘add’ in front of the identifiers. Loads the sequences from a .h5 pangenome file.

Parameters:

pangenome_filename – Name of the pangenome file
output – Path to the output file
add – Add a prefix to sequence header
disable_bar – disable progress bar

ppanggolin.formats.readBinaries.get_number_of_organisms(pangenome: Pangenome) → int

Standalone function to get the number of organisms in a pangenome

Parameters:: pangenome – Annotated pangenome
Returns:: Number of organisms in the pangenome

ppanggolin.formats.readBinaries.get_pangenome_parameters(h5f: File) → Dict[str, Dict[str, Any]]

Read and return the pangenome parameters.

Parameters:: h5f – Pangenome HDF5 file
Returns:: A dictionary containing the name of the ppanggolin step as the key, and a dictionary of parameter names and their corresponding values used for that step.

ppanggolin.formats.readBinaries.get_seqid_to_genes(h5f: File, genes: Set[bytes], get_all_genes: bool = False, disable_bar: bool = False) → Dict[int, List[str]]

Creates a mapping of sequence IDs to gene names.

Parameters:

h5f – The open HDF5 pangenome file containing gene sequence data.
genes – A list of gene names to include in the mapping (if get_all_genes is False).
get_all_genes – Boolean flag to indicate if all genes should be included in the mapping. If set to True, all genes will be added regardless of the genes parameter.
disable_bar – Boolean flag to disable the progress bar if set to True.

Returns:

A dictionary mapping sequence IDs (integers) to lists of gene names (strings).

ppanggolin.formats.readBinaries.get_soft_core_families(h5f: File, soft_core: float) → Set[bytes]

Identifies gene families that are present in at least a specified proportion of genomes.

Parameters:

h5f – The open HDF5 pangenome file containing gene family and genome data.
soft_core – The proportion of genomes (between 0 and 1) that a gene family must be present in to be considered a soft core family.

Returns:

A set of gene family names (as bytes) that are classified as soft core.

ppanggolin.formats.readBinaries.get_status(pangenome: Pangenome, pangenome_file: Path)

Checks which elements are already present in the file.

Parameters:

pangenome – Blank pangenome
pangenome_file – path to the pangenome file

ppanggolin.formats.readBinaries.read_annotation(pangenome: Pangenome, h5f: File, load_organisms: bool = True, load_contigs: bool = True, load_genes: bool = True, load_rnas: bool = True, chunk_size: int = 20000, disable_bar: bool = False)

Read annotation in pangenome hdf5 file to add in pangenome object

Parameters:

pangenome – Pangenome object without annotation
h5f – Pangenome HDF5 file with annotation
load_organisms – Flag to load organisms
load_contigs – Flag to load contigs
load_genes – Flag to load genes
load_rnas – Flag to load RNAs
chunk_size – Size of chunks reading
disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_chunks(table: Table, column: str | None = None, chunk: int = 10000)

Reading entirely the provided table (or column if specified) chunk per chunk to limit RAM usage.

Parameters:

table –
column –
chunk –

ppanggolin.formats.readBinaries.read_contigs(pangenome: Pangenome, table: Table, chunk_size: int = 20000, disable_bar: bool = False)

Read contig table in pangenome file to add them to the pangenome object

Parameters:

pangenome – Pangenome object
table – Contig table
chunk_size – Size of the chunk reading
disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_gene_families(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read gene families in pangenome hdf5 file to add in pangenome object

Parameters:

pangenome – Pangenome object without gene families
h5f – Pangenome HDF5 file with gene families information
disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_gene_families_info(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read information about gene families in pangenome hdf5 file to add in pangenome object

Parameters:

pangenome – Pangenome object without gene families information
h5f – Pangenome HDF5 file with gene families information
disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_gene_sequences(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read gene sequences in pangenome hdf5 file to add in pangenome object

Parameters:

pangenome – Pangenome object without gene sequence associate to gene
h5f – Pangenome HDF5 file with gene sequence associate to gene
disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_genedata(h5f: File) → Dict[int, Genedata]

Reads the genedata table and returns a genedata_id2genedata dictionary

Parameters:: h5f – the hdf5 file handler
Returns:: dictionary linking genedata to the genedata identifier
Raises:: KeyError – If a Genedata entry with joined coordinates is not found in the annotations.joinCoordinates table.

ppanggolin.formats.readBinaries.read_genes(pangenome: Pangenome, table: Table, genedata_dict: Dict[int, Genedata], link: bool = True, chunk_size: int = 20000, disable_bar: bool = False)

Read genes in pangenome file to add them to the pangenome object

Parameters:

pangenome – Pangenome object
table – Genes table
genedata_dict – Dictionary to link genedata with gene
link – Allow to link gene to organism and contig
chunk_size – Size of the chunk reading
disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_graph(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read information about graph in pangenome hdf5 file to add in pangenome object

Parameters:

pangenome – Pangenome object without graph information
h5f – Pangenome HDF5 file with graph information
disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_info(h5f)

Read the pangenome content

Parameters:: h5f – Pangenome HDF5 file

ppanggolin.formats.readBinaries.read_join_coordinates(h5f: File) → Dict[str, List[Tuple[int, int]]]

Read join coordinates from a HDF5 file and return a dictionary mapping genedata_id to coordinates.

Parameters:: h5f – An HDF5 file object.
Returns:: A dictionary mapping genedata_id to a list of tuples representing start and stop coordinates.

ppanggolin.formats.readBinaries.read_metadata(pangenome: Pangenome, h5f: File, metatype: str, sources: Set[str] | None = None, disable_bar: bool = False)

Read metadata to add them to the pangenome object

Parameters:

pangenome – Pangenome object
h5f – Pangenome file
metatype – Object type to associate metadata
sources – Source name of metadata
disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_module_families_from_pangenome_file(h5f: File, module_name: str) → Set[bytes]

Retrieves gene families associated with a specified module from the pangenome file.

Parameters:

h5f – The open HDF5 pangenome file containing module data.
module_name – The name of the module (as a string). The module ID is extracted from the name by removing the “module_” prefix.

Returns:

A set of gene family names (as bytes) associated with the specified module.

ppanggolin.formats.readBinaries.read_modules(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read modules in pangenome hdf5 file to add in pangenome object

Parameters:

pangenome – Pangenome object without modules
h5f – Pangenome HDF5 file with modules computed
disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_organisms(pangenome: Pangenome, table: Table, chunk_size: int = 20000, disable_bar: bool = False)

Read organism table in pangenome file to add them to the pangenome object

Parameters:

pangenome – Pangenome object
table – Organism table
chunk_size – Size of the chunk reading
disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_pangenome(pangenome, annotation: bool = False, gene_families: bool = False, graph: bool = False, rgp: bool = False, spots: bool = False, gene_sequences: bool = False, modules: bool = False, metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None, disable_bar: bool = False)

Reads a previously written pangenome, with all of its parts, depending on what is asked, with regard to what is filled in the ‘status’ field of the hdf5 file.

Parameters:

pangenome – Pangenome object without some information
annotation – get annotation
gene_families – get gene families
graph – get graph
rgp – get RGP
spots – get hotspot
gene_sequences – get gene sequences
modules – get modules
metadata – get metadata
metatypes – metatypes of the metadata to get
sources – sources of the metadata to get (None means all sources)
disable_bar – Allow to disable the progress bar

ppanggolin.formats.readBinaries.read_parameters(h5f: File)

Read pangenome parameters

Parameters:: h5f – Pangenome HDF5 file

ppanggolin.formats.readBinaries.read_rgp(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read region of genomic plasticity in pangenome hdf5 file to add in pangenome object

Parameters:

pangenome – Pangenome object without RGP
h5f – Pangenome HDF5 file with RGP computed
disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.read_rgp_genes_from_pangenome_file(h5f: File) → Set[bytes]

Retrieves a list of RGP genes from the pangenome file.

Parameters:: h5f – The open HDF5 pangenome file containing RGP gene data.
Returns:: A list of gene names (as bytes) from the RGP.

ppanggolin.formats.readBinaries.read_rnas(pangenome: Pangenome, table: Table, genedata_dict: Dict[int, Genedata], link: bool = True, chunk_size: int = 20000, disable_bar: bool = False)

Read RNAs in pangenome file to add them to the pangenome object

Parameters:

pangenome – Pangenome object
table – RNAs table
genedata_dict – Dictionary to link genedata with gene
link – Allow to link gene to organism and contig
chunk_size – Size of the chunk reading
disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.read_sequences(h5f: File) → dict: Reads the sequences table and returns a sequence id to sequence dictionary :param h5f: the hdf5 file handler :return: dictionary linking sequences to the seq identifier

ppanggolin.formats.readBinaries.read_spots(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Read hotspots in the pangenome HDF5 file and add them to the pangenome object.

Parameters:

pangenome – Pangenome object without spot
h5f – Pangenome HDF5 file with spot computed
disable_bar – Disable the progress bar

ppanggolin.formats.readBinaries.write_fasta_gene_fam_from_pangenome_file(pangenome_filename: str, output: Path, family_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)

Write representative nucleotide sequences of gene families

Parameters:

pangenome – Pangenome object with gene families sequences
output – Path to output directory
gene_families – Selected partition of gene families
soft_core – Soft core threshold to use
compress – Compress the file in .gz
disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.write_fasta_prot_fam_from_pangenome_file(pangenome_filename: str, output: Path, family_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)

Write representative amino acid sequences of gene families.

Parameters:

pangenome – Pangenome object with gene families sequences
output – Path to output directory
prot_families – Selected partition of protein families
soft_core – Soft core threshold to use
compress – Compress the file in .gz
disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.write_gene_sequences_from_pangenome_file(pangenome_filename: str, output: Path, list_cds: Iterator | None = None, add: str = '', compress: bool = False, disable_bar: bool = False)

Writes the CDS sequences of the Pangenome object to a File object that can be filtered or not by a list of CDS, and adds the eventual str ‘add’ in front of the identifiers. Loads the sequences from a .h5 pangenome file.

Parameters:

pangenome_filename – Name of the pangenome file
output – Path to the sequences file
list_cds – An iterable object of CDS
add – Add a prefix to sequence header
compress – Compress the output file
disable_bar – Prevent to print disable progress bar

ppanggolin.formats.readBinaries.write_genes_from_pangenome_file(pangenome_filename: str, output: Path, gene_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)

Write representative nucleotide sequences of gene families

Parameters:

pangenome – Pangenome object with gene families sequences
output – Path to output directory
gene_families – Selected partition of gene families
soft_core – Soft core threshold to use
compress – Compress the file in .gz
disable_bar – Disable progress bar

ppanggolin.formats.readBinaries.write_genes_seq_from_pangenome_file(h5f: File, outpath: Path, compress: bool, seq_id_to_genes: Dict[int, List[str]], disable_bar: bool)

Writes gene sequences from the pangenome file to an output file.

Only sequences whose IDs match the ones in seq_id_to_genes will be written.

Parameters:

h5f – The open HDF5 pangenome file containing sequence data.
outpath – The path to the output file where sequences will be written.
compress – Boolean flag to indicate whether output should be compressed.
seq_id_to_genes – A dictionary mapping sequence IDs to lists of gene names.
disable_bar – Boolean flag to disable the progress bar if set to True.

ppanggolin.formats.writeAnnotations module

ppanggolin.formats.writeAnnotations.contig_desc(contig_len: int, org_len: int) → NewCol]

Table description to save contig-related information

Parameters:

contig_len – Maximum size of contig name
org_len – Maximum size of organism name.

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.gene_desc(id_len: int, max_local_id: int) → NewCol]

Table description to save gene-related information

Parameters:

id_len – Maximum size of gene name
max_local_id – Maximum size of gene local identifier

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.gene_joined_coordinates_desc() → NewCol]

Creates a table for gene-related data

Parameters:

type_len – Maximum size of gene Type.
name_len – Maximum size of gene name
product_len – Maximum size of gene product

Returns:

Formatted table for gene metadata

ppanggolin.formats.writeAnnotations.gene_sequences_desc(gene_id_len: int, gene_type_len: int) → NewCol]

Create table to save gene sequences

Parameters:

gene_id_len – Maximum size of gene sequence identifier
gene_type_len – Maximum size of gene type

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.genedata_desc(type_len: int, name_len: int, product_len: int) → NewCol]

Creates a table for gene-related data

Parameters:

type_len – Maximum size of gene Type.
name_len – Maximum size of gene name
product_len – Maximum size of gene product

Returns:

Formatted table for gene metadata

ppanggolin.formats.writeAnnotations.get_gene_sequences_len(pangenome: Pangenome) → Tuple[int, int]: Get the maximum size of gene sequences to optimize disk space :param pangenome: Annotated pangenome :return: maximum size of each annotation

ppanggolin.formats.writeAnnotations.get_genedata(feature: Gene | RNA) → Genedata

Gets the genedata type of Feature

Parameters:: feature – Gene or RNA object
Returns:: Tuple with a Feature associated data

ppanggolin.formats.writeAnnotations.get_max_len_annotations(pangenome: Pangenome) → Tuple[int, int, int, int, int]

Get the maximum size of each annotation information to optimize disk space

Parameters:: pangenome – Annotated pangenome
Returns:: Maximum size of each annotation

ppanggolin.formats.writeAnnotations.get_max_len_genedata(pangenome: Pangenome) → Tuple[int, int, int]

Get the maximum size of each gene data information to optimize disk space

Parameters:: pangenome – Annotated pangenome
Returns:: maximum size of each annotation

ppanggolin.formats.writeAnnotations.get_sequence_len(pangenome: Pangenome) → int: Get the maximum size of gene sequences to optimize disk space :param pangenome: Annotated pangenome :return: maximum size of each annotation

ppanggolin.formats.writeAnnotations.organism_desc(org_len: int) → NewCol]

Table description to save organism-related information

Parameters:: org_len – Maximum size of organism name.
Returns:: Formatted table

ppanggolin.formats.writeAnnotations.rna_desc(id_len: int) → NewCol]

Table description to save rna-related information

Parameters:

id_len – Maximum size of RNA identifier
max_contig_len – Maximum size of contig identifier

Returns:

Formatted table

ppanggolin.formats.writeAnnotations.sequence_desc(max_seq_len: int) → NewCol]: Table description to save sequences :param max_seq_len: Maximum size of gene type :return: Formatted table

ppanggolin.formats.writeAnnotations.write_annotations(pangenome: Pangenome, h5f: File, rec_organisms: bool = True, rec_contigs: bool = True, rec_genes: bool = True, rec_rnas: bool = True, disable_bar: bool = False)

Function writing all the pangenome annotations

Parameters:

pangenome – Annotated pangenome
h5f – Pangenome HDF5 file
rec_organisms – Allow writing organisms in pangenomes
rec_contigs – Allow writing contigs in pangenomes
rec_genes – Allow writing genes in pangenomes
rec_rnas – Allow writing RNAs in pangenomes
disable_bar – Allow to disable progress bar

ppanggolin.formats.writeAnnotations.write_contigs(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, contig_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False): Write contigs information in the pangenome file :param pangenome: Annotated pangenome object :param h5f: Pangenome file :param annotation: Annotation table group :param contig_desc: Contigs table description :param disable_bar: Allow disabling progress bar

ppanggolin.formats.writeAnnotations.write_gene_joined_coordinates(h5f, annotation, genes_with_joined_coordinates_2_id, disable_bar)

Writing genedata information in pangenome file

Parameters:

h5f – Pangenome file
annotation – Annotation group in Table
genedata2gene – Dictionary linking genedata to gene identifier.
disable_bar – Allow disabling progress bar

ppanggolin.formats.writeAnnotations.write_gene_sequences(pangenome: Pangenome, h5f: File, disable_bar: bool = False): Function writing all the pangenome gene sequences :param pangenome: Pangenome with gene sequences :param h5f: Pangenome HDF5 file without sequences :param disable_bar: Disable progress bar

ppanggolin.formats.writeAnnotations.write_genedata(pangenome: Pangenome, h5f: File, annotation: Group, genedata2gene: Dict[Genedata, int], disable_bar=False)

Writing genedata information in pangenome file

Parameters:

pangenome – Pangenome object filled with annotation.
h5f – Pangenome file
annotation – Annotation group in Table
genedata2gene – Dictionary linking genedata to gene identifier.
disable_bar – Allow disabling progress bar

ppanggolin.formats.writeAnnotations.write_genes(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, gene_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False) → Dict[Genedata, int]

Write genes information in the pangenome file

Parameters:

pangenome – Annotated pangenome object
h5f – Pangenome file
annotation – Annotation table group
gene_desc – Genes table description
disable_bar – Allow to disable progress bar

Returns:

Dictionary linking genedata to gene identifier

ppanggolin.formats.writeAnnotations.write_organisms(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, organism_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False)

Write organisms information in the pangenome file

Parameters:

pangenome – Annotated pangenome object
h5f – Pangenome file
annotation – Annotation table group
organism_desc – Organisms table description.
disable_bar – Allow disabling progress bar

ppanggolin.formats.writeAnnotations.write_rnas(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, rna_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False) → Dict[Genedata, int]

Write RNAs information in the pangenome file

Parameters:

pangenome – Annotated pangenome object
h5f – Pangenome file
annotation – Annotation table group
rna_desc – RNAs table description
disable_bar – Allow to disable progress bar

Returns:

Dictionary linking genedata to RNA identifier

ppanggolin.formats.writeBinaries module

ppanggolin.formats.writeBinaries.erase_pangenome(pangenome: Pangenome, graph: bool = False, gene_families: bool = False, partition: bool = False, rgp: bool = False, spots: bool = False, modules: bool = False, metadata: bool = False, metatype: str | None = None, source: str | None = None)

Erases tables from a pangenome .h5 file

Parameters:

pangenome – Pangenome
graph – remove graph information
gene_families – remove gene families information
partition – remove partition information
rgp – remove rgp information
spots – remove spots information
modules – remove modules information
metadata – remove metadata information
metatype –
source –

ppanggolin.formats.writeBinaries.gene_fam_desc(max_name_len: int, max_sequence_length: int, max_part_len: int) → dict

Create a formatted table for gene families description

Parameters:

max_name_len – Maximum size of gene family name
max_sequence_length – Maximum size of gene family representing gene sequences
max_part_len – Maximum size of gene family partition

Returns:

Formatted table

ppanggolin.formats.writeBinaries.gene_to_fam_desc(gene_fam_name_len: int, gene_id_len: int) → dict

Create a formatted table for gene in gene families information

Parameters:

gene_fam_name_len – Maximum size of gene family names
gene_id_len – Maximum size of gene identifier

Returns:

formatted table

ppanggolin.formats.writeBinaries.get_gene_fam_len(pangenome: Pangenome) → Tuple[int, int, int]

Get maximum size of gene families information

Parameters:: pangenome – Pangenome with gene families computed
Returns:: Maximum size of each element

ppanggolin.formats.writeBinaries.get_gene_id_len(pangenome: Pangenome) → int

Get maximum size of gene id in pangenome graph

Parameters:: pangenome – Pangenome with graph computed
Returns:: Maximum size of gene id

ppanggolin.formats.writeBinaries.get_gene_to_fam_len(pangenome: Pangenome)

Get maximum size of gene in gene families information

Parameters:: pangenome – Pangenome with gene families computed
Returns:: Maximum size of each element

ppanggolin.formats.writeBinaries.get_mod_desc(pangenome: Pangenome) → int

Get maximum size of gene families name in modules

Parameters:: pangenome – Pangenome with modules computed
Returns:: Maximum size of each element

ppanggolin.formats.writeBinaries.get_rgp_len(pangenome: Pangenome) → Tuple[int, int]

Get maximum size of region of genomic plasticity and gene

Parameters:: pangenome – Pangenome with gene families computed
Returns:: Maximum size of each element

ppanggolin.formats.writeBinaries.get_spot_desc(pangenome: Pangenome) → int

Get maximum size of region of genomic plasticity in hotspot

Parameters:: pangenome – Pangenome with gene families computed
Returns:: Maximum size of each element

ppanggolin.formats.writeBinaries.getmax(arg: iter) → float

Get the maximum of arguments if exist 0 else

Parameters:: arg – list of values
Returns:: return the maximum

ppanggolin.formats.writeBinaries.getmean(arg: iter) → float

Compute the mean of arguments if exist 0 else

Parameters:: arg – list of values
Returns:: return the mean

ppanggolin.formats.writeBinaries.getmin(arg: iter) → float

Get the minimum of arguments if exist 0 else

Parameters:: arg – list of values
Returns:: return the minimum

ppanggolin.formats.writeBinaries.getstdev(arg: iter) → float

Compute the standard deviation of arguments if exist 0 else

Parameters:: arg – list of values
Returns:: return the sd

ppanggolin.formats.writeBinaries.graph_desc(max_gene_id_len)

Create a formatted table for pangenome graph

Parameters:: max_gene_id_len – Maximum size of gene id
Returns:: formatted table

ppanggolin.formats.writeBinaries.mod_desc(gene_fam_name_len)

Create a formatted table for hotspot

Parameters:: gene_fam_name_len – Maximum size of gene families name
Returns:: formatted table

ppanggolin.formats.writeBinaries.rgp_desc(max_rgp_len, max_gene_len)

Create a formatted table for region of genomic plasticity

Parameters:

max_rgp_len – Maximum size of RGP
max_gene_len – Maximum sizez of gene

Returns:

formatted table

ppanggolin.formats.writeBinaries.spot_desc(max_rgp_len)

Create a formatted table for hotspot

Parameters:: max_rgp_len – Maximum size of RGP
Returns:: formatted table

ppanggolin.formats.writeBinaries.update_gene_fam_partition(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Update the gene families table with partition information

Parameters:

pangenome – Partitioned pangenome
h5f – HDF5 file with gene families
disable_bar – Allow to disable progress bar

ppanggolin.formats.writeBinaries.update_gene_fragments(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Updates the annotation table with the fragmentation information from the defrag pipeline

Parameters:

pangenome – Annotated pangenome
h5f – HDF5 pangenome file
disable_bar – Allow to disable progress bar

ppanggolin.formats.writeBinaries.write_gene_fam_info(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Writing a table containing the protein sequences of each family

Parameters:

pangenome – Pangenome with gene families computed
h5f – HDF5 file to write gene families
force – force to write information if precedent information exist
disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_gene_families(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing all the pangenome gene families

Parameters:

pangenome – pangenome with gene families computed
h5f – HDF5 file to save pangenome with gene families
force – Force to write gene families in hdf5 file if there is already gene families
disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_graph(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing the pangenome graph

Parameters:

pangenome – pangenome with graph computed
h5f – HDF5 file to save pangenome graph
force – Force to write graph in hdf5 file if there is already one
disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_info(pangenome: Pangenome, h5f: File)

Writes information and numbers to be eventually called with the ‘info’ submodule

Parameters:

pangenome – Pangenome object with some information computed
h5f – Pangenome file to save information

ppanggolin.formats.writeBinaries.write_info_modules(pangenome: Pangenome, h5f: File)

Writes information about modules

Parameters:

pangenome – Pangenome object with some information computed
h5f – Pangenome file to save information

ppanggolin.formats.writeBinaries.write_modules(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing all the pangenome modules

Parameters:

pangenome – pangenome with spot computed
h5f – HDF5 file to save pangenome with spot
force – Force to write gene families in hdf5 file if there is already spot
disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_pangenome(pangenome: Pangenome, filename, force: bool = False, disable_bar: bool = False)

Writes or updates a pangenome file

Parameters:

pangenome – pangenome object
filename – HDF5 file to save pangenome
force – force to write on pangenome if information already exist
disable_bar – Allow to disable progress bar

ppanggolin.formats.writeBinaries.write_rgp(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing all the region of genomic plasticity in pangenome

Parameters:

pangenome – pangenome with RGP computed
h5f – HDF5 file to save pangenome with RGP
force – Force to write gene families in hdf5 file if there is already RGP
disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_spots(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)

Function writing all the pangenome hotspot

Parameters:

pangenome – pangenome with spot computed
h5f – HDF5 file to save pangenome with spot
force – Force to write gene families in hdf5 file if there is already spot
disable_bar – Disable progress bar

ppanggolin.formats.writeBinaries.write_status(pangenome: Pangenome, h5f: File)

Write pangenome status in HDF5 file

Parameters:

pangenome – Pangenome object
h5f – Pangenome file

ppanggolin.formats.writeFlatGenomes module

ppanggolin.formats.writeFlatGenomes.convert_overlapping_coordinates_for_gff(coordinates: List[Tuple[int, int]], contig_length: int)

Converts overlapping gene coordinates in GFF format for circular contigs.

Parameters:

coordinates – List of tuples representing gene coordinates.
contig_length – Length of the circular contig.

ppanggolin.formats.writeFlatGenomes.count_neighbors_partitions(gene_family: GeneFamily)

Count partition of neighbors families.

Parameters:: gene_family – Gene family for which we count neighbors

ppanggolin.formats.writeFlatGenomes.encode_attribute_val(product: str) → str

Encode special characters forbidden in column 9 of the GFF3 format.

Parameters:: product – The input string to encode.
Returns:: The encoded string with special characters replaced.

Reference: - GFF3 format requirement: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md - Code source taken from Bakta: https://github.com/oschwengers/bakta

ppanggolin.formats.writeFlatGenomes.encode_attributes(attributes: List[Tuple]) → str

Encode a list of attributes in GFF3 format.

Parameters:: attributes – A list of attribute key-value pairs represented as tuples.
Returns:: The encoded attributes as a semicolon-separated string.

ppanggolin.formats.writeFlatGenomes.get_organism_list(organisms_filt: str, pangenome: Pangenome) → Set[Organism]

Get a list of organisms to include in the output.

Parameters:

organisms_filt – Filter for selecting organisms. It can be a file path with one organism name per line or a comma-separated list of organism names.
pangenome – The pangenome from which organisms will be selected.

Returns:

A set of selected Organism objects.

ppanggolin.formats.writeFlatGenomes.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.formats.writeFlatGenomes.manage_module_colors(modules: Set[Module], window_size: int = 100) → Dict[Module, str]

Manages colors for a list of modules based on gene positions and a specified window size.

Parameters:

modules – A list of module objects for which you want to determine colors.
window_size – Minimum number of genes between two modules to color them with the same color. A higher value results in more module colors.

Returns:

A dictionary that maps each module to its assigned color.

ppanggolin.formats.writeFlatGenomes.mp_write_genomes_file(organism: Organism, output: Path, genome_file: Path | None = None, proksee: bool = False, gff: bool = False, table: bool = False, **kwargs) → str

Wrapper for the write_genomes_file function that allows it to be used in multiprocessing.

Parameters:

organism – Specify the organism to be written
output – Specify the path to the output directory
genome_file – Read the genome sequences from a file
proksee – Write a proksee file for the organism
gff – Write the gff file for the organism
table – Write the organism file for the organism
kwargs – Pass any number of keyword arguments to the function

Returns:

The organism name

ppanggolin.formats.writeFlatGenomes.palette(nb_colors: int) → List[str]

Generates a palette of colors for visual representation.

Parameters:: nb_colors – The number of colors needed in the palette.
Returns:: A list of color codes in hexadecimal format.

ppanggolin.formats.writeFlatGenomes.parser_flat(parser: ArgumentParser)

Parser for specific argument of write command

Parameters:: parser – parser for align argument

ppanggolin.formats.writeFlatGenomes.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.formats.writeFlatGenomes.write_flat_genome_files(pangenome: Pangenome, output: Path, table: bool = False, gff: bool = False, proksee: bool = False, compress: bool = False, fasta: Path | None = None, anno: Path | None = None, organisms_filt: str = 'all', add_metadata: bool = False, metadata_sep: str = '|', metadata_sources: List[str] | None = None, cpu: int = 1, disable_bar: bool = False)

Main function to write flat files from pangenome

Parameters:

pangenome – Pangenome object
output – Path to output directory
cpu – Number of available core
table – write table with pangenome annotation for each genome
gff – write a gff file with pangenome annotation for each organism
proksee – write a proksee file with pangenome annotation for each organisms
compress – Compress the file in .gz
disable_bar – Disable progress bar
fasta – File containing the list FASTA files for each organism
anno – File containing the list of GBFF/GFF files for each organism
organisms_filt – String used to specify which organism to write. if all, all organisms are written.
add_metadata – Add metadata to GFF files
metadata_sep – The separator used to join multiple metadata values
metadata_sources – Sources of the metadata to use and write in the outputs. None means all sources are used.

ppanggolin.formats.writeFlatGenomes.write_gff_file(organism: Organism, outdir: Path, annotation_sources: Dict[str, str], genome_sequences: Dict[str, str], metadata_sep: str = '|', compress: bool = False)

Write the GFF file of the provided organism.

Parameters:

organism – Organism object for which the GFF file is being written.
outdir – Path to the output directory where the GFF file will be written.
metadata_sep – The separator used to join multiple metadata values
compress – If True, compress the output GFF file using .gz format.
annotation_sources – A dictionary that maps types of features to their source information.
genome_sequences – A dictionary mapping contig names to their DNA sequences (default: None).

ppanggolin.formats.writeFlatGenomes.write_tsv_genome_file(organism: Organism, output: Path, compress: bool = False, metadata_sep: str = '|', need_regions: bool = False, need_spots: bool = False, need_modules: bool = False)

Write the table of genes with pangenome annotation for one organism in tsv

Parameters:

organism – An organism
output – Path to output directory
compress – Compress the file in .gz
need_regions – Write information about regions
need_spots – Write information about spots
need_modules – Write information about modules

ppanggolin.formats.writeFlatPangenome module

ppanggolin.formats.writeFlatPangenome.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.formats.writeFlatPangenome.parser_flat(parser: ArgumentParser)

Parser for specific argument of write command

Parameters:: parser – parser for align argument

ppanggolin.formats.writeFlatPangenome.spot2rgp(spots: set, output: Path, compress: bool = False)

Write a tsv file providing association between spot and rgp

Parameters:

spots – set of spots in pangenome
output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.formats.writeFlatPangenome.summarize_genome(organism: Organism, pangenome_persistent_count: int, pangenome_persistent_single_copy_families: Set[GeneFamily], soft_core_families: Set[GeneFamily], exact_core_families: Set[GeneFamily], rgp_count: int, spot_count: int, module_count: int) → Dict[str, any]

Summarizes genomic information of an organism.

Parameters:

input_organism – The organism for which the genome is being summarized.
pangenome_persistent_count – Count of persistent genes in the pangenome.
pangenome_persistent_single_copy_families – Set of gene families considered as persistent single-copy in the pangenome.
soft_core_families – soft core families of the pangenome
exact_core_families – exact core families of the pangenome
input_org_rgps – Number of regions of genomic plasticity in the input organism. None if not computed.
input_org_spots – Number of spots in the input organism. None if not computed.
input_org_modules – Number of modules in the input organism. None if not computed.

Returns:

A dictionary containing various summary information about the genome.

ppanggolin.formats.writeFlatPangenome.summarize_spots(spots: set, output: Path, compress: bool = False, file_name='summarize_spots.tsv')

Write a file providing summarize information about hotspots

Parameters:

spots – set of spots in pangenome
output – Path to output directory
compress – Compress the file in .gz

Patam file_name:

Name of the output file

ppanggolin.formats.writeFlatPangenome.write_borders(output: Path, dup_margin: float = 0.05, compress: bool = False)

Write all gene families bordering each spot

Parameters:

output – Path to output directory
compress – Compress the file in .gz
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated

ppanggolin.formats.writeFlatPangenome.write_gene_families_tsv(output: Path, compress: bool = False, disable_bar: bool = False)

Write the file providing the association between genes and gene families

Parameters:

output – Path to output directory
compress – Compress the file in .gz
disable_bar – Flag to disable progress bar

ppanggolin.formats.writeFlatPangenome.write_gene_presence_absence(output: Path, compress: bool = False)

Write the gene presence absence matrix

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_gexf(output: Path, light: bool = True, compress: bool = False)

Write the node of pangenome in gexf file

Parameters:

output – Path to output directory
light – save the light version of the pangenome graph
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_gexf_edges(gexf: TextIO, light: bool = True)

Write the edge of pangenome graph in gexf file

Parameters:

gexf – file-like object, compressed or not
light – save the light version of the pangenome graph

ppanggolin.formats.writeFlatPangenome.write_gexf_end(gexf: TextIO)

Write the end of gexf file to save pangenome

Parameters:: gexf – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_gexf_header(gexf: TextIO, light: bool = True)

Write the header of gexf file to save graph

Parameters:

gexf – file-like object, compressed or not
light – save the light version of the pangenome graph

ppanggolin.formats.writeFlatPangenome.write_gexf_nodes(gexf: TextIO, light: bool = True, soft_core: False = 0.95)

Write the node of pangenome graph in gexf file

Parameters:

gexf – file-like object, compressed or not
light – save the light version of the pangenome graph
soft_core – Soft core threshold to use

ppanggolin.formats.writeFlatPangenome.write_json(output: Path, compress: bool = False)

Writes the graph in a json file format

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_json_edge(edge: Edge, json: TextIO)

Write the edge graph in json file

Parameters:

edge – file-like object, compressed or not
json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_json_edges(json)

Write the edge graph in json file

Parameters:: json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_json_gene_fam(gene_fam: GeneFamily, json: TextIO)

Write the gene families corresponding to node graph in json file

Parameters:

gene_fam – file-like object, compressed or not
json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_json_header(json: TextIO)

Write the header of json file to save graph

Parameters:: json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_json_nodes(json: TextIO)

Write the node graph in json file

Parameters:: json – file-like object, compressed or not

ppanggolin.formats.writeFlatPangenome.write_matrix(output: Path, sep: str = ',', ext: str = 'csv', compress: bool = False, gene_names: bool = False)

Write a csv file format as used by Roary, among others. The alternative gene ID will be the partition, if there is one

Parameters:

sep – Column field separator
ext – file extension
output – Path to output directory
compress – Compress the file in .gz
gene_names – write the genes name if there are saved in pangenome

ppanggolin.formats.writeFlatPangenome.write_module_summary(output: Path, compress: bool = False)

Write a file providing summarize information about modules

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_modules(output: Path, compress: bool = False)

Write a tsv file providing association between modules and gene families

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_org_modules(output: Path, compress: bool = False)

Write a tsv file providing association between modules and organisms

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_pangenome_flat_files(pangenome: Pangenome, output: Path, cpu: int = 1, soft_core: float = 0.95, dup_margin: float = 0.05, csv: bool = False, gene_pa: bool = False, gexf: bool = False, light_gexf: bool = False, stats: bool = False, json: bool = False, partitions: bool = False, families_tsv: bool = False, regions: bool = False, regions_families: bool = False, spots: bool = False, borders: bool = False, modules: bool = False, spot_modules: bool = False, compress: bool = False, disable_bar: bool = False)

Main function to write flat files from pangenome

Parameters:

pangenome – Pangenome object
output – Path to output directory
cpu – Number of available core
soft_core – Soft core threshold to use
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
csv – write csv file format as used by Roary
gene_pa – write gene presence absence matrix
gexf – write pangenome graph in gexf format
light_gexf – write pangenome graph with only gene families
stats – write statistics about pangenome
json – write pangenome graph in json file
partitions – write the gene families for each partition
families_tsv – write gene families information
regions – write RGP information
spots – write information on spots
borders – write gene families bordering spots
modules – write information about modules
spot_modules – write association between modules and RGP and modules and spots
compress – Compress the file in .gz
disable_bar – Disable progress bar

ppanggolin.formats.writeFlatPangenome.write_partitions(output: Path, soft_core: float = 0.95)

Write the list of gene families for each partition

Parameters:

output – Path to output directory
soft_core – Soft core threshold to use

ppanggolin.formats.writeFlatPangenome.write_persistent_duplication_statistics(pangenome: Pangenome, output: Path, dup_margin: float, compress: bool) → Set[GeneFamily]

Writes statistics on persistent duplications in gene families to a specified output file.

Parameters:

pangenome – The Pangenome object containing gene families.
output – The Path specifying the output file location.
dup_margin – The duplication margin used for determining single copy markers.
compress – A boolean indicating whether to compress the output file.

:return :

ppanggolin.formats.writeFlatPangenome.write_regions(output: Path, compress: bool = False)

Write the file providing information about RGP content

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_regions_families(output: Path, compress: bool = False)

Write the file providing the association between regions of genomic plasticity and gene families.

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_rgp_modules(output: Path, compress: bool = False)

Write a tsv file providing association between modules and RGP

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_rgp_table(regions: Set[Region], output: Path, compress: bool = False)

Write the file providing information about regions of genomic plasticity.

Parameters:

regions – Set of Region objects representing regions.
output – Path to the output directory.
compress – Whether to compress the file in .gz format.

ppanggolin.formats.writeFlatPangenome.write_spot_modules(output: Path, compress: bool = False)

Write a tsv file providing association between modules and spots

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_spots(output: Path, compress: bool = False)

Write tsv files providing spots information and association with RGP

Parameters:

output – Path to output directory
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_stats(output: Path, soft_core: float = 0.95, dup_margin: float = 0.05, compress: bool = False)

Write pangenome statistics for each genomes

Parameters:

output – Path to output directory
soft_core – Soft core threshold to use
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
compress – Compress the file in .gz

ppanggolin.formats.writeFlatPangenome.write_summaries_in_tsv(summaries: List[Dict[str, Any]], output_file: Path, dup_margin: float, soft_core: float, compress: bool = False)

Writes summaries of organisms stored in a dictionary into a Tab-Separated Values (TSV) file.

Parameters:

summaries – A list containing organism summaries.
output_file – The Path specifying the output TSV file location.
soft_core – Soft core threshold used
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
compress – Compress the file in .gz

ppanggolin.formats.writeMSA module

ppanggolin.formats.writeMSA.compute_msa(families: Set[GeneFamily], output: Path, tmpdir: Path, cpu: int = 1, source: str = 'protein', use_gene_id: bool = False, code: str = '11', disable_bar: bool = False)

Compute MSA between pangenome gene families

Parameters:

families – Set of families specific to given partition
output – output directory name for families alignment
cpu – number of available core
tmpdir – path to temporary directory
source – indicates whether to use protein or dna sequences to compute the msa
use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA
code – Genetic code to use
disable_bar – Disable progress bar

ppanggolin.formats.writeMSA.get_families_to_write(pangenome: Pangenome, partition_filter: str = 'core', soft_core: float = 0.95, dup_margin: float = 0.95, single_copy: bool = True) → Set[GeneFamily]

Get families corresponding to the given partition

Parameters:

pangenome – Partitioned pangenome
partition_filter – choice of partition to compute Multiple Sequence Alignment of the gene families
soft_core – Soft core threshold to use
dup_margin – maximal number of genomes in which the gene family can have multiple members and still be considered a ‘single copy’ gene family
single_copy – Use “single copy” (defined by dup_margin) gene families only

Returns:

set of families unique to one partition

ppanggolin.formats.writeMSA.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.formats.writeMSA.launch_mafft(fname: Path, output: Path, fam_name: str)

Compute the MSA with mafft

Parameters:

fname – family gene sequence in fasta
output – directory to save alignment
fam_name – Name of the gene family

ppanggolin.formats.writeMSA.launch_multi_mafft(args: List[Tuple[Path, Path, str]])

Allow to launch mafft in multiprocessing

Parameters:: args – Pack of argument for launch_mafft
Returns:: Organism object for pangenome

ppanggolin.formats.writeMSA.parser_msa(parser: ArgumentParser)

Parser for specific argument of msa command

Parameters:: parser – parser for align argument

ppanggolin.formats.writeMSA.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.formats.writeMSA.translate(gene: Gene, code: Dict[str, Dict[str, str]]) → Tuple[str, bool]

translates the given dna sequence with the given translation table

Parameters:

gene – given gene
code – translation table corresponding to genetic code to use

Returns:

protein sequence

ppanggolin.formats.writeMSA.write_fasta_families(family: GeneFamily, tmpdir: TemporaryDirectory, code_table: Dict[str, Dict[str, str]], source: str = 'protein', use_gene_id: bool = False) → Tuple[Path, bool]

Write fasta files for each gene family

Parameters:

family – gene family to write
tmpdir – path to temporary directory
source – indicates whether to use protein or dna sequences to compute the msa
use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA
code_table – Genetic code to use

Returns:

path to fasta file

ppanggolin.formats.writeMSA.write_msa_files(pangenome: Pangenome, output: Path, cpu: int = 1, partition: str = 'core', tmpdir: Path | None = None, source: str = 'protein', soft_core: float = 0.95, phylo: bool = False, use_gene_id: bool = False, translation_table: str = '11', dup_margin: float = 0.95, single_copy: bool = True, force: bool = False, disable_bar: bool = False)

Main function to write MSA files

Parameters:

pangenome – Pangenome object with partition
output – Path to output directory
cpu – number of available core
partition – choice of partition to compute Multiple Sequence Alignment of the gene families
tmpdir – path to temporary directory
source – indicates whether to use protein or dna sequences to compute the msa
soft_core – Soft core threshold to use
phylo – Writes a whole genome msa file for additional phylogenetic analysis
use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA
translation_table – Translation table (genetic code) to use.
dup_margin – maximal number of genomes in which the gene family can have multiple members and still be considered a ‘single copy’ gene family
single_copy – Use “single copy” (defined by dup_margin) gene families only
force – force to write in the directory
disable_bar – Disable progress bar

ppanggolin.formats.writeMSA.write_whole_genome_msa(pangenome: Pangenome, families: set, phylo_name: Path, outdir: Path, use_gene_id: bool = False)

Writes a whole genome msa file for additional phylogenetic analysis

Parameters:

pangenome – Pangenome object
families – Set of families specific to given partition
phylo_name – output file name for phylo alignment
outdir – output directory name for families alignment
use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA

ppanggolin.formats.writeMetadata module

ppanggolin.formats.writeMetadata.desc_metadata(max_len_dict: Dict[str, int], type_dict: Dict[str, Col]) → dict

Create a formatted table for metadata description

Returns:: Formatted table

ppanggolin.formats.writeMetadata.erase_metadata(pangenome: Pangenome, h5f: File, status_group: Group, metatype: str | None = None, source: str | None = None)

Erase metadata in pangenome

Parameters:

pangenome – Pangenome with metadata to erase
h5f – HDF5 file with pangenome metadata
status_group – pangenome status in HDF5
metatype – select to which pangenome element metadata should be erased
source – name of the metadata source

ppanggolin.formats.writeMetadata.get_metadata_contig_len(select_ctg: List[Contig], source: str) → Tuple[Dict[str, int], Dict[str, Col], int]

Get maximum size of contig metadata information

Parameters:

select_ctg – selected elements from source
source – Name of the metadata source

Returns:

Maximum type and size of each element

Get maximum size of metadata information

Parameters:

select_elem – selected elements from source
source – Name of the metadata source

Returns:

Maximum type and size of each element

ppanggolin.formats.writeMetadata.write_metadata(pangenome: Pangenome, h5f: File, disable_bar: bool = False)

Write metadata in pangenome

Parameters:

pangenome – Pangenome where should be written metadata
h5f – HDF5 file with pangenome
disable_bar – Disable progress bar

ppanggolin.formats.writeMetadata.write_metadata_contig(h5f: File, source: str, select_contigs: List[Contig], disable_bar: bool = False)

Writing a table containing the metadata associated to contig

Parameters:

h5f – HDF5 file to write gene families
source – name of the metadata source
select_contigs – List of contig withj metadata
disable_bar – Disable progress bar

ppanggolin.formats.writeMetadata.write_metadata_group(h5f: File, metatype: str) → Group

Check and write the group in HDF5 file to organize metadata

Parameters:

h5f – HDF5 file with pangenome
metatype – select to which pangenome element metadata should be written

Returns:

Metadata group of the corresponding metatype

Writing a table containing the metadata associated to element from the metatype

Parameters:

h5f – HDF5 file to write gene families
source – name of the metadata source
metatype – select to which pangenome element metadata should be written
select_elements – Elements selected to write metadata
disable_bar – Disable progress bar

ppanggolin.formats.writeMetadata.write_metadata_status(pangenome: Pangenome, h5f: File, status_group: Group) → bool

Write status of metadata in pangenome file

Parameters:

pangenome – pangenome with metadata
h5f – HDF5 file with pangenome
status_group – Pangenome status information group

Returns:

ppanggolin.formats.writeSequences module

ppanggolin.formats.writeSequences.check_write_sequences_args(args: Namespace) → None

Check arguments compatibility in CLI

Parameters:: args – argparse namespace arguments from CLI
Raises:: argparse.ArgumentTypeError – if region is given but neither fasta nor anno is given

ppanggolin.formats.writeSequences.create_mmseqs_db(sequences: Iterable[Path], db_name: str, tmpdir: Path, db_mode: int = 0, db_type: int = 0) → Path

Create a MMseqs2 database from a sequences file.

Parameters:

sequences – File with the sequences
db_name – name of the database
tmpdir – Temporary directory to save the MMSeqs2 files
db_mode – Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q)
db_type – Database type 0: auto, 1: amino acid 2: nucleotides

Returns:

Path to the MMSeqs2 database

ppanggolin.formats.writeSequences.filter_values(arg_value: str)

Check filter value to ensure they are in the expected format.

Parameters:: arg_value – Argument value that is being tested.
Returns:: The same argument if it is valid.
Raises:: argparse.ArgumentTypeError – If the argument value is not in the expected format.

ppanggolin.formats.writeSequences.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.formats.writeSequences.parser_seq(parser: ArgumentParser)

Parser for specific argument of fasta command

Parameters:: parser – parser for align argument

ppanggolin.formats.writeSequences.read_fasta_gbk(file_path: Path) → Dict[str, str]

Read the genome file in gbk format

Parameters:: file_path – Path to genome file
Returns:: Dictionary with all sequences associated to contig

ppanggolin.formats.writeSequences.read_fasta_or_gff(file_path: Path) → Dict[str, str]

Read the genome file in fasta or gbff format

Parameters:: file_path – Path to genome file
Returns:: Dictionary with all sequences associated to contig

ppanggolin.formats.writeSequences.read_genome_file(genome_file: Path, organism: Organism) → Dict[str, str]

Read the genome file associated to organism to extract sequences

Parameters:

genome_file – Path to a fasta file or gbff/gff file
organism – organism object

Returns:

Dictionary with all sequences associated to contig

Raises:

TypeError – If the file containing sequences is not recognized
KeyError – If their inconsistency between pangenome contigs and the given contigs

ppanggolin.formats.writeSequences.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.formats.writeSequences.translate_genes(sequences: Path | Iterable[Path], tmpdir: Path, cpu: int = 1, is_single_line_fasta: bool = False, code: int = 11) → Path

Translate nucleotide sequences into MMSeqs2 amino acid sequences database

Parameters:

sequences – File with the nucleotide sequences
tmpdir – Temporary directory to save the MMSeqs2 files
cpu – Number of available threads to use
is_single_line_fasta – Allow to use soft link in MMSeqs2 database
code – Translation code to use

Returns:

Path to the MMSeqs2 database

ppanggolin.formats.writeSequences.write_gene_protein_sequences(pangenome_filename: str, output: Path, gene_filter: str, soft_core: float = 0.95, compress: bool = False, keep_tmp: bool = False, tmp: Path | None = None, cpu: int = 1, code: int = 11, disable_bar: bool = False)

Write all amino acid sequences from given genes in pangenome

Parameters:

pangenome – Pangenome object with gene families sequences
output – Path to output directory
proteins – Selected partition of gene
soft_core – Soft core threshold to use
compress – Compress the file in .gz
keep_tmp – Keep temporary directory
tmp – Path to temporary directory
cpu – Number of threads available
code – Genetic code use to translate nucleotide sequences to protein sequences
disable_bar – Disable progress bar

ppanggolin.formats.writeSequences.write_gene_sequences_from_annotations(genes_to_write: Iterable[Gene], output: Path, add: str = '', compress: bool = False, disable_bar: bool = False)

Writes the CDS sequences to a File object, and adds the string provided through add in front of it. Loads the sequences from previously computed or loaded annotations.

Parameters:

genes_to_write – Genes to write.
output – Path to output file to write sequences.
add – Add prefix to gene ID.
compress – Compress the file in .gz
disable_bar – Disable progress bar.

ppanggolin.formats.writeSequences.write_regions_sequences(pangenome: Pangenome, output: Path, regions: str, fasta: Path | None = None, anno: Path | None = None, compress: bool = False, disable_bar: bool = False)

Write representative amino acid sequences of gene families.

Parameters:

pangenome – Pangenome object with gene families sequences
output – Path to output directory
regions – Write the RGP nucleotide sequences
fasta – A tab-separated file listing the organism names, fasta filepath of its genomic sequences
anno – A tab-separated file listing the organism names, and the gff/gbff filepath of its annotations
compress – Compress the file in .gz
disable_bar – Disable progress bar

Raises:

SyntaxError – if no tabulation are found in list genomes file

ppanggolin.formats.writeSequences.write_sequence_files(pangenome: Pangenome, output: Path, fasta: Path | None = None, anno: Path | None = None, soft_core: float = 0.95, regions: str | None = None, genes: str | None = None, proteins: str | None = None, gene_families: str | None = None, prot_families: str | None = None, compress: bool = False, disable_bar: bool = False, **translate_kwgs)

Main function to write sequence file from pangenome

Parameters:

pangenome – Pangenome object containing sequences
output – Path to output directory
fasta – A tab-separated file listing the organism names, fasta filepath of its genomic sequences
anno – A tab-separated file listing the organism names, and the gff/gbff filepath of its annotations
soft_core – Soft core threshold to use
regions – Write the RGP nucleotide sequences
genes – Write all nucleotide CDS sequences
proteins – Write amino acid CDS sequences.
gene_families – Write representative nucleotide sequences of gene families.
prot_families – Write representative amino acid sequences of gene families.
compress – Compress the file in .gz
disable_bar – Disable progress bar

ppanggolin.formats.writeSequences.write_spaced_fasta(sequence: str, space: int = 60) → str

Write a maximum of element per line

Parameters:

sequence – sequence to write
space – maximum of size for one line

Returns:

a sequence of maximum space character

ppanggolin.formats.write_proksee module

ppanggolin.formats.write_proksee.initiate_proksee_data(features: List[str], organism: Organism, module_to_color: Dict[Module, str] | None = None)

Initializes ProkSee data structure with legends, tracks, and captions.

Parameters:

features – A list of features to include in the ProkSee data.
organism – The organism for which the ProkSee data is being generated.
module_to_color – A dictionary mapping modules to their assigned colors.

Returns:

ProkSee data structure containing legends, tracks, and captions.

ppanggolin.formats.write_proksee.write_contig(organism: Organism, genome_sequences: Dict[str, str] | None = None) → List[Dict]

Writes contig data for a given organism in proksee format.

Parameters:

organism – The organism for which contig data will be written.
genome_sequences – A dictionary mapping contig names to their DNA sequences (default: None).

Returns:

A list of contig data in a structured format.

ppanggolin.formats.write_proksee.write_genes(organism: Organism, multigenics: Set[GeneFamily], disable_bar: bool = True) → Tuple[List[Dict], Dict[str, List[Gene]]]

Writes gene data for a given organism, including both protein-coding genes and RNA genes.

Parameters:

organism – The organism for which gene data will be written.
disable_bar – A flag to disable the progress bar when processing genes (default: True).

Returns:

List of gene data in a structured format and a dictionary mapping gene families to genes.

ppanggolin.formats.write_proksee.write_legend_items(features: List[str], module_to_color: Dict[Module, str] | None = None)

Generates legend items based on the selected features and module-to-color mapping.

Parameters:

features – A list of features to include in the legend.
module_to_color – A dictionary mapping modules to their assigned colors.

Returns:

A data structure containing legend items based on the selected features and module colors.

ppanggolin.formats.write_proksee.write_modules(organism: Organism, gf2genes: Dict[str, List[Gene]])

Writes module data in proksee format for a list of modules associated with a given organism.

Parameters:

organism – The organism to which the modules are associated.
gf2genes – A dictionary that maps gene families to the genes they contain.

Returns:

A list of module data in a structured format.

ppanggolin.formats.write_proksee.write_proksee_organism(organism: Organism, output_file: Path, features: List[str] | None = None, module_to_colors: Dict[Module, str] | None = None, genome_sequences: Dict[str, str] | None = None, multigenics: Set[GeneFamily] = [], compress: bool = False)

Writes ProkSee data for a given organism, including contig information, genes colored by partition, RGPs, and modules. The resulting data is saved as a JSON file in the specified output file.

Parameters:

organism – The organism for which ProkSee data will be written.
output_file – The output file where ProkSee data will be written.
features – A list of features to include in the ProkSee data, e.g., [“rgp”, “modules”, “all”].
module_to_colors – A dictionary mapping modules to their assigned colors.
genome_sequences – The genome sequences for the organism.
compress – Compress the output file

ppanggolin.formats.write_proksee.write_rgp(organism: Organism)

Writes RGP (Region of Genomic Plasticity) data for a given organism in proksee format. :param organism: The specific organism for which RGP data will be written.

Returns:: A list of RGP data in a structured format.

ppanggolin.formats.write_proksee.write_tracks(features: List[str])

Generates track information based on the selected features.

Parameters:: features – A list of features to include in the ProkSee data.
Returns:: A list of track configurations based on the selected features.

ppanggolin.formats package

Submodules

ppanggolin.formats.readBinaries module

ppanggolin.formats.writeAnnotations module

ppanggolin.formats.writeBinaries module

ppanggolin.formats.writeFlatGenomes module

ppanggolin.formats.writeFlatPangenome module

ppanggolin.formats.writeMSA module

ppanggolin.formats.writeMetadata module

ppanggolin.formats.writeSequences module

ppanggolin.formats.write_proksee module

Module contents