ppanggolin.formats package
Submodules
ppanggolin.formats.readBinaries module
- class ppanggolin.formats.readBinaries.Genedata(start: int, stop: int, strand: str, gene_type: str, position: int, name: str, product: str, genetic_code: int, coordinates: List[Tuple[int]] | None = None)
Bases:
objectThis is a general class storing unique gene-related data to be written in a specific genedata table
- ppanggolin.formats.readBinaries.check_pangenome_info(pangenome, need_annotations: bool = False, need_families: bool = False, need_graph: bool = False, need_partitions: bool = False, need_rgp: bool = False, need_spots: bool = False, need_gene_sequences: bool = False, need_modules: bool = False, need_metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None, disable_bar: bool = False)
Defines what needs to be read depending on what is needed, and automatically checks if the required elements have been computed with regard to the pangenome.status
- Parameters:
pangenome – Pangenome object without some information
need_annotations – get annotation
need_families – get gene families
need_graph – get graph
need_partitions – get partition
need_rgp – get RGP
need_spots – get hotspot
need_gene_sequences – get gene sequences
need_modules – get modules
need_metadata – get metadata
metatypes – metatypes of the metadata to get (None means all types with metadata)
sources – sources of the metadata to get (None means all possible sources)
disable_bar – Allow to disable the progress bar
- ppanggolin.formats.readBinaries.create_info_dict(info_group: Group)
Read the pangenome content
- Parameters:
info_group – group in pangenome HDF5 file containing information about pangenome
- ppanggolin.formats.readBinaries.get_families_from_genes(h5f: File, genes: Set[bytes]) Set[bytes]
Retrieves gene families associated with a specified set of genes from the pangenome file.
- Parameters:
h5f – The open HDF5 pangenome file containing gene family data.
genes – A set of gene names (as bytes) for which to retrieve the associated families.
- Returns:
A set of gene family names (as bytes) associated with the specified genes.
- ppanggolin.formats.readBinaries.get_families_matching_partition(h5f: File, partition: str) Set[bytes]
Retrieves gene families that match the specified partition.
- Parameters:
h5f – The open HDF5 pangenome file containing gene family information.
partition – The partition name (as a string). If “all”, all gene families are included. Otherwise, it filters by the first letter of the partition.
- Returns:
A set of gene family names (as bytes) that match the partition criteria.
- ppanggolin.formats.readBinaries.get_family_to_genome_count(h5f: File) Dict[bytes, int]
Computes the number of unique genomes associated with each gene family.
- Parameters:
h5f – The open HDF5 pangenome file containing contig, gene, and gene family data.
- Returns:
A dictionary mapping gene family names (as bytes) to the count of unique genomes.
- ppanggolin.formats.readBinaries.get_gene_to_genome(h5f: File) Dict[bytes, bytes]
Generates a mapping between gene IDs and their corresponding genome.
- Parameters:
h5f – The open HDF5 pangenome file containing contig and gene annotations.
- Returns:
A dictionary mapping gene IDs to genome names.
- ppanggolin.formats.readBinaries.get_genes_from_families(h5f: File, families: List[bytes]) Set[bytes]
Retrieves a set of genes that belong to the specified families.
This function reads the gene family data from an HDF5 pangenome file and returns a set of genes that are part of the given list of gene families.
- Parameters:
h5f – The open HDF5 pangenome file containing gene family data.
families – A list of gene families (as bytes) to filter genes by.
- Returns:
A set of genes (as bytes) that belong to the specified families.
- ppanggolin.formats.readBinaries.get_need_info(pangenome, need_annotations: bool = False, need_families: bool = False, need_graph: bool = False, need_partitions: bool = False, need_rgp: bool = False, need_spots: bool = False, need_gene_sequences: bool = False, need_modules: bool = False, need_metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None)
- ppanggolin.formats.readBinaries.get_non_redundant_gene_sequences_from_file(pangenome_filename: str, output: Path, add: str = '', disable_bar: bool = False)
Writes the non-redundant CDS sequences of the Pangenome object to a File object that can be filtered or not by a list of CDS, and adds the eventual str ‘add’ in front of the identifiers. Loads the sequences from a .h5 pangenome file.
- Parameters:
pangenome_filename – Name of the pangenome file
output – Path to the output file
add – Add a prefix to sequence header
disable_bar – disable progress bar
- ppanggolin.formats.readBinaries.get_number_of_organisms(pangenome: Pangenome) int
Standalone function to get the number of organisms in a pangenome
- Parameters:
pangenome – Annotated pangenome
- Returns:
Number of organisms in the pangenome
- ppanggolin.formats.readBinaries.get_pangenome_parameters(h5f: File) Dict[str, Dict[str, Any]]
Read and return the pangenome parameters.
- Parameters:
h5f – Pangenome HDF5 file
- Returns:
A dictionary containing the name of the ppanggolin step as the key, and a dictionary of parameter names and their corresponding values used for that step.
- ppanggolin.formats.readBinaries.get_seqid_to_genes(h5f: File, genes: Set[bytes], get_all_genes: bool = False, disable_bar: bool = False) Dict[int, List[str]]
Creates a mapping of sequence IDs to gene names.
- Parameters:
h5f – The open HDF5 pangenome file containing gene sequence data.
genes – A list of gene names to include in the mapping (if get_all_genes is False).
get_all_genes – Boolean flag to indicate if all genes should be included in the mapping. If set to True, all genes will be added regardless of the genes parameter.
disable_bar – Boolean flag to disable the progress bar if set to True.
- Returns:
A dictionary mapping sequence IDs (integers) to lists of gene names (strings).
- ppanggolin.formats.readBinaries.get_soft_core_families(h5f: File, soft_core: float) Set[bytes]
Identifies gene families that are present in at least a specified proportion of genomes.
- Parameters:
h5f – The open HDF5 pangenome file containing gene family and genome data.
soft_core – The proportion of genomes (between 0 and 1) that a gene family must be present in to be considered a soft core family.
- Returns:
A set of gene family names (as bytes) that are classified as soft core.
- ppanggolin.formats.readBinaries.get_status(pangenome: Pangenome, pangenome_file: Path)
Checks which elements are already present in the file.
- Parameters:
pangenome – Blank pangenome
pangenome_file – path to the pangenome file
- ppanggolin.formats.readBinaries.read_annotation(pangenome: Pangenome, h5f: File, load_organisms: bool = True, load_contigs: bool = True, load_genes: bool = True, load_rnas: bool = True, chunk_size: int = 20000, disable_bar: bool = False)
Read annotation in pangenome hdf5 file to add in pangenome object
- Parameters:
pangenome – Pangenome object without annotation
h5f – Pangenome HDF5 file with annotation
load_organisms – Flag to load organisms
load_contigs – Flag to load contigs
load_genes – Flag to load genes
load_rnas – Flag to load RNAs
chunk_size – Size of chunks reading
disable_bar – Disable the progress bar
- ppanggolin.formats.readBinaries.read_chunks(table: Table, column: str | None = None, chunk: int = 10000)
Reading entirely the provided table (or column if specified) chunk per chunk to limit RAM usage.
- Parameters:
table –
column –
chunk –
- ppanggolin.formats.readBinaries.read_contigs(pangenome: Pangenome, table: Table, chunk_size: int = 20000, disable_bar: bool = False)
Read contig table in pangenome file to add them to the pangenome object
- Parameters:
pangenome – Pangenome object
table – Contig table
chunk_size – Size of the chunk reading
disable_bar – Disable progress bar
- ppanggolin.formats.readBinaries.read_gene_families(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Read gene families in pangenome hdf5 file to add in pangenome object
- Parameters:
pangenome – Pangenome object without gene families
h5f – Pangenome HDF5 file with gene families information
disable_bar – Disable the progress bar
- ppanggolin.formats.readBinaries.read_gene_families_info(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Read information about gene families in pangenome hdf5 file to add in pangenome object
- Parameters:
pangenome – Pangenome object without gene families information
h5f – Pangenome HDF5 file with gene families information
disable_bar – Disable the progress bar
- ppanggolin.formats.readBinaries.read_gene_sequences(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Read gene sequences in pangenome hdf5 file to add in pangenome object
- Parameters:
pangenome – Pangenome object without gene sequence associate to gene
h5f – Pangenome HDF5 file with gene sequence associate to gene
disable_bar – Disable the progress bar
- ppanggolin.formats.readBinaries.read_genedata(h5f: File) Dict[int, Genedata]
Reads the genedata table and returns a genedata_id2genedata dictionary
- Parameters:
h5f – the hdf5 file handler
- Returns:
dictionary linking genedata to the genedata identifier
- Raises:
KeyError – If a Genedata entry with joined coordinates is not found in the annotations.joinCoordinates table.
- ppanggolin.formats.readBinaries.read_genes(pangenome: Pangenome, table: Table, genedata_dict: Dict[int, Genedata], link: bool = True, chunk_size: int = 20000, disable_bar: bool = False)
Read genes in pangenome file to add them to the pangenome object
- Parameters:
pangenome – Pangenome object
table – Genes table
genedata_dict – Dictionary to link genedata with gene
link – Allow to link gene to organism and contig
chunk_size – Size of the chunk reading
disable_bar – Disable progress bar
- ppanggolin.formats.readBinaries.read_graph(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Read information about graph in pangenome hdf5 file to add in pangenome object
- Parameters:
pangenome – Pangenome object without graph information
h5f – Pangenome HDF5 file with graph information
disable_bar – Disable the progress bar
- ppanggolin.formats.readBinaries.read_info(h5f)
Read the pangenome content
- Parameters:
h5f – Pangenome HDF5 file
- ppanggolin.formats.readBinaries.read_join_coordinates(h5f: File) Dict[str, List[Tuple[int, int]]]
Read join coordinates from a HDF5 file and return a dictionary mapping genedata_id to coordinates.
- Parameters:
h5f – An HDF5 file object.
- Returns:
A dictionary mapping genedata_id to a list of tuples representing start and stop coordinates.
- ppanggolin.formats.readBinaries.read_metadata(pangenome: Pangenome, h5f: File, metatype: str, sources: Set[str] | None = None, disable_bar: bool = False)
Read metadata to add them to the pangenome object
- Parameters:
pangenome – Pangenome object
h5f – Pangenome file
metatype – Object type to associate metadata
sources – Source name of metadata
disable_bar – Disable progress bar
- ppanggolin.formats.readBinaries.read_module_families_from_pangenome_file(h5f: File, module_name: str) Set[bytes]
Retrieves gene families associated with a specified module from the pangenome file.
- Parameters:
h5f – The open HDF5 pangenome file containing module data.
module_name – The name of the module (as a string). The module ID is extracted from the name by removing the “module_” prefix.
- Returns:
A set of gene family names (as bytes) associated with the specified module.
- ppanggolin.formats.readBinaries.read_modules(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Read modules in pangenome hdf5 file to add in pangenome object
- Parameters:
pangenome – Pangenome object without modules
h5f – Pangenome HDF5 file with modules computed
disable_bar – Disable the progress bar
- ppanggolin.formats.readBinaries.read_organisms(pangenome: Pangenome, table: Table, chunk_size: int = 20000, disable_bar: bool = False)
Read organism table in pangenome file to add them to the pangenome object
- Parameters:
pangenome – Pangenome object
table – Organism table
chunk_size – Size of the chunk reading
disable_bar – Disable progress bar
- ppanggolin.formats.readBinaries.read_pangenome(pangenome, annotation: bool = False, gene_families: bool = False, graph: bool = False, rgp: bool = False, spots: bool = False, gene_sequences: bool = False, modules: bool = False, metadata: bool = False, metatypes: Set[str] | None = None, sources: Set[str] | None = None, disable_bar: bool = False)
Reads a previously written pangenome, with all of its parts, depending on what is asked, with regard to what is filled in the ‘status’ field of the hdf5 file.
- Parameters:
pangenome – Pangenome object without some information
annotation – get annotation
gene_families – get gene families
graph – get graph
rgp – get RGP
spots – get hotspot
gene_sequences – get gene sequences
modules – get modules
metadata – get metadata
metatypes – metatypes of the metadata to get
sources – sources of the metadata to get (None means all sources)
disable_bar – Allow to disable the progress bar
- ppanggolin.formats.readBinaries.read_parameters(h5f: File)
Read pangenome parameters
- Parameters:
h5f – Pangenome HDF5 file
- ppanggolin.formats.readBinaries.read_rgp(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Read region of genomic plasticity in pangenome hdf5 file to add in pangenome object
- Parameters:
pangenome – Pangenome object without RGP
h5f – Pangenome HDF5 file with RGP computed
disable_bar – Disable the progress bar
- ppanggolin.formats.readBinaries.read_rgp_genes_from_pangenome_file(h5f: File) Set[bytes]
Retrieves a list of RGP genes from the pangenome file.
- Parameters:
h5f – The open HDF5 pangenome file containing RGP gene data.
- Returns:
A list of gene names (as bytes) from the RGP.
- ppanggolin.formats.readBinaries.read_rnas(pangenome: Pangenome, table: Table, genedata_dict: Dict[int, Genedata], link: bool = True, chunk_size: int = 20000, disable_bar: bool = False)
Read RNAs in pangenome file to add them to the pangenome object
- Parameters:
pangenome – Pangenome object
table – RNAs table
genedata_dict – Dictionary to link genedata with gene
link – Allow to link gene to organism and contig
chunk_size – Size of the chunk reading
disable_bar – Disable progress bar
- ppanggolin.formats.readBinaries.read_sequences(h5f: File) dict
Reads the sequences table and returns a sequence id to sequence dictionary :param h5f: the hdf5 file handler :return: dictionary linking sequences to the seq identifier
- ppanggolin.formats.readBinaries.read_spots(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Read hotspots in the pangenome HDF5 file and add them to the pangenome object.
- Parameters:
pangenome – Pangenome object without spot
h5f – Pangenome HDF5 file with spot computed
disable_bar – Disable the progress bar
- ppanggolin.formats.readBinaries.write_fasta_gene_fam_from_pangenome_file(pangenome_filename: str, output: Path, family_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)
Write representative nucleotide sequences of gene families
- Parameters:
pangenome – Pangenome object with gene families sequences
output – Path to output directory
gene_families – Selected partition of gene families
soft_core – Soft core threshold to use
compress – Compress the file in .gz
disable_bar – Disable progress bar
- ppanggolin.formats.readBinaries.write_fasta_prot_fam_from_pangenome_file(pangenome_filename: str, output: Path, family_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)
Write representative amino acid sequences of gene families.
- Parameters:
pangenome – Pangenome object with gene families sequences
output – Path to output directory
prot_families – Selected partition of protein families
soft_core – Soft core threshold to use
compress – Compress the file in .gz
disable_bar – Disable progress bar
- ppanggolin.formats.readBinaries.write_gene_sequences_from_pangenome_file(pangenome_filename: str, output: Path, list_cds: Iterator | None = None, add: str = '', compress: bool = False, disable_bar: bool = False)
Writes the CDS sequences of the Pangenome object to a File object that can be filtered or not by a list of CDS, and adds the eventual str ‘add’ in front of the identifiers. Loads the sequences from a .h5 pangenome file.
- Parameters:
pangenome_filename – Name of the pangenome file
output – Path to the sequences file
list_cds – An iterable object of CDS
add – Add a prefix to sequence header
compress – Compress the output file
disable_bar – Prevent to print disable progress bar
- ppanggolin.formats.readBinaries.write_genes_from_pangenome_file(pangenome_filename: str, output: Path, gene_filter: str, soft_core: float = 0.95, compress: bool = False, disable_bar=False)
Write representative nucleotide sequences of gene families
- Parameters:
pangenome – Pangenome object with gene families sequences
output – Path to output directory
gene_families – Selected partition of gene families
soft_core – Soft core threshold to use
compress – Compress the file in .gz
disable_bar – Disable progress bar
- ppanggolin.formats.readBinaries.write_genes_seq_from_pangenome_file(h5f: File, outpath: Path, compress: bool, seq_id_to_genes: Dict[int, List[str]], disable_bar: bool)
Writes gene sequences from the pangenome file to an output file.
Only sequences whose IDs match the ones in seq_id_to_genes will be written.
- Parameters:
h5f – The open HDF5 pangenome file containing sequence data.
outpath – The path to the output file where sequences will be written.
compress – Boolean flag to indicate whether output should be compressed.
seq_id_to_genes – A dictionary mapping sequence IDs to lists of gene names.
disable_bar – Boolean flag to disable the progress bar if set to True.
ppanggolin.formats.writeAnnotations module
- ppanggolin.formats.writeAnnotations.contig_desc(contig_len: int, org_len: int) NewCol]
Table description to save contig-related information
- Parameters:
contig_len – Maximum size of contig name
org_len – Maximum size of organism name.
- Returns:
Formatted table
- ppanggolin.formats.writeAnnotations.gene_desc(id_len: int, max_local_id: int) NewCol]
Table description to save gene-related information
- Parameters:
id_len – Maximum size of gene name
max_local_id – Maximum size of gene local identifier
- Returns:
Formatted table
- ppanggolin.formats.writeAnnotations.gene_joined_coordinates_desc() NewCol]
Creates a table for gene-related data
- Parameters:
type_len – Maximum size of gene Type.
name_len – Maximum size of gene name
product_len – Maximum size of gene product
- Returns:
Formatted table for gene metadata
- ppanggolin.formats.writeAnnotations.gene_sequences_desc(gene_id_len: int, gene_type_len: int) NewCol]
Create table to save gene sequences
- Parameters:
gene_id_len – Maximum size of gene sequence identifier
gene_type_len – Maximum size of gene type
- Returns:
Formatted table
- ppanggolin.formats.writeAnnotations.genedata_desc(type_len: int, name_len: int, product_len: int) NewCol]
Creates a table for gene-related data
- Parameters:
type_len – Maximum size of gene Type.
name_len – Maximum size of gene name
product_len – Maximum size of gene product
- Returns:
Formatted table for gene metadata
- ppanggolin.formats.writeAnnotations.get_gene_sequences_len(pangenome: Pangenome) Tuple[int, int]
Get the maximum size of gene sequences to optimize disk space :param pangenome: Annotated pangenome :return: maximum size of each annotation
- ppanggolin.formats.writeAnnotations.get_genedata(feature: Gene | RNA) Genedata
Gets the genedata type of Feature
- Parameters:
feature – Gene or RNA object
- Returns:
Tuple with a Feature associated data
- ppanggolin.formats.writeAnnotations.get_max_len_annotations(pangenome: Pangenome) Tuple[int, int, int, int, int]
Get the maximum size of each annotation information to optimize disk space
- Parameters:
pangenome – Annotated pangenome
- Returns:
Maximum size of each annotation
- ppanggolin.formats.writeAnnotations.get_max_len_genedata(pangenome: Pangenome) Tuple[int, int, int]
Get the maximum size of each gene data information to optimize disk space
- Parameters:
pangenome – Annotated pangenome
- Returns:
maximum size of each annotation
- ppanggolin.formats.writeAnnotations.get_sequence_len(pangenome: Pangenome) int
Get the maximum size of gene sequences to optimize disk space :param pangenome: Annotated pangenome :return: maximum size of each annotation
- ppanggolin.formats.writeAnnotations.organism_desc(org_len: int) NewCol]
Table description to save organism-related information
- Parameters:
org_len – Maximum size of organism name.
- Returns:
Formatted table
- ppanggolin.formats.writeAnnotations.rna_desc(id_len: int) NewCol]
Table description to save rna-related information
- Parameters:
id_len – Maximum size of RNA identifier
max_contig_len – Maximum size of contig identifier
- Returns:
Formatted table
- ppanggolin.formats.writeAnnotations.sequence_desc(max_seq_len: int) NewCol]
Table description to save sequences :param max_seq_len: Maximum size of gene type :return: Formatted table
- ppanggolin.formats.writeAnnotations.write_annotations(pangenome: Pangenome, h5f: File, rec_organisms: bool = True, rec_contigs: bool = True, rec_genes: bool = True, rec_rnas: bool = True, disable_bar: bool = False)
Function writing all the pangenome annotations
- Parameters:
pangenome – Annotated pangenome
h5f – Pangenome HDF5 file
rec_organisms – Allow writing organisms in pangenomes
rec_contigs – Allow writing contigs in pangenomes
rec_genes – Allow writing genes in pangenomes
rec_rnas – Allow writing RNAs in pangenomes
disable_bar – Allow to disable progress bar
- ppanggolin.formats.writeAnnotations.write_contigs(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, contig_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False)
Write contigs information in the pangenome file :param pangenome: Annotated pangenome object :param h5f: Pangenome file :param annotation: Annotation table group :param contig_desc: Contigs table description :param disable_bar: Allow disabling progress bar
- ppanggolin.formats.writeAnnotations.write_gene_joined_coordinates(h5f, annotation, genes_with_joined_coordinates_2_id, disable_bar)
Writing genedata information in pangenome file
- Parameters:
h5f – Pangenome file
annotation – Annotation group in Table
genedata2gene – Dictionary linking genedata to gene identifier.
disable_bar – Allow disabling progress bar
- ppanggolin.formats.writeAnnotations.write_gene_sequences(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Function writing all the pangenome gene sequences :param pangenome: Pangenome with gene sequences :param h5f: Pangenome HDF5 file without sequences :param disable_bar: Disable progress bar
- ppanggolin.formats.writeAnnotations.write_genedata(pangenome: Pangenome, h5f: File, annotation: Group, genedata2gene: Dict[Genedata, int], disable_bar=False)
Writing genedata information in pangenome file
- Parameters:
pangenome – Pangenome object filled with annotation.
h5f – Pangenome file
annotation – Annotation group in Table
genedata2gene – Dictionary linking genedata to gene identifier.
disable_bar – Allow disabling progress bar
- ppanggolin.formats.writeAnnotations.write_genes(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, gene_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False) Dict[Genedata, int]
Write genes information in the pangenome file
- Parameters:
pangenome – Annotated pangenome object
h5f – Pangenome file
annotation – Annotation table group
gene_desc – Genes table description
disable_bar – Allow to disable progress bar
- Returns:
Dictionary linking genedata to gene identifier
- ppanggolin.formats.writeAnnotations.write_organisms(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, organism_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False)
Write organisms information in the pangenome file
- Parameters:
pangenome – Annotated pangenome object
h5f – Pangenome file
annotation – Annotation table group
organism_desc – Organisms table description.
disable_bar – Allow disabling progress bar
- ppanggolin.formats.writeAnnotations.write_rnas(pangenome: ~ppanggolin.pangenome.Pangenome, h5f: ~tables.file.File, annotation: ~tables.group.Group, rna_desc: ~typing.Dict[str, ~tables.description.Col._subclass_from_prefix.<locals>.NewCol | ~tables.description.Col._subclass_from_prefix.<locals>.NewCol], disable_bar=False) Dict[Genedata, int]
Write RNAs information in the pangenome file
- Parameters:
pangenome – Annotated pangenome object
h5f – Pangenome file
annotation – Annotation table group
rna_desc – RNAs table description
disable_bar – Allow to disable progress bar
- Returns:
Dictionary linking genedata to RNA identifier
ppanggolin.formats.writeBinaries module
- ppanggolin.formats.writeBinaries.erase_pangenome(pangenome: Pangenome, graph: bool = False, gene_families: bool = False, partition: bool = False, rgp: bool = False, spots: bool = False, modules: bool = False, metadata: bool = False, metatype: str | None = None, source: str | None = None)
Erases tables from a pangenome .h5 file
- Parameters:
pangenome – Pangenome
graph – remove graph information
gene_families – remove gene families information
partition – remove partition information
rgp – remove rgp information
spots – remove spots information
modules – remove modules information
metadata – remove metadata information
metatype –
source –
- ppanggolin.formats.writeBinaries.gene_fam_desc(max_name_len: int, max_sequence_length: int, max_part_len: int) dict
Create a formatted table for gene families description
- Parameters:
max_name_len – Maximum size of gene family name
max_sequence_length – Maximum size of gene family representing gene sequences
max_part_len – Maximum size of gene family partition
- Returns:
Formatted table
- ppanggolin.formats.writeBinaries.gene_to_fam_desc(gene_fam_name_len: int, gene_id_len: int) dict
Create a formatted table for gene in gene families information
- Parameters:
gene_fam_name_len – Maximum size of gene family names
gene_id_len – Maximum size of gene identifier
- Returns:
formatted table
- ppanggolin.formats.writeBinaries.get_gene_fam_len(pangenome: Pangenome) Tuple[int, int, int]
Get maximum size of gene families information
- Parameters:
pangenome – Pangenome with gene families computed
- Returns:
Maximum size of each element
- ppanggolin.formats.writeBinaries.get_gene_id_len(pangenome: Pangenome) int
Get maximum size of gene id in pangenome graph
- Parameters:
pangenome – Pangenome with graph computed
- Returns:
Maximum size of gene id
- ppanggolin.formats.writeBinaries.get_gene_to_fam_len(pangenome: Pangenome)
Get maximum size of gene in gene families information
- Parameters:
pangenome – Pangenome with gene families computed
- Returns:
Maximum size of each element
- ppanggolin.formats.writeBinaries.get_mod_desc(pangenome: Pangenome) int
Get maximum size of gene families name in modules
- Parameters:
pangenome – Pangenome with modules computed
- Returns:
Maximum size of each element
- ppanggolin.formats.writeBinaries.get_rgp_len(pangenome: Pangenome) Tuple[int, int]
Get maximum size of region of genomic plasticity and gene
- Parameters:
pangenome – Pangenome with gene families computed
- Returns:
Maximum size of each element
- ppanggolin.formats.writeBinaries.get_spot_desc(pangenome: Pangenome) int
Get maximum size of region of genomic plasticity in hotspot
- Parameters:
pangenome – Pangenome with gene families computed
- Returns:
Maximum size of each element
- ppanggolin.formats.writeBinaries.getmax(arg: iter) float
Get the maximum of arguments if exist 0 else
- Parameters:
arg – list of values
- Returns:
return the maximum
- ppanggolin.formats.writeBinaries.getmean(arg: iter) float
Compute the mean of arguments if exist 0 else
- Parameters:
arg – list of values
- Returns:
return the mean
- ppanggolin.formats.writeBinaries.getmin(arg: iter) float
Get the minimum of arguments if exist 0 else
- Parameters:
arg – list of values
- Returns:
return the minimum
- ppanggolin.formats.writeBinaries.getstdev(arg: iter) float
Compute the standard deviation of arguments if exist 0 else
- Parameters:
arg – list of values
- Returns:
return the sd
- ppanggolin.formats.writeBinaries.graph_desc(max_gene_id_len)
Create a formatted table for pangenome graph
- Parameters:
max_gene_id_len – Maximum size of gene id
- Returns:
formatted table
- ppanggolin.formats.writeBinaries.mod_desc(gene_fam_name_len)
Create a formatted table for hotspot
- Parameters:
gene_fam_name_len – Maximum size of gene families name
- Returns:
formatted table
- ppanggolin.formats.writeBinaries.rgp_desc(max_rgp_len, max_gene_len)
Create a formatted table for region of genomic plasticity
- Parameters:
max_rgp_len – Maximum size of RGP
max_gene_len – Maximum sizez of gene
- Returns:
formatted table
- ppanggolin.formats.writeBinaries.spot_desc(max_rgp_len)
Create a formatted table for hotspot
- Parameters:
max_rgp_len – Maximum size of RGP
- Returns:
formatted table
- ppanggolin.formats.writeBinaries.update_gene_fam_partition(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Update the gene families table with partition information
- Parameters:
pangenome – Partitioned pangenome
h5f – HDF5 file with gene families
disable_bar – Allow to disable progress bar
- ppanggolin.formats.writeBinaries.update_gene_fragments(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Updates the annotation table with the fragmentation information from the defrag pipeline
- Parameters:
pangenome – Annotated pangenome
h5f – HDF5 pangenome file
disable_bar – Allow to disable progress bar
- ppanggolin.formats.writeBinaries.write_gene_fam_info(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)
Writing a table containing the protein sequences of each family
- Parameters:
pangenome – Pangenome with gene families computed
h5f – HDF5 file to write gene families
force – force to write information if precedent information exist
disable_bar – Disable progress bar
- ppanggolin.formats.writeBinaries.write_gene_families(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)
Function writing all the pangenome gene families
- Parameters:
pangenome – pangenome with gene families computed
h5f – HDF5 file to save pangenome with gene families
force – Force to write gene families in hdf5 file if there is already gene families
disable_bar – Disable progress bar
- ppanggolin.formats.writeBinaries.write_graph(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)
Function writing the pangenome graph
- Parameters:
pangenome – pangenome with graph computed
h5f – HDF5 file to save pangenome graph
force – Force to write graph in hdf5 file if there is already one
disable_bar – Disable progress bar
- ppanggolin.formats.writeBinaries.write_info(pangenome: Pangenome, h5f: File)
Writes information and numbers to be eventually called with the ‘info’ submodule
- Parameters:
pangenome – Pangenome object with some information computed
h5f – Pangenome file to save information
- ppanggolin.formats.writeBinaries.write_info_modules(pangenome: Pangenome, h5f: File)
Writes information about modules
- Parameters:
pangenome – Pangenome object with some information computed
h5f – Pangenome file to save information
- ppanggolin.formats.writeBinaries.write_modules(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)
Function writing all the pangenome modules
- Parameters:
pangenome – pangenome with spot computed
h5f – HDF5 file to save pangenome with spot
force – Force to write gene families in hdf5 file if there is already spot
disable_bar – Disable progress bar
- ppanggolin.formats.writeBinaries.write_pangenome(pangenome: Pangenome, filename, force: bool = False, disable_bar: bool = False)
Writes or updates a pangenome file
- Parameters:
pangenome – pangenome object
filename – HDF5 file to save pangenome
force – force to write on pangenome if information already exist
disable_bar – Allow to disable progress bar
- ppanggolin.formats.writeBinaries.write_rgp(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)
Function writing all the region of genomic plasticity in pangenome
- Parameters:
pangenome – pangenome with RGP computed
h5f – HDF5 file to save pangenome with RGP
force – Force to write gene families in hdf5 file if there is already RGP
disable_bar – Disable progress bar
- ppanggolin.formats.writeBinaries.write_spots(pangenome: Pangenome, h5f: File, force: bool = False, disable_bar: bool = False)
Function writing all the pangenome hotspot
- Parameters:
pangenome – pangenome with spot computed
h5f – HDF5 file to save pangenome with spot
force – Force to write gene families in hdf5 file if there is already spot
disable_bar – Disable progress bar
ppanggolin.formats.writeFlatGenomes module
- ppanggolin.formats.writeFlatGenomes.convert_overlapping_coordinates_for_gff(coordinates: List[Tuple[int, int]], contig_length: int)
Converts overlapping gene coordinates in GFF format for circular contigs.
- Parameters:
coordinates – List of tuples representing gene coordinates.
contig_length – Length of the circular contig.
- ppanggolin.formats.writeFlatGenomes.count_neighbors_partitions(gene_family: GeneFamily)
Count partition of neighbors families.
- Parameters:
gene_family – Gene family for which we count neighbors
- ppanggolin.formats.writeFlatGenomes.encode_attribute_val(product: str) str
Encode special characters forbidden in column 9 of the GFF3 format.
- Parameters:
product – The input string to encode.
- Returns:
The encoded string with special characters replaced.
Reference: - GFF3 format requirement: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md - Code source taken from Bakta: https://github.com/oschwengers/bakta
- ppanggolin.formats.writeFlatGenomes.encode_attributes(attributes: List[Tuple]) str
Encode a list of attributes in GFF3 format.
- Parameters:
attributes – A list of attribute key-value pairs represented as tuples.
- Returns:
The encoded attributes as a semicolon-separated string.
- ppanggolin.formats.writeFlatGenomes.get_organism_list(organisms_filt: str, pangenome: Pangenome) Set[Organism]
Get a list of organisms to include in the output.
- Parameters:
organisms_filt – Filter for selecting organisms. It can be a file path with one organism name per line or a comma-separated list of organism names.
pangenome – The pangenome from which organisms will be selected.
- Returns:
A set of selected Organism objects.
- ppanggolin.formats.writeFlatGenomes.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.formats.writeFlatGenomes.manage_module_colors(modules: Set[Module], window_size: int = 100) Dict[Module, str]
Manages colors for a list of modules based on gene positions and a specified window size.
- Parameters:
modules – A list of module objects for which you want to determine colors.
window_size – Minimum number of genes between two modules to color them with the same color. A higher value results in more module colors.
- Returns:
A dictionary that maps each module to its assigned color.
- ppanggolin.formats.writeFlatGenomes.mp_write_genomes_file(organism: Organism, output: Path, genome_file: Path | None = None, proksee: bool = False, gff: bool = False, table: bool = False, **kwargs) str
Wrapper for the write_genomes_file function that allows it to be used in multiprocessing.
- Parameters:
organism – Specify the organism to be written
output – Specify the path to the output directory
genome_file – Read the genome sequences from a file
proksee – Write a proksee file for the organism
gff – Write the gff file for the organism
table – Write the organism file for the organism
kwargs – Pass any number of keyword arguments to the function
- Returns:
The organism name
- ppanggolin.formats.writeFlatGenomes.palette(nb_colors: int) List[str]
Generates a palette of colors for visual representation.
- Parameters:
nb_colors – The number of colors needed in the palette.
- Returns:
A list of color codes in hexadecimal format.
- ppanggolin.formats.writeFlatGenomes.parser_flat(parser: ArgumentParser)
Parser for specific argument of write command
- Parameters:
parser – parser for align argument
- ppanggolin.formats.writeFlatGenomes.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
- ppanggolin.formats.writeFlatGenomes.write_flat_genome_files(pangenome: Pangenome, output: Path, table: bool = False, gff: bool = False, proksee: bool = False, compress: bool = False, fasta: Path | None = None, anno: Path | None = None, organisms_filt: str = 'all', add_metadata: bool = False, metadata_sep: str = '|', metadata_sources: List[str] | None = None, cpu: int = 1, disable_bar: bool = False)
Main function to write flat files from pangenome
- Parameters:
pangenome – Pangenome object
output – Path to output directory
cpu – Number of available core
table – write table with pangenome annotation for each genome
gff – write a gff file with pangenome annotation for each organism
proksee – write a proksee file with pangenome annotation for each organisms
compress – Compress the file in .gz
disable_bar – Disable progress bar
fasta – File containing the list FASTA files for each organism
anno – File containing the list of GBFF/GFF files for each organism
organisms_filt – String used to specify which organism to write. if all, all organisms are written.
add_metadata – Add metadata to GFF files
metadata_sep – The separator used to join multiple metadata values
metadata_sources – Sources of the metadata to use and write in the outputs. None means all sources are used.
- ppanggolin.formats.writeFlatGenomes.write_gff_file(organism: Organism, outdir: Path, annotation_sources: Dict[str, str], genome_sequences: Dict[str, str], metadata_sep: str = '|', compress: bool = False)
Write the GFF file of the provided organism.
- Parameters:
organism – Organism object for which the GFF file is being written.
outdir – Path to the output directory where the GFF file will be written.
metadata_sep – The separator used to join multiple metadata values
compress – If True, compress the output GFF file using .gz format.
annotation_sources – A dictionary that maps types of features to their source information.
genome_sequences – A dictionary mapping contig names to their DNA sequences (default: None).
- ppanggolin.formats.writeFlatGenomes.write_tsv_genome_file(organism: Organism, output: Path, compress: bool = False, metadata_sep: str = '|', need_regions: bool = False, need_spots: bool = False, need_modules: bool = False)
Write the table of genes with pangenome annotation for one organism in tsv
- Parameters:
organism – An organism
output – Path to output directory
compress – Compress the file in .gz
need_regions – Write information about regions
need_spots – Write information about spots
need_modules – Write information about modules
ppanggolin.formats.writeFlatPangenome module
- ppanggolin.formats.writeFlatPangenome.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.formats.writeFlatPangenome.parser_flat(parser: ArgumentParser)
Parser for specific argument of write command
- Parameters:
parser – parser for align argument
- ppanggolin.formats.writeFlatPangenome.spot2rgp(spots: set, output: Path, compress: bool = False)
Write a tsv file providing association between spot and rgp
- Parameters:
spots – set of spots in pangenome
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
- ppanggolin.formats.writeFlatPangenome.summarize_genome(organism: Organism, pangenome_persistent_count: int, pangenome_persistent_single_copy_families: Set[GeneFamily], soft_core_families: Set[GeneFamily], exact_core_families: Set[GeneFamily], rgp_count: int, spot_count: int, module_count: int) Dict[str, any]
Summarizes genomic information of an organism.
- Parameters:
input_organism – The organism for which the genome is being summarized.
pangenome_persistent_count – Count of persistent genes in the pangenome.
pangenome_persistent_single_copy_families – Set of gene families considered as persistent single-copy in the pangenome.
soft_core_families – soft core families of the pangenome
exact_core_families – exact core families of the pangenome
input_org_rgps – Number of regions of genomic plasticity in the input organism. None if not computed.
input_org_spots – Number of spots in the input organism. None if not computed.
input_org_modules – Number of modules in the input organism. None if not computed.
- Returns:
A dictionary containing various summary information about the genome.
- ppanggolin.formats.writeFlatPangenome.summarize_spots(spots: set, output: Path, compress: bool = False, file_name='summarize_spots.tsv')
Write a file providing summarize information about hotspots
- Parameters:
spots – set of spots in pangenome
output – Path to output directory
compress – Compress the file in .gz
- Patam file_name:
Name of the output file
- ppanggolin.formats.writeFlatPangenome.write_borders(output: Path, dup_margin: float = 0.05, compress: bool = False)
Write all gene families bordering each spot
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
- ppanggolin.formats.writeFlatPangenome.write_gene_families_tsv(output: Path, compress: bool = False, disable_bar: bool = False)
Write the file providing the association between genes and gene families
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
disable_bar – Flag to disable progress bar
- ppanggolin.formats.writeFlatPangenome.write_gene_presence_absence(output: Path, compress: bool = False)
Write the gene presence absence matrix
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_gexf(output: Path, light: bool = True, compress: bool = False)
Write the node of pangenome in gexf file
- Parameters:
output – Path to output directory
light – save the light version of the pangenome graph
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_gexf_edges(gexf: TextIO, light: bool = True)
Write the edge of pangenome graph in gexf file
- Parameters:
gexf – file-like object, compressed or not
light – save the light version of the pangenome graph
- ppanggolin.formats.writeFlatPangenome.write_gexf_end(gexf: TextIO)
Write the end of gexf file to save pangenome
- Parameters:
gexf – file-like object, compressed or not
- ppanggolin.formats.writeFlatPangenome.write_gexf_header(gexf: TextIO, light: bool = True)
Write the header of gexf file to save graph
- Parameters:
gexf – file-like object, compressed or not
light – save the light version of the pangenome graph
- ppanggolin.formats.writeFlatPangenome.write_gexf_nodes(gexf: TextIO, light: bool = True, soft_core: False = 0.95)
Write the node of pangenome graph in gexf file
- Parameters:
gexf – file-like object, compressed or not
light – save the light version of the pangenome graph
soft_core – Soft core threshold to use
- ppanggolin.formats.writeFlatPangenome.write_json(output: Path, compress: bool = False)
Writes the graph in a json file format
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_json_edge(edge: Edge, json: TextIO)
Write the edge graph in json file
- Parameters:
edge – file-like object, compressed or not
json – file-like object, compressed or not
- ppanggolin.formats.writeFlatPangenome.write_json_edges(json)
Write the edge graph in json file
- Parameters:
json – file-like object, compressed or not
- ppanggolin.formats.writeFlatPangenome.write_json_gene_fam(gene_fam: GeneFamily, json: TextIO)
Write the gene families corresponding to node graph in json file
- Parameters:
gene_fam – file-like object, compressed or not
json – file-like object, compressed or not
- ppanggolin.formats.writeFlatPangenome.write_json_header(json: TextIO)
Write the header of json file to save graph
- Parameters:
json – file-like object, compressed or not
- ppanggolin.formats.writeFlatPangenome.write_json_nodes(json: TextIO)
Write the node graph in json file
- Parameters:
json – file-like object, compressed or not
- ppanggolin.formats.writeFlatPangenome.write_matrix(output: Path, sep: str = ',', ext: str = 'csv', compress: bool = False, gene_names: bool = False)
Write a csv file format as used by Roary, among others. The alternative gene ID will be the partition, if there is one
- Parameters:
sep – Column field separator
ext – file extension
output – Path to output directory
compress – Compress the file in .gz
gene_names – write the genes name if there are saved in pangenome
- ppanggolin.formats.writeFlatPangenome.write_module_summary(output: Path, compress: bool = False)
Write a file providing summarize information about modules
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_modules(output: Path, compress: bool = False)
Write a tsv file providing association between modules and gene families
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_org_modules(output: Path, compress: bool = False)
Write a tsv file providing association between modules and organisms
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_pangenome_flat_files(pangenome: Pangenome, output: Path, cpu: int = 1, soft_core: float = 0.95, dup_margin: float = 0.05, csv: bool = False, gene_pa: bool = False, gexf: bool = False, light_gexf: bool = False, stats: bool = False, json: bool = False, partitions: bool = False, families_tsv: bool = False, regions: bool = False, regions_families: bool = False, spots: bool = False, borders: bool = False, modules: bool = False, spot_modules: bool = False, compress: bool = False, disable_bar: bool = False)
Main function to write flat files from pangenome
- Parameters:
pangenome – Pangenome object
output – Path to output directory
cpu – Number of available core
soft_core – Soft core threshold to use
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
csv – write csv file format as used by Roary
gene_pa – write gene presence absence matrix
gexf – write pangenome graph in gexf format
light_gexf – write pangenome graph with only gene families
stats – write statistics about pangenome
json – write pangenome graph in json file
partitions – write the gene families for each partition
families_tsv – write gene families information
regions – write RGP information
spots – write information on spots
borders – write gene families bordering spots
modules – write information about modules
spot_modules – write association between modules and RGP and modules and spots
compress – Compress the file in .gz
disable_bar – Disable progress bar
- ppanggolin.formats.writeFlatPangenome.write_partitions(output: Path, soft_core: float = 0.95)
Write the list of gene families for each partition
- Parameters:
output – Path to output directory
soft_core – Soft core threshold to use
- ppanggolin.formats.writeFlatPangenome.write_persistent_duplication_statistics(pangenome: Pangenome, output: Path, dup_margin: float, compress: bool) Set[GeneFamily]
Writes statistics on persistent duplications in gene families to a specified output file.
- Parameters:
pangenome – The Pangenome object containing gene families.
output – The Path specifying the output file location.
dup_margin – The duplication margin used for determining single copy markers.
compress – A boolean indicating whether to compress the output file.
:return :
- ppanggolin.formats.writeFlatPangenome.write_regions(output: Path, compress: bool = False)
Write the file providing information about RGP content
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_regions_families(output: Path, compress: bool = False)
Write the file providing the association between regions of genomic plasticity and gene families.
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_rgp_modules(output: Path, compress: bool = False)
Write a tsv file providing association between modules and RGP
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_rgp_table(regions: Set[Region], output: Path, compress: bool = False)
Write the file providing information about regions of genomic plasticity.
- Parameters:
regions – Set of Region objects representing regions.
output – Path to the output directory.
compress – Whether to compress the file in .gz format.
- ppanggolin.formats.writeFlatPangenome.write_spot_modules(output: Path, compress: bool = False)
Write a tsv file providing association between modules and spots
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_spots(output: Path, compress: bool = False)
Write tsv files providing spots information and association with RGP
- Parameters:
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_stats(output: Path, soft_core: float = 0.95, dup_margin: float = 0.05, compress: bool = False)
Write pangenome statistics for each genomes
- Parameters:
output – Path to output directory
soft_core – Soft core threshold to use
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
compress – Compress the file in .gz
- ppanggolin.formats.writeFlatPangenome.write_summaries_in_tsv(summaries: List[Dict[str, Any]], output_file: Path, dup_margin: float, soft_core: float, compress: bool = False)
Writes summaries of organisms stored in a dictionary into a Tab-Separated Values (TSV) file.
- Parameters:
summaries – A list containing organism summaries.
output_file – The Path specifying the output TSV file location.
soft_core – Soft core threshold used
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
compress – Compress the file in .gz
ppanggolin.formats.writeMSA module
- ppanggolin.formats.writeMSA.compute_msa(families: Set[GeneFamily], output: Path, tmpdir: Path, cpu: int = 1, source: str = 'protein', use_gene_id: bool = False, code: str = '11', disable_bar: bool = False)
Compute MSA between pangenome gene families
- Parameters:
families – Set of families specific to given partition
output – output directory name for families alignment
cpu – number of available core
tmpdir – path to temporary directory
source – indicates whether to use protein or dna sequences to compute the msa
use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA
code – Genetic code to use
disable_bar – Disable progress bar
- ppanggolin.formats.writeMSA.get_families_to_write(pangenome: Pangenome, partition_filter: str = 'core', soft_core: float = 0.95, dup_margin: float = 0.95, single_copy: bool = True) Set[GeneFamily]
Get families corresponding to the given partition
- Parameters:
pangenome – Partitioned pangenome
partition_filter – choice of partition to compute Multiple Sequence Alignment of the gene families
soft_core – Soft core threshold to use
dup_margin – maximal number of genomes in which the gene family can have multiple members and still be considered a ‘single copy’ gene family
single_copy – Use “single copy” (defined by dup_margin) gene families only
- Returns:
set of families unique to one partition
- ppanggolin.formats.writeMSA.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.formats.writeMSA.launch_mafft(fname: Path, output: Path, fam_name: str)
Compute the MSA with mafft
- Parameters:
fname – family gene sequence in fasta
output – directory to save alignment
fam_name – Name of the gene family
- ppanggolin.formats.writeMSA.launch_multi_mafft(args: List[Tuple[Path, Path, str]])
Allow to launch mafft in multiprocessing
- Parameters:
args – Pack of argument for launch_mafft
- Returns:
Organism object for pangenome
- ppanggolin.formats.writeMSA.parser_msa(parser: ArgumentParser)
Parser for specific argument of msa command
- Parameters:
parser – parser for align argument
- ppanggolin.formats.writeMSA.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
- ppanggolin.formats.writeMSA.translate(gene: Gene, code: Dict[str, Dict[str, str]]) Tuple[str, bool]
translates the given dna sequence with the given translation table
- Parameters:
gene – given gene
code – translation table corresponding to genetic code to use
- Returns:
protein sequence
- ppanggolin.formats.writeMSA.write_fasta_families(family: GeneFamily, tmpdir: TemporaryDirectory, code_table: Dict[str, Dict[str, str]], source: str = 'protein', use_gene_id: bool = False) Tuple[Path, bool]
Write fasta files for each gene family
- Parameters:
family – gene family to write
tmpdir – path to temporary directory
source – indicates whether to use protein or dna sequences to compute the msa
use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA
code_table – Genetic code to use
- Returns:
path to fasta file
- ppanggolin.formats.writeMSA.write_msa_files(pangenome: Pangenome, output: Path, cpu: int = 1, partition: str = 'core', tmpdir: Path | None = None, source: str = 'protein', soft_core: float = 0.95, phylo: bool = False, use_gene_id: bool = False, translation_table: str = '11', dup_margin: float = 0.95, single_copy: bool = True, force: bool = False, disable_bar: bool = False)
Main function to write MSA files
- Parameters:
pangenome – Pangenome object with partition
output – Path to output directory
cpu – number of available core
partition – choice of partition to compute Multiple Sequence Alignment of the gene families
tmpdir – path to temporary directory
source – indicates whether to use protein or dna sequences to compute the msa
soft_core – Soft core threshold to use
phylo – Writes a whole genome msa file for additional phylogenetic analysis
use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA
translation_table – Translation table (genetic code) to use.
dup_margin – maximal number of genomes in which the gene family can have multiple members and still be considered a ‘single copy’ gene family
single_copy – Use “single copy” (defined by dup_margin) gene families only
force – force to write in the directory
disable_bar – Disable progress bar
- ppanggolin.formats.writeMSA.write_whole_genome_msa(pangenome: Pangenome, families: set, phylo_name: Path, outdir: Path, use_gene_id: bool = False)
Writes a whole genome msa file for additional phylogenetic analysis
- Parameters:
pangenome – Pangenome object
families – Set of families specific to given partition
phylo_name – output file name for phylo alignment
outdir – output directory name for families alignment
use_gene_id – Use gene identifiers rather than organism names for sequences in the family MSA
ppanggolin.formats.writeMetadata module
- ppanggolin.formats.writeMetadata.desc_metadata(max_len_dict: Dict[str, int], type_dict: Dict[str, Col]) dict
Create a formatted table for metadata description
- Returns:
Formatted table
- ppanggolin.formats.writeMetadata.erase_metadata(pangenome: Pangenome, h5f: File, status_group: Group, metatype: str | None = None, source: str | None = None)
Erase metadata in pangenome
- Parameters:
pangenome – Pangenome with metadata to erase
h5f – HDF5 file with pangenome metadata
status_group – pangenome status in HDF5
metatype – select to which pangenome element metadata should be erased
source – name of the metadata source
- ppanggolin.formats.writeMetadata.get_metadata_contig_len(select_ctg: List[Contig], source: str) Tuple[Dict[str, int], Dict[str, Col], int]
Get maximum size of contig metadata information
- Parameters:
select_ctg – selected elements from source
source – Name of the metadata source
- Returns:
Maximum type and size of each element
- ppanggolin.formats.writeMetadata.get_metadata_len(select_elem: List[Gene] | List[Organism] | List[GeneFamily] | List[Region] | List[Spot] | List[Module], source: str) Tuple[Dict[str, int], Dict[str, Col], int]
Get maximum size of metadata information
- Parameters:
select_elem – selected elements from source
source – Name of the metadata source
- Returns:
Maximum type and size of each element
- ppanggolin.formats.writeMetadata.write_metadata(pangenome: Pangenome, h5f: File, disable_bar: bool = False)
Write metadata in pangenome
- Parameters:
pangenome – Pangenome where should be written metadata
h5f – HDF5 file with pangenome
disable_bar – Disable progress bar
- ppanggolin.formats.writeMetadata.write_metadata_contig(h5f: File, source: str, select_contigs: List[Contig], disable_bar: bool = False)
Writing a table containing the metadata associated to contig
- Parameters:
h5f – HDF5 file to write gene families
source – name of the metadata source
select_contigs – List of contig withj metadata
disable_bar – Disable progress bar
- ppanggolin.formats.writeMetadata.write_metadata_group(h5f: File, metatype: str) Group
Check and write the group in HDF5 file to organize metadata
- Parameters:
h5f – HDF5 file with pangenome
metatype – select to which pangenome element metadata should be written
- Returns:
Metadata group of the corresponding metatype
- ppanggolin.formats.writeMetadata.write_metadata_metatype(h5f: File, source: str, metatype: str, select_elements: List[Gene] | List[Organism] | List[GeneFamily] | List[Region] | List[Spot] | List[Module], disable_bar: bool = False)
Writing a table containing the metadata associated to element from the metatype
- Parameters:
h5f – HDF5 file to write gene families
source – name of the metadata source
metatype – select to which pangenome element metadata should be written
select_elements – Elements selected to write metadata
disable_bar – Disable progress bar
ppanggolin.formats.writeSequences module
- ppanggolin.formats.writeSequences.check_write_sequences_args(args: Namespace) None
Check arguments compatibility in CLI
- Parameters:
args – argparse namespace arguments from CLI
- Raises:
argparse.ArgumentTypeError – if region is given but neither fasta nor anno is given
- ppanggolin.formats.writeSequences.create_mmseqs_db(sequences: Iterable[Path], db_name: str, tmpdir: Path, db_mode: int = 0, db_type: int = 0) Path
Create a MMseqs2 database from a sequences file.
- Parameters:
sequences – File with the sequences
db_name – name of the database
tmpdir – Temporary directory to save the MMSeqs2 files
db_mode – Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q)
db_type – Database type 0: auto, 1: amino acid 2: nucleotides
- Returns:
Path to the MMSeqs2 database
- ppanggolin.formats.writeSequences.filter_values(arg_value: str)
Check filter value to ensure they are in the expected format.
- Parameters:
arg_value – Argument value that is being tested.
- Returns:
The same argument if it is valid.
- Raises:
argparse.ArgumentTypeError – If the argument value is not in the expected format.
- ppanggolin.formats.writeSequences.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.formats.writeSequences.parser_seq(parser: ArgumentParser)
Parser for specific argument of fasta command
- Parameters:
parser – parser for align argument
- ppanggolin.formats.writeSequences.read_fasta_gbk(file_path: Path) Dict[str, str]
Read the genome file in gbk format
- Parameters:
file_path – Path to genome file
- Returns:
Dictionary with all sequences associated to contig
- ppanggolin.formats.writeSequences.read_fasta_or_gff(file_path: Path) Dict[str, str]
Read the genome file in fasta or gbff format
- Parameters:
file_path – Path to genome file
- Returns:
Dictionary with all sequences associated to contig
- ppanggolin.formats.writeSequences.read_genome_file(genome_file: Path, organism: Organism) Dict[str, str]
Read the genome file associated to organism to extract sequences
- Parameters:
genome_file – Path to a fasta file or gbff/gff file
organism – organism object
- Returns:
Dictionary with all sequences associated to contig
- Raises:
TypeError – If the file containing sequences is not recognized
KeyError – If their inconsistency between pangenome contigs and the given contigs
- ppanggolin.formats.writeSequences.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
- ppanggolin.formats.writeSequences.translate_genes(sequences: Path | Iterable[Path], tmpdir: Path, cpu: int = 1, is_single_line_fasta: bool = False, code: int = 11) Path
Translate nucleotide sequences into MMSeqs2 amino acid sequences database
- Parameters:
sequences – File with the nucleotide sequences
tmpdir – Temporary directory to save the MMSeqs2 files
cpu – Number of available threads to use
is_single_line_fasta – Allow to use soft link in MMSeqs2 database
code – Translation code to use
- Returns:
Path to the MMSeqs2 database
- ppanggolin.formats.writeSequences.write_gene_protein_sequences(pangenome_filename: str, output: Path, gene_filter: str, soft_core: float = 0.95, compress: bool = False, keep_tmp: bool = False, tmp: Path | None = None, cpu: int = 1, code: int = 11, disable_bar: bool = False)
Write all amino acid sequences from given genes in pangenome
- Parameters:
pangenome – Pangenome object with gene families sequences
output – Path to output directory
proteins – Selected partition of gene
soft_core – Soft core threshold to use
compress – Compress the file in .gz
keep_tmp – Keep temporary directory
tmp – Path to temporary directory
cpu – Number of threads available
code – Genetic code use to translate nucleotide sequences to protein sequences
disable_bar – Disable progress bar
- ppanggolin.formats.writeSequences.write_gene_sequences_from_annotations(genes_to_write: Iterable[Gene], output: Path, add: str = '', compress: bool = False, disable_bar: bool = False)
Writes the CDS sequences to a File object, and adds the string provided through add in front of it. Loads the sequences from previously computed or loaded annotations.
- Parameters:
genes_to_write – Genes to write.
output – Path to output file to write sequences.
add – Add prefix to gene ID.
compress – Compress the file in .gz
disable_bar – Disable progress bar.
- ppanggolin.formats.writeSequences.write_regions_sequences(pangenome: Pangenome, output: Path, regions: str, fasta: Path | None = None, anno: Path | None = None, compress: bool = False, disable_bar: bool = False)
Write representative amino acid sequences of gene families.
- Parameters:
pangenome – Pangenome object with gene families sequences
output – Path to output directory
regions – Write the RGP nucleotide sequences
fasta – A tab-separated file listing the organism names, fasta filepath of its genomic sequences
anno – A tab-separated file listing the organism names, and the gff/gbff filepath of its annotations
compress – Compress the file in .gz
disable_bar – Disable progress bar
- Raises:
SyntaxError – if no tabulation are found in list genomes file
- ppanggolin.formats.writeSequences.write_sequence_files(pangenome: Pangenome, output: Path, fasta: Path | None = None, anno: Path | None = None, soft_core: float = 0.95, regions: str | None = None, genes: str | None = None, proteins: str | None = None, gene_families: str | None = None, prot_families: str | None = None, compress: bool = False, disable_bar: bool = False, **translate_kwgs)
Main function to write sequence file from pangenome
- Parameters:
pangenome – Pangenome object containing sequences
output – Path to output directory
fasta – A tab-separated file listing the organism names, fasta filepath of its genomic sequences
anno – A tab-separated file listing the organism names, and the gff/gbff filepath of its annotations
soft_core – Soft core threshold to use
regions – Write the RGP nucleotide sequences
genes – Write all nucleotide CDS sequences
proteins – Write amino acid CDS sequences.
gene_families – Write representative nucleotide sequences of gene families.
prot_families – Write representative amino acid sequences of gene families.
compress – Compress the file in .gz
disable_bar – Disable progress bar
- ppanggolin.formats.writeSequences.write_spaced_fasta(sequence: str, space: int = 60) str
Write a maximum of element per line
- Parameters:
sequence – sequence to write
space – maximum of size for one line
- Returns:
a sequence of maximum space character
ppanggolin.formats.write_proksee module
- ppanggolin.formats.write_proksee.initiate_proksee_data(features: List[str], organism: Organism, module_to_color: Dict[Module, str] | None = None)
Initializes ProkSee data structure with legends, tracks, and captions.
- Parameters:
features – A list of features to include in the ProkSee data.
organism – The organism for which the ProkSee data is being generated.
module_to_color – A dictionary mapping modules to their assigned colors.
- Returns:
ProkSee data structure containing legends, tracks, and captions.
- ppanggolin.formats.write_proksee.write_contig(organism: Organism, genome_sequences: Dict[str, str] | None = None) List[Dict]
Writes contig data for a given organism in proksee format.
- Parameters:
organism – The organism for which contig data will be written.
genome_sequences – A dictionary mapping contig names to their DNA sequences (default: None).
- Returns:
A list of contig data in a structured format.
- ppanggolin.formats.write_proksee.write_genes(organism: Organism, multigenics: Set[GeneFamily], disable_bar: bool = True) Tuple[List[Dict], Dict[str, List[Gene]]]
Writes gene data for a given organism, including both protein-coding genes and RNA genes.
- Parameters:
organism – The organism for which gene data will be written.
disable_bar – A flag to disable the progress bar when processing genes (default: True).
- Returns:
List of gene data in a structured format and a dictionary mapping gene families to genes.
- ppanggolin.formats.write_proksee.write_legend_items(features: List[str], module_to_color: Dict[Module, str] | None = None)
Generates legend items based on the selected features and module-to-color mapping.
- Parameters:
features – A list of features to include in the legend.
module_to_color – A dictionary mapping modules to their assigned colors.
- Returns:
A data structure containing legend items based on the selected features and module colors.
- ppanggolin.formats.write_proksee.write_modules(organism: Organism, gf2genes: Dict[str, List[Gene]])
Writes module data in proksee format for a list of modules associated with a given organism.
- Parameters:
organism – The organism to which the modules are associated.
gf2genes – A dictionary that maps gene families to the genes they contain.
- Returns:
A list of module data in a structured format.
- ppanggolin.formats.write_proksee.write_proksee_organism(organism: Organism, output_file: Path, features: List[str] | None = None, module_to_colors: Dict[Module, str] | None = None, genome_sequences: Dict[str, str] | None = None, multigenics: Set[GeneFamily] = [], compress: bool = False)
Writes ProkSee data for a given organism, including contig information, genes colored by partition, RGPs, and modules. The resulting data is saved as a JSON file in the specified output file.
- Parameters:
organism – The organism for which ProkSee data will be written.
output_file – The output file where ProkSee data will be written.
features – A list of features to include in the ProkSee data, e.g., [“rgp”, “modules”, “all”].
module_to_colors – A dictionary mapping modules to their assigned colors.
genome_sequences – The genome sequences for the organism.
compress – Compress the output file
- ppanggolin.formats.write_proksee.write_rgp(organism: Organism)
Writes RGP (Region of Genomic Plasticity) data for a given organism in proksee format. :param organism: The specific organism for which RGP data will be written.
- Returns:
A list of RGP data in a structured format.
- ppanggolin.formats.write_proksee.write_tracks(features: List[str])
Generates track information based on the selected features.
- Parameters:
features – A list of features to include in the ProkSee data.
- Returns:
A list of track configurations based on the selected features.