ppanggolin package

Subpackages

Submodules

ppanggolin.edge module

class ppanggolin.edge.Edge(source_gene: Gene, target_gene: Gene)

Bases: object

The Edge class represents an edge between two gene families in the pangenome graph. It is associated with all the organisms in which the neighborship is found, and all the involved genes as well.

Methods:
  • get_org_dict: Returns a dictionary with organisms as keys and an iterable of the pairs in genes as values.

  • gene_pairs: Returns a list of all the gene pairs in the Edge.

  • add_genes: Adds genes to the edge. They are supposed to be in the same organism.

Fields:
  • source: A GeneFamily object representing the source gene family of the edge.

  • target: A GeneFamily object representing the target gene family of the edge.

  • organisms: A defaultdict object representing the organisms in which the edge is found and the pairs of genes involved.

add_genes(source_gene: Gene, target_gene: Gene)

Adds genes to the edge. They are supposed to be in the same organism.

Parameters:
  • source_gene – Gene corresponding to the source of the edge

  • target_gene – Gene corresponding to the target of the edge

Raises:
  • TypeError – If the genes are not with Gene type

  • ValueError – If genes are not associated with an organism

  • Exception – If the genes are not in the same organism.

property gene_pairs: List[Tuple[Gene, Gene]]

Get the list of all the gene pairs in the Edge

Returns:

A list of all the gene pairs in the Edge

get_organism_genes_pairs(organism: Organism) List[Tuple[Gene, Gene]]

Get the gene pair corresponding to the given organism

Parameters:

organism – Wanted organism

Returns:

Pair of genes in the edge corresponding to the given organism

get_organisms_dict() Dict[Organism, List[Tuple[Gene, Gene]]]

Get all the organisms with their corresponding pair of genes in the edge

Returns:

Dictionary with the organism as the key and list of gene pairs as value

property number_of_organisms: int

Get the number of organisms in the edge

Returns:

Number of organisms

property organisms: Generator[Organism, None, None]

Get all the organisms belonging to the edge

Returns:

Generator with organisms as the key and an iterable of the gene pairs as value

ppanggolin.geneFamily module

class ppanggolin.geneFamily.GeneFamily(family_id: int, name: str)

Bases: MetaFeatures

This represents a single gene family. It will be a node in the pangenome graph, and be aware of its genes and edges.

Methods:
  • named_partition: returns a meaningful name for the partition associated with the family.

  • neighbors: returns all the GeneFamilies that are linked with an edge.

  • edges: returns all Edges that are linked to this gene family.

  • genes: returns all the genes associated with the family.

  • organisms: returns all the Organisms that have this gene family.

  • spots: returns all the spots associated with the family.

  • modules: returns all the modules associated with the family.

  • number_of_neighbor: returns the number of neighbor GeneFamilies.

  • number_of_edges: returns the number of edges.

  • number_of_genes: returns the number of genes.

  • number_of_organisms: returns the number of organisms.

  • number_of_spots: returns the number of spots.

  • set_edge: sets an edge between the current family and a target family.

  • add_sequence: assigns a protein sequence to the gene family.

  • add_gene: adds a gene to the gene family and sets the gene’s family accordingly.

  • add_spot: adds a spot to the gene family.

  • add_module: adds a module to the gene family.

  • Mk_bitarray: produces a bitarray representing the presence/absence of the family in the pangenome using the provided index.

  • get_org_dict: returns a dictionary of organisms as keys and sets of genes as values.

  • get_genes_per_org: returns the genes belonging to the gene family in the given organism.

Fields:
  • name: the name of the gene family.

  • ID: the internal identifier of the gene family.

  • removed: a boolean indicating whether the family has been removed from the main graph.

  • sequence: the protein sequence associated with the family.

  • Partition: the partition associated with the family.

add(gene: Gene)

Add a gene to the gene family, and sets the gene’s :attr:family accordingly.

Parameters:

gene – The gene to add

Raises:

TypeError – If the provided gene is of the wrong type

add_sequence(seq: str)

Assigns a protein sequence to the gene family.

Parameters:

seq – The sequence to add to the gene family

add_spot(spot: Spot)

Add the given spot to the family

Parameters:

spot – Spot belonging to the family

contains_gene_id(identifier)

Check if the family contains already a gene id

Parameters:

identifier – ID of the gene

Returns:

True if it contains False if it does not

Raises:

TypeError – If the identifier is not instance string

duplication_ratio(exclude_fragment: bool) bool

Checks if the gene family is considered single copy based on the provided criteria.

Parameters:
  • dup_margin – The maximum allowed duplication margin for a gene family to be considered single copy.

  • exclude_fragment – A boolean indicating whether to exclude fragments when determining single copy families.

Returns:

A boolean indicating whether the gene family is single copy.

property edges: Generator[Edge, None, None]

Returns all Edges that are linked to this gene family

Returns:

Edges of the gene family

property genes

Return all the genes belonging to the family

Returns:

Generator of genes

get(identifier: str) Gene

Get a gene by its name

Parameters:

identifier – ID of the gene

Returns:

Wanted gene

Raises:

TypeError – If the identifier is not instance string

get_edge(target: GeneFamily) Edge

Get the edge by the target gene family neighbor

get_genes_per_org(org: Organism) Generator[Gene, None, None]

Returns the genes belonging to the gene family in the given Organism

Parameters:

org – Organism to look for

Returns:

A set of gene(s)

get_org_dict() Dict[Organism, Set[Gene]]

Returns the organisms and the genes belonging to the gene family

Returns:

A dictionary of organism as key and set of genes as values

property has_module: bool

Check if the family is in a module

return True if it has a module else False

is_single_copy(dup_margin: float, exclude_fragment: bool) bool

Checks if the gene family is considered single copy based on the provided criteria.

Parameters:
  • dup_margin – The maximum allowed duplication margin for a gene family to be considered single copy.

  • exclude_fragment – A boolean indicating whether to exclude fragments when determining single copy families.

Returns:

A boolean indicating whether the gene family is single copy.

mk_bitarray(index: Dict[Organism, int], partition: str = 'all')

Produces a bitarray representing the presence/absence of the family in the pangenome using the provided index The bitarray is stored in the bitarray attribute and is a gmpy2.xmpz type.

Parameters:
  • index – The index computed by ppanggolin.pangenome.Pangenome.getIndex()

  • partition – partition used to compute bitarray

property module: Module

Return all the modules belonging to the family

Returns:

Generator of modules

property named_partition: str

Reads the partition attribute and returns a meaningful name

Returns:

The partition name of the gene family

Raises:

ValueError – If the gene family has no partition assigned

property neighbors: Generator[GeneFamily, None, None]

Returns all the GeneFamilies that are linked with an edge

Returns:

Neighbors

property number_of_edges: int

Get the number of edges for the current gene family

property number_of_genes: int

Get the number of genes for the current gene family

property number_of_neighbors: int

Get the number of neighbor for the current gene family

property number_of_organisms: int

Get the number of organisms for the current gene family

property number_of_spots: int

Get the number of spots for the current gene family

property organisms: Generator[Organism, None, None]

Returns all the Organisms that have this gene family

Returns:

Organisms that have this gene family

property partition
remove(identifier)

Remove a gene by its name

Parameters:

identifier – Name of the gene

Returns:

Wanted gene

Raises:

TypeError – If the identifier is not instance string

property representative: Gene

Get the representative gene of the family

Returns:

The representative gene of the family

set_edge(target: GeneFamily, edge: Edge)

Set the edge between the gene family and another one

Parameters:
  • target – Neighbor family

  • edge – Edge connecting families

set_module(module: Module)

Add the given module to the family

Parameters:

module – Module belonging to the family

property spots: Generator[Spot, None, None]

Return all the spots belonging to the family

Returns:

Generator of spots

ppanggolin.genetic_codes module

ppanggolin.genetic_codes.genetic_codes(code)

ppanggolin.genome module

class ppanggolin.genome.Contig(identifier: int, name: str, is_circular: bool = False)

Bases: MetaFeatures

Describe the contig content and some information Methods: - genes: Returns a list of gene objects present in the contig. - add_rna: Adds an RNA object to the contig. - add_gene: Adds a gene object to the contig.

Fields: - name: Name of the contig. - is_circular: Boolean value indicating whether the contig is circular or not. - RNAs: Set of RNA annotations present in the contig.

TODO: Getter gene should be based on gene ID, and 2 other attributes should exist to get them by start or position.

Also, when set a new gene in contig, start, stop and strand should be check to check difference, maybe define __eq__ method in gene class.

property RNAs: Generator[RNA, None, None]

Return all the RNA in the contig

Returns:

Generator of RNA

add(gene: Gene)

Add a gene to the contig

Parameters:

gene – Gene to add

Raises:

TypeError – Region is not an instance Region

add_contig_length(contig_length: int)

Add contig length to Contig object.

Parameters:

contig_length – Length of the contig.

Raises:

ValueError – If trying to define a contig length different from previously defined.

add_rna(rna: RNA)

Add RNA to contig

Parameters:

rna – RNA object to add

Raises:
  • TypeError – RNA is not instance RNA

  • KeyError – Another RNA with the same ID already exists in the contig

property families

Get the families belonging to this contig

Returns:

families in the contig

Return type:

Generator[GeneFamily, None, None]

property genes: Generator[Gene, None, None]

Give the gene content of the contig

Returns:

Generator of genes in contig

get_by_coordinate(coordinate: Tuple[int, int, str]) Gene

Get a gene by its coordinate

Parameters:

coordinate – Tuple containing start, stop and strand of the gene

Returns:

The gene with the specified coordinate.

Raises:

TypeError – Position is not an integer

get_genes(begin: int = 0, end: int | None = None, outrange_ok: bool = False) List[Gene]

Gets a list of genes within a range of gene position. If no arguments are given it return all genes.

Parameters:
  • begin – Position of the first gene to retrieve

  • end – Position of the last gene to not retrieve

  • outrange_ok – If True even is the last position is out of range return all the genes from begin to last position

Returns:

List of genes between begin and end position

Raises:
  • TypeError – If begin or end is not an integer

  • ValueError – If begin position is greater than end position

  • IndexError – If end position is greater than last gene position in contig

get_ordered_consecutive_genes(genes: Iterable[Gene]) List[List[Gene]]

Order the given genes considering the circularity of the contig.

Parameters:

genes – An iterable containing genes supposed to be consecutive along the contig.

Returns:

A list of lists containing ordered consecutive genes considering circularity.

property length: int | None

Get the length of the contig

property modules

Get the modules belonging to this contig

Returns:

Modules belonging to this contig

Return type:

Generator[Module, None, None]

property number_of_genes: int

Get the number of genes in the contig

Returns:

the number of genes in the contig

property number_of_rnas: int

Get the number of RNA in the contig

property organism: Organism

Return organism that Feature belongs to.

Returns:

Organism of the feature

property regions

Get the regions belonging to this contig

Returns:

RGP in the contig

Return type:

Generator[Region, None, None]

remove(position)

Remove a gene by its position

Parameters:

position – Position of the gene in the contig

Raises:

TypeError – Position is not an integer

property spots

Get the spot belonging to this contig

Returns:

Spot in the contig

Return type:

Generator[Spot, None, None]

class ppanggolin.genome.Feature(identifier: str)

Bases: MetaFeatures

This is a general class representation of Gene, RNA

Methods: - fill_annotations: fills general annotation for child classes. - fill_parents: associates the object to an organism and a contig. - Add_sequence: adds a sequence to the feature.

Fields: - ID: Identifier of the feature given by PPanGGOLiN. - is_fragment: Boolean value indicating whether the feature is a fragment or not. - type: Type of the feature. - start: Start position of the feature. - stop: Stop position of the feature. - strand: Strand associated with the feature. - product: Associated product of the feature. - name: Name of the feature. - local_identifier: Identifier provided by the original file. - organism: Parent organism of the feature. - contig: Parent contig of the feature. - dna: DNA sequence of the feature.

add_sequence(sequence)

Add a sequence to feature

Parameters:

sequence – Sequence corresponding to the feature

Raises:

AssertionError – Sequence must be a string

property contig: Contig

Return contig that Feature belongs to.

Returns:

Contig of the feature

fill_annotations(start: int, stop: int, strand: str, gene_type: str = '', name: str = '', product: str = '', local_identifier: str = '', coordinates: List[Tuple[int, int]] | None = None)

Fill general annotation for child classes

Parameters:
  • start – Start position

  • stop – Stop position

  • coordinates – start and stop positions. in a list of tuple. Can have multiple tuple in case of join gene

  • strand – associated strand

  • gene_type – Type of gene

  • name – Name of the feature

  • product – Associated product

  • local_identifier – Identifier provided by the original file

Raises:
  • TypeError – If attribute value does not correspond to the expected type

  • ValueError – If strand is not ‘+’ or ‘-’

fill_parents(organism: Organism | None = None, contig: Contig | None = None)

Associate object to an organism and a contig

Parameters:
  • organism – Parent organism

  • contig – Parent contig

property has_joined_coordinates: bool

Whether or not the feature has joined coordinates.

property organism: Organism

Return organism that Feature belongs to.

Returns:

Organism of the feature

property overlaps_contig_edge: bool

Check based on the coordinates of the feature, if the gene seems to overlap contig edge.

start_relative_to(gene)
stop_relative_to(gene)
string_coordinates() str

Return a string representation of the coordinates

class ppanggolin.genome.Gene(gene_id: str)

Bases: Feature

Save gene from the genome as an Object with some information for Pangenome

Methods: - fill_annotations: fills general annotation for the gene object and adds additional attributes such as position and genetic code. - Add_protein: adds the protein sequence corresponding to the translated gene to the object.

Fields: - position: the position of the gene in the genome. - family: the family that the gene belongs to. - RGP: A putative Region of Plasticity that contains the gene. - genetic_code: the genetic code associated with the gene. - Protein: the protein sequence corresponding to the translated gene.

property RGP

Return the RGP that gene belongs to

Returns:

RGP of the Gene

Return type:

Region

add_protein(protein: str)

Add a protein sequence corresponding to translated gene

Parameters:

protein – Protein sequence

Raises:

TypeError – Protein sequence must be a string

property family

Return GeneFamily that Gene belongs to.

Returns:

Gene family of the gene

Return type:

GeneFamily

fill_annotations(position: int | None = None, genetic_code: int = 0, is_partial: bool = False, frame: int = 0, **kwargs)

Fill Gene annotation provide by PPanGGOLiN dependencies

Parameters:
  • position – Gene localization in genome

  • genetic_code – Genetic code associated to gene

  • is_partial – is the gene a partial gene

  • frame – One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..

  • kwargs – look at Feature.fill_annotations methods

Raises:

TypeError – If position or genetic code value is not instance integers

property frame: int

Get the frame of the gene

property module

Get the modules belonging to the gene

Returns:

get the modules linked to the gene

Return type:

Module

property spot

Get the spot belonging to the gene

Returns:

the spot linked to the gene

Return type:

Spot

class ppanggolin.genome.Organism(name: str)

Bases: MetaFeatures

Describe the Genome content and some information

Methods:

  • families: Returns a set of gene families present in the organism.

  • genes: Returns a generator to get genes in the organism.

  • number_of_genes: Returns the number of genes in the organism.

  • contigs: Returns the values in the contig dictionary from the organism.

  • get_contig: Gets the contig with the given identifier in the organism, adding it if it does not exist.

  • _create_contig: Creates a new contig object and adds it to the contig dictionary.

  • mk_bitarray: Produces a bitarray representing the presence/absence of gene families in the organism using the provided index.

Fields:

  • name: Name of the organism.

  • bitarray: Bitarray representing the presence/absence of gene families in the organism.

add(contig: Contig)

Add a contig to organism

Param:

Contig to add in organism

Raises:

KeyError – Contig with the given name already exist in the organism

property contigs: Generator[Contig, None, None]

Generator of contigs in the organism

Returns:

Values in contig dictionary from organism

property families

Return the gene families present in the organism

Returns:

Generator of gene families

Return type:

Generator[GeneFamily, None, None]

property genes: Generator[Gene, None, None]

Generator to get genes in the organism

Returns:

Generator of genes

get(name: str) Contig

Get contig with the given identifier in the organism

Parameters:

name – Contig identifier

Returns:

The contig with the given identifier

group_genes_by_partition() Dict[str, Set]

Groups genes based on their family’s named partition and returns a dictionary mapping partition names to sets of genes belonging to each partition.

Returns:

A dictionary containing sets of genes grouped by their family’s named partition.

mk_bitarray(index: Dict[Organism, int], partition: str = 'all')

Produces a bitarray representing the presence / absence of families in the organism using the provided index The bitarray is stored in the bitarray attribute and is a gmpy2.xmpz type.

Parameters:
  • partition – Filters partition

  • index – The index computed by ppanggolin.pangenome.Pangenome.getIndex()

Raises:

Exception – Partition is not recognized

property modules

Get all the modules belonging to this genome

Returns:

Generator of modules

Return type:

Generator[Module, None, None]

property number_of_contigs: int

Get number of contigs in organism

Returns:

Number of contigs in organism

number_of_families() int

Get the number of gene families in the organism

Returns:

Number of gene families

number_of_genes() int

Get number of genes in the organism

Returns:

Number of genes

property number_of_modules: int

Get number of modules in organism

Returns:

Number of modules in organism

property number_of_regions: int

Get number of RGP in organism

Returns:

Number of RGP in organism

number_of_rnas() int

Get number of genes in the organism

Returns:

Number of genes

property number_of_spots: int

Get number of spots in organism

Returns:

Number of spots in organism

property regions

Get all RGPS belonging to this genome

Returns:

Generator of RGPS

Return type:

Generator[Region, None, None]

remove(name: str)

Remove a contig with the given identifier in the organism

Parameters:

name – Contig identifier

property rna_genes: Generator[RNA, None, None]

Generator to get genes in the organism

Returns:

Generator of genes

property spots

Get all spots belonging to this genome

Returns:

Generator of spots

Return type:

Generator[Spot, None, None]

class ppanggolin.genome.RNA(rna_id: str)

Bases: Feature

Save RNA from genome as an Object with some information for Pangenome

Parameters:

rna_id – Identifier of the rna

ppanggolin.main module

ppanggolin.main.cmd_line() Namespace

Manage the command line argument given by user

Returns:

arguments given and readable by PPanGGOLiN

ppanggolin.main.main()

Run the command given by user and set / check some things.

Returns:

ppanggolin.metadata module

class ppanggolin.metadata.MetaFeatures

Bases: object

The MetaFeatures class provides methods to access and manipulate metadata in all ppanggolin classes.

Methods metadata: Generate all metadata from all sources. sources: Generate all metadata sources. get_metadata: Get metadata based on attribute values. max_metadata_by_source: Gets the source with the maximum number of metadata and the corresponding count.

add_metadata(metadata: Metadata, metadata_id: int | None = None) None

Add metadata to metadata getter

Parameters:
  • metadata – metadata value to add for the source

  • metadata_id – metadata identifier

Raises:

AssertionError – Source or metadata is not with the correct type

del_metadata_by_attribute(**kwargs)

Remove a source from the feature

del_metadata_by_source(source: str)

Remove a source from the feature

Parameters:

source – Name of the source to delete

Raises:
  • AssertionError – Source is not with the correct type

  • KeyError – Source does not belong in the MetaFeature

formatted_metadata_dict() Dict[str, List[str]]

Format metadata by combining source and field values.

Given an object with metadata, this function creates a new dictionary where the keys are formatted as ‘source_field’.

Returns:

A dictionary with formatted metadata.

formatted_metadata_dict_to_string(separator: str = '|') Dict[str, str]

Format metadata by combining source and field values.

Given an object with metadata, this function creates a new dictionary where the keys are formatted as ‘source_field’. In some cases, it is possible to have multiple values for the same field, in this situation, values are concatenated with the specified separator.

Parameters:

separator – The separator used to join multiple values for the same field (default is ‘|’).

Returns:

A dictionary with formatted metadata.

get_metadata(source: str, metadata_id: int | None = None) Metadata

Get metadata from metadata getter by its source and identifier

Parameters:
  • source – source of the metadata

  • metadata_id – metadata identifier

Raises:

KeyError – No metadata with ID or source is found

get_metadata_by_attribute(**kwargs) Generator[Metadata, None, None]

Get metadata by one or more attribute

Returns:

Metadata searched

get_metadata_by_source(source: str) Dict[int, Metadata] | None

Get all the metadata feature corresponding to the source

Parameters:

source – Name of the source to get

Returns:

List of metadata corresponding to the source

Raises:

AssertionError – Source is not with the correct type

has_metadata() bool

Does the feature has some metadata associated.

Returns:

True if it has metadata else False

has_source(source: str) bool

Check if the source is in the metadata feature

Parameters:

source – name of the source

Returns:

True if the source is in the metadata feature else False

max_metadata_by_source() Tuple[str, int]

Get the maximum number of metadata for one source

Returns:

Name of the source with the maximum annotation and the number of metadata corresponding

property metadata: Generator[Metadata, None, None]

Generate metadata in gene families

Returns:

Metadata from all sources

property number_of_metadata: int

Get the number of metadata associated to feature

property sources: Generator[str, None, None]

Get all metadata source in gene family

Returns:

Metadata source

class ppanggolin.metadata.Metadata(source: str, **kwargs)

Bases: object

The Metadata class represents a metadata link to genes, gene families, organisms, regions, spot or modules.

Methods:
  • number_of_attribute: Returns the number of attributes in the Metadata object.

  • get: Returns the value of a specific attribute, or None if the attribute does not exist.

  • fields: Returns a list of all the attributes in the Metadata object.

Fields:
  • source: A string representing the source of the metadata.

  • kwargs: A dictionary of attributes and values representing the metadata. The attributes can be any string, and the values can be any type except None or NaN.

property fields: List[str]

Get all the field of the metadata

Returns:

List of the field in the metadata

to_dict() Dict[str, Any]

Get metadata in dict format.

ppanggolin.pangenome module

class ppanggolin.pangenome.Pangenome

Bases: object

This is a class representing your pangenome. It is used as a basic unit for all the analysis to access to the different elements of your pangenome, such as organisms, contigs, genes or gene families. It has setter and getter methods for most elements in your pangenome, and you can use those to add new elements to it, or get objects that have a specific identifier to manipulate them directly.

property RNAs: Generator[Gene, None, None]

Generator of genes in the pangenome.

Returns:

gene generator

add_edge(gene1: Gene, gene2: Gene) Edge

Adds an edge between the two gene families that the two given genes belong to.

Parameters:
  • gene1 – The first gene

  • gene2 – The second gene

Returns:

The created Edge

Raises:
  • AssertionError – Genes object are expected

  • AttributeError – Genes are not associated to any families

add_file(pangenome_file: Path, check_version: bool = True)

Links an HDF5 file to the pangenome. If needed elements will be loaded from this file, and anything that is computed will be saved to this file when ppanggolin.formats.writeBinaries.writePangenome() is called.

Parameters:
  • pangenome_file – A string representing filepath to hdf5 pangenome file to be either used or created

  • check_version – Check ppanggolin version of the pangenome file to be compatible with the current version of ppaggolin being used.

Raises:

AssertionError – If the pangenome_file is not an instance of the Path class

add_gene_family(family: GeneFamily)

Adds the given gene family to the pangenome. If a family with the same name already exists, raises a KeyError.

Parameters:

family – The gene family to add to the pangenome

Raises:
  • KeyError – If a family with the same name already exists

  • Exception – For any unexpected exceptions

add_module(module: Module)

Add the given module to the pangenome

Parameters:

module – Module to add in pangenome

Raises:
  • AssertionError – Error if module is not a Module object

  • KeyError – Error if another module exist in pangenome with the same name

add_organism(organism: Organism)

Adds an organism that did not exist previously in the pangenome if an Organism object is provided. If an organism with the same name exists it will raise an error. If a str object is provided, will return the corresponding organism that has this name OR create a new one if it does not exist.

Parameters:

organism – Organism to add to the pangenome

Raises:
  • AssertionError – If the organism name is not a string

  • KeyError – if the provided organism is already in pangenome

add_region(region: Region)

Add a region to the pangenome

Parameters:

region – Region to add in pangenome

Raises:
  • AssertionError – Error if region is not a Region object

  • KeyError – Error if another Region exist in pangenome with the same name

add_spot(spot: Spot)

Adds the given iterable of spots to the pangenome.

Parameters:

spot – Spot which should be added

Raises:
  • AssertionError – Error if spot is not a Spot object

  • KeyError – Error if another Spot exist in pangenome with the same identifier

compute_family_bitarrays(part: str = 'all') Dict[Organism, int]

Based on the index generated by get_org_index, generate a bitarray for each gene family. If the family j is present in the organism with the index i, the bit at position i will be 1. If it is not, the bit will be 0. The bitarrays are gmpy2.xmpz object.

Parameters:

part – Filter the organism in function of the given partition

Returns:

The index of organisms in pangenome

compute_mod_bitarrays(part: str = 'all') Dict[GeneFamily, int]

Based on the index generated by get_fam_index, generated a bitarray for each gene family present in modules. If the family j is present in the module with the index i, the bit at position i will be 1. If it is not, the bit will be 0. The bitarrays are gmpy2.xmpz object.

Parameters:

part – Filter the organism in function of the given partition

Returns:

A dictionary with Organism as key and int as value.

compute_org_bitarrays(part='all') Dict[GeneFamily, int]

Based on the index generated by get_fam_index, generate a bitarray for each gene family. If the family j is present in the organism with the index i, the bit at position i will be 1. If it is not, the bit will be 0. The bitarrays are gmpy2.xmpz object.

Parameters:

part – Filter the organism in function of the given partition

Returns:

The index of gene families in pangenome

contig_lengths_unavailable() bool

Check if the pangenome has contig lengths unavailable

Returns:

True if contig lengths are unavailable, False otherwise

property contigs: Generator[Contig, None, None]
property edges: Generator[Edge, None, None]

Returns all the edges in the pangenome graph

Returns:

Generator of edge

exact_core_families() Set[GeneFamily]

Retrieves gene families considered as the exact core (present in all organisms).

Returns:

A set containing gene families identified as the exact core.

property gene_families: Generator[GeneFamily, None, None]

Returns all the gene families in the pangenome

Returns:

Generator of gene families

property genes: Generator[Gene, None, None]

Generator of genes in the pangenome.

Returns:

gene generator

get_contig(identifier: int | None = None, name: str | None = None, organism_name: str | None = None) Contig

Returns the contig by his identifier or by his name. If name is given the organism name is needed

Parameters:
  • identifier – ID of the contig to look for

  • name – The name of the contig to look for

  • organism_name – Name of the organism to which the contig belong

Returns:

Returns the wanted contig

Raises:
  • AssertionError – If the contig_id is not an integer

  • KeyError – If the contig is not in the pangenome

get_elem_by_metadata(metatype: str, **kwargs) Generator[GeneFamily | Gene | Organism | Region | Spot | Module, None, None]

Get element in pangenome with metadata attribute expected

Parameters:
  • metatype – Select to which pangenome element metadata

  • kwargs – attributes to identify metadata

Returns:

Metadata element

get_elem_by_source(source: str, metatype: str) Generator[GeneFamily | Gene | Contig | Organism | Region | Spot | Module, None, None]

Get gene families with a specific source in pangenome

Parameters:
  • source – Name of the source

  • metatype – select to which pangenome element metadata should be written

Returns:

Gene families with the source

get_fam_index() Dict[GeneFamily, int]

Creates an index for gene families (each family is assigned an Integer).

Returns:

The index of families in pangenome

get_gene(gene_id: str) Gene

Returns the gene that has the given gene ID

Parameters:

gene_id – The gene ID to look for

Returns:

Returns the gene that has the ID gene_id

Raises:
  • AssertionError – If the gene_id is not a string

  • KeyError – If the gene_id is not in the pangenome

get_gene_family(name: str) GeneFamily

Returns the gene family that has the given name

Parameters:

name – The gene family name to look for

Returns:

Returns the gene family that has the name name

Raises:
  • AssertionError – If the name is not an integer

  • KeyError – If the name is not corresponding to any family in the pangenome

get_module(module_id: int | str) Module

Returns the module that has the given module ID.

Parameters:

module_id – The module ID to look for. It can be an integer or a string in the format ‘module_<integer>’.

Returns:

The module with the specified ID.

Raises:
  • KeyError – If the module ID does not exist in the pangenome.

  • ValueError – If the provided module ID does not have the expected format.

get_multigenics(dup_margin: float, persistent: bool = True) Set[GeneFamily]

Returns the multigenic persistent families of the pangenome graph. A family will be considered multigenic if it is duplicated in more than dup_margin of the genomes where it is present.

Parameters:
  • dup_margin – The ratio of presence in multicopy above which a gene family is considered multigenic

  • persistent – if we consider only the persistent genes

Returns:

Set of gene families considered multigenic

get_org_index() Dict[Organism, int]

Creates an index for Organisms (each organism is assigned an Integer).

Returns:

The index of organisms in pangenome

get_organism(name: str) Organism

Get an organism that is expected to be in the pangenome using its name, which is supposedly unique. Raises an error if the organism does not exist.

Parameters:

name – Name of the Organism to get

Returns:

The related Organism object

Raises:
  • AssertionError – If the organism name is not a string

  • KeyError – If the provided name is not an organism in the pangenome

get_region(name: str) Region

Returns a region with the given region_name. Creates it if it does not exist.

Parameters:

name – The name of the region to return

Returns:

The region

Raises:
  • AssertionError – If the RGP name is not a string

  • KeyError – If the provided name is not a RGP in the pangenome

get_single_copy_persistent_families(dup_margin: float, exclude_fragments: bool) Set[GeneFamily]

Retrieves gene families that are both persistent and single copy based on the provided criteria.

Parameters:
  • dup_margin – The maximum allowed duplication margin for a gene family to be considered single copy.

  • exclude_fragments – A boolean indicating whether to exclude fragments when determining single copy families.

Returns:

A set containing gene families that are both persistent and single copy.

get_spot(spot_id: int | str) Spot

Returns the spot that has the given spot ID.

Parameters:

spot_id – The spot ID to look for. It can be an integer or a string in the format ‘spot_<integer>’.

Returns:

The spot with the specified ID.

Raises:
  • KeyError – If the spot ID does not exist in the pangenome.

  • ValueError – If the provided spot ID does not have the expected format.

has_metadata() bool

Whether or not the pangenome has metadata associated with any of its elements.

property max_fam_id

Get the last family identifier

metadata(metatype: str) Generator[Metadata, None, None]

Create a generator with all metadatas in the pangenome

Parameters:

metatype – Select to which pangenome element metadata should be generate

Returns:

Set of metadata source

metadata_sources(metatype: str) Set[str]

Returns all the metadata source in the pangenomes

Parameters:

metatype – Select to which pangenome element metadata should be searched

Returns:

Set of metadata source

Raises:

AssertionError – Error if metatype is not a string

property modules: Generator[Module, None, None]

Generate modules in the pangenome

property number_of_contigs: int

Returns the number of contigs present in the pangenome

Returns:

The number of contigs

property number_of_edges: int

Returns the number of edge present in the pangenome

Returns:

The number of gene families

property number_of_gene_families: int

Returns the number of gene families present in the pangenome

Returns:

The number of gene families

property number_of_genes: int

Returns the number of gene present in the pangenome

Returns:

The number of genes

property number_of_modules: int

Returns the number of modules present in the pangenome

Returns:

The number of modules

property number_of_organisms: int

Returns the number of organisms present in the pangenome

Returns:

The number of organism

property number_of_rgp: int

Returns the number of gene families present in the pangenome

Returns:

The number of gene families

property number_of_rnas: int

Returns the number of gene present in the pangenome

Returns:

The number of genes

property number_of_spots: int

Returns the number of gene families present in the pangenome

Returns:

The number of gene families

property organisms: Generator[Organism, None, None]

Returns all the organisms in the pangenome

Returns:

Generator ppanggolin.genome.Organism

property regions: Generator[Region, None, None]

returns all the regions (RGP) in the pangenome

Returns:

list of RGP

select_elem(metatype: str)

Get all the element for the given metatype

Parameters:

metatype – Name of pangenome component that will be get

Returns:

All elements from pangenome for the metatype

Raises:
  • AssertionError – Error if metatype is not a string

  • KeyError – Error if metatype is not recognized

soft_core_families(soft_core_threshold: float) Set[GeneFamily]

Retrieves gene families considered part of the soft core based on the provided threshold.

Parameters:

soft_core_threshold – The threshold to determine the minimum fraction of organisms required for a gene family to be considered part of the soft core.

Returns:

A set containing gene families identified as part of the soft core.

property spots: Generator[Spot, None, None]

Generate spots in the pangenome

Returns:

Spot generator

ppanggolin.region module

class ppanggolin.region.GeneContext(gc_id: int, families: Set[GeneFamily] | None = None, families_of_interest: Set[GeneFamily] | None = None)

Bases: object

Represent a gene context which is a collection of gene families related to a specific genomic context.

Methods - families: Generator that yields all the gene families in the gene context. - add_context_graph: Add a context graph corresponding to the gene context. - add_family: Add a gene family to the gene context.

Fields - gc_id: The identifier of the gene context. - graph: context graph corresponding to the gene context

add_family(family: GeneFamily)

Add a gene family to the gene context.

Parameters:

family – The gene family to add.

property families: Generator[GeneFamily, None, None]

Generator of the family in the context

Returns:

Gene families belonging to the context

property graph
class ppanggolin.region.Module(module_id: int, families: set | None = None)

Bases: MetaFeatures

The Module class represents a module in a pangenome analysis.

The Module class has the following attributes: - ID: An integer identifier for the module. - bitarray: A bitarray representing the presence/absence of the gene families in an organism.

The Module class has the following methods: - families: Returns a generator that yields the gene families in the module. - mk_bitarray: Generates a bitarray representing the presence/absence of the gene families in an organism using the provided index.

add(family: GeneFamily)

Add a family to the module. Alias more readable for setitem

Parameters:

family – Region to add in the spot

Raises:

TypeError – Region is not an instance Region

property families: Generator[GeneFamily, None, None]

Generator of the family in the module

Returns:

Families belonging to the module

get(name: str) GeneFamily

Get a family by its name. Alias more readable for getitem

Parameters:

name – Name of the family

Returns:

Wanted family

mk_bitarray(index: Dict[GeneFamily, int], partition: str = 'all')

Produces a bitarray representing the presence / absence of families in the organism using the provided index The bitarray is stored in the bitarray attribute and is a gmpy2.xmpz type.

Parameters:
  • partition – filter module by partition

  • index – The index computed by ppanggolin.pangenome.Pangenome.getIndex()

property organisms: Generator[Organism, None, None]

Returns all the Organisms that have this module

Returns:

Organisms that have this module

remove(name: str)

Remove a family by its name. Alias more readable for delitem

Parameters:

name – Name of the family

class ppanggolin.region.Region(name: str)

Bases: MetaFeatures

The ‘Region’ class represents a region of genomic plasticity.

Methods:
  • ‘genes’: the property that generates the genes in the region as they are ordered in contigs.

  • ‘families’: the property that generates the gene families in the region.

  • ‘Length’: the property that gets the length of the region.

  • ‘organism’: the property that gets the organism linked to the region.

  • ‘Contig’: the property that gets the starter contig linked to the region.

  • ‘is_whole_contig’: the property that indicates if the region is an entire contig.

  • ‘is_contig_border’: the property that indicates if the region is bordering a contig.

  • ‘get_rnas’: the method that gets the RNA in the region.

  • ‘Get_bordering_genes’: the method that gets the bordered genes in the region.

Fields:
  • ‘name’: the name of the region.

  • ‘score’: the score of the region.

  • ‘Starter’: the first gene in the region.

  • ‘stopper’: the last gene in the region.

add(gene: Gene)

Add a gene to the region

Parameters:

gene – Gene to add

property contig: Contig

Get the starter contig link to RGP

Returns:

Contig corresponding to the region

property coordinates: List[Tuple[int]]

Return the coordinates of the region :return: coordinates of the region

property families: Generator[GeneFamily, None, None]

Get the gene families in the RGP

Returns:

Gene families

property genes: Generator[Gene, None, None]

Generate the gene as they are ordered in contigs

Returns:

Genes in the region

get(position: int) Gene

Get a gene by its position

Parameters:

position – Position of the gene in the contig

Returns:

Wanted gene

Raises:

TypeError – Position is not an integer

get_bordering_genes(n: int, multigenics: Set[GeneFamily], return_only_persistents: bool = True) List[List[Gene], List[Gene]]

Get the bordered genes in the region. Find the n persistent and single copy gene bordering the region. If return_only_persistents is False, the method return all genes included between the n single copy and persistent genes.

Parameters:
  • n – Number of genes to get

  • multigenics – pangenome graph multigenic persistent families

  • return_only_persistents – return only non multgenic persistent genes identify as the region. If False return all genes included between the borders made of n persistent and single copy genes around the region.

Returns:

A list of bordering genes in start and stop position

get_ordered_genes() List[Gene]

Get ordered genes of the region, taking into account the circularity of contigs.

Returns:

A list of genes ordered by their positions in the region.

id_counter = 0
identify_rgp_last_and_first_genes()

Identify first and last genes of the rgp by taking into account the circularity of contigs.

Set the attributes _starter: first gene of the region and _stopper: last gene of the region and _coordinates

property is_contig_border: bool

Indicates if the region is bordering a contig

Returns:

True if bordering else False

Raises:

AssertionError – No genes in the regions, it’s not expected

property is_whole_contig: bool

Indicates if the region is an entire contig

Returns:

True if whole contig else False

property length

Get the length of the region

Returns:

Size of the region

property modules: Set[Module]

Get the modules of gene families in the RGP

Returns:

Modules found in families of the RGP

property number_of_families: int

Get the number of different gene families in the region

Returns:

Number of families

property organism: Organism

Get the Organism link to RGP

Returns:

Organism corresponding to the region

property overlaps_contig_edge: bool
remove(position)

Remove a gene by its position

Parameters:

position – Position of the gene in the contig

Raises:

TypeError – Position is not an integer

property spot: Spot | None
property start: int

Get the starter start link to RGP

Returns:

start position in the contig of the first gene of the RGP

property starter: Gene

Return first gene of the region. If this gene is not identified, it does that first. :return: first gene of the region

property stop: int

Get the stopper stop link to RGP

Returns:

start position in the contig of the last gene of the RGP

property stopper: Gene

Return last gene of the region. If this gene is not identified, it does that first. :return: last gene of the region

string_coordinates() str

Return a string representation of the coordinates

class ppanggolin.region.Spot(spot_id: int)

Bases: MetaFeatures

The ‘Spot’ class represents a region of genomic plasticity.

Methods:
  • ‘regions’: the property that generates the regions in the spot.

  • ‘families’: the property that generates the gene families in the spot.

  • ‘spot_2_families’: add to Gene Families a link to spot.

  • ‘borders’: Extracts all the borders of all RGPs belonging to the spot

  • ‘get_uniq_to_rgp’: Get dictionary with a representing RGP as key, and all identical RGPs as value

  • ‘get_uniq_ordered_set’: Get an Iterable of all the unique syntenies in the spot

  • ‘get_uniq_content’: Get an Iterable of all the unique rgp (in terms of gene family content) in the spot

  • ‘count_uniq_content’: Get a counter of uniq RGP and number of identical RGP (in terms of gene family content)

  • ‘count_uniq_ordered_set’: Get a counter of uniq RGP and number of identical RGP (in terms of synteny content)

Fields:
  • ‘ID’: Identifier of the spot

add(region: Region)

Add a region to the spot. Alias more readable for setitem

Parameters:

region – Region to add in the spot

Raises:

TypeError – Region is not an instance Region

borders(set_size: int, multigenics) List[List[int, List[GeneFamily], List[GeneFamily]]]

Extracts all the borders of all RGPs belonging to the spot

Parameters:
  • set_size – Number of genes to get

  • multigenics – pangenome graph multigenic persistent families

Returns:

Families that bordering spot

count_uniq_content() dict

Get a counter of uniq RGP and number of identical RGP (in terms of gene family content)

Returns:

Dictionary with a representative rgp as the key and number of identical rgp as value

count_uniq_ordered_set()

Get a counter of uniq RGP and number of identical RGP (in terms of synteny content)

Returns:

Dictionary with a representative rgp as the key and number of identical rgp as value

property families: Generator[GeneFamily, None, None]

Get the gene families in the RGP

Returns:

Family in the spot

get(name: str) Region

Get a region by its name. Alias more readable for getitem

Parameters:

name – Name of the region

Returns:

Wanted region

get_uniq_content() Set[Region]

Get an Iterable of all the unique rgp (in terms of gene family content) in the spot

Returns:

Iterable of all the unique rgp (in terms of gene family content) in the spot

get_uniq_ordered_set() Set[Region]

Get an Iterable of all the unique syntenies in the spot

Returns:

Iterable of all the unique syntenies in the spot

get_uniq_to_rgp() Dict[Region, Set[Region]]

Get dictionary with a representing RGP as the key, and all identical RGPs as value

Returns:

Dictionary with a representing RGP as the key, and set of identical RGPs as value

property number_of_families: int

Get the number of different families in the spot

Returns:

Number of families

property regions: Generator[Region, None, None]

Generates the regions in the spot

Returns:

Regions in the spot

remove(name: str)

Remove a region by its name. Alias more readable for delitem

Parameters:

name – Name of the region

spot_2_families()

Add to Gene Families a link to spot

ppanggolin.utils module

ppanggolin.utils.add_common_arguments(subparser: ArgumentParser)

Add common argument to the input subparser.

Parameters:

subparser – A subparser object from any subcommand.

ppanggolin.utils.add_gene(obj, gene, fam_split: bool = True)
Parameters:
  • obj

  • gene

  • fam_split

ppanggolin.utils.check_config_consistency(config: dict, workflow_steps: list)

Check that the same parameter used in different subcommand inside a workflow has the same value.

If not, the function throw a logging.getLogger(“PPanGGOLiN”).warning.

Params config_dict:

config dict with as key the section of the config file and as value another dict pairing name and value of parameters.

Params workflow_steps:

list of subcommand names used in the workflow execution.

ppanggolin.utils.check_input_files(file: Path, check_tsv: bool = False)

Checks if the provided input files exist and are of the proper format

Parameters:
  • file – Path to the file

  • check_tsv – Allow checking tsv file for annotation or fasta list

ppanggolin.utils.check_log(log_file: str) TextIO

Check if the output log is writable

Parameters:

log_file – Path to the log output

Returns:

output for log

ppanggolin.utils.check_option_workflow(args)

Check if the given argument to a workflow command is usable

Parameters:

args – list of arguments

ppanggolin.utils.check_tools_availability(tool_to_description: Dict[str, str] | List[str]) dict[str, bool]

Check if the given command-line tools are available in the system’s PATH.

Parameters:

tool_to_description – A dictionary where keys are tool names and values are descriptions of their purpose, or a list of tool names.

Returns:

A dictionary with tool names as keys and boolean values indicating availability.

ppanggolin.utils.check_translation_table_to_use(pangenome: Pangenome, is_user_specified: bool, user_translation_table: int) int

Determine the translation table to use based on what has been used previously and user input.

This function implements the following priority logic: 1. If the user explicitly specified a translation table, it takes precedence

(with a warning if it differs from previous usage)

  1. If a translation table was used previously in the pangenome, use it

  2. Otherwise, use the default/user-provided translation table

Parameters:
  • pangenome – Pangenome object containing information from previous analysis steps, including the translation table used if available

  • is_user_specified – Whether the translation table was explicitly specified by the user

  • user_translation_table – The translation table value provided by the user (default or explicitly specified)

Returns:

The translation table to use for translation

ppanggolin.utils.check_tsv_sanity(tsv_file: Path)

Check if the given TSV file is readable for the next PPanGGOLiN step.

Parameters:

tsv – Path to the TSV containing organism information.

Raises:

ValueError – If the file format is incorrect or contains invalid genome names.

ppanggolin.utils.check_version_compatibility(file_version: str) None

Checks the compatibility of the provided pangenome file version with the current PPanGGOLiN version.

Parameters:

file_version – A string representing the version of the pangenome file.

ppanggolin.utils.combine_args(args: Namespace, another_args: Namespace)

Combine two args object.

Parameters:
  • args – initial arguments.

  • another_args – another args

Returns:

object with combined arguments

ppanggolin.utils.connected_components(g: Graph, removed: set, weight: float)

Yields subgraphs of each connected component you get when filtering edges based on the given weight.

Parameters:
  • g – Subgraph

  • removed – removed node

  • weight – threshold to remove node or not

ppanggolin.utils.create_tmpdir(main_dir, basename='tmpdir', keep_tmp=False)
ppanggolin.utils.delete_unspecified_args(args: Namespace)

Delete argument from the given argparse.Namespace with None values.

Parameters:

args – arguments to filter.

ppanggolin.utils.detect_filetype(filename: Path) str

Detects whether the current file is gff3, gbk/gbff, fasta, tsv or unknown. If unknown, it will raise an error

Parameters:

filename – path to file

Returns:

current file type

ppanggolin.utils.erase_default_value(parser: ArgumentParser)

Remove default action in the given list of argument parser actions.

This is dnoe to distinguish specified arguments.

Params parser:

An argparse.ArgumentParser object with default values to erase.

ppanggolin.utils.extract_contig_window(contig_size: int, positions_of_interest: Iterable[int], window_size: int, is_circular: bool = False)

Extracts contiguous windows around positions of interest within a contig.

Parameters:
  • contig_size – Number of genes in contig.

  • positions_of_interest – An iterable containing the positions of interest.

  • window_size – The size of the window to extract around each position of interest.

  • is_circular – Indicates if the contig is circular.

Returns:

Yields tuples representing the start and end positions of each contiguous window.

ppanggolin.utils.find_consecutive_sequences(sequence: List[int]) List[List[int]]

Find consecutive sequences in a list of integers.

Parameters:

sequence – The input list of integers.

Returns:

A list of lists containing consecutive sequences of integers.

ppanggolin.utils.find_region_border_position(region_positions: List[int], contig_gene_count: int) Tuple[int, int]

Find the start and stop integers of the region considering circularity of the contig.

Parameters:
  • region_positions – List of positions that belong to the region.

  • contig_gene_count – Number of gene in the contig. The contig is considered circular.

Returns:

A tuple containing the start and stop integers of the region.

ppanggolin.utils.flatten_nested_dict(nested_dict: Dict[str, Dict | int | str | float]) Dict[str, int | str | float]

Flattens a nested dictionary into a flat dictionary by concatenating keys at different levels.

Parameters:

nested_dict – The nested dictionary to be flattened.

Returns:

A flat dictionary with concatenated keys.

ppanggolin.utils.get_arg_name(arg_val: str | TextIOWrapper) str | TextIOWrapper

Returns the name of a file if the argument is a TextIOWrapper object, otherwise returns the argument value.

Parameters:

arg_val – Either a string or a TextIOWrapper object.

Returns:

Either a string or a TextIOWrapper object, depending on the type of the input argument.

ppanggolin.utils.get_arg_names_from_namespace(args_namespace: Namespace) List[str]

Extract all argument names from a Namespace object, excluding private attributes.

Parameters:

args_namespace – An argparse Namespace object

Returns:

List of argument names (excluding those starting with ‘_’)

ppanggolin.utils.get_args_differing_from_default(default_args: Namespace, final_args: Namespace, param_to_ignore: List[str] | Set[str] | None = None) dict

Get the parameters that have different value than default values.

Params default_args:

default arguments

Params final_args:

final arguments to compare with default

Params param_to_ignore:

list of params to ignore.

Returns:

A dict with param that differ from default as key and the final value of the param as value

ppanggolin.utils.get_cli_args(subparser_fct: Callable) Namespace

Parse command line arguments using the specified parsing function.

Params subparser_fct:

Subparser function to use. This subparser give the expected argument for the subcommand.

ppanggolin.utils.get_config_args(subcommand: str, subparser_fct: Callable, config_dict: dict, config_section: str, expected_params: List[str] | Set[str], strict_config_check: bool) Namespace

Parsing parameters of a specific section of the config file.

If some parameter are not specified in the config they are not added to the args object.

Params subcommand:

Name of the ppanggolin subcommand.

Params subparser_fct:

Subparser function to use. This subparser give the expected argument for the subcommand.

Params config_dict:

config dict with as key the section of the config file and as value another dict pairing name and value of parameters.

Params config_section:

Which section to parse in config file.

Params expected_params:

List of argument to expect in the parser. If the parser has other arguments, these arguments are filtered out.

Params strict_config_check:

if set to true, an error is raised when a parameter is found in the config which it is not in the expected_params list.

Return args:

Arguments parse from the config

ppanggolin.utils.get_consecutive_region_positions(region_positions: List[int], contig_gene_count: int) List[List[int]]

Order integers position of the region considering circularity of the contig.

Parameters:
  • region_positions – List of positions that belong to the region.

  • contig_gene_count – Number of gene in the contig. The contig is considered circular.

Returns:

An ordered list of integers of the region.

Raises:

ValueError – If unexpected conditions are encountered.

ppanggolin.utils.get_default_args(subcommand: str, subparser_fct: Callable, unwanted_args: list | None = None) Namespace

Get default value for the arguments for the given subparser function.

Params subcommand:

Name of the ppanggolin subcommand.

Params subparser_fct:

Subparser function to use. This subparser give the expected argument for the subcommand.

Params unwanted_args:

List of arguments to filter out.

Return args:

arguments with default values.

ppanggolin.utils.get_major_version(version: str) int

Extracts the major version number from a version string.

Parameters:

version – A string representing the version number.

Returns:

The major version extracted from the input version string.

Raises:

ValueError – If the input version does not have the expected format.

ppanggolin.utils.get_subcommand_parser(subparser_fct: Callable, name: str = '') Tuple[_SubParsersAction, ArgumentParser]

Get subcommand parser object using the given subparser function.

Common arguments are also added to the parser object.

Params subparser_fct:

Name:

Name of section to add more info in the parser in case of error.

Returns:

The parser and subparser objects

ppanggolin.utils.has_non_ascii(string_to_test: str | Collection[str]) bool

Check if a string or a collection of strings contains any non-ASCII characters.

Parameters:

string_to_test – A single string or a collection of strings to check.

Returns:

True if any string contains non-ASCII characters, False otherwise.

ppanggolin.utils.is_compressed(file_or_file_path: Path | BinaryIO | TextIOWrapper | TextIO) Tuple[bool, str | None]

Detects if a file is compressed based on its file signature.

Parameters:

file_or_file_path – The file to check.

Returns:

True if the file is a recognized compressed format with the format name, False otherwise.

Raises:

TypeError – If the file type is not supported.

ppanggolin.utils.jaccard_similarities(mat: csc_matrix, jaccard_similarity_th) csc_matrix

Compute the jaccard similarities

Parameters:
  • mat

  • jaccard_similarity_th – threshold

Returns:

ppanggolin.utils.manage_cli_and_config_args(subcommand: str, config_file: str, subcommand_to_subparser: dict) Namespace

Manage command line and config arguments for the given subcommand.

This function parse arguments from the cmd line and config file and set up the following priority: cli > config > default When the subcommand is a workflow, the subcommand used in workflows are also parsed in the config.

Params subcommand:

Name of the subcommand.

Params config_file:

Path to the config file given in argument. If None, only default and cli arguments value are used.

Params subcommand_to_subparser:

Dict with subcommand name as key and the corresponding subparser function as value.

ppanggolin.utils.min_one(x) int

Check if the given int is superior to one

Parameters:

x – given float by user

Returns:

given float if it is acceptable

Raises:

argparse.ArgumentTypeError – The float is not acceptable

ppanggolin.utils.mk_file_name(basename: str, output: Path, force: bool = False) Path

Returns a usable filename for a ppanggolin output file, or crashes.

Parameters:
  • basename – basename for the file

  • output – Path to save the file

  • force – Force to write the file

Returns:

Path to the file

ppanggolin.utils.mk_outdir(output: Path, force: bool = False, exist_ok: bool = False)

Create a directory at the given output if it doesn’t exist already

Parameters:
  • output – Path where to create directory

  • force – Force to write in the directory

  • exist_ok – Does not give an error if the directory already exists.

Raises:

FileExistError – The current path already exist and force is false

ppanggolin.utils.overwrite_args(default_args: Namespace, config_args: Namespace, cli_args: Namespace)

Overwrite args objects.

When arguments are given in CLI, their values are used instead of the ones found in the config file. When arguments are specified in the config file, they overwrite default values.

Parameters:
  • default_args – default arguments

  • config_args – arguments parsed from the config file

  • cli_args – arguments parsed from the command line

Returns:

final arguments

ppanggolin.utils.parse_config_file(yaml_config_file: str) dict

Parse yaml config file.

Parameters:

yaml_config_file – config file in yaml

Returns:

dict of config with key the command and as value another dict with param as key and value as value.

ppanggolin.utils.parse_input_paths_file(path_list_file: Path) Dict[str, Dict[str, Path | List[str]]]

Parse an input paths file to extract genome information.

This function reads an input paths file, which is in TSV format, and extracts genome information including file paths and putative circular contigs.

Parameters:

path_list_file – The path to the input paths file.

Returns:

A dictionary where keys are genome names and values are dictionaries containing path information and putative circular contigs.

Raises:
  • FileNotFoundError – If a specified genome file path does not exist.

  • Exception – If there are no genomes in the provided file.

ppanggolin.utils.read_compressed_or_not(file_or_file_path: Path | BinaryIO | TextIOWrapper | TextIO) TextIOWrapper | BinaryIO | TextIO

Opens and reads a file, decompressing it if necessary.

Parameters: file (pathlib.Path, io.BytesIO, io.TextIOWrapper, io.TextIOBase): The file to read. It can be a Path object from the pathlib module, a BytesIO object, a TextIOWrapper, or TextIOBase object.

Returns: str: The contents of the file, decompressed if it was a recognized compressed format.

Raises: TypeError: If the file type is not supported.

ppanggolin.utils.replace_non_ascii(string_with_ascii: str | Collection[str], replacement_string: str = '_') str | Collection[str]

Replace all non-ASCII characters in a string or a collection of strings with a specified replacement string.

Parameters:
  • string_with_ascii – A string or collection of strings potentially containing non-ASCII characters.

  • replacement_string – The string to replace non-ASCII characters with (default is ‘_’).

Returns:

A new string or collection where all non-ASCII characters have been replaced.

ppanggolin.utils.restricted_float(x: int | float) float

Decrease the choice possibility of float in argparse

Parameters:

x – given float by user

Returns:

given float if it is acceptable

Raises:

argparse.ArgumentTypeError – The float is not acceptable

ppanggolin.utils.run_subprocess(cmd: List[str], output: Path | None = None, msg: str = 'Subprocess failed with the following error:\n')

Run a subprocess command and write the output to the given path.

Parameters:
  • cmd – List of program arguments.

  • output – Path to write the subprocess output (optional).

  • msg – Message to print if the subprocess fails.

Raises:
  • FileNotFoundError – If the command’s executable is not found.

  • subprocess.CalledProcessError – If the subprocess returns a non-zero exit code.

ppanggolin.utils.set_up_config_param_to_parser(config_param_val: dict) list

Take dict pairing parameters and values and format the corresponding list of arguments to feed a parser.

When the parameter value is False, the parameter is a flag and thus is not added to the list.

Params config_param_val:

Dict with parameter name as key and parameter value as value.

Returns:

list of argument strings formatted for an argparse.ArgumentParser object.

ppanggolin.utils.set_verbosity_level(args)

Set the verbosity level

Parameters:

args – argument pass by command line

ppanggolin.utils.write_compressed_or_not(file_path: Path, compress: bool = False) GzipFile | TextIOWrapper

Create a file-like object, compressed or not.

Parameters:
  • file_path – Path to the file

  • compress – Compress the file in .gz

Returns:

file-like object, compressed or not

Module contents