ppanggolin.RGP package

Submodules

ppanggolin.RGP.genomicIsland module

class ppanggolin.RGP.genomicIsland.MatriceNode(state, score, prev, gene)

Bases: object

changes(score)

ppanggolin.RGP.genomicIsland.check_pangenome_former_rgp(pangenome: Pangenome, force: bool = False)

checks pangenome status and .h5 files for former rgp, delete them if allowed or raise an error

Parameters:

pangenome – Pangenome object
force – Allow to force write on Pangenome file

ppanggolin.RGP.genomicIsland.compute_org_rgp(organism: Organism, multigenics: set, persistent_penalty: int = 3, variable_gain: int = 1, min_length: int = 3000, min_score: int = 4, naming: str = 'contig', disable_bar: bool = True) → set

Compute regions of genomic plasticity (RGP) on the given organism based on the provided parameters.

Parameters:

organism – The Organism object representing the organism.
multigenics – A set of multigenic persistent families of the pangenome graph.
persistent_penalty – Penalty score to apply to persistent multigenic families (default: 3).
variable_gain – Gain score to apply to variable multigenic families (default: 1).
min_length – Minimum length threshold (in base pairs) for the regions to be considered RGP (default: 3000).
min_score – Minimum score threshold for considering a region as RGP (default: 4).
naming – Naming scheme for the regions, either “contig” or “organism” (default: “contig”).
disable_bar – Whether to disable the progress bar. It is recommended to disable it when calling this function in a loop on multiple organisms (default: True).

Returns:

A set of RGPs of the provided organism.

ppanggolin.RGP.genomicIsland.extract_rgp(contig, node, rgp_id, naming) → Region: Extract the region from the given starting node

ppanggolin.RGP.genomicIsland.init_matrices(contig: Contig, multi: set, persistent_penalty: int = 3, variable_gain: int = 1) → list

Initialize the vector of score/state nodes

Parameters:

contig – Current contig from one organism
persistent_penalty – Penalty score to apply to persistent genes
variable_gain – Gain score to apply to variable genes
multi – multigenic persistent families of the pangenome graph.

Returns:

Initialized matrice

ppanggolin.RGP.genomicIsland.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.RGP.genomicIsland.mk_regions(contig: Contig, matrix: list, multi: set, min_length: int = 3000, min_score: int = 4, persistent: int = 3, continuity: int = 1, naming: str = 'contig') → Set[Region]

Processing matrix and ‘emptying’ it to get the regions.

Parameters:

contig – Current contig from one organism
matrix – Initialized matrix
multi – multigenic persistent families of the pangenome graph.
min_length – Minimum length (bp) of a region to be considered RGP
min_score – Minimal score wanted for considering a region as being RGP
persistent – Penalty score to apply to persistent genes
continuity – Gain score to apply to variable genes
naming –

Returns:

ppanggolin.RGP.genomicIsland.naming_scheme(organisms: Iterable[Organism]) → str

Determine the naming scheme for the contigs in the pangenome.

Parameters:: organisms – Iterable of organims objects
Returns:: Naming scheme for the contigs (“contig” or “organism”).

ppanggolin.RGP.genomicIsland.parser_rgp(parser: ArgumentParser)

Parser for specific argument of rgp command

Parameters:: parser – parser for align argument

ppanggolin.RGP.genomicIsland.predict_rgp(pangenome: Pangenome, persistent_penalty: int = 3, variable_gain: int = 1, min_length: int = 3000, min_score: int = 4, dup_margin: float = 0.05, force: bool = False, disable_bar: bool = False)

Main function to predict region of genomic plasticity

Parameters:

pangenome – blank pangenome object
persistent_penalty – Penalty score to apply to persistent genes
variable_gain – Gain score to apply to variable genes
min_length – Minimum length (bp) of a region to be considered RGP
min_score – Minimal score wanted for considering a region as being RGP
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
force – Allow to force write on Pangenome file
disable_bar – Disable progress bar

ppanggolin.RGP.genomicIsland.rewrite_matrix(contig, matrix, index, persistent, continuity, multi): ReWrite the matrice from the given index of the node that started a region.

ppanggolin.RGP.genomicIsland.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.RGP.rgp_cluster module

class ppanggolin.RGP.rgp_cluster.IdenticalRegions(name: str, identical_rgps: Set[Region], families: Set[GeneFamily], is_contig_border: bool)

Bases: object

Represents a group of Identical Regions within a pangenome.

Parameters:

name – The name of the identical region group.
identical_rgps – A set of Region objects representing the identical regions.
families – A set of GeneFamily objects associated with the identical regions.
is_contig_border – A boolean indicating if the identical regions span across contig borders.

property genes: Return iterable of genes from all RGPs that are identical in families

property modules: Set[Module]: Return iterable of genes from all RGPs that are identical in families

property spots: Set[Spot]: Return spots from all RGPs that are identical in families

ppanggolin.RGP.rgp_cluster.add_edges_to_identical_rgps(rgp_graph: Graph, identical_rgps_objects: List[IdenticalRegions])

Replace identical rgp objects by all identical RGPs it contains.

Parameters:

rgp_graph – The RGP graph to add edges to.
identical_rgps_objects – A dictionary mapping RGPs to sets of identical RGPs.

ppanggolin.RGP.rgp_cluster.add_info_to_identical_rgps(rgp_graph: Graph, identical_rgps_objects: List[IdenticalRegions], rgp_to_spot: Dict[Region, int])

Add identical rgps info in the graph as node attributes.

Params rgp_graph:: Graph with rgp id as node and grr value as edges
Params rgp_to_identical_rgps:: dict with uniq RGP as the key and set of identical rgps as value

ppanggolin.RGP.rgp_cluster.add_info_to_rgp_nodes(graph, regions: List[Region], region_to_spot: dict)

Format RGP information into a dictionary for adding to the graph.

This function takes a list of RGPs and a dictionary mapping each RGP to its corresponding spot ID, and formats the RGP information into a dictionary for further processing or addition to a graph.

Parameters:

graph – RGPs graph
regions – A list of RGPs.
region_to_spot – A dictionary mapping each RGP to its corresponding spot ID.

Returns:

A dictionary with RGP id as the key and a dictionary containing information on the corresponding RGP as value.

ppanggolin.RGP.rgp_cluster.add_rgp_metadata_to_graph(graph: Graph, rgps: List[Region | IdenticalRegions]) → None

Add metadata from Region or IdenticalRegions objects to the graph.

Parameters:

graph – The graph to which the metadata will be added.
rgps – A set of Region or IdenticalRegions objects containing the metadata to be added.

ppanggolin.RGP.rgp_cluster.cluster_rgp(pangenome, grr_cutoff: float, output: str, basename: str, ignore_incomplete_rgp: bool, unmerge_identical_rgps: bool, grr_metric: str, disable_bar: bool, graph_formats: Set[str], add_metadata: bool = False, metadata_sep: str = '|', metadata_sources: List[str] | None = None)

Main function to cluster regions of genomic plasticity based on their GRR

Parameters:

pangenome – pangenome object
grr_cutoff – GRR cutoff value for clustering
output – Directory where the output files will be saved
basename – Basename for the output files
ignore_incomplete_rgp – Whether to ignore incomplete RGPs located at a contig border
unmerge_identical_rgps – Whether to unmerge identical RGPs into separate nodes in the graph
grr_metric – GRR metric to use for clustering
disable_bar – Whether to disable the progress bar
graph_formats – Set of graph file formats to save the output
add_metadata – Add metadata to cluster files
metadata_sep – The separator used to join multiple metadata values
metadata_sources – Sources of the metadata to use and write in the outputs. None means all sources are used.

ppanggolin.RGP.rgp_cluster.cluster_rgp_on_grr(graph: Graph, clustering_attribute: str = 'grr')

Cluster rgp based on grr using louvain communities clustering.

Parameters:

graph – NetworkX graph object representing the RGPs and their relationship
clustering_attribute – Attribute of the graph to use for clustering (default is “grr”)

ppanggolin.RGP.rgp_cluster.compute_grr(rgp_a_families: Set[GeneFamily], rgp_b_families: Set[GeneFamily], mode: Callable) → float

Compute gene repertoire relatedness (GRR) between two rgp. Mode can be the function min to compute min GRR or max to compute max_grr

Parameters:

rgp_a_families – Rgp A
rgp_b_families – rgp B
mode – min or max function

Returns:

GRR value between 0 and 1

ppanggolin.RGP.rgp_cluster.compute_jaccard_index(rgp_a_families: set, rgp_b_families: set) → float

Compute jaccard index between two rgp based on their families.

Parameters:

rgp_a_families – Rgp A
rgp_b_families – rgp B

:return : Jaccard index

ppanggolin.RGP.rgp_cluster.compute_rgp_metric(rgp_a: Region, rgp_b: Region, grr_cutoff: float, grr_metric: str) → Tuple[int, int, dict] | None

Compute GRR metric between two RGPs.

Parameters:

rgp_a – A rgp
rgp_b – another rgp
grr_cutoff – Cutoff filter
grr_metric – grr mode between min_grr, max_grr and incomplete_aware_grr

Returns:

Tuple containing the IDs of the two RGPs and the computed metrics as a dictionary

ppanggolin.RGP.rgp_cluster.dereplicate_rgp(rgps: Set[Region | IdenticalRegions], disable_bar: bool = False) → List[Region | IdenticalRegions]

Dereplicate RGPs that have the same families.

Given a list of Region or IdenticalRegions objects representing RGPs, this function groups together RGPs with the same families into IdenticalRegions objects and returns a list of dereplicated RGPs.

Parameters:

rgps – A set of Region or IdenticalRegions objects representing the RGPs to be dereplicated.
disable_bar – If True, disable the progress bar.

Returns:

A list of dereplicated RGPs (Region or IdenticalRegions objects). For RGPs with the same families, they will be grouped together in IdenticalRegions objects.

ppanggolin.RGP.rgp_cluster.format_rgp_metadata(rgp: Region) → Dict[str, str]

Format RGP metadata by combining source and field values.

Given an RGP object with metadata, this function creates a new dictionary where the keys are formatted as ‘source_field’ and the values are concatenated with ‘|’ as the delimiter.

Parameters:: rgp – The RGP object with metadata.
Returns:: A dictionary with formatted metadata.

ppanggolin.RGP.rgp_cluster.get_spot_id(rgp: Region, rgp_to_spot: Dict[Region, int]) → str

Return Spot ID associated to an RGP. It adds the prefix “spot” to the spot ID. When no spot is associated with the RGP, then the string “No spot” is return

Parameters:

rgp – RGP id
rgp_to_spot – A dictionary mapping an RGP to its spot.

Returns:

Spot ID of the given RGP with the prefix spot or “No spot”.

ppanggolin.RGP.rgp_cluster.join_dicts(dicts: List[Dict[str, Any]], delimiter: str = ';') → Dict[str, Any]

Join dictionaries by concatenating the values with a custom delimiter for common keys.

Given a list of dictionaries, this function creates a new dictionary where the values for common keys are concatenated with the specified delimiter.

Parameters:

dicts – A list of dictionaries to be joined.
delimiter – The delimiter to use for joining values. Default is ‘;’.

Returns:

A dictionary with joined values for common keys.

ppanggolin.RGP.rgp_cluster.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provided by user

ppanggolin.RGP.rgp_cluster.parser_cluster_rgp(parser: ArgumentParser)

Parser for specific argument of rgp command

Parameters:: parser – Parser for cluster_rgp argument

ppanggolin.RGP.rgp_cluster.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : Sub_parser for cluster_rgp command

:return : Parser arguments for cluster_rgp command

ppanggolin.RGP.rgp_cluster.write_rgp_cluster_table(outfile: str, grr_graph: Graph, rgps_in_graph: List[Region | IdenticalRegions], grr_metric: str, rgp_to_spot: Dict[Region, int]) → None

Writes RGP cluster info to a TSV file using pandas.

Parameters:

outfile – Name of the tsv file
grr_graph – The GRR graph.
rgps_in_graph – A dictionary mapping an RGP to a set of identical RGPs.
grr_metric – The GRR metric used for clustering.
rgp_to_spot – A dictionary mapping an RGP to its spot.

Returns:

None

ppanggolin.RGP.spot module

ppanggolin.RGP.spot.add_new_node_in_spot_graph(g: Graph, region: Region, borders: list) → str

Add bordering region as node to graph

Parameters:

g – spot graph
region – region in spot
borders – bordering families in spot

Return blocks:

name of the node that has been added

ppanggolin.RGP.spot.check_pangenome_former_spots(pangenome: Pangenome, force: bool = False)

checks pangenome status and .h5 files for former spots, delete them if allowed or raise an error

Parameters:

pangenome – Pangenome object
force – Allow to force write on Pangenome file

ppanggolin.RGP.spot.check_sim(pair_border1: list, pair_border2: list, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) → bool

Checks if the two pairs of exact_match first gene families are identical, or eventually if they overlap in an ordered way at least ‘overlapping_match’

Parameters:

pair_border1 – First flanking gene families pair
pair_border2 – Second flanking gene families pair
overlapping_match – Number of missing persistent genes allowed when comparing flanking genes
set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation
exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs

Returns:

Whether identical gene families or not

ppanggolin.RGP.spot.comp_border(border1: list, border2: list, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) → bool

Compare two border

Parameters:

border1 –
border2 –
overlapping_match –
set_size –
exact_match –

Returns:

ppanggolin.RGP.spot.launch(args: Namespace)

Command launcher

Parameters:: args – All arguments provide by user

ppanggolin.RGP.spot.make_spot_graph(rgps: list, multigenics: set, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) → Graph

Create a spot graph from pangenome RGP

Parameters:

rgps – list of pangenome RGP
multigenics – pangenome graph multigenic persistent families
overlapping_match – Number of missing persistent genes allowed when comparing flanking genes
set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation
exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs

Returns:

spot graph

ppanggolin.RGP.spot.parser_spot(parser: ArgumentParser)

Parser for specific argument of spot command

Parameters:: parser – parser for align argument

ppanggolin.RGP.spot.predict_hotspots(pangenome: Pangenome, output: Path, spot_graph: bool = False, graph_formats: List[str] = ['gexf'], overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1, force: bool = False, disable_bar: bool = False)

Main function to predict hotspot

Parameters:

pangenome – Blank pangenome object
output – Output directory to save the spot graph
spot_graph – Writes graph of pairs of blocks of single copy markers flanking RGPs from same hotspot
graph_formats – Set of graph file formats to save the output
overlapping_match – Number of missing persistent genes allowed when comparing flanking genes
set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation
exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs
force – Allow to force write on Pangenome file
disable_bar – Disable progress bar

ppanggolin.RGP.spot.subparser(sub_parser: _SubParsersAction) → ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.RGP.spot.write_spot_graph(graph_spot, outdir, graph_formats, file_basename='spotGraph')

ppanggolin.RGP package

Submodules

ppanggolin.RGP.genomicIsland module

ppanggolin.RGP.rgp_cluster module

ppanggolin.RGP.spot module

Module contents