ppanggolin.RGP package

Submodules

ppanggolin.RGP.genomicIsland module

class ppanggolin.RGP.genomicIsland.MatriceNode(state, score, prev, gene)

Bases: object

changes(score)
ppanggolin.RGP.genomicIsland.check_pangenome_former_rgp(pangenome: Pangenome, force: bool = False)

checks pangenome status and .h5 files for former rgp, delete them if allowed or raise an error

Parameters:
  • pangenome – Pangenome object

  • force – Allow to force write on Pangenome file

ppanggolin.RGP.genomicIsland.compute_org_rgp(organism: Organism, multigenics: set, persistent_penalty: int = 3, variable_gain: int = 1, min_length: int = 3000, min_score: int = 4, naming: str = 'contig', disable_bar: bool = True) set

Compute regions of genomic plasticity (RGP) on the given organism based on the provided parameters.

Parameters:
  • organism – The Organism object representing the organism.

  • multigenics – A set of multigenic persistent families of the pangenome graph.

  • persistent_penalty – Penalty score to apply to persistent multigenic families (default: 3).

  • variable_gain – Gain score to apply to variable multigenic families (default: 1).

  • min_length – Minimum length threshold (in base pairs) for the regions to be considered RGP (default: 3000).

  • min_score – Minimum score threshold for considering a region as RGP (default: 4).

  • naming – Naming scheme for the regions, either “contig” or “organism” (default: “contig”).

  • disable_bar – Whether to disable the progress bar. It is recommended to disable it when calling this function in a loop on multiple organisms (default: True).

Returns:

A set of RGPs of the provided organism.

ppanggolin.RGP.genomicIsland.extract_rgp(contig, node, rgp_id, naming) Region

Extract the region from the given starting node

ppanggolin.RGP.genomicIsland.init_matrices(contig: Contig, multi: set, persistent_penalty: int = 3, variable_gain: int = 1) list

Initialize the vector of score/state nodes

Parameters:
  • contig – Current contig from one organism

  • persistent_penalty – Penalty score to apply to persistent genes

  • variable_gain – Gain score to apply to variable genes

  • multi – multigenic persistent families of the pangenome graph.

Returns:

Initialized matrice

ppanggolin.RGP.genomicIsland.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.RGP.genomicIsland.mk_regions(contig: Contig, matrix: list, multi: set, min_length: int = 3000, min_score: int = 4, persistent: int = 3, continuity: int = 1, naming: str = 'contig') Set[Region]

Processing matrix and ‘emptying’ it to get the regions.

Parameters:
  • contig – Current contig from one organism

  • matrix – Initialized matrix

  • multi – multigenic persistent families of the pangenome graph.

  • min_length – Minimum length (bp) of a region to be considered RGP

  • min_score – Minimal score wanted for considering a region as being RGP

  • persistent – Penalty score to apply to persistent genes

  • continuity – Gain score to apply to variable genes

  • naming

Returns:

ppanggolin.RGP.genomicIsland.naming_scheme(organisms: Iterable[Organism]) str

Determine the naming scheme for the contigs in the pangenome.

Parameters:

organisms – Iterable of organims objects

Returns:

Naming scheme for the contigs (“contig” or “organism”).

ppanggolin.RGP.genomicIsland.parser_rgp(parser: ArgumentParser)

Parser for specific argument of rgp command

Parameters:

parser – parser for align argument

ppanggolin.RGP.genomicIsland.predict_rgp(pangenome: Pangenome, persistent_penalty: int = 3, variable_gain: int = 1, min_length: int = 3000, min_score: int = 4, dup_margin: float = 0.05, force: bool = False, disable_bar: bool = False)

Main function to predict region of genomic plasticity

Parameters:
  • pangenome – blank pangenome object

  • persistent_penalty – Penalty score to apply to persistent genes

  • variable_gain – Gain score to apply to variable genes

  • min_length – Minimum length (bp) of a region to be considered RGP

  • min_score – Minimal score wanted for considering a region as being RGP

  • dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated

  • force – Allow to force write on Pangenome file

  • disable_bar – Disable progress bar

ppanggolin.RGP.genomicIsland.rewrite_matrix(contig, matrix, index, persistent, continuity, multi)

ReWrite the matrice from the given index of the node that started a region.

ppanggolin.RGP.genomicIsland.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.RGP.rgp_cluster module

class ppanggolin.RGP.rgp_cluster.IdenticalRegions(name: str, identical_rgps: Set[Region], families: Set[GeneFamily], is_contig_border: bool)

Bases: object

Represents a group of Identical Regions within a pangenome.

Parameters:
  • name – The name of the identical region group.

  • identical_rgps – A set of Region objects representing the identical regions.

  • families – A set of GeneFamily objects associated with the identical regions.

  • is_contig_border – A boolean indicating if the identical regions span across contig borders.

property genes

Return iterable of genes from all RGPs that are identical in families

property modules: Set[Module]

Return iterable of genes from all RGPs that are identical in families

property spots: Set[Spot]

Return spots from all RGPs that are identical in families

ppanggolin.RGP.rgp_cluster.add_edges_to_identical_rgps(rgp_graph: Graph, identical_rgps_objects: List[IdenticalRegions])

Replace identical rgp objects by all identical RGPs it contains.

Parameters:
  • rgp_graph – The RGP graph to add edges to.

  • identical_rgps_objects – A dictionary mapping RGPs to sets of identical RGPs.

ppanggolin.RGP.rgp_cluster.add_info_to_identical_rgps(rgp_graph: Graph, identical_rgps_objects: List[IdenticalRegions], rgp_to_spot: Dict[Region, int])

Add identical rgps info in the graph as node attributes.

Params rgp_graph:

Graph with rgp id as node and grr value as edges

Params rgp_to_identical_rgps:

dict with uniq RGP as the key and set of identical rgps as value

ppanggolin.RGP.rgp_cluster.add_info_to_rgp_nodes(graph, regions: List[Region], region_to_spot: dict)

Format RGP information into a dictionary for adding to the graph.

This function takes a list of RGPs and a dictionary mapping each RGP to its corresponding spot ID, and formats the RGP information into a dictionary for further processing or addition to a graph.

Parameters:
  • graph – RGPs graph

  • regions – A list of RGPs.

  • region_to_spot – A dictionary mapping each RGP to its corresponding spot ID.

Returns:

A dictionary with RGP id as the key and a dictionary containing information on the corresponding RGP as value.

ppanggolin.RGP.rgp_cluster.add_rgp_metadata_to_graph(graph: Graph, rgps: List[Region | IdenticalRegions]) None

Add metadata from Region or IdenticalRegions objects to the graph.

Parameters:
  • graph – The graph to which the metadata will be added.

  • rgps – A set of Region or IdenticalRegions objects containing the metadata to be added.

ppanggolin.RGP.rgp_cluster.cluster_rgp(pangenome, grr_cutoff: float, output: str, basename: str, ignore_incomplete_rgp: bool, unmerge_identical_rgps: bool, grr_metric: str, disable_bar: bool, graph_formats: Set[str], add_metadata: bool = False, metadata_sep: str = '|', metadata_sources: List[str] | None = None)

Main function to cluster regions of genomic plasticity based on their GRR

Parameters:
  • pangenome – pangenome object

  • grr_cutoff – GRR cutoff value for clustering

  • output – Directory where the output files will be saved

  • basename – Basename for the output files

  • ignore_incomplete_rgp – Whether to ignore incomplete RGPs located at a contig border

  • unmerge_identical_rgps – Whether to unmerge identical RGPs into separate nodes in the graph

  • grr_metric – GRR metric to use for clustering

  • disable_bar – Whether to disable the progress bar

  • graph_formats – Set of graph file formats to save the output

  • add_metadata – Add metadata to cluster files

  • metadata_sep – The separator used to join multiple metadata values

  • metadata_sources – Sources of the metadata to use and write in the outputs. None means all sources are used.

ppanggolin.RGP.rgp_cluster.cluster_rgp_on_grr(graph: Graph, clustering_attribute: str = 'grr')

Cluster rgp based on grr using louvain communities clustering.

Parameters:
  • graph – NetworkX graph object representing the RGPs and their relationship

  • clustering_attribute – Attribute of the graph to use for clustering (default is “grr”)

ppanggolin.RGP.rgp_cluster.compute_grr(rgp_a_families: Set[GeneFamily], rgp_b_families: Set[GeneFamily], mode: Callable) float

Compute gene repertoire relatedness (GRR) between two rgp. Mode can be the function min to compute min GRR or max to compute max_grr

Parameters:
  • rgp_a_families – Rgp A

  • rgp_b_families – rgp B

  • mode – min or max function

Returns:

GRR value between 0 and 1

ppanggolin.RGP.rgp_cluster.compute_jaccard_index(rgp_a_families: set, rgp_b_families: set) float

Compute jaccard index between two rgp based on their families.

Parameters:
  • rgp_a_families – Rgp A

  • rgp_b_families – rgp B

:return : Jaccard index

ppanggolin.RGP.rgp_cluster.compute_rgp_metric(rgp_a: Region, rgp_b: Region, grr_cutoff: float, grr_metric: str) Tuple[int, int, dict] | None

Compute GRR metric between two RGPs.

Parameters:
  • rgp_a – A rgp

  • rgp_b – another rgp

  • grr_cutoff – Cutoff filter

  • grr_metric – grr mode between min_grr, max_grr and incomplete_aware_grr

Returns:

Tuple containing the IDs of the two RGPs and the computed metrics as a dictionary

ppanggolin.RGP.rgp_cluster.dereplicate_rgp(rgps: Set[Region | IdenticalRegions], disable_bar: bool = False) List[Region | IdenticalRegions]

Dereplicate RGPs that have the same families.

Given a list of Region or IdenticalRegions objects representing RGPs, this function groups together RGPs with the same families into IdenticalRegions objects and returns a list of dereplicated RGPs.

Parameters:
  • rgps – A set of Region or IdenticalRegions objects representing the RGPs to be dereplicated.

  • disable_bar – If True, disable the progress bar.

Returns:

A list of dereplicated RGPs (Region or IdenticalRegions objects). For RGPs with the same families, they will be grouped together in IdenticalRegions objects.

ppanggolin.RGP.rgp_cluster.format_rgp_metadata(rgp: Region) Dict[str, str]

Format RGP metadata by combining source and field values.

Given an RGP object with metadata, this function creates a new dictionary where the keys are formatted as ‘source_field’ and the values are concatenated with ‘|’ as the delimiter.

Parameters:

rgp – The RGP object with metadata.

Returns:

A dictionary with formatted metadata.

ppanggolin.RGP.rgp_cluster.get_spot_id(rgp: Region, rgp_to_spot: Dict[Region, int]) str

Return Spot ID associated to an RGP. It adds the prefix “spot” to the spot ID. When no spot is associated with the RGP, then the string “No spot” is return

Parameters:
  • rgp – RGP id

  • rgp_to_spot – A dictionary mapping an RGP to its spot.

Returns:

Spot ID of the given RGP with the prefix spot or “No spot”.

ppanggolin.RGP.rgp_cluster.join_dicts(dicts: List[Dict[str, Any]], delimiter: str = ';') Dict[str, Any]

Join dictionaries by concatenating the values with a custom delimiter for common keys.

Given a list of dictionaries, this function creates a new dictionary where the values for common keys are concatenated with the specified delimiter.

Parameters:
  • dicts – A list of dictionaries to be joined.

  • delimiter – The delimiter to use for joining values. Default is ‘;’.

Returns:

A dictionary with joined values for common keys.

ppanggolin.RGP.rgp_cluster.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provided by user

ppanggolin.RGP.rgp_cluster.parser_cluster_rgp(parser: ArgumentParser)

Parser for specific argument of rgp command

Parameters:

parser – Parser for cluster_rgp argument

ppanggolin.RGP.rgp_cluster.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : Sub_parser for cluster_rgp command

:return : Parser arguments for cluster_rgp command

ppanggolin.RGP.rgp_cluster.write_rgp_cluster_table(outfile: str, grr_graph: Graph, rgps_in_graph: List[Region | IdenticalRegions], grr_metric: str, rgp_to_spot: Dict[Region, int]) None

Writes RGP cluster info to a TSV file using pandas.

Parameters:
  • outfile – Name of the tsv file

  • grr_graph – The GRR graph.

  • rgps_in_graph – A dictionary mapping an RGP to a set of identical RGPs.

  • grr_metric – The GRR metric used for clustering.

  • rgp_to_spot – A dictionary mapping an RGP to its spot.

Returns:

None

ppanggolin.RGP.spot module

ppanggolin.RGP.spot.add_new_node_in_spot_graph(g: Graph, region: Region, borders: list) str

Add bordering region as node to graph

Parameters:
  • g – spot graph

  • region – region in spot

  • borders – bordering families in spot

Return blocks:

name of the node that has been added

ppanggolin.RGP.spot.check_pangenome_former_spots(pangenome: Pangenome, force: bool = False)

checks pangenome status and .h5 files for former spots, delete them if allowed or raise an error

Parameters:
  • pangenome – Pangenome object

  • force – Allow to force write on Pangenome file

ppanggolin.RGP.spot.check_sim(pair_border1: list, pair_border2: list, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) bool

Checks if the two pairs of exact_match first gene families are identical, or eventually if they overlap in an ordered way at least ‘overlapping_match’

Parameters:
  • pair_border1 – First flanking gene families pair

  • pair_border2 – Second flanking gene families pair

  • overlapping_match – Number of missing persistent genes allowed when comparing flanking genes

  • set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation

  • exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs

Returns:

Whether identical gene families or not

ppanggolin.RGP.spot.comp_border(border1: list, border2: list, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) bool

Compare two border

Parameters:
  • border1

  • border2

  • overlapping_match

  • set_size

  • exact_match

Returns:

ppanggolin.RGP.spot.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.RGP.spot.make_spot_graph(rgps: list, multigenics: set, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) Graph

Create a spot graph from pangenome RGP

Parameters:
  • rgps – list of pangenome RGP

  • multigenics – pangenome graph multigenic persistent families

  • overlapping_match – Number of missing persistent genes allowed when comparing flanking genes

  • set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation

  • exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs

Returns:

spot graph

ppanggolin.RGP.spot.parser_spot(parser: ArgumentParser)

Parser for specific argument of spot command

Parameters:

parser – parser for align argument

ppanggolin.RGP.spot.predict_hotspots(pangenome: Pangenome, output: Path, spot_graph: bool = False, graph_formats: List[str] = ['gexf'], overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1, force: bool = False, disable_bar: bool = False)

Main function to predict hotspot

Parameters:
  • pangenome – Blank pangenome object

  • output – Output directory to save the spot graph

  • spot_graph – Writes graph of pairs of blocks of single copy markers flanking RGPs from same hotspot

  • graph_formats – Set of graph file formats to save the output

  • overlapping_match – Number of missing persistent genes allowed when comparing flanking genes

  • set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation

  • exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs

  • force – Allow to force write on Pangenome file

  • disable_bar – Disable progress bar

ppanggolin.RGP.spot.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.RGP.spot.write_spot_graph(graph_spot, outdir, graph_formats, file_basename='spotGraph')

Module contents