ppanggolin.RGP package
Submodules
ppanggolin.RGP.genomicIsland module
- class ppanggolin.RGP.genomicIsland.MatriceNode(state, score, prev, gene)
Bases:
object- changes(score)
- ppanggolin.RGP.genomicIsland.check_pangenome_former_rgp(pangenome: Pangenome, force: bool = False)
checks pangenome status and .h5 files for former rgp, delete them if allowed or raise an error
- Parameters:
pangenome – Pangenome object
force – Allow to force write on Pangenome file
- ppanggolin.RGP.genomicIsland.compute_org_rgp(organism: Organism, multigenics: set, persistent_penalty: int = 3, variable_gain: int = 1, min_length: int = 3000, min_score: int = 4, naming: str = 'contig', disable_bar: bool = True) set
Compute regions of genomic plasticity (RGP) on the given organism based on the provided parameters.
- Parameters:
organism – The Organism object representing the organism.
multigenics – A set of multigenic persistent families of the pangenome graph.
persistent_penalty – Penalty score to apply to persistent multigenic families (default: 3).
variable_gain – Gain score to apply to variable multigenic families (default: 1).
min_length – Minimum length threshold (in base pairs) for the regions to be considered RGP (default: 3000).
min_score – Minimum score threshold for considering a region as RGP (default: 4).
naming – Naming scheme for the regions, either “contig” or “organism” (default: “contig”).
disable_bar – Whether to disable the progress bar. It is recommended to disable it when calling this function in a loop on multiple organisms (default: True).
- Returns:
A set of RGPs of the provided organism.
- ppanggolin.RGP.genomicIsland.extract_rgp(contig, node, rgp_id, naming) Region
Extract the region from the given starting node
- ppanggolin.RGP.genomicIsland.init_matrices(contig: Contig, multi: set, persistent_penalty: int = 3, variable_gain: int = 1) list
Initialize the vector of score/state nodes
- Parameters:
contig – Current contig from one organism
persistent_penalty – Penalty score to apply to persistent genes
variable_gain – Gain score to apply to variable genes
multi – multigenic persistent families of the pangenome graph.
- Returns:
Initialized matrice
- ppanggolin.RGP.genomicIsland.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.RGP.genomicIsland.mk_regions(contig: Contig, matrix: list, multi: set, min_length: int = 3000, min_score: int = 4, persistent: int = 3, continuity: int = 1, naming: str = 'contig') Set[Region]
Processing matrix and ‘emptying’ it to get the regions.
- Parameters:
contig – Current contig from one organism
matrix – Initialized matrix
multi – multigenic persistent families of the pangenome graph.
min_length – Minimum length (bp) of a region to be considered RGP
min_score – Minimal score wanted for considering a region as being RGP
persistent – Penalty score to apply to persistent genes
continuity – Gain score to apply to variable genes
naming –
- Returns:
- ppanggolin.RGP.genomicIsland.naming_scheme(organisms: Iterable[Organism]) str
Determine the naming scheme for the contigs in the pangenome.
- Parameters:
organisms – Iterable of organims objects
- Returns:
Naming scheme for the contigs (“contig” or “organism”).
- ppanggolin.RGP.genomicIsland.parser_rgp(parser: ArgumentParser)
Parser for specific argument of rgp command
- Parameters:
parser – parser for align argument
- ppanggolin.RGP.genomicIsland.predict_rgp(pangenome: Pangenome, persistent_penalty: int = 3, variable_gain: int = 1, min_length: int = 3000, min_score: int = 4, dup_margin: float = 0.05, force: bool = False, disable_bar: bool = False)
Main function to predict region of genomic plasticity
- Parameters:
pangenome – blank pangenome object
persistent_penalty – Penalty score to apply to persistent genes
variable_gain – Gain score to apply to variable genes
min_length – Minimum length (bp) of a region to be considered RGP
min_score – Minimal score wanted for considering a region as being RGP
dup_margin – minimum ratio of organisms in which family must have multiple genes to be considered duplicated
force – Allow to force write on Pangenome file
disable_bar – Disable progress bar
- ppanggolin.RGP.genomicIsland.rewrite_matrix(contig, matrix, index, persistent, continuity, multi)
ReWrite the matrice from the given index of the node that started a region.
- ppanggolin.RGP.genomicIsland.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
ppanggolin.RGP.rgp_cluster module
- class ppanggolin.RGP.rgp_cluster.IdenticalRegions(name: str, identical_rgps: Set[Region], families: Set[GeneFamily], is_contig_border: bool)
Bases:
objectRepresents a group of Identical Regions within a pangenome.
- Parameters:
name – The name of the identical region group.
identical_rgps – A set of Region objects representing the identical regions.
families – A set of GeneFamily objects associated with the identical regions.
is_contig_border – A boolean indicating if the identical regions span across contig borders.
- property genes
Return iterable of genes from all RGPs that are identical in families
- ppanggolin.RGP.rgp_cluster.add_edges_to_identical_rgps(rgp_graph: Graph, identical_rgps_objects: List[IdenticalRegions])
Replace identical rgp objects by all identical RGPs it contains.
- Parameters:
rgp_graph – The RGP graph to add edges to.
identical_rgps_objects – A dictionary mapping RGPs to sets of identical RGPs.
- ppanggolin.RGP.rgp_cluster.add_info_to_identical_rgps(rgp_graph: Graph, identical_rgps_objects: List[IdenticalRegions], rgp_to_spot: Dict[Region, int])
Add identical rgps info in the graph as node attributes.
- Params rgp_graph:
Graph with rgp id as node and grr value as edges
- Params rgp_to_identical_rgps:
dict with uniq RGP as the key and set of identical rgps as value
- ppanggolin.RGP.rgp_cluster.add_info_to_rgp_nodes(graph, regions: List[Region], region_to_spot: dict)
Format RGP information into a dictionary for adding to the graph.
This function takes a list of RGPs and a dictionary mapping each RGP to its corresponding spot ID, and formats the RGP information into a dictionary for further processing or addition to a graph.
- Parameters:
graph – RGPs graph
regions – A list of RGPs.
region_to_spot – A dictionary mapping each RGP to its corresponding spot ID.
- Returns:
A dictionary with RGP id as the key and a dictionary containing information on the corresponding RGP as value.
- ppanggolin.RGP.rgp_cluster.add_rgp_metadata_to_graph(graph: Graph, rgps: List[Region | IdenticalRegions]) None
Add metadata from Region or IdenticalRegions objects to the graph.
- Parameters:
graph – The graph to which the metadata will be added.
rgps – A set of Region or IdenticalRegions objects containing the metadata to be added.
- ppanggolin.RGP.rgp_cluster.cluster_rgp(pangenome, grr_cutoff: float, output: str, basename: str, ignore_incomplete_rgp: bool, unmerge_identical_rgps: bool, grr_metric: str, disable_bar: bool, graph_formats: Set[str], add_metadata: bool = False, metadata_sep: str = '|', metadata_sources: List[str] | None = None)
Main function to cluster regions of genomic plasticity based on their GRR
- Parameters:
pangenome – pangenome object
grr_cutoff – GRR cutoff value for clustering
output – Directory where the output files will be saved
basename – Basename for the output files
ignore_incomplete_rgp – Whether to ignore incomplete RGPs located at a contig border
unmerge_identical_rgps – Whether to unmerge identical RGPs into separate nodes in the graph
grr_metric – GRR metric to use for clustering
disable_bar – Whether to disable the progress bar
graph_formats – Set of graph file formats to save the output
add_metadata – Add metadata to cluster files
metadata_sep – The separator used to join multiple metadata values
metadata_sources – Sources of the metadata to use and write in the outputs. None means all sources are used.
- ppanggolin.RGP.rgp_cluster.cluster_rgp_on_grr(graph: Graph, clustering_attribute: str = 'grr')
Cluster rgp based on grr using louvain communities clustering.
- Parameters:
graph – NetworkX graph object representing the RGPs and their relationship
clustering_attribute – Attribute of the graph to use for clustering (default is “grr”)
- ppanggolin.RGP.rgp_cluster.compute_grr(rgp_a_families: Set[GeneFamily], rgp_b_families: Set[GeneFamily], mode: Callable) float
Compute gene repertoire relatedness (GRR) between two rgp. Mode can be the function min to compute min GRR or max to compute max_grr
- Parameters:
rgp_a_families – Rgp A
rgp_b_families – rgp B
mode – min or max function
- Returns:
GRR value between 0 and 1
- ppanggolin.RGP.rgp_cluster.compute_jaccard_index(rgp_a_families: set, rgp_b_families: set) float
Compute jaccard index between two rgp based on their families.
- Parameters:
rgp_a_families – Rgp A
rgp_b_families – rgp B
:return : Jaccard index
- ppanggolin.RGP.rgp_cluster.compute_rgp_metric(rgp_a: Region, rgp_b: Region, grr_cutoff: float, grr_metric: str) Tuple[int, int, dict] | None
Compute GRR metric between two RGPs.
- Parameters:
rgp_a – A rgp
rgp_b – another rgp
grr_cutoff – Cutoff filter
grr_metric – grr mode between min_grr, max_grr and incomplete_aware_grr
- Returns:
Tuple containing the IDs of the two RGPs and the computed metrics as a dictionary
- ppanggolin.RGP.rgp_cluster.dereplicate_rgp(rgps: Set[Region | IdenticalRegions], disable_bar: bool = False) List[Region | IdenticalRegions]
Dereplicate RGPs that have the same families.
Given a list of Region or IdenticalRegions objects representing RGPs, this function groups together RGPs with the same families into IdenticalRegions objects and returns a list of dereplicated RGPs.
- Parameters:
rgps – A set of Region or IdenticalRegions objects representing the RGPs to be dereplicated.
disable_bar – If True, disable the progress bar.
- Returns:
A list of dereplicated RGPs (Region or IdenticalRegions objects). For RGPs with the same families, they will be grouped together in IdenticalRegions objects.
- ppanggolin.RGP.rgp_cluster.format_rgp_metadata(rgp: Region) Dict[str, str]
Format RGP metadata by combining source and field values.
Given an RGP object with metadata, this function creates a new dictionary where the keys are formatted as ‘source_field’ and the values are concatenated with ‘|’ as the delimiter.
- Parameters:
rgp – The RGP object with metadata.
- Returns:
A dictionary with formatted metadata.
- ppanggolin.RGP.rgp_cluster.get_spot_id(rgp: Region, rgp_to_spot: Dict[Region, int]) str
Return Spot ID associated to an RGP. It adds the prefix “spot” to the spot ID. When no spot is associated with the RGP, then the string “No spot” is return
- Parameters:
rgp – RGP id
rgp_to_spot – A dictionary mapping an RGP to its spot.
- Returns:
Spot ID of the given RGP with the prefix spot or “No spot”.
- ppanggolin.RGP.rgp_cluster.join_dicts(dicts: List[Dict[str, Any]], delimiter: str = ';') Dict[str, Any]
Join dictionaries by concatenating the values with a custom delimiter for common keys.
Given a list of dictionaries, this function creates a new dictionary where the values for common keys are concatenated with the specified delimiter.
- Parameters:
dicts – A list of dictionaries to be joined.
delimiter – The delimiter to use for joining values. Default is ‘;’.
- Returns:
A dictionary with joined values for common keys.
- ppanggolin.RGP.rgp_cluster.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provided by user
- ppanggolin.RGP.rgp_cluster.parser_cluster_rgp(parser: ArgumentParser)
Parser for specific argument of rgp command
- Parameters:
parser – Parser for cluster_rgp argument
- ppanggolin.RGP.rgp_cluster.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : Sub_parser for cluster_rgp command
:return : Parser arguments for cluster_rgp command
- ppanggolin.RGP.rgp_cluster.write_rgp_cluster_table(outfile: str, grr_graph: Graph, rgps_in_graph: List[Region | IdenticalRegions], grr_metric: str, rgp_to_spot: Dict[Region, int]) None
Writes RGP cluster info to a TSV file using pandas.
- Parameters:
outfile – Name of the tsv file
grr_graph – The GRR graph.
rgps_in_graph – A dictionary mapping an RGP to a set of identical RGPs.
grr_metric – The GRR metric used for clustering.
rgp_to_spot – A dictionary mapping an RGP to its spot.
- Returns:
None
ppanggolin.RGP.spot module
- ppanggolin.RGP.spot.add_new_node_in_spot_graph(g: Graph, region: Region, borders: list) str
Add bordering region as node to graph
- Parameters:
g – spot graph
region – region in spot
borders – bordering families in spot
- Return blocks:
name of the node that has been added
- ppanggolin.RGP.spot.check_pangenome_former_spots(pangenome: Pangenome, force: bool = False)
checks pangenome status and .h5 files for former spots, delete them if allowed or raise an error
- Parameters:
pangenome – Pangenome object
force – Allow to force write on Pangenome file
- ppanggolin.RGP.spot.check_sim(pair_border1: list, pair_border2: list, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) bool
Checks if the two pairs of exact_match first gene families are identical, or eventually if they overlap in an ordered way at least ‘overlapping_match’
- Parameters:
pair_border1 – First flanking gene families pair
pair_border2 – Second flanking gene families pair
overlapping_match – Number of missing persistent genes allowed when comparing flanking genes
set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation
exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs
- Returns:
Whether identical gene families or not
- ppanggolin.RGP.spot.comp_border(border1: list, border2: list, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) bool
Compare two border
- Parameters:
border1 –
border2 –
overlapping_match –
set_size –
exact_match –
- Returns:
- ppanggolin.RGP.spot.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.RGP.spot.make_spot_graph(rgps: list, multigenics: set, overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1) Graph
Create a spot graph from pangenome RGP
- Parameters:
rgps – list of pangenome RGP
multigenics – pangenome graph multigenic persistent families
overlapping_match – Number of missing persistent genes allowed when comparing flanking genes
set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation
exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs
- Returns:
spot graph
- ppanggolin.RGP.spot.parser_spot(parser: ArgumentParser)
Parser for specific argument of spot command
- Parameters:
parser – parser for align argument
- ppanggolin.RGP.spot.predict_hotspots(pangenome: Pangenome, output: Path, spot_graph: bool = False, graph_formats: List[str] = ['gexf'], overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1, force: bool = False, disable_bar: bool = False)
Main function to predict hotspot
- Parameters:
pangenome – Blank pangenome object
output – Output directory to save the spot graph
spot_graph – Writes graph of pairs of blocks of single copy markers flanking RGPs from same hotspot
graph_formats – Set of graph file formats to save the output
overlapping_match – Number of missing persistent genes allowed when comparing flanking genes
set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation
exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs
force – Allow to force write on Pangenome file
disable_bar – Disable progress bar
- ppanggolin.RGP.spot.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
- ppanggolin.RGP.spot.write_spot_graph(graph_spot, outdir, graph_formats, file_basename='spotGraph')