ppanggolin.context package
Submodules
ppanggolin.context.searchGeneContext module
- ppanggolin.context.searchGeneContext.add_edges_to_context_graph(context_graph: Graph, contig: Contig, contig_windows: List[Tuple[int, int]], transitivity: int) Graph
Add edges to the context graph based on contig genes and windows.
- Parameters:
context_graph – The context graph to which edges will be added.
contig – contig containing genes to add the edges
contig_windows – A list of tuples representing the start and end positions of contig windows.
transitivity – The number of next genes to consider when adding edges.
- Returns:
A context graph specific to the contig of interest with edges added
- ppanggolin.context.searchGeneContext.add_val_to_dict_attribute(attr_dict: dict, attribute_key, attribute_value)
Add an attribute value to an edge or node dictionary set.
- Parameters:
attr_dict – The dictionary containing the edge/node attributes.
attribute_key – The key of the attribute.
attribute_value – The value of the attribute to be added.
- ppanggolin.context.searchGeneContext.align_sequences_to_families(pangenome: Pangenome, output: Path, sequence_file: Path | None = None, identity: float = 0.5, coverage: float = 0.8, use_representatives: bool = False, no_defrag: bool = False, cpu: int = 1, translation_table: int = 11, tmpdir: Path | None = None, keep_tmp: bool = False, disable_bar=True) Tuple[Set[GeneFamily], Dict[GeneFamily, Set[str]]]
Align sequences to pangenome gene families to get families of interest
- Parameters:
pangenome – Pangenome containing GeneFamilies to align with sequence set
sequence_file – Path to file containing the sequences
output – Path to output directory
tmpdir – Path to temporary directory
identity – minimum identity threshold between sequences and gene families for the alignment
coverage – minimum coverage threshold between sequences and gene families for the alignment
use_representatives – Use representative sequences of families rather than all sequences to align input genes
no_defrag – do not use the defragmentation workflow if true
cpu – Number of core used to process
disable_bar – Allow preventing bar progress print
translation_table – The translation table to use when the input sequences are nucleotide sequences.
keep_tmp – If True, keep temporary files.
- Returns:
Set of gene families of interest and dict which link gene families to sequence ID
- ppanggolin.context.searchGeneContext.check_pangenome_for_context_search(pangenome: Pangenome, sequences: bool = False)
Check pangenome status and information to search context
- Parameters:
pangenome – The pangenome object
sequences – True if search contexts with sequences
- ppanggolin.context.searchGeneContext.compute_edge_metrics(context_graph: Graph, gene_proportion_cutoff: float) None
Compute various metrics on the edges of the context graph.
- Parameters:
context_graph – The context graph.
gene_proportion_cutoff – The minimum proportion of shared genes between two features for their edge to be considered significant.
- ppanggolin.context.searchGeneContext.compute_gene_context_graph(families: Iterable[GeneFamily], transitive: int = 4, window_size: int = 0, disable_bar: bool = False) Tuple[Graph, Dict[FrozenSet[GeneFamily], Set[Organism]]]
Construct the graph of gene contexts between families of the pangenome.
- Parameters:
families – An iterable of gene families.
transitive – Size of the transitive closure used to build the graph.
window_size – Size of the window for extracting gene contexts (default: 0).
disable_bar – Flag to disable the progress bar (default: False).
- Returns:
The constructed gene context graph and the combination of gene families corresponding to the context that exist in at least one genome
- ppanggolin.context.searchGeneContext.export_context_to_dataframe(gene_contexts: set, fam2seq: Dict[GeneFamily, Set[str]], families_of_interest: Set[GeneFamily], output: Path)
Export the results into dataFrame
- Parameters:
gene_contexts – connected components found in the pangenome
fam2seq – Dictionary with gene families as keys and set of sequence ids as values
families_of_interest – families of interest that are at the origin of the context.
output – output path
- ppanggolin.context.searchGeneContext.fam_to_seq(seq_to_pan: dict) dict
Create a dictionary with gene families as keys and list of sequences id as values
- Parameters:
seq_to_pan – Dictionary storing the sequence ids as keys and the gene families to which they are assigned as values
- Returns:
Dictionary reversed
- ppanggolin.context.searchGeneContext.get_contig_to_genes(gene_families: Iterable[GeneFamily]) Dict[Contig, Set[Gene]]
Group genes from specified gene families by contig.
- Parameters:
gene_families – An iterable of gene families object.
- Returns:
A dictionary mapping contigs to sets of genes.
- ppanggolin.context.searchGeneContext.get_gene_contexts(context_graph: Graph, families_of_interest: Set[GeneFamily]) Set[GeneContext]
Extract gene contexts from a context graph based on the provided set of gene families of interest.
Gene contexts are extracted from a context graph by identifying connected components. The function filters the connected components based on the following criteria: - Remove singleton families (components with only one gene family). - Remove components that do not contain any gene families of interest.
For each remaining connected component, a GeneContext object is created.
- Parameters:
context_graph – The context graph from which to extract gene contexts.
families_of_interest – Set of gene families of interest.
- Returns:
Set of GeneContext objects representing the extracted gene contexts.
- ppanggolin.context.searchGeneContext.get_n_next_genes_index(current_index: int, next_genes_count: int, contig_size: int, is_circular: bool = False) Iterator[int]
Generate the indices of the next genes based on the current index and contig properties.
- Parameters:
current_index – The index of the current gene.
next_genes_count – The number of next genes to consider.
contig_size – The total number of genes in the contig.
is_circular – Flag indicating whether the contig is circular (default: False).
- Returns:
An iterator yielding the indices of the next genes.
- Raises:
IndexError – If the current index is out of range for the given contig size.
- ppanggolin.context.searchGeneContext.increment_attribute_counter(edge_dict: dict, key: Hashable)
Increment the counter for an edge/node attribute in the edge/node dictionary.
- Parameters:
edge_dict – The dictionary containing the attributes.
key – The key of the attribute.
- ppanggolin.context.searchGeneContext.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.context.searchGeneContext.make_graph_writable(context_graph)
The original context graph contains ppanggolin objects as nodes and lists and dictionaries in edge attributes. Since these objects cannot be written to the output graph, this function creates a new graph that contains only writable objects.
- Parameters:
context_graph – List of gene context. it includes graph of the context
- ppanggolin.context.searchGeneContext.parser_context(parser: ArgumentParser)
Parser for specific argument of context command
- Parameters:
parser – parser for align argument
- ppanggolin.context.searchGeneContext.search_gene_context_in_pangenome(pangenome: Pangenome, output: Path, sequence_file: Path | None = None, families: Path | None = None, transitive: int = 4, jaccard_threshold: float = 0.85, window_size: int = 1, graph_format: str = 'graphml', disable_bar=True, **kwargs)
Main function to search common gene contexts between sequence set and pangenome families
- Parameters:
pangenome – Pangenome containing GeneFamilies to align with sequence set
sequence_file – Path to file containing the sequences
families – Path to file containing families name
output – Path to output directory
transitive – number of genes to check on both sides of a family aligned with an input sequence
jaccard_threshold – Jaccard index threshold to filter edges in graph
window_size – Number of genes to consider in the gene context.
graph_format – Write format of the context graph. Can be graphml or gexf
disable_bar – Allow preventing bar progress print
- ppanggolin.context.searchGeneContext.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for align command
:return : parser arguments for align command
- ppanggolin.context.searchGeneContext.write_graph(graph: Graph, output_dir: Path, graph_format: str)
Write a graph to file in the GraphML format or/and in GEXF format.
- Parameters:
graph – Graph to write
output_dir – The output directory where the graph file will be written.
graph_format – Formats of the output graph. Can be graphml or gexf