ppanggolin.context package

Submodules

ppanggolin.context.searchGeneContext module

ppanggolin.context.searchGeneContext.add_edges_to_context_graph(context_graph: Graph, contig: Contig, contig_windows: List[Tuple[int, int]], transitivity: int) Graph

Add edges to the context graph based on contig genes and windows.

Parameters:
  • context_graph – The context graph to which edges will be added.

  • contig – contig containing genes to add the edges

  • contig_windows – A list of tuples representing the start and end positions of contig windows.

  • transitivity – The number of next genes to consider when adding edges.

Returns:

A context graph specific to the contig of interest with edges added

ppanggolin.context.searchGeneContext.add_val_to_dict_attribute(attr_dict: dict, attribute_key, attribute_value)

Add an attribute value to an edge or node dictionary set.

Parameters:
  • attr_dict – The dictionary containing the edge/node attributes.

  • attribute_key – The key of the attribute.

  • attribute_value – The value of the attribute to be added.

ppanggolin.context.searchGeneContext.align_sequences_to_families(pangenome: Pangenome, output: Path, sequence_file: Path | None = None, identity: float = 0.5, coverage: float = 0.8, use_representatives: bool = False, no_defrag: bool = False, cpu: int = 1, translation_table: int = 11, tmpdir: Path | None = None, keep_tmp: bool = False, disable_bar=True) Tuple[Set[GeneFamily], Dict[GeneFamily, Set[str]]]

Align sequences to pangenome gene families to get families of interest

Parameters:
  • pangenome – Pangenome containing GeneFamilies to align with sequence set

  • sequence_file – Path to file containing the sequences

  • output – Path to output directory

  • tmpdir – Path to temporary directory

  • identity – minimum identity threshold between sequences and gene families for the alignment

  • coverage – minimum coverage threshold between sequences and gene families for the alignment

  • use_representatives – Use representative sequences of families rather than all sequences to align input genes

  • no_defrag – do not use the defragmentation workflow if true

  • cpu – Number of core used to process

  • disable_bar – Allow preventing bar progress print

  • translation_table – The translation table to use when the input sequences are nucleotide sequences.

  • keep_tmp – If True, keep temporary files.

Returns:

Set of gene families of interest and dict which link gene families to sequence ID

Check pangenome status and information to search context

Parameters:
  • pangenome – The pangenome object

  • sequences – True if search contexts with sequences

ppanggolin.context.searchGeneContext.compute_edge_metrics(context_graph: Graph, gene_proportion_cutoff: float) None

Compute various metrics on the edges of the context graph.

Parameters:
  • context_graph – The context graph.

  • gene_proportion_cutoff – The minimum proportion of shared genes between two features for their edge to be considered significant.

ppanggolin.context.searchGeneContext.compute_gene_context_graph(families: Iterable[GeneFamily], transitive: int = 4, window_size: int = 0, disable_bar: bool = False) Tuple[Graph, Dict[FrozenSet[GeneFamily], Set[Organism]]]

Construct the graph of gene contexts between families of the pangenome.

Parameters:
  • families – An iterable of gene families.

  • transitive – Size of the transitive closure used to build the graph.

  • window_size – Size of the window for extracting gene contexts (default: 0).

  • disable_bar – Flag to disable the progress bar (default: False).

Returns:

The constructed gene context graph and the combination of gene families corresponding to the context that exist in at least one genome

ppanggolin.context.searchGeneContext.export_context_to_dataframe(gene_contexts: set, fam2seq: Dict[GeneFamily, Set[str]], families_of_interest: Set[GeneFamily], output: Path)

Export the results into dataFrame

Parameters:
  • gene_contexts – connected components found in the pangenome

  • fam2seq – Dictionary with gene families as keys and set of sequence ids as values

  • families_of_interest – families of interest that are at the origin of the context.

  • output – output path

ppanggolin.context.searchGeneContext.fam_to_seq(seq_to_pan: dict) dict

Create a dictionary with gene families as keys and list of sequences id as values

Parameters:

seq_to_pan – Dictionary storing the sequence ids as keys and the gene families to which they are assigned as values

Returns:

Dictionary reversed

ppanggolin.context.searchGeneContext.get_contig_to_genes(gene_families: Iterable[GeneFamily]) Dict[Contig, Set[Gene]]

Group genes from specified gene families by contig.

Parameters:

gene_families – An iterable of gene families object.

Returns:

A dictionary mapping contigs to sets of genes.

ppanggolin.context.searchGeneContext.get_gene_contexts(context_graph: Graph, families_of_interest: Set[GeneFamily]) Set[GeneContext]

Extract gene contexts from a context graph based on the provided set of gene families of interest.

Gene contexts are extracted from a context graph by identifying connected components. The function filters the connected components based on the following criteria: - Remove singleton families (components with only one gene family). - Remove components that do not contain any gene families of interest.

For each remaining connected component, a GeneContext object is created.

Parameters:
  • context_graph – The context graph from which to extract gene contexts.

  • families_of_interest – Set of gene families of interest.

Returns:

Set of GeneContext objects representing the extracted gene contexts.

ppanggolin.context.searchGeneContext.get_n_next_genes_index(current_index: int, next_genes_count: int, contig_size: int, is_circular: bool = False) Iterator[int]

Generate the indices of the next genes based on the current index and contig properties.

Parameters:
  • current_index – The index of the current gene.

  • next_genes_count – The number of next genes to consider.

  • contig_size – The total number of genes in the contig.

  • is_circular – Flag indicating whether the contig is circular (default: False).

Returns:

An iterator yielding the indices of the next genes.

Raises:

IndexError – If the current index is out of range for the given contig size.

ppanggolin.context.searchGeneContext.increment_attribute_counter(edge_dict: dict, key: Hashable)

Increment the counter for an edge/node attribute in the edge/node dictionary.

Parameters:
  • edge_dict – The dictionary containing the attributes.

  • key – The key of the attribute.

ppanggolin.context.searchGeneContext.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.context.searchGeneContext.make_graph_writable(context_graph)

The original context graph contains ppanggolin objects as nodes and lists and dictionaries in edge attributes. Since these objects cannot be written to the output graph, this function creates a new graph that contains only writable objects.

Parameters:

context_graph – List of gene context. it includes graph of the context

ppanggolin.context.searchGeneContext.parser_context(parser: ArgumentParser)

Parser for specific argument of context command

Parameters:

parser – parser for align argument

ppanggolin.context.searchGeneContext.search_gene_context_in_pangenome(pangenome: Pangenome, output: Path, sequence_file: Path | None = None, families: Path | None = None, transitive: int = 4, jaccard_threshold: float = 0.85, window_size: int = 1, graph_format: str = 'graphml', disable_bar=True, **kwargs)

Main function to search common gene contexts between sequence set and pangenome families

Parameters:
  • pangenome – Pangenome containing GeneFamilies to align with sequence set

  • sequence_file – Path to file containing the sequences

  • families – Path to file containing families name

  • output – Path to output directory

  • transitive – number of genes to check on both sides of a family aligned with an input sequence

  • jaccard_threshold – Jaccard index threshold to filter edges in graph

  • window_size – Number of genes to consider in the gene context.

  • graph_format – Write format of the context graph. Can be graphml or gexf

  • disable_bar – Allow preventing bar progress print

ppanggolin.context.searchGeneContext.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.context.searchGeneContext.write_graph(graph: Graph, output_dir: Path, graph_format: str)

Write a graph to file in the GraphML format or/and in GEXF format.

Parameters:
  • graph – Graph to write

  • output_dir – The output directory where the graph file will be written.

  • graph_format – Formats of the output graph. Can be graphml or gexf

Module contents