ppanggolin.projection package

Submodules

ppanggolin.projection.projection module

class ppanggolin.projection.projection.NewSpot(spot_id: int)

Bases: Spot

This class represent a hotspot specifically created for the projected genome.

ppanggolin.projection.projection.annotate_fasta_files(genome_name_to_fasta_path: Dict[str, dict], tmpdir: str, cpu: int = 1, translation_table: int = 11, kingdom: str = 'bacteria', norna: bool = False, allow_overlap: bool = False, procedure: str | None = None, disable_bar: bool = False)

Main function to annotate a pangenome

Parameters:
  • genome_name_to_fasta_path

  • fasta_list – List of fasta file containing sequences that will be base of pangenome

  • tmpdir – Path to temporary directory

  • cpu – number of CPU cores to use

  • translation_table – Translation table (genetic code) to use.

  • kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.

  • norna – Use to avoid annotating RNA features.

  • allow_overlap – Use to not remove genes overlapping with RNA features

  • procedure – prodigal procedure used

  • disable_bar – Disable the progresse bar

ppanggolin.projection.projection.annotate_input_genes_with_pangenome_families(pangenome: Pangenome, input_organisms: Iterable[Organism], output: Path, cpu: int, use_representatives: bool, no_defrag: bool, identity: float, coverage: float, tmpdir: Path, translation_table: int, keep_tmp: bool = False, disable_bar: bool = False)

Annotate input genes with pangenome gene families by associating them to a cluster.

Parameters:
  • pangenome – Pangenome object.

  • input_organisms – Iterable of input organism objects.

  • output – Output directory for generated files.

  • cpu – Number of CPU cores to use.

  • no_defrag – Whether to use defragmentation.

  • use_representatives – Use representative sequences of gene families rather than all sequence to align input genes

  • identity – Minimum identity threshold for gene clustering.

  • coverage – Minimum coverage threshold for gene clustering.

  • tmpdir – Temporary directory for intermediate files.

  • translation_table – Translation table ID for nucleotide sequences.

  • keep_tmp – If True, keep temporary files.

  • disable_bar – Whether to disable progress bar.

Returns:

Number of genes that do not cluster with any of the gene families of the pangenome.

ppanggolin.projection.projection.check_input_names(pangenome, input_names)

Check if input organism names already exist in the pangenome.

Parameters:
  • pangenome – The pangenome object.

  • input_names – List of input organism names to check.

Raises:

NameError – If duplicate organism names are found in the pangenome.

ppanggolin.projection.projection.check_pangenome_for_projection(pangenome: Pangenome, fast_aln: bool)

Check the status of a pangenome and determine whether projection is possible.

Parameters:
  • pangenome – The pangenome to be checked.

  • fast_aln – Whether to use the fast alignment option for gene projection.

This function checks various attributes of a pangenome to determine whether it is suitable for projecting features into a provided genome.

Returns:

A tuple indicating whether RGP prediction, spot projection, and module projection are possible (True) or not (False) based on the pangenome’s status.

Raises:

NameError: If the pangenome has not been partitioned. Exception: If the pangenome lacks gene sequences or gene family sequences, and fast alignment is not enabled.

ppanggolin.projection.projection.check_projection_arguments(args: Namespace, parser: ArgumentParser) str

Check the arguments provided for genome projection and raise errors if they are incompatible or missing.

Parameters:
  • args – An argparse.Namespace object containing parsed command-line arguments.

  • parser – parser of the command

Returns:

A string indicating the input mode (‘single’ or ‘multiple’).

ppanggolin.projection.projection.check_spots_congruency(graph_spot: Graph, spots: List[Spot]) None

Check congruency of spots in the spot graph with the original spots.

Parameters:
  • graph_spot – The spot graph containing the connected components representing the spots.

  • spots – List of original spots in the pangenome.

Returns:

None.

ppanggolin.projection.projection.get_gene_sequences_from_fasta_files(organisms, genome_name_to_annot_path)

Get gene sequences from fasta path file

Parameters:
  • organisms – input pangenome

  • fasta_file – list of fasta file

ppanggolin.projection.projection.infer_input_mode(input_file: Path, expected_types: List[str], parser: ArgumentParser) str

Determine the input mode based on the provided input file and expected file types.

Parameters:
  • input_file – A Path object representing the input file.

  • expected_types – A list of expected file types (e.g., [‘fasta’, ‘gff’, ‘gbff’, ‘tsv’]).

Returns:

A string indicating the input mode (‘single’ or ‘multiple’).

ppanggolin.projection.projection.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.projection.projection.manage_annotate_param(annotate_param_names: List[str], pangenome_args: Namespace, config_file: str | None) Namespace

Manage annotate parameters by collecting them from different sources and merging them.

Parameters:
  • annotate_param_names – List of annotate parameter names to be managed.

  • pangenome_args – Annotate arguments parsed from pangenomes parameters.

  • config_file – Path to the config file, can be None if not provided.

Returns:

An argparse.Namespace containing the merged annotate parameters with their values.

ppanggolin.projection.projection.manage_input_genomes_annotation(pangenome, input_mode: str, anno: str, fasta: str, organism_name: str, circular_contigs: list, pangenome_params, cpu: int, use_pseudo: bool, disable_bar: bool, tmpdir: str, config: dict, translation_table: int)

Manage the input genomes annotation based on the provided mode and parameters.

Parameters:
  • pangenome – The pangenome object.

  • input_mode – The input mode, either ‘multiple’ or ‘single’.

  • anno – The annotation file path or None.

  • fasta – The FASTA file path or None.

  • organism_name – The name of the organism.

  • circular_contigs – List of circular contigs.

  • pangenome_params – Parameters for pangenome processing.

  • cpu – Number of CPUs to use.

  • use_pseudo – Flag to use pseudo annotation.

  • disable_bar – Flag to disable progress bar.

  • tmpdir – Temporary directory path.

  • config – Configuration dictionary.

  • translation_table – Translation table (genetic code) to use.

Returns:

A tuple of organisms, genome_name_to_path, and input_type.

ppanggolin.projection.projection.parser_projection(parser: ArgumentParser)

Parser for specific argument of projection command

Parameters:

parser – parser for projection argument

ppanggolin.projection.projection.predict_RGP(pangenome: Pangenome, input_organisms: List[Organism], persistent_penalty: int, variable_gain: int, min_length: int, min_score: int, multigenics: Set[GeneFamily], output_dir: Path, disable_bar: bool, compress: bool) Dict[Organism, Set[Region]]

Compute Regions of Genomic Plasticity (RGP) for the given input organisms.

Parameters:
  • pangenome – The pangenome object.

  • input_organisms – List of the input organisms for which to compute RGPs.

  • persistent_penalty – Penalty score to apply to persistent genes.

  • variable_gain – Gain score to apply to variable genes.

  • min_length – Minimum length (bp) of a region to be considered as RGP.

  • min_score – Minimal score required for considering a region as RGP.

  • multigenics – multigenic families.

  • output_dir – Output directory where predicted rgps are going to be written.

  • disable_bar – Flag to disable the progress bar.

  • compress – Flag to compress the rgp table in gz.

Returns:

Dictionary mapping organism with the set of predicted regions

ppanggolin.projection.projection.predict_spot_in_one_organism(graph_spot: Graph, input_org_rgps: List[Region], original_nodes: Set[int], new_spot_id_counter: int, multigenics: Set[GeneFamily], organism_name: str, output: Path, write_graph_flag: bool = False, graph_formats: List[str] = ['gexf'], overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1, compress: bool = False) Set[Spot]

Predict spots for input organism RGPs.

Parameters:
  • graph_spot – The spot graph from the pangenome.

  • input_org_rgps – List of RGPs from the input organism to be associated with spots.

  • original_nodes – Set of original nodes in the spot graph.

  • new_spot_id_counter – Counter for new spot IDs.

  • multigenics – Set of pangenome graph multigenic persistent families.

  • organism_name – Name of the input organism.

  • output – Output directory to save the spot graph.

  • write_graph_flag – If True, writes the spot graph in the specified formats. Default is False.

  • graph_formats – List of graph formats to write (default is [‘gexf’]).

  • overlapping_match – Number of missing persistent genes allowed when comparing flanking genes. Default is 2.

  • set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation. Default is 3.

  • exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs. Default is 1.

  • compress – Flag to compress output files

Returns:

Set[Spot]: The predicted spots for the input organism RGPs.

ppanggolin.projection.projection.predict_spots_in_input_organisms(initial_spots: List[Spot], initial_regions: List[Region], input_org_2_rgps: Dict[Organism, Set[Region]], multigenics: Set[GeneFamily], output: Path, write_graph_flag: bool = False, graph_formats: List[str] = ['gexf'], overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1, compress: bool = False) Dict[Organism, Set[Spot]]

Create a spot graph from pangenome RGP and predict spots for input organism RGPs.

Parameters:
  • initial_spots – List of original spots in the pangenome.

  • initial_regions – List of original regions in the pangenome.

  • input_org_2_rgps – Dictionary mapping input organisms to their RGPs.

  • multigenics – Set of pangenome graph multigenic persistent families.

  • output – Output directory to save the spot graph.

  • write_graph_flag – If True, writes the spot graph in the specified formats. Default is False.

  • graph_formats – List of graph formats to write (default is [‘gexf’]).

  • overlapping_match – Number of missing persistent genes allowed when comparing flanking genes. Default is 2.

  • set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation. Default is 3.

  • exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs. Default is 1.

  • compress – Flag to compress output files

Returns:

A dictionary mapping input organism RGPs to their predicted spots.

ppanggolin.projection.projection.project_and_write_modules(pangenome: Pangenome, input_organisms: Iterable[Organism], output: Path, compress: bool = False)

Write a tsv file providing association between modules and the input organism

Parameters:
  • pangenome – Pangenome object

  • input_organisms – iterable of the organisms that is being annotated

  • output – Path to output directory

  • compress – Compress the file in .gz

ppanggolin.projection.projection.read_annotation_files(genome_name_to_annot_path: Dict[str, dict], cpu: int = 1, pseudo: bool = False, disable_bar: bool = False) Tuple[List[Organism], Dict[Organism, bool]]

Read the annotation from GBFF file

Parameters:
  • pangenome – pangenome object

  • organisms_file – List of GBFF files for each organism

  • cpu – number of CPU cores to use

  • pseudo – allow to read pseudogène

  • disable_bar – Disable the progresse bar

ppanggolin.projection.projection.retrieve_gene_sequences_from_fasta_file(input_organism, fasta_file)

Get gene sequences from fastas

Parameters:
  • pangenome – input pangenome

  • fasta_file – list of fasta file

ppanggolin.projection.projection.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for projection command

:return : parser arguments for projection command

ppanggolin.projection.projection.summarize_projected_genome(organism: Organism, pangenome_persistent_count: int, pangenome_persistent_single_copy_families: Set[GeneFamily], soft_core_families: Set[GeneFamily], exact_core_families: Set[GeneFamily], input_org_rgps: List[Region], input_org_spots: List[Spot], input_org_modules: List[Module], pangenome_file: str, singleton_gene_count: int) Dict[str, any]

Summarizes the projected genome and generates an organism summary.

Parameters:
  • organism – The Organism object for which the summary is generated.

  • pangenome_persistent_count – The count of persistent genes in the pangenome.

  • pangenome_persistent_single_copy_families – Set of single-copy persistent gene families.

  • soft_core_families – Set of soft core families in the pangenome.

  • exact_core_families – Set of exact core families in the pangenome.

  • input_org_rgps – List of Region objects for the input organism.

  • input_org_spots – List of Spot objects for the input organism.

  • input_org_modules – List of Module objects for the input organism.

  • pangenome_file – Filepath to the pangenome file.

  • singleton_gene_count – Count of singleton genes in the organism.

Returns:

A dictionary containing summarized information about the organism.

ppanggolin.projection.projection.write_projection_results(pangenome: Pangenome, organisms: Set[Organism], input_org_2_rgps: Dict[Organism, Set[Region]], input_org_to_spots: Dict[Organism, Set[Spot]], input_orgs_to_modules: Dict[Organism, Set[Module]], input_org_to_lonely_genes_count: Dict[Organism, int], write_proksee: bool, write_gff: bool, write_table: bool, add_sequences: bool, genome_name_to_path: Dict[str, dict], input_type: str, output_dir: Path, dup_margin: float, soft_core: float, metadata_sep: str, compress: bool, need_regions: bool, need_spots: bool, need_modules: bool)

Write the results of the projection of pangneome onto input genomes.

Parameters:
  • pangenome – The pangenome onto which the projection is performed.

  • organisms – A set of input organisms for projection.

  • input_org_2_rgps – A dictionary mapping input organisms to sets of regions of genomic plasticity (RGPs).

  • input_org_to_spots – A dictionary mapping input organisms to sets of spots.

  • input_orgs_to_modules – A dictionary mapping input organisms to sets of modules.

  • input_org_to_lonely_genes_count – A dictionary mapping input organisms to the count of lonely genes.

  • write_proksee – Whether to write ProkSee JSON files.

  • write_gff – Whether to write GFF files.

  • add_sequences – Whether to add sequences to the output files.

  • genome_name_to_path – A dictionary mapping genome names to file paths.

  • input_type – The type of input data (e.g., “annotation”).

  • output_dir – The directory where the output files will be written.

  • dup_margin – The duplication margin used to compute completeness.

  • soft_core – Soft core threshold

Parama write_table:

Whether to write table files.

Note:

  • If write_proksee is True and input organisms have modules, module colors for ProkSee are obtained.

  • The function calls other functions such as summarize_projection, read_genome_file, write_proksee_organism, write_gff_file, and write_summaries to generate various output files and summaries.

ppanggolin.projection.projection.write_rgp_to_spot_table(rgp_to_spots: Dict[Region, Set[str]], output: Path, filename: str, compress: bool = False)

Write a table mapping RGPs to corresponding spot IDs.

Parameters:
  • rgp_to_spots – A dictionary mapping RGPs to spot IDs.

  • output – Path to the output directory.

  • filename – Name of the file to write.

  • compress – Whether to compress the file.

ppanggolin.projection.projection.write_summary_in_yaml(summary_info: Dict[str, Any], output_file: Path)

Write summary information to a YAML file.

This function takes a dictionary containing summary information. It writes this information to a YAML file.

Parameters:
  • summary_info – A dictionary containing summary information.

  • output_file – The file where the summary will be written.

Module contents