ppanggolin.projection package
Submodules
ppanggolin.projection.projection module
- class ppanggolin.projection.projection.NewSpot(spot_id: int)
Bases:
SpotThis class represent a hotspot specifically created for the projected genome.
- ppanggolin.projection.projection.annotate_fasta_files(genome_name_to_fasta_path: Dict[str, dict], tmpdir: str, cpu: int = 1, translation_table: int = 11, kingdom: str = 'bacteria', norna: bool = False, allow_overlap: bool = False, procedure: str | None = None, disable_bar: bool = False)
Main function to annotate a pangenome
- Parameters:
genome_name_to_fasta_path –
fasta_list – List of fasta file containing sequences that will be base of pangenome
tmpdir – Path to temporary directory
cpu – number of CPU cores to use
translation_table – Translation table (genetic code) to use.
kingdom – Kingdom to which the prokaryota belongs to, to know which models to use for rRNA annotation.
norna – Use to avoid annotating RNA features.
allow_overlap – Use to not remove genes overlapping with RNA features
procedure – prodigal procedure used
disable_bar – Disable the progresse bar
- ppanggolin.projection.projection.annotate_input_genes_with_pangenome_families(pangenome: Pangenome, input_organisms: Iterable[Organism], output: Path, cpu: int, use_representatives: bool, no_defrag: bool, identity: float, coverage: float, tmpdir: Path, translation_table: int, keep_tmp: bool = False, disable_bar: bool = False)
Annotate input genes with pangenome gene families by associating them to a cluster.
- Parameters:
pangenome – Pangenome object.
input_organisms – Iterable of input organism objects.
output – Output directory for generated files.
cpu – Number of CPU cores to use.
no_defrag – Whether to use defragmentation.
use_representatives – Use representative sequences of gene families rather than all sequence to align input genes
identity – Minimum identity threshold for gene clustering.
coverage – Minimum coverage threshold for gene clustering.
tmpdir – Temporary directory for intermediate files.
translation_table – Translation table ID for nucleotide sequences.
keep_tmp – If True, keep temporary files.
disable_bar – Whether to disable progress bar.
- Returns:
Number of genes that do not cluster with any of the gene families of the pangenome.
- ppanggolin.projection.projection.check_input_names(pangenome, input_names)
Check if input organism names already exist in the pangenome.
- Parameters:
pangenome – The pangenome object.
input_names – List of input organism names to check.
- Raises:
NameError – If duplicate organism names are found in the pangenome.
- ppanggolin.projection.projection.check_pangenome_for_projection(pangenome: Pangenome, fast_aln: bool)
Check the status of a pangenome and determine whether projection is possible.
- Parameters:
pangenome – The pangenome to be checked.
fast_aln – Whether to use the fast alignment option for gene projection.
This function checks various attributes of a pangenome to determine whether it is suitable for projecting features into a provided genome.
- Returns:
A tuple indicating whether RGP prediction, spot projection, and module projection are possible (True) or not (False) based on the pangenome’s status.
- Raises:
NameError: If the pangenome has not been partitioned. Exception: If the pangenome lacks gene sequences or gene family sequences, and fast alignment is not enabled.
- ppanggolin.projection.projection.check_projection_arguments(args: Namespace, parser: ArgumentParser) str
Check the arguments provided for genome projection and raise errors if they are incompatible or missing.
- Parameters:
args – An argparse.Namespace object containing parsed command-line arguments.
parser – parser of the command
- Returns:
A string indicating the input mode (‘single’ or ‘multiple’).
- ppanggolin.projection.projection.check_spots_congruency(graph_spot: Graph, spots: List[Spot]) None
Check congruency of spots in the spot graph with the original spots.
- Parameters:
graph_spot – The spot graph containing the connected components representing the spots.
spots – List of original spots in the pangenome.
- Returns:
None.
- ppanggolin.projection.projection.get_gene_sequences_from_fasta_files(organisms, genome_name_to_annot_path)
Get gene sequences from fasta path file
- Parameters:
organisms – input pangenome
fasta_file – list of fasta file
- ppanggolin.projection.projection.infer_input_mode(input_file: Path, expected_types: List[str], parser: ArgumentParser) str
Determine the input mode based on the provided input file and expected file types.
- Parameters:
input_file – A Path object representing the input file.
expected_types – A list of expected file types (e.g., [‘fasta’, ‘gff’, ‘gbff’, ‘tsv’]).
- Returns:
A string indicating the input mode (‘single’ or ‘multiple’).
- ppanggolin.projection.projection.launch(args: Namespace)
Command launcher
- Parameters:
args – All arguments provide by user
- ppanggolin.projection.projection.manage_annotate_param(annotate_param_names: List[str], pangenome_args: Namespace, config_file: str | None) Namespace
Manage annotate parameters by collecting them from different sources and merging them.
- Parameters:
annotate_param_names – List of annotate parameter names to be managed.
pangenome_args – Annotate arguments parsed from pangenomes parameters.
config_file – Path to the config file, can be None if not provided.
- Returns:
An argparse.Namespace containing the merged annotate parameters with their values.
- ppanggolin.projection.projection.manage_input_genomes_annotation(pangenome, input_mode: str, anno: str, fasta: str, organism_name: str, circular_contigs: list, pangenome_params, cpu: int, use_pseudo: bool, disable_bar: bool, tmpdir: str, config: dict, translation_table: int)
Manage the input genomes annotation based on the provided mode and parameters.
- Parameters:
pangenome – The pangenome object.
input_mode – The input mode, either ‘multiple’ or ‘single’.
anno – The annotation file path or None.
fasta – The FASTA file path or None.
organism_name – The name of the organism.
circular_contigs – List of circular contigs.
pangenome_params – Parameters for pangenome processing.
cpu – Number of CPUs to use.
use_pseudo – Flag to use pseudo annotation.
disable_bar – Flag to disable progress bar.
tmpdir – Temporary directory path.
config – Configuration dictionary.
translation_table – Translation table (genetic code) to use.
- Returns:
A tuple of organisms, genome_name_to_path, and input_type.
- ppanggolin.projection.projection.parser_projection(parser: ArgumentParser)
Parser for specific argument of projection command
- Parameters:
parser – parser for projection argument
- ppanggolin.projection.projection.predict_RGP(pangenome: Pangenome, input_organisms: List[Organism], persistent_penalty: int, variable_gain: int, min_length: int, min_score: int, multigenics: Set[GeneFamily], output_dir: Path, disable_bar: bool, compress: bool) Dict[Organism, Set[Region]]
Compute Regions of Genomic Plasticity (RGP) for the given input organisms.
- Parameters:
pangenome – The pangenome object.
input_organisms – List of the input organisms for which to compute RGPs.
persistent_penalty – Penalty score to apply to persistent genes.
variable_gain – Gain score to apply to variable genes.
min_length – Minimum length (bp) of a region to be considered as RGP.
min_score – Minimal score required for considering a region as RGP.
multigenics – multigenic families.
output_dir – Output directory where predicted rgps are going to be written.
disable_bar – Flag to disable the progress bar.
compress – Flag to compress the rgp table in gz.
- Returns:
Dictionary mapping organism with the set of predicted regions
- ppanggolin.projection.projection.predict_spot_in_one_organism(graph_spot: Graph, input_org_rgps: List[Region], original_nodes: Set[int], new_spot_id_counter: int, multigenics: Set[GeneFamily], organism_name: str, output: Path, write_graph_flag: bool = False, graph_formats: List[str] = ['gexf'], overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1, compress: bool = False) Set[Spot]
Predict spots for input organism RGPs.
- Parameters:
graph_spot – The spot graph from the pangenome.
input_org_rgps – List of RGPs from the input organism to be associated with spots.
original_nodes – Set of original nodes in the spot graph.
new_spot_id_counter – Counter for new spot IDs.
multigenics – Set of pangenome graph multigenic persistent families.
organism_name – Name of the input organism.
output – Output directory to save the spot graph.
write_graph_flag – If True, writes the spot graph in the specified formats. Default is False.
graph_formats – List of graph formats to write (default is [‘gexf’]).
overlapping_match – Number of missing persistent genes allowed when comparing flanking genes. Default is 2.
set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation. Default is 3.
exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs. Default is 1.
compress – Flag to compress output files
- Returns:
Set[Spot]: The predicted spots for the input organism RGPs.
- ppanggolin.projection.projection.predict_spots_in_input_organisms(initial_spots: List[Spot], initial_regions: List[Region], input_org_2_rgps: Dict[Organism, Set[Region]], multigenics: Set[GeneFamily], output: Path, write_graph_flag: bool = False, graph_formats: List[str] = ['gexf'], overlapping_match: int = 2, set_size: int = 3, exact_match: int = 1, compress: bool = False) Dict[Organism, Set[Spot]]
Create a spot graph from pangenome RGP and predict spots for input organism RGPs.
- Parameters:
initial_spots – List of original spots in the pangenome.
initial_regions – List of original regions in the pangenome.
input_org_2_rgps – Dictionary mapping input organisms to their RGPs.
multigenics – Set of pangenome graph multigenic persistent families.
output – Output directory to save the spot graph.
write_graph_flag – If True, writes the spot graph in the specified formats. Default is False.
graph_formats – List of graph formats to write (default is [‘gexf’]).
overlapping_match – Number of missing persistent genes allowed when comparing flanking genes. Default is 2.
set_size – Number of single copy markers to use as flanking genes for RGP during hotspot computation. Default is 3.
exact_match – Number of perfectly matching flanking single copy markers required to associate RGPs. Default is 1.
compress – Flag to compress output files
- Returns:
A dictionary mapping input organism RGPs to their predicted spots.
- ppanggolin.projection.projection.project_and_write_modules(pangenome: Pangenome, input_organisms: Iterable[Organism], output: Path, compress: bool = False)
Write a tsv file providing association between modules and the input organism
- Parameters:
pangenome – Pangenome object
input_organisms – iterable of the organisms that is being annotated
output – Path to output directory
compress – Compress the file in .gz
- ppanggolin.projection.projection.read_annotation_files(genome_name_to_annot_path: Dict[str, dict], cpu: int = 1, pseudo: bool = False, disable_bar: bool = False) Tuple[List[Organism], Dict[Organism, bool]]
Read the annotation from GBFF file
- Parameters:
pangenome – pangenome object
organisms_file – List of GBFF files for each organism
cpu – number of CPU cores to use
pseudo – allow to read pseudogène
disable_bar – Disable the progresse bar
- ppanggolin.projection.projection.retrieve_gene_sequences_from_fasta_file(input_organism, fasta_file)
Get gene sequences from fastas
- Parameters:
pangenome – input pangenome
fasta_file – list of fasta file
- ppanggolin.projection.projection.subparser(sub_parser: _SubParsersAction) ArgumentParser
Subparser to launch PPanGGOLiN in Command line
:param sub_parser : sub_parser for projection command
:return : parser arguments for projection command
- ppanggolin.projection.projection.summarize_projected_genome(organism: Organism, pangenome_persistent_count: int, pangenome_persistent_single_copy_families: Set[GeneFamily], soft_core_families: Set[GeneFamily], exact_core_families: Set[GeneFamily], input_org_rgps: List[Region], input_org_spots: List[Spot], input_org_modules: List[Module], pangenome_file: str, singleton_gene_count: int) Dict[str, any]
Summarizes the projected genome and generates an organism summary.
- Parameters:
organism – The Organism object for which the summary is generated.
pangenome_persistent_count – The count of persistent genes in the pangenome.
pangenome_persistent_single_copy_families – Set of single-copy persistent gene families.
soft_core_families – Set of soft core families in the pangenome.
exact_core_families – Set of exact core families in the pangenome.
input_org_rgps – List of Region objects for the input organism.
input_org_spots – List of Spot objects for the input organism.
input_org_modules – List of Module objects for the input organism.
pangenome_file – Filepath to the pangenome file.
singleton_gene_count – Count of singleton genes in the organism.
- Returns:
A dictionary containing summarized information about the organism.
- ppanggolin.projection.projection.write_projection_results(pangenome: Pangenome, organisms: Set[Organism], input_org_2_rgps: Dict[Organism, Set[Region]], input_org_to_spots: Dict[Organism, Set[Spot]], input_orgs_to_modules: Dict[Organism, Set[Module]], input_org_to_lonely_genes_count: Dict[Organism, int], write_proksee: bool, write_gff: bool, write_table: bool, add_sequences: bool, genome_name_to_path: Dict[str, dict], input_type: str, output_dir: Path, dup_margin: float, soft_core: float, metadata_sep: str, compress: bool, need_regions: bool, need_spots: bool, need_modules: bool)
Write the results of the projection of pangneome onto input genomes.
- Parameters:
pangenome – The pangenome onto which the projection is performed.
organisms – A set of input organisms for projection.
input_org_2_rgps – A dictionary mapping input organisms to sets of regions of genomic plasticity (RGPs).
input_org_to_spots – A dictionary mapping input organisms to sets of spots.
input_orgs_to_modules – A dictionary mapping input organisms to sets of modules.
input_org_to_lonely_genes_count – A dictionary mapping input organisms to the count of lonely genes.
write_proksee – Whether to write ProkSee JSON files.
write_gff – Whether to write GFF files.
add_sequences – Whether to add sequences to the output files.
genome_name_to_path – A dictionary mapping genome names to file paths.
input_type – The type of input data (e.g., “annotation”).
output_dir – The directory where the output files will be written.
dup_margin – The duplication margin used to compute completeness.
soft_core – Soft core threshold
- Parama write_table:
Whether to write table files.
Note:
If write_proksee is True and input organisms have modules, module colors for ProkSee are obtained.
The function calls other functions such as summarize_projection, read_genome_file, write_proksee_organism, write_gff_file, and write_summaries to generate various output files and summaries.
- ppanggolin.projection.projection.write_rgp_to_spot_table(rgp_to_spots: Dict[Region, Set[str]], output: Path, filename: str, compress: bool = False)
Write a table mapping RGPs to corresponding spot IDs.
- Parameters:
rgp_to_spots – A dictionary mapping RGPs to spot IDs.
output – Path to the output directory.
filename – Name of the file to write.
compress – Whether to compress the file.
- ppanggolin.projection.projection.write_summary_in_yaml(summary_info: Dict[str, Any], output_file: Path)
Write summary information to a YAML file.
This function takes a dictionary containing summary information. It writes this information to a YAML file.
- Parameters:
summary_info – A dictionary containing summary information.
output_file – The file where the summary will be written.