Conserved module prediction
PPanGGOLiN can predict and work with conserved modules, which are groups of genes that are part of the variable genome, and often found together across the genomes of the pangenome. These conserved modules may also be potential functional modules.
Further details can be found in the panModule preprint
The panModule workflow
The panModule workflow facilitates the generation of a pangenome with predicted conserved modules from a specified set of genomes. This command extends the functionality of the workflow command by detecting conserved modules. Additionally, it generates descriptive tsv files detailing the predicted modules, whose format are detailed here.
To execute the panModule workflow, use the following command:
ppanggolin panmodule --fasta GENOME_LIST_FILE
Replace GENOME_LIST_FILE with a tab-separated file listing the genome names, and the fasta file path of their genomic sequences as described here. Alternatively, you can provide a list of GFF/GBFF files as input by using the --anno parameter, similar to how it is used in the workflow and annotate commands.
The panmodule workflow predicts modules using default parameters. To fine-tune the detection, you can use the module command on a partitioned pangenome acquired through the workflow for example or use a configuration file, as described here.
Predict conserved module
The module command predicts conserved modules on an partitioned pangenome. The command has several options for tuning the prediction. Details about each parameter are available in the related preprint.
The command can be used simply as such:
ppanggolin module -p pangenome.h5
This will predict modules and store the results in the HDF5 pangenome file. If you wish to have descriptive tsv files, whose format is detailed here, you can use the write_pangenome command with the flag --modules:
ppanggolin write_pangenome -p pangenome.h5 --modules --output MYOUTPUTDIR
If spots of insertion have been predicted in you pangenome using the spot command (or inside the panrgp or all workflow commands), you can also list the associations between the predicted spots and the predicted modules as such:
ppanggolin write_pangenome -p pangenome.h5 --spot_modules --output MYOUTPUTDIR
The format of each file is given here
Module outputs
Descriptive Tables for Predicted Modules
To describe predicted modules, various files can be generated, each describing different characteristics of these modules.
To generate these tables, use the write_pangenome command with the --module :
ppanggolin write_pangenome -p pangenome.h5 --modules -o my_output_dir
This command generates three tables: functional_modules.tsv, modules_in_genomes.tsv, and modules_summary.tsv described below:
1. Gene Family to Module Mapping Table
The functional_modules.tsv file lists modules with their corresponding gene families. Each line establishes a mapping between a gene family and its respective module.
It follows the following format:
Column |
Description |
|---|---|
module_id |
Identifier for the module |
family_id |
Identifier for the family |
2. Genome-wise Module Composition
The modules_in_genomes.tsv file provides a comprehensive overview of the modules present in each genome, detailing their completeness levels. Due to potential variability in module predictions, some modules might exhibit partial completeness in specific genomes where they are detected.
The structure of the modules_in_genomes.tsv file is outlined as follows:
Column |
Description |
|---|---|
module_id |
Identifier for the module |
genome |
Genome in which the indicated module is found |
completion |
Indicates the level of completeness (0.0 to 1.0) of the module in the |
3. modules summary
The modules_summary.tsv file lists characteristics for each detected module, with one line for each module.
The format is as follows:
column |
description |
|---|---|
module_id |
The module identifier |
nb_families |
The number of families which are included in the module The families |
nb_genomes |
The number of genomes in which the module is found. Those genomes are |
partition |
The average partition of the families in the module. |
mean_number_of_occurrence |
the mean number of time a module is present in each genome. |
Mapping Modules with Spots and Regions of Genomic Plasticity (RGPs)
Predicted modules can be associated with Spots of insertion and Regions of Genomic Plasticity (RGPs) using the write_pangenome command with the --spot_modules flag as follows:
ppanggolin write_pangenome -p pangenome.h5 --spot_modules -o my_output_dir
This command generates two tables: modules_spots.tsv and modules_RGP_lists.tsv, described below.
Note
These outputs are available only if modules, spots, and RGPs have been computed in your pangenome (see the command all or the commands spot, rgp, and module for that).
Moreover, this information can be visualized through figures using the command ppanggolin draw --spots (refer to Spot plots, which can display modules).
1. Associating Modules and Spots
The modules_spots.tsv file indicates which modules are present in each spot.
Its format is as follows:
Column |
Description |
|---|---|
module_id |
Module identifier |
spot_id |
Spot identifier |
2. Associating Modules and RGPs
The modules_RGP_lists.tsv file lists RGPs that contain the same modules. These RGPs may have different gene families, but they will not include any other modules apart from those indicated. The format of modules_RGP_lists.tsv is as follows:
Column |
Description |
|---|---|
representative_RGP |
An RGP considered representative for the group, serving as a randomly chosen |
nb_spots |
The number of spots where the RGPs containing the listed modules are observed |
mod_list |
A list of modules present in the indicated RGPs |
RGP_list |
A list of RGPs that specifically includes the previously listed modules |
Module Information
Once module have been predicted, the info command can present overall statistics regarding the predicted modules, including details about the families found within the modules and their distribution across various partitions.
ppanggolin info -p pangenome.h5 --content
The command output provides the following details about modules:
[...]
Modules:
Number_of_modules: 380
Families_in_Modules: 2242
Partition_composition:
Persistent: 0.27
Shell: 37.69
Cloud: 62.04
Number_of_Families_per_Modules:
min: 3
max: 65
sd: 5.84
mean: 5.9