Metadata and Pangenome
Associating Metadata to Pangenome Elements
The metadata command allows the addition of metadata linked to various pangenome elements. Metadata can be associated with genes, genomes, families, RGPs, spots, and modules using a simple TSV file.
To add metadata to your pangenome, execute the command as shown below:
ppanggolin metadata -p PANGENOME --metadata METADATA.TSV --source SOURCE --assign ASSIGN
The
--sourceargument corresponds to the metadata’s origin and will serve as the storage key in the pangenome.--assignallows you to specify the pangenome elements to which you want to add metadata from the following list: {families, genomes, genes, RGPs, spots, modules}.
The associated metadata can then be exported in various output files of PPanGGOLiN such as GFF, PROKSEE JSON Map and Table output for genomes (see here for more details) and in the gexf graph file of the pangenome as well as in the graph resulting in the RGP clustering.
The metadata linked to pangenome elements can be exported to various output file formats within PPanGGOLiN, including GFF, PROKSEE JSON Map, and Table outputs of the write_genomes command (see here for more details). Additionally, the metadata can also be included in the gexf graph file representing the pangenome and in the RGP clustering graph.
Note
Certain information (such as species, strain, and dbx_ref) is extracted from genome annotation files (GBFF, GFF) during the annotation step and added to the pangenome as metadata under the source ‘annotation_files’. These metadata can be extracted using the write_metadata command.
Metadata Format
PPanGGOLiN offers a highly flexible metadata file format. Only one column name is mandatory, and it aligns with the assignment argument chosen by the user (ie. families, RGPS…).
For instance, the TSV file used to assign metadata to gene families for functional annotation might resemble the following:
families |
Accession |
Function |
Description |
|---|---|---|---|
GF_1 |
Acc_1 |
Fn_1 |
Desc_1 |
GF_2 |
Acc_2 |
Fn_2 |
Desc_2 |
GF_2 |
Acc_3 |
Fn_3 |
Desc_3 |
… |
… |
… |
… |
GF_n |
Acc_n |
Fn_n |
Desc_n |
Note
As you can see in the above table, one element (here GF_2) can be associated with with multiple metadata entries.
Command Specific Option Details
--metadata
PPanGGOLiN enables to give one TSV at a time to add metadata.
--source
The source serves as the key for accessing metadata within the pangenome. If the source name already exists in the pangenome, it can be overwritten using the --force option. This system facilitates the utilization of multiple metadata sources, accessible and usable within PPanGGOLiN. In the context of annotation, the source typically represents the name of the annotation database used during the annotation process.
--assign
PPanGGOLiN enables the addition of metadata to various pangenome elements, including families, genomes, genes, RGPs, spots, and modules. However, the user can provide only one metadata file at a time, thereby specifying a single source and one type of pangenome element.
--omit
This option allows you to bypass errors resulting from an unfound ID in the pangenome. It can be beneficial when utilizing a general TSV with elements not present in the pangenome. This argument should be used carefully.
Checking Metadata Associated with the Pangenome
You can inspect which pangenome elements have associated metadata and their respective sources using the ppanggolin info command. For more details on this command, refer to the info command documentation.
To check the metadata, execute the following command:
ppanggolin info -p PANGENOME --metadata
Exporting Metadata to TSV Files
The write_metadata command allows you to export metadata to TSV files. It creates one file per source of pangenome element metadata.
For example, if families have metadata from sources hmmer and dbcan, and genomes have metadata from the source gtdb, the command will create three files:
families_metadata_from_hmmer.tsvfamilies_metadata_from_dbcan.tsvgenomes_metadata_from_gtdb.tsv
To export metadata from your pangenome, execute the following command:
ppanggolin write_metadata -p PANGENOME --output metadata_tsv_files