Write pangenome sequences

The fasta command can be used to write sequences of the pangenome or specific parts of the pangenome in FASTA format.

Most options require a partition.

Available partitions are:

  • all for the entire pangenome.

  • Persistent for persistent families

  • Shell for shell genes or families

  • Cloud for cloud genes or families

  • rgp for genes or families found in RGPs

  • core for core genes or families

  • softcore for softcore genes or families

When using the softcore filter, the --soft_core option can be used to modify the threshold used to determine what is part of the softcore. It is set to 0.95 by default.

Genes

Nucleotide sequences

With the --genes partition option PPanGGOLiN will write the nucleotide CDS sequences for the given partition. It can be used as such, to write all the genes of the pangenome for example:

ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes all

Or to write only the persistent genes:

ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes persistent

Protein sequences

With the --proteins partition option PPanGGOLiN will write the nucleotide CDS sequences for the given partition. It can be used as such, to write all the genes of the pangenome for example:

ppanggolin fasta -p pangenome.h5 --output MY_GENES --proteins all

Or to write only the cloud genes:

ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes_prot cloud

To translate the gene sequences, PPanGGOLiN uses the MMSeqs2 translatenucs command. So for this option you can specify multiple threads with --cpu. You can also specify the translation table to use with --translate_table. The temporary directory, can be specified with --tmpdir to store the MMSeqs2 database and other files. Temporary files will be deleted at the end of the execution. To keep them, you can use the --keep_tmp option.

Gene families

Protein sequences

With the --prot_families partition option PPanGGOLiN will write the protein sequences of the representative gene for each family for the given partition. It can be used as such for all families:

ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families all

Or for all the shell families for example:

ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families shell

Nucleotide sequences

With the --gene_families partition option PPanGGOLiN will write the nucleotide sequences of the representative gene for each family for the given partition. It can be used as such for all families:

ppanggolin fasta -p pangenome.h5 --output MY_GENES_FAMILIES --gene_families all

Or for the core families for example:

ppanggolin fasta -p pangenome.h5 --output MY_GENES_FAMILIES --gene_families core

Modules

All the precedent command admit a module as partition.

So you can write the protein sequences for the family in module_X as such:

ppanggolin fasta -p pangenome.h5 --output MY_REGIONS --prot_families module_X

Or the nucleotide sequence of all genes in module_X:

ppanggolin fasta -p pangenome.h5 --output MY_REGIONS --genes module_X

Regions

This option can be used to write the nucleotide sequences of the detected RGPs. It requires the fasta sequences used to compute the pangenome, as originally provided when you computed your pangenome.

This command has only two filters:

  • all, for all regions

  • complete, for only the ‘complete’ regions which are not on a contig border

It can be used as such:

ppanggolin fasta -p pangenome.h5 --output MY_REGIONS --regions all --fasta genomes.fasta.list