ppanggolin package
Subpackages
- ppanggolin.RGP package
- Submodules
- ppanggolin.RGP.genomicIsland module
- ppanggolin.RGP.rgp_cluster module
IdenticalRegionsadd_edges_to_identical_rgps()add_info_to_identical_rgps()add_info_to_rgp_nodes()add_rgp_metadata_to_graph()cluster_rgp()cluster_rgp_on_grr()compute_grr()compute_jaccard_index()compute_rgp_metric()dereplicate_rgp()format_rgp_metadata()get_spot_id()join_dicts()launch()parser_cluster_rgp()subparser()write_rgp_cluster_table()
- ppanggolin.RGP.spot module
- Module contents
- ppanggolin.align package
- Submodules
- ppanggolin.align.alignOnPang module
align()align_seq_to_pang()draw_spot_gexf()get_fam_to_rgp()get_fam_to_spot()get_input_seq_to_family_with_all()get_input_seq_to_family_with_rep()get_seq_ids()get_seq_info()launch()map_input_gene_to_family_all_aln()map_input_gene_to_family_rep_aln()parser_align()project_and_write_partition()subparser()write_all_gene_sequences()write_gene_fam_sequences()write_gene_to_gene_family()
- Module contents
- ppanggolin.annotate package
- Submodules
- ppanggolin.annotate.annotate module
add_metadata_from_gff_file()annotate_pangenome()check_and_add_extra_gene_part()check_annotate_args()chose_gene_identifiers()combine_contigs_metadata()correct_putative_overlaps()create_gene()determine_genetic_code_from_annotation_files()determine_genetic_code_to_use()extract_positions()fix_partial_gene_coordinates()get_gene_sequences_from_fastas()launch()local_identifiers_are_unique()parse_contig_header_lines()parse_db_xref_metadata()parse_dna_seq_lines()parse_feature_lines()parse_gbff_by_contig()parser_annot()read_anno_file()read_annotations()read_org_gbff()read_org_gff()reverse_complement_coordinates()shift_end_coordinates()shift_start_coordinates()subparser()
- ppanggolin.annotate.synta module
- Module contents
- ppanggolin.cluster package
- Submodules
- ppanggolin.cluster.cluster module
align_rep()check_pangenome_for_clustering()check_pangenome_former_clustering()clustering()first_clustering()get_family_representative_sequences()infer_singletons()launch()mk_local_to_gene()parser_clust()read_clustering()read_clustering_file()read_faa()read_fam2seq()read_gene2fam()read_tsv()refine_clustering()subparser()
- Module contents
- ppanggolin.context package
- Submodules
- ppanggolin.context.searchGeneContext module
add_edges_to_context_graph()add_val_to_dict_attribute()align_sequences_to_families()check_pangenome_for_context_search()compute_edge_metrics()compute_gene_context_graph()export_context_to_dataframe()fam_to_seq()get_contig_to_genes()get_gene_contexts()get_n_next_genes_index()increment_attribute_counter()launch()make_graph_writable()parser_context()search_gene_context_in_pangenome()subparser()write_graph()
- Module contents
- ppanggolin.figures package
- ppanggolin.formats package
- Submodules
- ppanggolin.formats.readBinaries module
Genedatacheck_pangenome_info()create_info_dict()get_families_from_genes()get_families_matching_partition()get_family_to_genome_count()get_gene_to_genome()get_genes_from_families()get_need_info()get_non_redundant_gene_sequences_from_file()get_number_of_organisms()get_pangenome_parameters()get_seqid_to_genes()get_soft_core_families()get_status()read_annotation()read_chunks()read_contigs()read_gene_families()read_gene_families_info()read_gene_sequences()read_genedata()read_genes()read_graph()read_info()read_join_coordinates()read_metadata()read_module_families_from_pangenome_file()read_modules()read_organisms()read_pangenome()read_parameters()read_rgp()read_rgp_genes_from_pangenome_file()read_rnas()read_sequences()read_spots()write_fasta_gene_fam_from_pangenome_file()write_fasta_prot_fam_from_pangenome_file()write_gene_sequences_from_pangenome_file()write_genes_from_pangenome_file()write_genes_seq_from_pangenome_file()
- ppanggolin.formats.writeAnnotations module
contig_desc()gene_desc()gene_joined_coordinates_desc()gene_sequences_desc()genedata_desc()get_gene_sequences_len()get_genedata()get_max_len_annotations()get_max_len_genedata()get_sequence_len()organism_desc()rna_desc()sequence_desc()write_annotations()write_contigs()write_gene_joined_coordinates()write_gene_sequences()write_genedata()write_genes()write_organisms()write_rnas()
- ppanggolin.formats.writeBinaries module
erase_pangenome()gene_fam_desc()gene_to_fam_desc()get_gene_fam_len()get_gene_id_len()get_gene_to_fam_len()get_mod_desc()get_rgp_len()get_spot_desc()getmax()getmean()getmin()getstdev()graph_desc()mod_desc()rgp_desc()spot_desc()update_gene_fam_partition()update_gene_fragments()write_gene_fam_info()write_gene_families()write_graph()write_info()write_info_modules()write_modules()write_pangenome()write_rgp()write_spots()write_status()
- ppanggolin.formats.writeFlatGenomes module
- ppanggolin.formats.writeFlatPangenome module
launch()parser_flat()spot2rgp()subparser()summarize_genome()summarize_spots()write_borders()write_gene_families_tsv()write_gene_presence_absence()write_gexf()write_gexf_edges()write_gexf_end()write_gexf_header()write_gexf_nodes()write_json()write_json_edge()write_json_edges()write_json_gene_fam()write_json_header()write_json_nodes()write_matrix()write_module_summary()write_modules()write_org_modules()write_pangenome_flat_files()write_partitions()write_persistent_duplication_statistics()write_regions()write_regions_families()write_rgp_modules()write_rgp_table()write_spot_modules()write_spots()write_stats()write_summaries_in_tsv()
- ppanggolin.formats.writeMSA module
- ppanggolin.formats.writeMetadata module
- ppanggolin.formats.writeSequences module
check_write_sequences_args()create_mmseqs_db()filter_values()launch()parser_seq()read_fasta_gbk()read_fasta_or_gff()read_genome_file()subparser()translate_genes()write_gene_protein_sequences()write_gene_sequences_from_annotations()write_regions_sequences()write_sequence_files()write_spaced_fasta()
- ppanggolin.formats.write_proksee module
- Module contents
- ppanggolin.graph package
- ppanggolin.info package
- ppanggolin.meta package
- ppanggolin.metrics package
- ppanggolin.mod package
- ppanggolin.nem package
- ppanggolin.projection package
- Submodules
- ppanggolin.projection.projection module
NewSpotannotate_fasta_files()annotate_input_genes_with_pangenome_families()check_input_names()check_pangenome_for_projection()check_projection_arguments()check_spots_congruency()get_gene_sequences_from_fasta_files()infer_input_mode()launch()manage_annotate_param()manage_input_genomes_annotation()parser_projection()predict_RGP()predict_spot_in_one_organism()predict_spots_in_input_organisms()project_and_write_modules()read_annotation_files()retrieve_gene_sequences_from_fasta_file()subparser()summarize_projected_genome()write_projection_results()write_rgp_to_spot_table()write_summary_in_yaml()
- Module contents
- ppanggolin.utility package
- ppanggolin.workflow package
Submodules
ppanggolin.edge module
- class ppanggolin.edge.Edge(source_gene: Gene, target_gene: Gene)
Bases:
objectThe Edge class represents an edge between two gene families in the pangenome graph. It is associated with all the organisms in which the neighborship is found, and all the involved genes as well.
- Methods:
get_org_dict: Returns a dictionary with organisms as keys and an iterable of the pairs in genes as values.
gene_pairs: Returns a list of all the gene pairs in the Edge.
add_genes: Adds genes to the edge. They are supposed to be in the same organism.
- Fields:
source: A GeneFamily object representing the source gene family of the edge.
target: A GeneFamily object representing the target gene family of the edge.
organisms: A defaultdict object representing the organisms in which the edge is found and the pairs of genes involved.
- add_genes(source_gene: Gene, target_gene: Gene)
Adds genes to the edge. They are supposed to be in the same organism.
- Parameters:
source_gene – Gene corresponding to the source of the edge
target_gene – Gene corresponding to the target of the edge
- Raises:
TypeError – If the genes are not with Gene type
ValueError – If genes are not associated with an organism
Exception – If the genes are not in the same organism.
- property gene_pairs: List[Tuple[Gene, Gene]]
Get the list of all the gene pairs in the Edge
- Returns:
A list of all the gene pairs in the Edge
- get_organism_genes_pairs(organism: Organism) List[Tuple[Gene, Gene]]
Get the gene pair corresponding to the given organism
- Parameters:
organism – Wanted organism
- Returns:
Pair of genes in the edge corresponding to the given organism
- get_organisms_dict() Dict[Organism, List[Tuple[Gene, Gene]]]
Get all the organisms with their corresponding pair of genes in the edge
- Returns:
Dictionary with the organism as the key and list of gene pairs as value
- property number_of_organisms: int
Get the number of organisms in the edge
- Returns:
Number of organisms
ppanggolin.geneFamily module
- class ppanggolin.geneFamily.GeneFamily(family_id: int, name: str)
Bases:
MetaFeaturesThis represents a single gene family. It will be a node in the pangenome graph, and be aware of its genes and edges.
- Methods:
named_partition: returns a meaningful name for the partition associated with the family.
neighbors: returns all the GeneFamilies that are linked with an edge.
edges: returns all Edges that are linked to this gene family.
genes: returns all the genes associated with the family.
organisms: returns all the Organisms that have this gene family.
spots: returns all the spots associated with the family.
modules: returns all the modules associated with the family.
number_of_neighbor: returns the number of neighbor GeneFamilies.
number_of_edges: returns the number of edges.
number_of_genes: returns the number of genes.
number_of_organisms: returns the number of organisms.
number_of_spots: returns the number of spots.
set_edge: sets an edge between the current family and a target family.
add_sequence: assigns a protein sequence to the gene family.
add_gene: adds a gene to the gene family and sets the gene’s family accordingly.
add_spot: adds a spot to the gene family.
add_module: adds a module to the gene family.
Mk_bitarray: produces a bitarray representing the presence/absence of the family in the pangenome using the provided index.
get_org_dict: returns a dictionary of organisms as keys and sets of genes as values.
get_genes_per_org: returns the genes belonging to the gene family in the given organism.
- Fields:
name: the name of the gene family.
ID: the internal identifier of the gene family.
removed: a boolean indicating whether the family has been removed from the main graph.
sequence: the protein sequence associated with the family.
Partition: the partition associated with the family.
- add(gene: Gene)
Add a gene to the gene family, and sets the gene’s :attr:family accordingly.
- Parameters:
gene – The gene to add
- Raises:
TypeError – If the provided gene is of the wrong type
- add_sequence(seq: str)
Assigns a protein sequence to the gene family.
- Parameters:
seq – The sequence to add to the gene family
- add_spot(spot: Spot)
Add the given spot to the family
- Parameters:
spot – Spot belonging to the family
- contains_gene_id(identifier)
Check if the family contains already a gene id
- Parameters:
identifier – ID of the gene
- Returns:
True if it contains False if it does not
- Raises:
TypeError – If the identifier is not instance string
- duplication_ratio(exclude_fragment: bool) bool
Checks if the gene family is considered single copy based on the provided criteria.
- Parameters:
dup_margin – The maximum allowed duplication margin for a gene family to be considered single copy.
exclude_fragment – A boolean indicating whether to exclude fragments when determining single copy families.
- Returns:
A boolean indicating whether the gene family is single copy.
- property edges: Generator[Edge, None, None]
Returns all Edges that are linked to this gene family
- Returns:
Edges of the gene family
- property genes
Return all the genes belonging to the family
- Returns:
Generator of genes
- get(identifier: str) Gene
Get a gene by its name
- Parameters:
identifier – ID of the gene
- Returns:
Wanted gene
- Raises:
TypeError – If the identifier is not instance string
- get_edge(target: GeneFamily) Edge
Get the edge by the target gene family neighbor
- get_genes_per_org(org: Organism) Generator[Gene, None, None]
Returns the genes belonging to the gene family in the given Organism
- Parameters:
org – Organism to look for
- Returns:
A set of gene(s)
- get_org_dict() Dict[Organism, Set[Gene]]
Returns the organisms and the genes belonging to the gene family
- Returns:
A dictionary of organism as key and set of genes as values
- property has_module: bool
Check if the family is in a module
return True if it has a module else False
- is_single_copy(dup_margin: float, exclude_fragment: bool) bool
Checks if the gene family is considered single copy based on the provided criteria.
- Parameters:
dup_margin – The maximum allowed duplication margin for a gene family to be considered single copy.
exclude_fragment – A boolean indicating whether to exclude fragments when determining single copy families.
- Returns:
A boolean indicating whether the gene family is single copy.
- mk_bitarray(index: Dict[Organism, int], partition: str = 'all')
Produces a bitarray representing the presence/absence of the family in the pangenome using the provided index The bitarray is stored in the
bitarrayattribute and is agmpy2.xmpztype.- Parameters:
index – The index computed by
ppanggolin.pangenome.Pangenome.getIndex()partition – partition used to compute bitarray
- property module: Module
Return all the modules belonging to the family
- Returns:
Generator of modules
- property named_partition: str
Reads the partition attribute and returns a meaningful name
- Returns:
The partition name of the gene family
- Raises:
ValueError – If the gene family has no partition assigned
- property neighbors: Generator[GeneFamily, None, None]
Returns all the GeneFamilies that are linked with an edge
- Returns:
Neighbors
- property number_of_edges: int
Get the number of edges for the current gene family
- property number_of_genes: int
Get the number of genes for the current gene family
- property number_of_neighbors: int
Get the number of neighbor for the current gene family
- property number_of_organisms: int
Get the number of organisms for the current gene family
- property number_of_spots: int
Get the number of spots for the current gene family
- property organisms: Generator[Organism, None, None]
Returns all the Organisms that have this gene family
- Returns:
Organisms that have this gene family
- property partition
- remove(identifier)
Remove a gene by its name
- Parameters:
identifier – Name of the gene
- Returns:
Wanted gene
- Raises:
TypeError – If the identifier is not instance string
- property representative: Gene
Get the representative gene of the family
- Returns:
The representative gene of the family
- set_edge(target: GeneFamily, edge: Edge)
Set the edge between the gene family and another one
- Parameters:
target – Neighbor family
edge – Edge connecting families
ppanggolin.genetic_codes module
- ppanggolin.genetic_codes.genetic_codes(code)
ppanggolin.genome module
- class ppanggolin.genome.Contig(identifier: int, name: str, is_circular: bool = False)
Bases:
MetaFeaturesDescribe the contig content and some information Methods: - genes: Returns a list of gene objects present in the contig. - add_rna: Adds an RNA object to the contig. - add_gene: Adds a gene object to the contig.
Fields: - name: Name of the contig. - is_circular: Boolean value indicating whether the contig is circular or not. - RNAs: Set of RNA annotations present in the contig.
- TODO: Getter gene should be based on gene ID, and 2 other attributes should exist to get them by start or position.
Also, when set a new gene in contig, start, stop and strand should be check to check difference, maybe define __eq__ method in gene class.
- property RNAs: Generator[RNA, None, None]
Return all the RNA in the contig
- Returns:
Generator of RNA
- add(gene: Gene)
Add a gene to the contig
- Parameters:
gene – Gene to add
- Raises:
TypeError – Region is not an instance Region
- add_contig_length(contig_length: int)
Add contig length to Contig object.
- Parameters:
contig_length – Length of the contig.
- Raises:
ValueError – If trying to define a contig length different from previously defined.
- add_rna(rna: RNA)
Add RNA to contig
- Parameters:
rna – RNA object to add
- Raises:
TypeError – RNA is not instance RNA
KeyError – Another RNA with the same ID already exists in the contig
- property families
Get the families belonging to this contig
- Returns:
families in the contig
- Return type:
Generator[GeneFamily, None, None]
- property genes: Generator[Gene, None, None]
Give the gene content of the contig
- Returns:
Generator of genes in contig
- get_by_coordinate(coordinate: Tuple[int, int, str]) Gene
Get a gene by its coordinate
- Parameters:
coordinate – Tuple containing start, stop and strand of the gene
- Returns:
The gene with the specified coordinate.
- Raises:
TypeError – Position is not an integer
- get_genes(begin: int = 0, end: int | None = None, outrange_ok: bool = False) List[Gene]
Gets a list of genes within a range of gene position. If no arguments are given it return all genes.
- Parameters:
begin – Position of the first gene to retrieve
end – Position of the last gene to not retrieve
outrange_ok – If True even is the last position is out of range return all the genes from begin to last position
- Returns:
List of genes between begin and end position
- Raises:
TypeError – If begin or end is not an integer
ValueError – If begin position is greater than end position
IndexError – If end position is greater than last gene position in contig
- get_ordered_consecutive_genes(genes: Iterable[Gene]) List[List[Gene]]
Order the given genes considering the circularity of the contig.
- Parameters:
genes – An iterable containing genes supposed to be consecutive along the contig.
- Returns:
A list of lists containing ordered consecutive genes considering circularity.
- property length: int | None
Get the length of the contig
- property modules
Get the modules belonging to this contig
- Returns:
Modules belonging to this contig
- Return type:
Generator[Module, None, None]
- property number_of_genes: int
Get the number of genes in the contig
- Returns:
the number of genes in the contig
- property number_of_rnas: int
Get the number of RNA in the contig
- property organism: Organism
Return organism that Feature belongs to.
- Returns:
Organism of the feature
- property regions
Get the regions belonging to this contig
- Returns:
RGP in the contig
- Return type:
Generator[Region, None, None]
- remove(position)
Remove a gene by its position
- Parameters:
position – Position of the gene in the contig
- Raises:
TypeError – Position is not an integer
- class ppanggolin.genome.Feature(identifier: str)
Bases:
MetaFeaturesThis is a general class representation of Gene, RNA
Methods: - fill_annotations: fills general annotation for child classes. - fill_parents: associates the object to an organism and a contig. - Add_sequence: adds a sequence to the feature.
Fields: - ID: Identifier of the feature given by PPanGGOLiN. - is_fragment: Boolean value indicating whether the feature is a fragment or not. - type: Type of the feature. - start: Start position of the feature. - stop: Stop position of the feature. - strand: Strand associated with the feature. - product: Associated product of the feature. - name: Name of the feature. - local_identifier: Identifier provided by the original file. - organism: Parent organism of the feature. - contig: Parent contig of the feature. - dna: DNA sequence of the feature.
- add_sequence(sequence)
Add a sequence to feature
- Parameters:
sequence – Sequence corresponding to the feature
- Raises:
AssertionError – Sequence must be a string
- fill_annotations(start: int, stop: int, strand: str, gene_type: str = '', name: str = '', product: str = '', local_identifier: str = '', coordinates: List[Tuple[int, int]] | None = None)
Fill general annotation for child classes
- Parameters:
start – Start position
stop – Stop position
coordinates – start and stop positions. in a list of tuple. Can have multiple tuple in case of join gene
strand – associated strand
gene_type – Type of gene
name – Name of the feature
product – Associated product
local_identifier – Identifier provided by the original file
- Raises:
TypeError – If attribute value does not correspond to the expected type
ValueError – If strand is not ‘+’ or ‘-’
- fill_parents(organism: Organism | None = None, contig: Contig | None = None)
Associate object to an organism and a contig
- Parameters:
organism – Parent organism
contig – Parent contig
- property has_joined_coordinates: bool
Whether or not the feature has joined coordinates.
- property organism: Organism
Return organism that Feature belongs to.
- Returns:
Organism of the feature
- property overlaps_contig_edge: bool
Check based on the coordinates of the feature, if the gene seems to overlap contig edge.
- start_relative_to(gene)
- stop_relative_to(gene)
- string_coordinates() str
Return a string representation of the coordinates
- class ppanggolin.genome.Gene(gene_id: str)
Bases:
FeatureSave gene from the genome as an Object with some information for Pangenome
Methods: - fill_annotations: fills general annotation for the gene object and adds additional attributes such as position and genetic code. - Add_protein: adds the protein sequence corresponding to the translated gene to the object.
Fields: - position: the position of the gene in the genome. - family: the family that the gene belongs to. - RGP: A putative Region of Plasticity that contains the gene. - genetic_code: the genetic code associated with the gene. - Protein: the protein sequence corresponding to the translated gene.
- add_protein(protein: str)
Add a protein sequence corresponding to translated gene
- Parameters:
protein – Protein sequence
- Raises:
TypeError – Protein sequence must be a string
- property family
Return GeneFamily that Gene belongs to.
- Returns:
Gene family of the gene
- Return type:
- fill_annotations(position: int | None = None, genetic_code: int = 0, is_partial: bool = False, frame: int = 0, **kwargs)
Fill Gene annotation provide by PPanGGOLiN dependencies
- Parameters:
position – Gene localization in genome
genetic_code – Genetic code associated to gene
is_partial – is the gene a partial gene
frame – One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..
kwargs – look at Feature.fill_annotations methods
- Raises:
TypeError – If position or genetic code value is not instance integers
- property frame: int
Get the frame of the gene
- property module
Get the modules belonging to the gene
- Returns:
get the modules linked to the gene
- Return type:
- class ppanggolin.genome.Organism(name: str)
Bases:
MetaFeaturesDescribe the Genome content and some information
Methods:
families: Returns a set of gene families present in the organism.
genes: Returns a generator to get genes in the organism.
number_of_genes: Returns the number of genes in the organism.
contigs: Returns the values in the contig dictionary from the organism.
get_contig: Gets the contig with the given identifier in the organism, adding it if it does not exist.
_create_contig: Creates a new contig object and adds it to the contig dictionary.
mk_bitarray: Produces a bitarray representing the presence/absence of gene families in the organism using the provided index.
Fields:
name: Name of the organism.
bitarray: Bitarray representing the presence/absence of gene families in the organism.
- add(contig: Contig)
Add a contig to organism
- Param:
Contig to add in organism
- Raises:
KeyError – Contig with the given name already exist in the organism
- property contigs: Generator[Contig, None, None]
Generator of contigs in the organism
- Returns:
Values in contig dictionary from organism
- property families
Return the gene families present in the organism
- Returns:
Generator of gene families
- Return type:
Generator[GeneFamily, None, None]
- property genes: Generator[Gene, None, None]
Generator to get genes in the organism
- Returns:
Generator of genes
- get(name: str) Contig
Get contig with the given identifier in the organism
- Parameters:
name – Contig identifier
- Returns:
The contig with the given identifier
- group_genes_by_partition() Dict[str, Set]
Groups genes based on their family’s named partition and returns a dictionary mapping partition names to sets of genes belonging to each partition.
- Returns:
A dictionary containing sets of genes grouped by their family’s named partition.
- mk_bitarray(index: Dict[Organism, int], partition: str = 'all')
Produces a bitarray representing the presence / absence of families in the organism using the provided index The bitarray is stored in the
bitarrayattribute and is agmpy2.xmpztype.- Parameters:
partition – Filters partition
index – The index computed by
ppanggolin.pangenome.Pangenome.getIndex()
- Raises:
Exception – Partition is not recognized
- property modules
Get all the modules belonging to this genome
- Returns:
Generator of modules
- Return type:
Generator[Module, None, None]
- property number_of_contigs: int
Get number of contigs in organism
- Returns:
Number of contigs in organism
- number_of_families() int
Get the number of gene families in the organism
- Returns:
Number of gene families
- number_of_genes() int
Get number of genes in the organism
- Returns:
Number of genes
- property number_of_modules: int
Get number of modules in organism
- Returns:
Number of modules in organism
- property number_of_regions: int
Get number of RGP in organism
- Returns:
Number of RGP in organism
- number_of_rnas() int
Get number of genes in the organism
- Returns:
Number of genes
- property number_of_spots: int
Get number of spots in organism
- Returns:
Number of spots in organism
- property regions
Get all RGPS belonging to this genome
- Returns:
Generator of RGPS
- Return type:
Generator[Region, None, None]
- remove(name: str)
Remove a contig with the given identifier in the organism
- Parameters:
name – Contig identifier
ppanggolin.main module
- ppanggolin.main.cmd_line() Namespace
Manage the command line argument given by user
- Returns:
arguments given and readable by PPanGGOLiN
- ppanggolin.main.main()
Run the command given by user and set / check some things.
- Returns:
ppanggolin.metadata module
- class ppanggolin.metadata.MetaFeatures
Bases:
objectThe MetaFeatures class provides methods to access and manipulate metadata in all ppanggolin classes.
Methods metadata: Generate all metadata from all sources. sources: Generate all metadata sources. get_metadata: Get metadata based on attribute values. max_metadata_by_source: Gets the source with the maximum number of metadata and the corresponding count.
- add_metadata(metadata: Metadata, metadata_id: int | None = None) None
Add metadata to metadata getter
- Parameters:
metadata – metadata value to add for the source
metadata_id – metadata identifier
- Raises:
AssertionError – Source or metadata is not with the correct type
- del_metadata_by_attribute(**kwargs)
Remove a source from the feature
- del_metadata_by_source(source: str)
Remove a source from the feature
- Parameters:
source – Name of the source to delete
- Raises:
AssertionError – Source is not with the correct type
KeyError – Source does not belong in the MetaFeature
- formatted_metadata_dict() Dict[str, List[str]]
Format metadata by combining source and field values.
Given an object with metadata, this function creates a new dictionary where the keys are formatted as ‘source_field’.
- Returns:
A dictionary with formatted metadata.
- formatted_metadata_dict_to_string(separator: str = '|') Dict[str, str]
Format metadata by combining source and field values.
Given an object with metadata, this function creates a new dictionary where the keys are formatted as ‘source_field’. In some cases, it is possible to have multiple values for the same field, in this situation, values are concatenated with the specified separator.
- Parameters:
separator – The separator used to join multiple values for the same field (default is ‘|’).
- Returns:
A dictionary with formatted metadata.
- get_metadata(source: str, metadata_id: int | None = None) Metadata
Get metadata from metadata getter by its source and identifier
- Parameters:
source – source of the metadata
metadata_id – metadata identifier
- Raises:
KeyError – No metadata with ID or source is found
- get_metadata_by_attribute(**kwargs) Generator[Metadata, None, None]
Get metadata by one or more attribute
- Returns:
Metadata searched
- get_metadata_by_source(source: str) Dict[int, Metadata] | None
Get all the metadata feature corresponding to the source
- Parameters:
source – Name of the source to get
- Returns:
List of metadata corresponding to the source
- Raises:
AssertionError – Source is not with the correct type
- has_metadata() bool
Does the feature has some metadata associated.
- Returns:
True if it has metadata else False
- has_source(source: str) bool
Check if the source is in the metadata feature
- Parameters:
source – name of the source
- Returns:
True if the source is in the metadata feature else False
- max_metadata_by_source() Tuple[str, int]
Get the maximum number of metadata for one source
- Returns:
Name of the source with the maximum annotation and the number of metadata corresponding
- property metadata: Generator[Metadata, None, None]
Generate metadata in gene families
- Returns:
Metadata from all sources
- property number_of_metadata: int
Get the number of metadata associated to feature
- property sources: Generator[str, None, None]
Get all metadata source in gene family
- Returns:
Metadata source
- class ppanggolin.metadata.Metadata(source: str, **kwargs)
Bases:
objectThe Metadata class represents a metadata link to genes, gene families, organisms, regions, spot or modules.
- Methods:
number_of_attribute: Returns the number of attributes in the Metadata object.
get: Returns the value of a specific attribute, or None if the attribute does not exist.
fields: Returns a list of all the attributes in the Metadata object.
- Fields:
source: A string representing the source of the metadata.
kwargs: A dictionary of attributes and values representing the metadata. The attributes can be any string, and the values can be any type except None or NaN.
- property fields: List[str]
Get all the field of the metadata
- Returns:
List of the field in the metadata
- to_dict() Dict[str, Any]
Get metadata in dict format.
ppanggolin.pangenome module
- class ppanggolin.pangenome.Pangenome
Bases:
objectThis is a class representing your pangenome. It is used as a basic unit for all the analysis to access to the different elements of your pangenome, such as organisms, contigs, genes or gene families. It has setter and getter methods for most elements in your pangenome, and you can use those to add new elements to it, or get objects that have a specific identifier to manipulate them directly.
- property RNAs: Generator[Gene, None, None]
Generator of genes in the pangenome.
- Returns:
gene generator
- add_edge(gene1: Gene, gene2: Gene) Edge
Adds an edge between the two gene families that the two given genes belong to.
- Parameters:
gene1 – The first gene
gene2 – The second gene
- Returns:
The created Edge
- Raises:
AssertionError – Genes object are expected
AttributeError – Genes are not associated to any families
- add_file(pangenome_file: Path, check_version: bool = True)
Links an HDF5 file to the pangenome. If needed elements will be loaded from this file, and anything that is computed will be saved to this file when
ppanggolin.formats.writeBinaries.writePangenome()is called.- Parameters:
pangenome_file – A string representing filepath to hdf5 pangenome file to be either used or created
check_version – Check ppanggolin version of the pangenome file to be compatible with the current version of ppaggolin being used.
- Raises:
AssertionError – If the pangenome_file is not an instance of the Path class
- add_gene_family(family: GeneFamily)
Adds the given gene family to the pangenome. If a family with the same name already exists, raises a KeyError.
- Parameters:
family – The gene family to add to the pangenome
- Raises:
KeyError – If a family with the same name already exists
Exception – For any unexpected exceptions
- add_module(module: Module)
Add the given module to the pangenome
- Parameters:
module – Module to add in pangenome
- Raises:
AssertionError – Error if module is not a Module object
KeyError – Error if another module exist in pangenome with the same name
- add_organism(organism: Organism)
Adds an organism that did not exist previously in the pangenome if an Organism object is provided. If an organism with the same name exists it will raise an error. If a str object is provided, will return the corresponding organism that has this name OR create a new one if it does not exist.
- Parameters:
organism – Organism to add to the pangenome
- Raises:
AssertionError – If the organism name is not a string
KeyError – if the provided organism is already in pangenome
- add_region(region: Region)
Add a region to the pangenome
- Parameters:
region – Region to add in pangenome
- Raises:
AssertionError – Error if region is not a Region object
KeyError – Error if another Region exist in pangenome with the same name
- add_spot(spot: Spot)
Adds the given iterable of spots to the pangenome.
- Parameters:
spot – Spot which should be added
- Raises:
AssertionError – Error if spot is not a Spot object
KeyError – Error if another Spot exist in pangenome with the same identifier
- compute_family_bitarrays(part: str = 'all') Dict[Organism, int]
Based on the index generated by get_org_index, generate a bitarray for each gene family. If the family j is present in the organism with the index i, the bit at position i will be 1. If it is not, the bit will be 0. The bitarrays are gmpy2.xmpz object.
- Parameters:
part – Filter the organism in function of the given partition
- Returns:
The index of organisms in pangenome
- compute_mod_bitarrays(part: str = 'all') Dict[GeneFamily, int]
Based on the index generated by get_fam_index, generated a bitarray for each gene family present in modules. If the family j is present in the module with the index i, the bit at position i will be 1. If it is not, the bit will be 0. The bitarrays are gmpy2.xmpz object.
- Parameters:
part – Filter the organism in function of the given partition
- Returns:
A dictionary with Organism as key and int as value.
- compute_org_bitarrays(part='all') Dict[GeneFamily, int]
Based on the index generated by get_fam_index, generate a bitarray for each gene family. If the family j is present in the organism with the index i, the bit at position i will be 1. If it is not, the bit will be 0. The bitarrays are gmpy2.xmpz object.
- Parameters:
part – Filter the organism in function of the given partition
- Returns:
The index of gene families in pangenome
Check if the pangenome has contig lengths unavailable
- Returns:
True if contig lengths are unavailable, False otherwise
- property edges: Generator[Edge, None, None]
Returns all the edges in the pangenome graph
- Returns:
Generator of edge
- exact_core_families() Set[GeneFamily]
Retrieves gene families considered as the exact core (present in all organisms).
- Returns:
A set containing gene families identified as the exact core.
- property gene_families: Generator[GeneFamily, None, None]
Returns all the gene families in the pangenome
- Returns:
Generator of gene families
- property genes: Generator[Gene, None, None]
Generator of genes in the pangenome.
- Returns:
gene generator
- get_contig(identifier: int | None = None, name: str | None = None, organism_name: str | None = None) Contig
Returns the contig by his identifier or by his name. If name is given the organism name is needed
- Parameters:
identifier – ID of the contig to look for
name – The name of the contig to look for
organism_name – Name of the organism to which the contig belong
- Returns:
Returns the wanted contig
- Raises:
AssertionError – If the contig_id is not an integer
KeyError – If the contig is not in the pangenome
- get_elem_by_metadata(metatype: str, **kwargs) Generator[GeneFamily | Gene | Organism | Region | Spot | Module, None, None]
Get element in pangenome with metadata attribute expected
- Parameters:
metatype – Select to which pangenome element metadata
kwargs – attributes to identify metadata
- Returns:
Metadata element
- get_elem_by_source(source: str, metatype: str) Generator[GeneFamily | Gene | Contig | Organism | Region | Spot | Module, None, None]
Get gene families with a specific source in pangenome
- Parameters:
source – Name of the source
metatype – select to which pangenome element metadata should be written
- Returns:
Gene families with the source
- get_fam_index() Dict[GeneFamily, int]
Creates an index for gene families (each family is assigned an Integer).
- Returns:
The index of families in pangenome
- get_gene(gene_id: str) Gene
Returns the gene that has the given gene ID
- Parameters:
gene_id – The gene ID to look for
- Returns:
Returns the gene that has the ID gene_id
- Raises:
AssertionError – If the gene_id is not a string
KeyError – If the gene_id is not in the pangenome
- get_gene_family(name: str) GeneFamily
Returns the gene family that has the given name
- Parameters:
name – The gene family name to look for
- Returns:
Returns the gene family that has the name name
- Raises:
AssertionError – If the name is not an integer
KeyError – If the name is not corresponding to any family in the pangenome
- get_module(module_id: int | str) Module
Returns the module that has the given module ID.
- Parameters:
module_id – The module ID to look for. It can be an integer or a string in the format ‘module_<integer>’.
- Returns:
The module with the specified ID.
- Raises:
KeyError – If the module ID does not exist in the pangenome.
ValueError – If the provided module ID does not have the expected format.
- get_multigenics(dup_margin: float, persistent: bool = True) Set[GeneFamily]
Returns the multigenic persistent families of the pangenome graph. A family will be considered multigenic if it is duplicated in more than dup_margin of the genomes where it is present.
- Parameters:
dup_margin – The ratio of presence in multicopy above which a gene family is considered multigenic
persistent – if we consider only the persistent genes
- Returns:
Set of gene families considered multigenic
- get_org_index() Dict[Organism, int]
Creates an index for Organisms (each organism is assigned an Integer).
- Returns:
The index of organisms in pangenome
- get_organism(name: str) Organism
Get an organism that is expected to be in the pangenome using its name, which is supposedly unique. Raises an error if the organism does not exist.
- Parameters:
name – Name of the Organism to get
- Returns:
The related Organism object
- Raises:
AssertionError – If the organism name is not a string
KeyError – If the provided name is not an organism in the pangenome
- get_region(name: str) Region
Returns a region with the given region_name. Creates it if it does not exist.
- Parameters:
name – The name of the region to return
- Returns:
The region
- Raises:
AssertionError – If the RGP name is not a string
KeyError – If the provided name is not a RGP in the pangenome
- get_single_copy_persistent_families(dup_margin: float, exclude_fragments: bool) Set[GeneFamily]
Retrieves gene families that are both persistent and single copy based on the provided criteria.
- Parameters:
dup_margin – The maximum allowed duplication margin for a gene family to be considered single copy.
exclude_fragments – A boolean indicating whether to exclude fragments when determining single copy families.
- Returns:
A set containing gene families that are both persistent and single copy.
- get_spot(spot_id: int | str) Spot
Returns the spot that has the given spot ID.
- Parameters:
spot_id – The spot ID to look for. It can be an integer or a string in the format ‘spot_<integer>’.
- Returns:
The spot with the specified ID.
- Raises:
KeyError – If the spot ID does not exist in the pangenome.
ValueError – If the provided spot ID does not have the expected format.
- has_metadata() bool
Whether or not the pangenome has metadata associated with any of its elements.
- property max_fam_id
Get the last family identifier
- metadata(metatype: str) Generator[Metadata, None, None]
Create a generator with all metadatas in the pangenome
- Parameters:
metatype – Select to which pangenome element metadata should be generate
- Returns:
Set of metadata source
- metadata_sources(metatype: str) Set[str]
Returns all the metadata source in the pangenomes
- Parameters:
metatype – Select to which pangenome element metadata should be searched
- Returns:
Set of metadata source
- Raises:
AssertionError – Error if metatype is not a string
- property number_of_contigs: int
Returns the number of contigs present in the pangenome
- Returns:
The number of contigs
- property number_of_edges: int
Returns the number of edge present in the pangenome
- Returns:
The number of gene families
- property number_of_gene_families: int
Returns the number of gene families present in the pangenome
- Returns:
The number of gene families
- property number_of_genes: int
Returns the number of gene present in the pangenome
- Returns:
The number of genes
- property number_of_modules: int
Returns the number of modules present in the pangenome
- Returns:
The number of modules
- property number_of_organisms: int
Returns the number of organisms present in the pangenome
- Returns:
The number of organism
- property number_of_rgp: int
Returns the number of gene families present in the pangenome
- Returns:
The number of gene families
- property number_of_rnas: int
Returns the number of gene present in the pangenome
- Returns:
The number of genes
- property number_of_spots: int
Returns the number of gene families present in the pangenome
- Returns:
The number of gene families
- property organisms: Generator[Organism, None, None]
Returns all the organisms in the pangenome
- Returns:
Generator
ppanggolin.genome.Organism
- property regions: Generator[Region, None, None]
returns all the regions (RGP) in the pangenome
- Returns:
list of RGP
- select_elem(metatype: str)
Get all the element for the given metatype
- Parameters:
metatype – Name of pangenome component that will be get
- Returns:
All elements from pangenome for the metatype
- Raises:
AssertionError – Error if metatype is not a string
KeyError – Error if metatype is not recognized
- soft_core_families(soft_core_threshold: float) Set[GeneFamily]
Retrieves gene families considered part of the soft core based on the provided threshold.
- Parameters:
soft_core_threshold – The threshold to determine the minimum fraction of organisms required for a gene family to be considered part of the soft core.
- Returns:
A set containing gene families identified as part of the soft core.
ppanggolin.region module
- class ppanggolin.region.GeneContext(gc_id: int, families: Set[GeneFamily] | None = None, families_of_interest: Set[GeneFamily] | None = None)
Bases:
objectRepresent a gene context which is a collection of gene families related to a specific genomic context.
Methods - families: Generator that yields all the gene families in the gene context. - add_context_graph: Add a context graph corresponding to the gene context. - add_family: Add a gene family to the gene context.
Fields - gc_id: The identifier of the gene context. - graph: context graph corresponding to the gene context
- add_family(family: GeneFamily)
Add a gene family to the gene context.
- Parameters:
family – The gene family to add.
- property families: Generator[GeneFamily, None, None]
Generator of the family in the context
- Returns:
Gene families belonging to the context
- property graph
- class ppanggolin.region.Module(module_id: int, families: set | None = None)
Bases:
MetaFeaturesThe Module class represents a module in a pangenome analysis.
The Module class has the following attributes: - ID: An integer identifier for the module. - bitarray: A bitarray representing the presence/absence of the gene families in an organism.
The Module class has the following methods: - families: Returns a generator that yields the gene families in the module. - mk_bitarray: Generates a bitarray representing the presence/absence of the gene families in an organism using the provided index.
- add(family: GeneFamily)
Add a family to the module. Alias more readable for setitem
- Parameters:
family – Region to add in the spot
- Raises:
TypeError – Region is not an instance Region
- property families: Generator[GeneFamily, None, None]
Generator of the family in the module
- Returns:
Families belonging to the module
- get(name: str) GeneFamily
Get a family by its name. Alias more readable for getitem
- Parameters:
name – Name of the family
- Returns:
Wanted family
- mk_bitarray(index: Dict[GeneFamily, int], partition: str = 'all')
Produces a bitarray representing the presence / absence of families in the organism using the provided index The bitarray is stored in the
bitarrayattribute and is agmpy2.xmpztype.- Parameters:
partition – filter module by partition
index – The index computed by
ppanggolin.pangenome.Pangenome.getIndex()
- property organisms: Generator[Organism, None, None]
Returns all the Organisms that have this module
- Returns:
Organisms that have this module
- remove(name: str)
Remove a family by its name. Alias more readable for delitem
- Parameters:
name – Name of the family
- class ppanggolin.region.Region(name: str)
Bases:
MetaFeaturesThe ‘Region’ class represents a region of genomic plasticity.
- Methods:
‘genes’: the property that generates the genes in the region as they are ordered in contigs.
‘families’: the property that generates the gene families in the region.
‘Length’: the property that gets the length of the region.
‘organism’: the property that gets the organism linked to the region.
‘Contig’: the property that gets the starter contig linked to the region.
‘is_whole_contig’: the property that indicates if the region is an entire contig.
‘is_contig_border’: the property that indicates if the region is bordering a contig.
‘get_rnas’: the method that gets the RNA in the region.
‘Get_bordering_genes’: the method that gets the bordered genes in the region.
- Fields:
‘name’: the name of the region.
‘score’: the score of the region.
‘Starter’: the first gene in the region.
‘stopper’: the last gene in the region.
- property contig: Contig
Get the starter contig link to RGP
- Returns:
Contig corresponding to the region
- property coordinates: List[Tuple[int]]
Return the coordinates of the region :return: coordinates of the region
- property families: Generator[GeneFamily, None, None]
Get the gene families in the RGP
- Returns:
Gene families
- property genes: Generator[Gene, None, None]
Generate the gene as they are ordered in contigs
- Returns:
Genes in the region
- get(position: int) Gene
Get a gene by its position
- Parameters:
position – Position of the gene in the contig
- Returns:
Wanted gene
- Raises:
TypeError – Position is not an integer
- get_bordering_genes(n: int, multigenics: Set[GeneFamily], return_only_persistents: bool = True) List[List[Gene], List[Gene]]
Get the bordered genes in the region. Find the n persistent and single copy gene bordering the region. If return_only_persistents is False, the method return all genes included between the n single copy and persistent genes.
- Parameters:
n – Number of genes to get
multigenics – pangenome graph multigenic persistent families
return_only_persistents – return only non multgenic persistent genes identify as the region. If False return all genes included between the borders made of n persistent and single copy genes around the region.
- Returns:
A list of bordering genes in start and stop position
- get_ordered_genes() List[Gene]
Get ordered genes of the region, taking into account the circularity of contigs.
- Returns:
A list of genes ordered by their positions in the region.
- id_counter = 0
- identify_rgp_last_and_first_genes()
Identify first and last genes of the rgp by taking into account the circularity of contigs.
Set the attributes _starter: first gene of the region and _stopper: last gene of the region and _coordinates
- property is_contig_border: bool
Indicates if the region is bordering a contig
- Returns:
True if bordering else False
- Raises:
AssertionError – No genes in the regions, it’s not expected
- property is_whole_contig: bool
Indicates if the region is an entire contig
- Returns:
True if whole contig else False
- property length
Get the length of the region
- Returns:
Size of the region
- property modules: Set[Module]
Get the modules of gene families in the RGP
- Returns:
Modules found in families of the RGP
- property number_of_families: int
Get the number of different gene families in the region
- Returns:
Number of families
- property organism: Organism
Get the Organism link to RGP
- Returns:
Organism corresponding to the region
- property overlaps_contig_edge: bool
- remove(position)
Remove a gene by its position
- Parameters:
position – Position of the gene in the contig
- Raises:
TypeError – Position is not an integer
- property start: int
Get the starter start link to RGP
- Returns:
start position in the contig of the first gene of the RGP
- property starter: Gene
Return first gene of the region. If this gene is not identified, it does that first. :return: first gene of the region
- property stop: int
Get the stopper stop link to RGP
- Returns:
start position in the contig of the last gene of the RGP
- property stopper: Gene
Return last gene of the region. If this gene is not identified, it does that first. :return: last gene of the region
- string_coordinates() str
Return a string representation of the coordinates
- class ppanggolin.region.Spot(spot_id: int)
Bases:
MetaFeaturesThe ‘Spot’ class represents a region of genomic plasticity.
- Methods:
‘regions’: the property that generates the regions in the spot.
‘families’: the property that generates the gene families in the spot.
‘spot_2_families’: add to Gene Families a link to spot.
‘borders’: Extracts all the borders of all RGPs belonging to the spot
‘get_uniq_to_rgp’: Get dictionary with a representing RGP as key, and all identical RGPs as value
‘get_uniq_ordered_set’: Get an Iterable of all the unique syntenies in the spot
‘get_uniq_content’: Get an Iterable of all the unique rgp (in terms of gene family content) in the spot
‘count_uniq_content’: Get a counter of uniq RGP and number of identical RGP (in terms of gene family content)
‘count_uniq_ordered_set’: Get a counter of uniq RGP and number of identical RGP (in terms of synteny content)
- Fields:
‘ID’: Identifier of the spot
- add(region: Region)
Add a region to the spot. Alias more readable for setitem
- Parameters:
region – Region to add in the spot
- Raises:
TypeError – Region is not an instance Region
- borders(set_size: int, multigenics) List[List[int, List[GeneFamily], List[GeneFamily]]]
Extracts all the borders of all RGPs belonging to the spot
- Parameters:
set_size – Number of genes to get
multigenics – pangenome graph multigenic persistent families
- Returns:
Families that bordering spot
- count_uniq_content() dict
Get a counter of uniq RGP and number of identical RGP (in terms of gene family content)
- Returns:
Dictionary with a representative rgp as the key and number of identical rgp as value
- count_uniq_ordered_set()
Get a counter of uniq RGP and number of identical RGP (in terms of synteny content)
- Returns:
Dictionary with a representative rgp as the key and number of identical rgp as value
- property families: Generator[GeneFamily, None, None]
Get the gene families in the RGP
- Returns:
Family in the spot
- get(name: str) Region
Get a region by its name. Alias more readable for getitem
- Parameters:
name – Name of the region
- Returns:
Wanted region
- get_uniq_content() Set[Region]
Get an Iterable of all the unique rgp (in terms of gene family content) in the spot
- Returns:
Iterable of all the unique rgp (in terms of gene family content) in the spot
- get_uniq_ordered_set() Set[Region]
Get an Iterable of all the unique syntenies in the spot
- Returns:
Iterable of all the unique syntenies in the spot
- get_uniq_to_rgp() Dict[Region, Set[Region]]
Get dictionary with a representing RGP as the key, and all identical RGPs as value
- Returns:
Dictionary with a representing RGP as the key, and set of identical RGPs as value
- property number_of_families: int
Get the number of different families in the spot
- Returns:
Number of families
- property regions: Generator[Region, None, None]
Generates the regions in the spot
- Returns:
Regions in the spot
- remove(name: str)
Remove a region by its name. Alias more readable for delitem
- Parameters:
name – Name of the region
- spot_2_families()
Add to Gene Families a link to spot
ppanggolin.utils module
- ppanggolin.utils.add_common_arguments(subparser: ArgumentParser)
Add common argument to the input subparser.
- Parameters:
subparser – A subparser object from any subcommand.
- ppanggolin.utils.add_gene(obj, gene, fam_split: bool = True)
- Parameters:
obj –
gene –
fam_split –
- ppanggolin.utils.check_config_consistency(config: dict, workflow_steps: list)
Check that the same parameter used in different subcommand inside a workflow has the same value.
If not, the function throw a logging.getLogger(“PPanGGOLiN”).warning.
- Params config_dict:
config dict with as key the section of the config file and as value another dict pairing name and value of parameters.
- Params workflow_steps:
list of subcommand names used in the workflow execution.
- ppanggolin.utils.check_input_files(file: Path, check_tsv: bool = False)
Checks if the provided input files exist and are of the proper format
- Parameters:
file – Path to the file
check_tsv – Allow checking tsv file for annotation or fasta list
- ppanggolin.utils.check_log(log_file: str) TextIO
Check if the output log is writable
- Parameters:
log_file – Path to the log output
- Returns:
output for log
- ppanggolin.utils.check_option_workflow(args)
Check if the given argument to a workflow command is usable
- Parameters:
args – list of arguments
- ppanggolin.utils.check_tools_availability(tool_to_description: Dict[str, str] | List[str]) dict[str, bool]
Check if the given command-line tools are available in the system’s PATH.
- Parameters:
tool_to_description – A dictionary where keys are tool names and values are descriptions of their purpose, or a list of tool names.
- Returns:
A dictionary with tool names as keys and boolean values indicating availability.
- ppanggolin.utils.check_translation_table_to_use(pangenome: Pangenome, is_user_specified: bool, user_translation_table: int) int
Determine the translation table to use based on what has been used previously and user input.
This function implements the following priority logic: 1. If the user explicitly specified a translation table, it takes precedence
(with a warning if it differs from previous usage)
If a translation table was used previously in the pangenome, use it
Otherwise, use the default/user-provided translation table
- Parameters:
pangenome – Pangenome object containing information from previous analysis steps, including the translation table used if available
is_user_specified – Whether the translation table was explicitly specified by the user
user_translation_table – The translation table value provided by the user (default or explicitly specified)
- Returns:
The translation table to use for translation
- ppanggolin.utils.check_tsv_sanity(tsv_file: Path)
Check if the given TSV file is readable for the next PPanGGOLiN step.
- Parameters:
tsv – Path to the TSV containing organism information.
- Raises:
ValueError – If the file format is incorrect or contains invalid genome names.
- ppanggolin.utils.check_version_compatibility(file_version: str) None
Checks the compatibility of the provided pangenome file version with the current PPanGGOLiN version.
- Parameters:
file_version – A string representing the version of the pangenome file.
- ppanggolin.utils.combine_args(args: Namespace, another_args: Namespace)
Combine two args object.
- Parameters:
args – initial arguments.
another_args – another args
- Returns:
object with combined arguments
- ppanggolin.utils.connected_components(g: Graph, removed: set, weight: float)
Yields subgraphs of each connected component you get when filtering edges based on the given weight.
- Parameters:
g – Subgraph
removed – removed node
weight – threshold to remove node or not
- ppanggolin.utils.create_tmpdir(main_dir, basename='tmpdir', keep_tmp=False)
- ppanggolin.utils.delete_unspecified_args(args: Namespace)
Delete argument from the given argparse.Namespace with None values.
- Parameters:
args – arguments to filter.
- ppanggolin.utils.detect_filetype(filename: Path) str
Detects whether the current file is gff3, gbk/gbff, fasta, tsv or unknown. If unknown, it will raise an error
- Parameters:
filename – path to file
- Returns:
current file type
- ppanggolin.utils.erase_default_value(parser: ArgumentParser)
Remove default action in the given list of argument parser actions.
This is dnoe to distinguish specified arguments.
- Params parser:
An argparse.ArgumentParser object with default values to erase.
- ppanggolin.utils.extract_contig_window(contig_size: int, positions_of_interest: Iterable[int], window_size: int, is_circular: bool = False)
Extracts contiguous windows around positions of interest within a contig.
- Parameters:
contig_size – Number of genes in contig.
positions_of_interest – An iterable containing the positions of interest.
window_size – The size of the window to extract around each position of interest.
is_circular – Indicates if the contig is circular.
- Returns:
Yields tuples representing the start and end positions of each contiguous window.
- ppanggolin.utils.find_consecutive_sequences(sequence: List[int]) List[List[int]]
Find consecutive sequences in a list of integers.
- Parameters:
sequence – The input list of integers.
- Returns:
A list of lists containing consecutive sequences of integers.
- ppanggolin.utils.find_region_border_position(region_positions: List[int], contig_gene_count: int) Tuple[int, int]
Find the start and stop integers of the region considering circularity of the contig.
- Parameters:
region_positions – List of positions that belong to the region.
contig_gene_count – Number of gene in the contig. The contig is considered circular.
- Returns:
A tuple containing the start and stop integers of the region.
- ppanggolin.utils.flatten_nested_dict(nested_dict: Dict[str, Dict | int | str | float]) Dict[str, int | str | float]
Flattens a nested dictionary into a flat dictionary by concatenating keys at different levels.
- Parameters:
nested_dict – The nested dictionary to be flattened.
- Returns:
A flat dictionary with concatenated keys.
- ppanggolin.utils.get_arg_name(arg_val: str | TextIOWrapper) str | TextIOWrapper
Returns the name of a file if the argument is a TextIOWrapper object, otherwise returns the argument value.
- Parameters:
arg_val – Either a string or a TextIOWrapper object.
- Returns:
Either a string or a TextIOWrapper object, depending on the type of the input argument.
- ppanggolin.utils.get_arg_names_from_namespace(args_namespace: Namespace) List[str]
Extract all argument names from a Namespace object, excluding private attributes.
- Parameters:
args_namespace – An argparse Namespace object
- Returns:
List of argument names (excluding those starting with ‘_’)
- ppanggolin.utils.get_args_differing_from_default(default_args: Namespace, final_args: Namespace, param_to_ignore: List[str] | Set[str] | None = None) dict
Get the parameters that have different value than default values.
- Params default_args:
default arguments
- Params final_args:
final arguments to compare with default
- Params param_to_ignore:
list of params to ignore.
- Returns:
A dict with param that differ from default as key and the final value of the param as value
- ppanggolin.utils.get_cli_args(subparser_fct: Callable) Namespace
Parse command line arguments using the specified parsing function.
- Params subparser_fct:
Subparser function to use. This subparser give the expected argument for the subcommand.
- ppanggolin.utils.get_config_args(subcommand: str, subparser_fct: Callable, config_dict: dict, config_section: str, expected_params: List[str] | Set[str], strict_config_check: bool) Namespace
Parsing parameters of a specific section of the config file.
If some parameter are not specified in the config they are not added to the args object.
- Params subcommand:
Name of the ppanggolin subcommand.
- Params subparser_fct:
Subparser function to use. This subparser give the expected argument for the subcommand.
- Params config_dict:
config dict with as key the section of the config file and as value another dict pairing name and value of parameters.
- Params config_section:
Which section to parse in config file.
- Params expected_params:
List of argument to expect in the parser. If the parser has other arguments, these arguments are filtered out.
- Params strict_config_check:
if set to true, an error is raised when a parameter is found in the config which it is not in the expected_params list.
- Return args:
Arguments parse from the config
- ppanggolin.utils.get_consecutive_region_positions(region_positions: List[int], contig_gene_count: int) List[List[int]]
Order integers position of the region considering circularity of the contig.
- Parameters:
region_positions – List of positions that belong to the region.
contig_gene_count – Number of gene in the contig. The contig is considered circular.
- Returns:
An ordered list of integers of the region.
- Raises:
ValueError – If unexpected conditions are encountered.
- ppanggolin.utils.get_default_args(subcommand: str, subparser_fct: Callable, unwanted_args: list | None = None) Namespace
Get default value for the arguments for the given subparser function.
- Params subcommand:
Name of the ppanggolin subcommand.
- Params subparser_fct:
Subparser function to use. This subparser give the expected argument for the subcommand.
- Params unwanted_args:
List of arguments to filter out.
- Return args:
arguments with default values.
- ppanggolin.utils.get_major_version(version: str) int
Extracts the major version number from a version string.
- Parameters:
version – A string representing the version number.
- Returns:
The major version extracted from the input version string.
- Raises:
ValueError – If the input version does not have the expected format.
- ppanggolin.utils.get_subcommand_parser(subparser_fct: Callable, name: str = '') Tuple[_SubParsersAction, ArgumentParser]
Get subcommand parser object using the given subparser function.
Common arguments are also added to the parser object.
- Params subparser_fct:
- Name:
Name of section to add more info in the parser in case of error.
- Returns:
The parser and subparser objects
- ppanggolin.utils.has_non_ascii(string_to_test: str | Collection[str]) bool
Check if a string or a collection of strings contains any non-ASCII characters.
- Parameters:
string_to_test – A single string or a collection of strings to check.
- Returns:
True if any string contains non-ASCII characters, False otherwise.
- ppanggolin.utils.is_compressed(file_or_file_path: Path | BinaryIO | TextIOWrapper | TextIO) Tuple[bool, str | None]
Detects if a file is compressed based on its file signature.
- Parameters:
file_or_file_path – The file to check.
- Returns:
True if the file is a recognized compressed format with the format name, False otherwise.
- Raises:
TypeError – If the file type is not supported.
- ppanggolin.utils.jaccard_similarities(mat: csc_matrix, jaccard_similarity_th) csc_matrix
Compute the jaccard similarities
- Parameters:
mat –
jaccard_similarity_th – threshold
- Returns:
- ppanggolin.utils.manage_cli_and_config_args(subcommand: str, config_file: str, subcommand_to_subparser: dict) Namespace
Manage command line and config arguments for the given subcommand.
This function parse arguments from the cmd line and config file and set up the following priority: cli > config > default When the subcommand is a workflow, the subcommand used in workflows are also parsed in the config.
- Params subcommand:
Name of the subcommand.
- Params config_file:
Path to the config file given in argument. If None, only default and cli arguments value are used.
- Params subcommand_to_subparser:
Dict with subcommand name as key and the corresponding subparser function as value.
- ppanggolin.utils.min_one(x) int
Check if the given int is superior to one
- Parameters:
x – given float by user
- Returns:
given float if it is acceptable
- Raises:
argparse.ArgumentTypeError – The float is not acceptable
- ppanggolin.utils.mk_file_name(basename: str, output: Path, force: bool = False) Path
Returns a usable filename for a ppanggolin output file, or crashes.
- Parameters:
basename – basename for the file
output – Path to save the file
force – Force to write the file
- Returns:
Path to the file
- ppanggolin.utils.mk_outdir(output: Path, force: bool = False, exist_ok: bool = False)
Create a directory at the given output if it doesn’t exist already
- Parameters:
output – Path where to create directory
force – Force to write in the directory
exist_ok – Does not give an error if the directory already exists.
- Raises:
FileExistError – The current path already exist and force is false
- ppanggolin.utils.overwrite_args(default_args: Namespace, config_args: Namespace, cli_args: Namespace)
Overwrite args objects.
When arguments are given in CLI, their values are used instead of the ones found in the config file. When arguments are specified in the config file, they overwrite default values.
- Parameters:
default_args – default arguments
config_args – arguments parsed from the config file
cli_args – arguments parsed from the command line
- Returns:
final arguments
- ppanggolin.utils.parse_config_file(yaml_config_file: str) dict
Parse yaml config file.
- Parameters:
yaml_config_file – config file in yaml
- Returns:
dict of config with key the command and as value another dict with param as key and value as value.
- ppanggolin.utils.parse_input_paths_file(path_list_file: Path) Dict[str, Dict[str, Path | List[str]]]
Parse an input paths file to extract genome information.
This function reads an input paths file, which is in TSV format, and extracts genome information including file paths and putative circular contigs.
- Parameters:
path_list_file – The path to the input paths file.
- Returns:
A dictionary where keys are genome names and values are dictionaries containing path information and putative circular contigs.
- Raises:
FileNotFoundError – If a specified genome file path does not exist.
Exception – If there are no genomes in the provided file.
- ppanggolin.utils.read_compressed_or_not(file_or_file_path: Path | BinaryIO | TextIOWrapper | TextIO) TextIOWrapper | BinaryIO | TextIO
Opens and reads a file, decompressing it if necessary.
Parameters: file (pathlib.Path, io.BytesIO, io.TextIOWrapper, io.TextIOBase): The file to read. It can be a Path object from the pathlib module, a BytesIO object, a TextIOWrapper, or TextIOBase object.
Returns: str: The contents of the file, decompressed if it was a recognized compressed format.
Raises: TypeError: If the file type is not supported.
- ppanggolin.utils.replace_non_ascii(string_with_ascii: str | Collection[str], replacement_string: str = '_') str | Collection[str]
Replace all non-ASCII characters in a string or a collection of strings with a specified replacement string.
- Parameters:
string_with_ascii – A string or collection of strings potentially containing non-ASCII characters.
replacement_string – The string to replace non-ASCII characters with (default is ‘_’).
- Returns:
A new string or collection where all non-ASCII characters have been replaced.
- ppanggolin.utils.restricted_float(x: int | float) float
Decrease the choice possibility of float in argparse
- Parameters:
x – given float by user
- Returns:
given float if it is acceptable
- Raises:
argparse.ArgumentTypeError – The float is not acceptable
- ppanggolin.utils.run_subprocess(cmd: List[str], output: Path | None = None, msg: str = 'Subprocess failed with the following error:\n')
Run a subprocess command and write the output to the given path.
- Parameters:
cmd – List of program arguments.
output – Path to write the subprocess output (optional).
msg – Message to print if the subprocess fails.
- Raises:
FileNotFoundError – If the command’s executable is not found.
subprocess.CalledProcessError – If the subprocess returns a non-zero exit code.
- ppanggolin.utils.set_up_config_param_to_parser(config_param_val: dict) list
Take dict pairing parameters and values and format the corresponding list of arguments to feed a parser.
When the parameter value is False, the parameter is a flag and thus is not added to the list.
- Params config_param_val:
Dict with parameter name as key and parameter value as value.
- Returns:
list of argument strings formatted for an argparse.ArgumentParser object.
- ppanggolin.utils.set_verbosity_level(args)
Set the verbosity level
- Parameters:
args – argument pass by command line
- ppanggolin.utils.write_compressed_or_not(file_path: Path, compress: bool = False) GzipFile | TextIOWrapper
Create a file-like object, compressed or not.
- Parameters:
file_path – Path to the file
compress – Compress the file in .gz
- Returns:
file-like object, compressed or not