ppanggolin.nem package

Submodules

ppanggolin.nem.partition module

ppanggolin.nem.partition.check_pangenome_former_partition(pangenome: Pangenome, force: bool = False)

checks pangenome status and .h5 files for former partitions, delete them if allowed or raise an error

Parameters:
  • pangenome – Pangenome object

  • force – Allow to force write on Pangenome file

ppanggolin.nem.partition.evaluate_nb_partitions(organisms: set, output: Path | None = None, sm_degree: int = 10, free_dispersion: bool = False, chunk_size: int = 500, krange: list | None = None, icl_margin: float = 0.05, draw_icl: bool = False, cpu: int = 1, seed: int = 42, tmpdir: Path | None = None, disable_bar: bool = False) int

Evaluate the optimal number of partition for the pangenome

Parameters:
  • organisms – Set of organisms from pangenome

  • tmpdir – temporary directory path

  • output – output directory path to draw ICL

  • sm_degree – Maximum degree of the nodes to be included in the smoothing process.

  • free_dispersion – use if the dispersion around the centroid vector of each partition during must be free.

  • chunk_size – Size of the chunks when performing partitioning using chunks of organisms.

  • krange – Range of K values to test when detecting K automatically.

  • icl_margin – margin use to select the lowest K in maximizing ICL

  • draw_icl – draw the ICL curve for all the tested K values.

  • cpu – Number of available core

  • seed – seed used to generate random numbers

  • disable_bar – Disable progress bar

Returns:

Ideal number of partition computed

ppanggolin.nem.partition.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.nem.partition.nem_samples(pack: tuple) Tuple[dict, None, None] | Tuple[int, float, float] | Tuple[dict, dict, float]

run partitioning :param pack: {index: int, tmpdir: str, beta: float, sm_degree: int, free_dispersion: bool, kval: int, seed: int, init: str, keep_tmp_files: bool}

Returns:

ppanggolin.nem.partition.nem_single(args: List[Tuple[Path, int, float, bool, int, int, str, bool, int, bool]]) Tuple[dict, None, None] | Tuple[int, float, float] | Tuple[dict, dict, float]

Allow to run partitioning in multiprocessing to evaluate partition number

Parameters:

args – {nem_dir_path: str, nb_org: int, beta: float, free_dispersion: bool, kval: int, seed: int, init: str, keep_files: bool, itermax: int, just_log_likelihood: bool}

Returns:

Result of run partitioning

ppanggolin.nem.partition.parser_partition(parser: ArgumentParser)

Parser for specific argument of partition command

Parameters:

parser – parser for align argument

ppanggolin.nem.partition.partition(pangenome: Pangenome, output: Path | None = None, beta: float = 2.5, sm_degree: int = 10, free_dispersion: bool = False, chunk_size: int = 500, kval: int = -1, krange: list | None = None, icl_margin: float = 0.05, draw_icl: bool = False, cpu: int = 1, seed: int = 42, tmpdir: Path | None = None, keep_tmp_files: bool = False, force: bool = False, disable_bar: bool = False)

Partitioning the pangenome

Parameters:
  • pangenome – Pangenome containing GeneFamilies to align with sequence set

  • tmpdir – temporary directory path

  • output – output directory path to draw ICL

  • beta – strength of the smoothing using the graph topology during partitioning. 0 deactivate spatial smoothing

  • sm_degree – Maximum degree of the nodes to be included in the smoothing process.

  • free_dispersion – use if the dispersion around the centroid vector of each partition during must be free.

  • chunk_size – Size of the chunks when performing partitioning using chunks of organisms.

  • kval – Number of partitions to use. Must be at least 2. If under 2, it will be detected automatically.

  • krange – Range of K values to test when detecting K automatically.

  • icl_margin – margin use to select the lowest K in maximizing ICL

  • draw_icl – draw the ICL curve for all the tested K values.

  • cpu – Number of available core

  • seed – seed used to generate random numbers

  • keep_tmp_files – True if you want to keep the temporary NEM files

  • force – Allow to force write on Pangenome file

  • disable_bar – Disable progress bar

ppanggolin.nem.partition.partition_nem(index: int, kval: int, beta: float = 2.5, sm_degree: int = 10, free_dispersion: bool = False, seed: int = 42, init: str = 'param_file', tmpdir: Path | None = None, keep_tmp_files: bool = False) Tuple[dict, None, None] | Tuple[int, float, float] | Tuple[dict, dict, float]
Parameters:
  • index – Index of the sample group

  • tmpdir – temporary directory path

  • kval – Number of partitions to use

  • beta – strength of the smoothing using the graph topology during partitioning. 0 deactivate spatial smoothing

  • sm_degree – Maximum degree of the nodes to be included in the smoothing process.

  • free_dispersion – use if the dispersion around the centroid vector of each partition during must be free.

  • seed – seed used to generate random numbers

  • init – Initiate nem parameters with pangenome parameters or randomly

  • keep_tmp_files – True if you want to keep the temporary NEM files

Returns:

ppanggolin.nem.partition.run_partitioning(nem_dir_path: Path, nb_org: int, beta: float = 2.5, free_dispersion: bool = False, kval: int = 3, seed: int = 42, init: str = 'param_file', keep_files: bool = False, itermax: int = 100, just_log_likelihood: bool = False) Tuple[dict, None, None] | Tuple[int, float, float] | Tuple[dict, dict, float]

Main function to make partitioning

Parameters:
  • nem_dir_path – Path to directory with nem files

  • nb_org – Number of organisms

  • beta – strength of the smoothing using the graph topology during partitioning. 0 deactivate spatial smoothing

  • free_dispersion – use if the dispersion around the centroid vector of each partition during must be free.

  • kval – Number of partitions to use. Must be at least 2. If under 2, it will be detected automatically.

  • seed – seed used to generate random numbers

  • init – Initiate nem parameters with pangenome parameters or randomly

  • keep_files – True if you want to keep the NEM files

  • itermax – Maximum iteration to compute partitioning

  • just_log_likelihood – Return only nem parameter result

Returns:

Nem parameters and if not just log likelihood the families associated to partition

ppanggolin.nem.partition.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

ppanggolin.nem.partition.write_nem_input_files(tmpdir: Path, organisms: set, sm_degree: int = 10) Tuple[float, int]

Create and format input files for partitioning with NEM

Parameters:
  • tmpdir – temporary directory path

  • organisms – Set of organism from pangenome

  • sm_degree – Maximum degree of the nodes to be included in the smoothing process.

Returns:

total edge weight to ponderate beta and number of families

ppanggolin.nem.rarefaction module

ppanggolin.nem.rarefaction.draw_curve(output: Path, data: list, max_sampling: int = 10)

Draw the rarefaction curve and associated data

Parameters:
  • output – output directory path to draw the rarefaction curve and associated data

  • max_sampling – Maximum number of organisms in a sample

  • data

ppanggolin.nem.rarefaction.launch(args: Namespace)

Command launcher

Parameters:

args – All arguments provide by user

ppanggolin.nem.rarefaction.launch_raref_nem(args: Tuple[int, Path, float, int, bool, int, int, list, int]) Tuple[Tuple[Dict[str, int], int]]

Launch raref_nem in multiprocessing

Parameters:

args – {index: int, tmpdir: str, beta: float, sm_degree: int, free_dispersion: bool, chunk_size: int, kval: int, krange: list, seed: int}

Returns:

Count of each partition and parameters for the given sample index

ppanggolin.nem.rarefaction.make_rarefaction_curve(pangenome: Pangenome, output: Path, tmpdir: Path | None = None, beta: float = 2.5, depth: int = 30, min_sampling: int = 1, max_sampling: int = 100, sm_degree: int = 10, free_dispersion: bool = False, chunk_size: int = 500, kval: int = -1, krange: list | None = None, cpu: int = 1, seed: int = 42, kestimate: bool = False, soft_core: float = 0.95, disable_bar: bool = False)

Main function to make the rarefaction curve

Parameters:
  • pangenome – Pangenome containing GeneFamilies to align with sequence set

  • output – output directory path to draw the rarefaction curve and associated data

  • tmpdir – temporary directory path

  • beta – strength of the smoothing using the graph topology during partitioning. 0 deactivate spatial smoothing

  • depth – Number of samplings at each sampling point

  • min_sampling – Minimum number of organisms in a sample

  • max_sampling – Maximum number of organisms in a sample

  • sm_degree – Maximum degree of the nodes to be included in the smoothing process.

  • free_dispersion – use if the dispersion around the centroid vector of each partition during must be free.

  • chunk_size – Size of the chunks when performing partitioning using chunks of organisms.

  • kval – Number of partitions to use. Must be at least 2. If under 2, it will be detected automatically.

  • krange – Range of K values to test when detecting K automatically.

  • cpu – Number of available core

  • seed – seed used to generate random numbers

  • kestimate – recompute the number of partitions for each sample between the values provided by krange

  • soft_core – Soft core threshold

  • disable_bar – Disable progress bar

ppanggolin.nem.rarefaction.parser_rarefaction(parser: ArgumentParser)

Parser for specific argument of graph command

Parameters:

parser – parser for align argument

ppanggolin.nem.rarefaction.raref_nem(index: int, tmpdir: Path, beta: float = 2.5, sm_degree: int = 10, free_dispersion: bool = False, chunk_size: int = 500, kval: int = -1, krange: list | None = None, seed: int = 42) Tuple[Dict[str, int], int]
Parameters:
  • index – Index of the sample group organisms

  • tmpdir – temporary directory path

  • beta – strength of the smoothing using the graph topology during partitioning. 0 deactivate spatial smoothing

  • sm_degree – Maximum degree of the nodes to be included in the smoothing process.

  • free_dispersion – use if the dispersion around the centroid vector of each partition during must be free.

  • chunk_size – Size of the chunks when performing partitioning using chunks of organisms.

  • kval – Number of partitions to use

  • krange – Range of K values to test when detecting K automatically.

  • seed – seed used to generate random numbers

Returns:

Count of each partition and parameters for the given sample index

ppanggolin.nem.rarefaction.subparser(sub_parser: _SubParsersAction) ArgumentParser

Subparser to launch PPanGGOLiN in Command line

:param sub_parser : sub_parser for align command

:return : parser arguments for align command

Module contents