High-level description
TheHybridSolver class in Cassiopeia implements a hybrid approach to phylogenetic tree reconstruction. It combines a top-down greedy solver with a more precise bottom-up solver to efficiently infer relationships in large datasets. The greedy solver is applied initially to cluster cells until a specified cutoff is reached (either based on the number of cells or the distance to the latest common ancestor). Then, the bottom solver is used to reconstruct subtrees for each cluster, resulting in a complete tree.
Code Structure
TheHybridSolver class inherits from the CassiopeiaSolver class and utilizes two solver instances: top_solver (a subclass of GreedySolver) and bottom_solver (any subclass of CassiopeiaSolver). The solve method orchestrates the hybrid solving process, while apply_top_solver and apply_bottom_solver handle the respective solver applications. assess_cutoff determines when to switch from the top solver to the bottom solver.
References
GreedySolver: TheHybridSolverrelies on aGreedySolverinstance for the initial top-down clustering.CassiopeiaSolver: Bothtop_solverandbottom_solverare subclasses ofCassiopeiaSolver, providing a common interface for tree reconstruction.CassiopeiaTree: TheHybridSolveroperates on aCassiopeiaTreeobject, which stores the character matrix, priors, and the resulting tree.
Symbols
HybridSolver
Description
This class implements the hybrid phylogenetic tree reconstruction algorithm. It combines a greedy top-down solver with a more precise bottom-up solver to handle large datasets efficiently.Inputs
| Name | Type | Description |
|---|---|---|
| top_solver | GreedySolver.GreedySolver | An algorithm to be applied at the top of the tree. Must be a subclass of GreedySolver. |
| bottom_solver | CassiopeiaSolver.CassiopeiaSolver | An algorithm to be applied at the bottom of the tree. Must be a subclass of CassiopeiaSolver. |
| lca_cutoff | float | Distance to the latest-common-ancestor (LCA) of a subclade to be used as a cutoff for transitioning to the bottom solver. |
| cell_cutoff | int | Number of cells in a subclade to be used as a cutoff for transitioning to the bottom solver. |
| threads | int | Number of threads to be used. This corresponds to the number of subproblems to be run concurrently with the bottom solver. |
| prior_transformation | str | Function to use when transforming priors into weights. |
| progress_bar | bool | Indicates if a progress bar should be shown when solve is called. |
Outputs
TheHybridSolver does not directly return any outputs. It modifies the input CassiopeiaTree object by populating it with the reconstructed tree.
Internal Logic
- Apply Top Solver: The
apply_top_solvermethod is used to recursively cluster cells using thetop_solveruntil the specified cutoff is reached (eitherlca_cutofforcell_cutoff). - Generate Subproblems: As the top solver clusters cells, it generates subproblems, each consisting of a subtree root and a list of samples within that subtree.
- Apply Bottom Solver: The
apply_bottom_solvermethod is applied to each subproblem, using thebottom_solverto reconstruct the subtree for the corresponding samples. - Combine Subtrees: The resulting subtrees are combined into a single tree, forming the complete reconstructed phylogeny.
- Add Duplicates and Prune: Duplicate samples are added to the tree, and any spurious leaves (not present in the original character matrix) are pruned.
- Populate CassiopeiaTree: The final reconstructed tree is used to populate the input
CassiopeiaTreeobject.
Side Effects
- Modifies the input
CassiopeiaTreeobject by populating it with the reconstructed tree. - Creates log files for each subproblem solved by the bottom solver.
apply_top_solver
Description
This method recursively applies the top solver to cluster cells until the specified cutoff is reached.Inputs
| Name | Type | Description |
|---|---|---|
| character_matrix | pd.DataFrame | Character matrix |
| samples | List[str] | Samples in the subclade of interest. |
| tree | nx.DiGraph | In progress tree for the HybridSolver. |
| node_name_generator | Generator[str, None, None] | Generator for creating unique node names while applying the top-solver. |
| weights | Optional[Dict[int, Dict[int, float]]] | Weights of character-state combinations, derived from priors if these are available. |
| missing_state_indicator | int | Indicator for missing data |
| root | Optional[int] | Node ID of the root in the subtree containing the samples. |
Outputs
| Name | Type | Description |
|---|---|---|
| root | int | The ID of the node serving as the root of the tree containing the samples. |
| subproblems | List[Tuple[int, List[str]]] | A list of subproblems in the form [subtree-root, subtree-samples]. |
| tree | nx.DiGraph | The updated tree with the top solver applied. |
Internal Logic
- Base Case: If there is only one sample, return the sample ID, an empty list of subproblems, and the unmodified tree.
- Perform Split: Use the
top_solverto partition the samples based on the character matrix and weights. - Create Root: Generate a unique node name for the root of the current subtree.
- Handle Single Clade: If the split results in a single clade, connect all samples in the clade to the root and return.
- Recursive Application: For each clade resulting from the split:
- If the cutoff is reached, add the clade as a subproblem.
- Otherwise, recursively apply
apply_top_solverto the clade, connect the resulting subtree to the current root, and add any new subproblems.
- Return Results: Return the root ID, the list of subproblems, and the updated tree.
apply_bottom_solver
Description
This method applies the bottom solver to a given subproblem, reconstructing the subtree for the corresponding samples.Inputs
| Name | Type | Description |
|---|---|---|
| cassiopeia_tree | CassiopeiaTree | CassiopeiaTree for the entire dataset. This will be subsetted with respect to the samples specified. |
| root | int | Identifier of the root in the master tree |
| samples | List[str] | A list of samples for which to infer a tree. |
| logfile | str | Base location for logging output. A specific logfile will be created from this base logfile name. |
| layer | Optional[str] | Layer storing the character matrix for solving. If None, the default character matrix is used in the CassiopeiaTree. |
Outputs
| Name | Type | Description |
|---|---|---|
| subproblem_tree | nx.DiGraph | A tree in the form of a Networkx graph for the subproblem. |
| root | int | The original root identifier. |
Internal Logic
- Base Case: If there is only one sample, create a simple tree with the root connected to the sample and return.
- Subset Character Matrix: Extract the relevant portion of the character matrix corresponding to the input samples.
- Create Subtree: Create a new
CassiopeiaTreeobject for the subproblem, using the subsetted character matrix and priors from the original tree. - Solve Subproblem: Apply the
bottom_solverto the subtree, reconstructing the phylogeny for the subproblem samples. - Connect to Root: Connect the root of the reconstructed subtree to the input root ID.
- Return Results: Return the reconstructed subtree and the original root ID.
assess_cutoff
Description
This method determines whether the specified cutoff has been reached for a given set of samples.Inputs
| Name | Type | Description |
|---|---|---|
| samples | List[str] | A list of samples in a clade. |
| character_matrix | pd.DataFrame | Character matrix |
| missing_state_indicator | int | Indicator for missing data. |
Outputs
| Name | Type | Description |
|---|---|---|
| cutoff_reached | bool | True if the cutoff is reached, False if not. |
Internal Logic
- Cell Cutoff: If
cell_cutoffis specified, check if the number of samples is less than or equal to the cutoff. If so, return True. - LCA Cutoff: If
lca_cutoffis specified:- Calculate the LCA character states for the samples.
- Calculate the Hamming distance between each sample’s character states and the LCA character states.
- If the maximum distance is less than or equal to the
lca_cutoff, return True.
- Return False: If neither cutoff is reached, return False.
__add_duplicates_to_tree_and_remove_spurious_leaves
Description
This private method adds duplicate samples to the tree and removes any spurious leaves that are not present in the original character matrix.Inputs
| Name | Type | Description |
|---|---|---|
| tree | nx.DiGraph | The tree after solving. |
| character_matrix | pd.DataFrame | Character matrix |
| node_name_generator | Generator[str, None, None] | Generator for creating unique node names. |
Outputs
| Name | Type | Description |
|---|---|---|
| tree | nx.DiGraph | The tree with duplicates added and spurious leaves pruned. |
Internal Logic
- Add Duplicates:
- Identify groups of duplicate samples based on their character states.
- For each group, create a new internal node and connect it to the original sample and all its duplicates.
- Remove Spurious Leaves:
- Identify leaves in the tree that are not present in the character matrix.
- For each spurious leaf, remove it and prune its lineage until a node with at least two children is reached.
- Return Tree: Return the updated tree.
Dependencies
networkx: Used for representing and manipulating the tree structure.pandas: Used for handling the character matrix and dissimilarity map.multiprocessing: Used for parallelizing the bottom solver application.tqdm: Used for displaying progress bars.
Error Handling
HybridSolverErroris raised if no cutoff is specified (eitherlca_cutofforcell_cutoff).
Logging
Thesolve method creates log files for each subproblem solved by the bottom solver. The base file name for the log files is specified by the logfile argument, and a unique identifier is appended to each file name.
