High-level description
This code defines functions for constructing and manipulating graphs used in phylogenetic tree reconstruction. It includes functions for building connectivity graphs based on shared mutations, similarity graphs based on a given similarity function, and optimizing graph partitions using hill-climbing procedures.Code Structure
The code consists of several functions that are used to construct and manipulate graphs for phylogenetic tree reconstruction. Theconstruct_connectivity_graph and construct_similarity_graph functions are used to build different types of graphs, while max_cut_improve_cut and spectral_improve_cut are used to optimize partitions on these graphs. The check_if_cut function is a utility function used by the optimization functions.
References
This code references thesolver_utilities module for converting sample names to indices.
Symbols
check_if_cut
Description
This function checks if two nodes are on opposite sides of a graph partition.Inputs
| Name | Type | Description |
|---|---|---|
| u | int | The index of the first node. |
| v | int | The index of the second node. |
| cut | List[int] | A list of node indices representing one side of the partition. |
Outputs
| Name | Type | Description |
|---|---|---|
| bool | True if nodes u and v are on opposite sides of the partition, False otherwise. |
Internal Logic
The function simply checks if one node is in thecut list and the other is not.
construct_connectivity_graph
Description
This function constructs a connectivity graph for a set of samples based on their observed mutations. The edge weights represent the tendency of two samples to be grouped together or separated based on their shared and differing mutations.Inputs
| Name | Type | Description |
|---|---|---|
| character_matrix | pd.DataFrame | The character matrix of observed character states for all samples. |
| mutation_frequencies | Dict[int, Dict[int, int]] | A dictionary containing the frequencies of each character/state pair. |
| missing_state_indicator | int | The character representing missing values. |
| samples | List[str] | A list of sample names to include in the graph. |
| weights | Optional[Dict[int, Dict[int, float]]] | Optional weights for character/state pairs. |
Outputs
| Name | Type | Description |
|---|---|---|
| G_names | nx.Graph | The constructed connectivity graph with sample names as nodes. |
Internal Logic
The function iterates through all pairs of samples and calculates a score based on their shared and differing mutations. The score is positive if the samples share mutations and negative if they have different mutations. The absolute value of the score is higher for mutations that are less frequent in the dataset. The scores are then used as edge weights in the graph.max_cut_improve_cut
Description
This function implements a greedy hill-climbing procedure to optimize a partition for the max-cut criterion on a graph.Inputs
| Name | Type | Description |
|---|---|---|
| G | nx.Graph | The graph to optimize the partition on. |
| cut | List[str] | A list of nodes representing one side of the initial partition. |
Outputs
| Name | Type | Description |
|---|---|---|
| new_cut | List[str] | A new partition that is a local maximum to the max-cut criterion. |
Internal Logic
The function iteratively moves nodes between the two sides of the partition, choosing the move that maximizes the weight of edges across the cut. The procedure terminates when no further improvement can be made or a maximum number of iterations is reached.construct_similarity_graph
Description
This function constructs a similarity graph for a set of samples based on a given similarity function.Inputs
| Name | Type | Description |
|---|---|---|
| character_matrix | pd.DataFrame | The character matrix of observed character states for all samples. |
| missing_state_indicator | int | The character representing missing values. |
| samples | List[str] | A list of sample names to include in the graph. |
| similarity_function | Callable | A function that calculates a similarity score between two samples. |
| threshold | int | A minimum similarity threshold to include an edge in the graph. |
| weights | Optional[Dict[int, Dict[int, float]]] | Optional weights for character/state pairs. |
Outputs
| Name | Type | Description |
|---|---|---|
| G_names | nx.Graph | The constructed similarity graph with sample names as nodes. |
Internal Logic
The function iterates through all pairs of samples and calculates their similarity using the providedsimilarity_function. If the similarity score is above the given threshold, an edge is added between the corresponding nodes in the graph with the weight equal to the similarity score.
spectral_improve_cut
Description
This function implements a greedy hill-climbing procedure to optimize a partition for a modified normalized cut criterion on a graph.Inputs
| Name | Type | Description |
|---|---|---|
| G | nx.Graph | The graph to optimize the partition on. |
| cut | List[str] | A list of nodes representing one side of the initial partition. |
Outputs
| Name | Type | Description |
|---|---|---|
| new_cut | List[str] | A new partition that is a local minimum to the objective function. |
