graph_utilities.py - Stanza Demo

High-level description

This code defines functions for constructing and manipulating graphs used in phylogenetic tree reconstruction. It includes functions for building connectivity graphs based on shared mutations, similarity graphs based on a given similarity function, and optimizing graph partitions using hill-climbing procedures.

Code Structure

The code consists of several functions that are used to construct and manipulate graphs for phylogenetic tree reconstruction. The construct_connectivity_graph and construct_similarity_graph functions are used to build different types of graphs, while max_cut_improve_cut and spectral_improve_cut are used to optimize partitions on these graphs. The check_if_cut function is a utility function used by the optimization functions.

References

This code references the solver_utilities module for converting sample names to indices.

Symbols

`check_if_cut`

Description

This function checks if two nodes are on opposite sides of a graph partition.

Inputs

Name	Type	Description
u	int	The index of the first node.
v	int	The index of the second node.
cut	List[int]	A list of node indices representing one side of the partition.

Outputs

Name	Type	Description
	bool	True if nodes u and v are on opposite sides of the partition, False otherwise.

Internal Logic

The function simply checks if one node is in the cut list and the other is not.

`construct_connectivity_graph`

Description

This function constructs a connectivity graph for a set of samples based on their observed mutations. The edge weights represent the tendency of two samples to be grouped together or separated based on their shared and differing mutations.

Inputs

Name	Type	Description
character_matrix	pd.DataFrame	The character matrix of observed character states for all samples.
mutation_frequencies	Dict[int, Dict[int, int]]	A dictionary containing the frequencies of each character/state pair.
missing_state_indicator	int	The character representing missing values.
samples	List[str]	A list of sample names to include in the graph.
weights	Optional[Dict[int, Dict[int, float]]]	Optional weights for character/state pairs.

Outputs

Name	Type	Description
G_names	nx.Graph	The constructed connectivity graph with sample names as nodes.

Internal Logic

The function iterates through all pairs of samples and calculates a score based on their shared and differing mutations. The score is positive if the samples share mutations and negative if they have different mutations. The absolute value of the score is higher for mutations that are less frequent in the dataset. The scores are then used as edge weights in the graph.

`max_cut_improve_cut`

Description

This function implements a greedy hill-climbing procedure to optimize a partition for the max-cut criterion on a graph.

Inputs

Name	Type	Description
G	nx.Graph	The graph to optimize the partition on.
cut	List[str]	A list of nodes representing one side of the initial partition.

Outputs

Name	Type	Description
new_cut	List[str]	A new partition that is a local maximum to the max-cut criterion.

Internal Logic

The function iteratively moves nodes between the two sides of the partition, choosing the move that maximizes the weight of edges across the cut. The procedure terminates when no further improvement can be made or a maximum number of iterations is reached.

`construct_similarity_graph`

Description

This function constructs a similarity graph for a set of samples based on a given similarity function.

Inputs

Name	Type	Description
character_matrix	pd.DataFrame	The character matrix of observed character states for all samples.
missing_state_indicator	int	The character representing missing values.
samples	List[str]	A list of sample names to include in the graph.
similarity_function	Callable	A function that calculates a similarity score between two samples.
threshold	int	A minimum similarity threshold to include an edge in the graph.
weights	Optional[Dict[int, Dict[int, float]]]	Optional weights for character/state pairs.

Outputs

Name	Type	Description
G_names	nx.Graph	The constructed similarity graph with sample names as nodes.

Internal Logic

The function iterates through all pairs of samples and calculates their similarity using the provided similarity_function. If the similarity score is above the given threshold, an edge is added between the corresponding nodes in the graph with the weight equal to the similarity score.

`spectral_improve_cut`

Description

This function implements a greedy hill-climbing procedure to optimize a partition for a modified normalized cut criterion on a graph.

Inputs

Name	Type	Description
G	nx.Graph	The graph to optimize the partition on.
cut	List[str]	A list of nodes representing one side of the initial partition.

Outputs

Name	Type	Description
new_cut	List[str]	A new partition that is a local minimum to the objective function.

Internal Logic

The function iteratively moves nodes between the two sides of the partition, choosing the move that minimizes the ratio of the weight of edges across the cut to the minimum of the sum of edge weights within each side of the cut. The procedure terminates when no further improvement can be made or a maximum number of iterations is reached.

Project Root

​High-level description

​Code Structure

​References

​Symbols

​check_if_cut

​Description

​Inputs

​Outputs

​Internal Logic

​construct_connectivity_graph

​Description

​Inputs

​Outputs

​Internal Logic

​max_cut_improve_cut

​Description

​Inputs

​Outputs

​Internal Logic

​construct_similarity_graph

​Description

​Inputs

​Outputs

​Internal Logic

​spectral_improve_cut

​Description

​Inputs

​Outputs

​Internal Logic

High-level description

Code Structure

References

Symbols

`check_if_cut`

Description

Inputs

Outputs

Internal Logic

`construct_connectivity_graph`

Description

Inputs

Outputs

Internal Logic

`max_cut_improve_cut`

Description

Inputs

Outputs

Internal Logic

`construct_similarity_graph`

Description

Inputs

Outputs

Internal Logic

`spectral_improve_cut`

Description

Inputs

Outputs

Internal Logic