High-level description
Theassign_missing_average function is a missing data imputation method used in the Cassiopeia-Greedy algorithm. It aims to assign samples with missing data at a particular character to the most similar partition (left or right) based on the average number of shared mutations. This function is particularly useful for handling ambiguous character states, such as those arising from uncertainties in single-cell sequencing data.
Symbols
assign_missing_average
Description
This function takes a character matrix, partition information, and a list of samples with missing data. It calculates a score for each missing sample against both the left and right partitions, based on the average number of shared mutations. The sample is then assigned to the partition with the higher score.Inputs
| Name | Type | Description |
|---|---|---|
character_matrix | pandas.DataFrame | The character matrix containing observed character states for all samples. |
missing_state_indicator | int | The character representing missing values in the character matrix. |
left_set | List[str] | A list of sample names currently assigned to the left partition. |
right_set | List[str] | A list of sample names currently assigned to the right partition. |
missing | List[str] | A list of sample names with missing data at the current character being considered. |
weights | Optional[Dict[int, Dict[int, float]]] | Optional weights for character/state mutation pairs, derived from priors. |
Outputs
| Name | Type | Description |
|---|---|---|
left_set | List[str] | The updated left partition with imputed missing samples. |
right_set | List[str] | The updated right partition with imputed missing samples. |
Internal Logic
- Convert sample names to indices: For efficient indexing, the function first converts the sample names in
left_set,right_set, andmissingto their corresponding integer indices in thecharacter_matrix. - Define
score_sidehelper function: This function calculates the similarity score between a missing sample and a given partition (left or right). It iterates through each character, comparing the states of the missing sample to the states present in the partition. The score is calculated based on the number of shared states, optionally weighted by the providedweights. - Calculate scores for each missing sample: For each sample in
missing, the function calculates its similarity score to both the left and right partitions using thescore_sidefunction. - Assign missing samples based on scores: The missing sample is assigned to the partition with the higher average similarity score. The average score is calculated by dividing the total score by the number of samples in the respective partition.
- Return updated partitions: The function returns the updated
left_setandright_setlists, now including the imputed missing samples.
References
solver_utilities.convert_sample_names_to_indices: Used to convert sample names to their corresponding indices in the character matrix.unravel_ambiguous_states: Imported fromcassiopeia.mixinsand used to handle ambiguous character states, expanding them into a list of all possible states.
Side Effects
This function modifies the inputleft_set and right_set lists by adding the imputed missing samples to the appropriate partition.
