High-level description
TheCas9LineageTracingDataSimulator class simulates Cas9-based lineage tracing data by overlaying mutations onto a CassiopeiaTree. It models Cas9 cutting as an exponential process and incorporates features like cassettes, state distributions, and silencing rates to mimic real-world lineage tracing experiments.
Code Structure
TheCas9LineageTracingDataSimulator class inherits from the LineageTracingDataSimulator class. It defines several parameters in its constructor to control the simulation process, such as the number of cassettes, size of each cassette, mutation rate, state distribution, and silencing rates. The main functionality is implemented in the overlay_data method, which modifies the character states of the input CassiopeiaTree in place.
References
cassiopeia.data.CassiopeiaTree: The class representing the tree structure on which the data is overlaid.cassiopeia.simulator.LineageTracingDataSimulator: The parent class defining the interface for lineage tracing data simulators.
Symbols
Cas9LineageTracingDataSimulator
Description
This class simulates Cas9-based lineage tracing data by overlaying mutations onto a CassiopeiaTree. It models Cas9 cutting as an exponential process and incorporates features like cassettes, state distributions, and silencing rates to mimic real-world lineage tracing experiments.Inputs
| Name | Type | Description |
|---|---|---|
| number_of_cassettes | int | Number of cassettes (arrays of target sites). Defaults to 10. |
| size_of_cassette | int | Number of editable target sites per cassette. Defaults to 3. |
| mutation_rate | Union[float, List[float]] | Exponential parameter for the Cas9 cutting rate. Can be a float (all sites mutate at the same rate), a list of floats of length size_of_cassette (each site mutates at the specified rate across all cassettes), or a list of floats of length number_of_cassettes * size_of_cassette (each site and cassette mutates at the specified rate). Defaults to 0.01. |
| state_generating_distribution | Callable[[], float] | Distribution from which to simulate state likelihoods. This is only used if state_priors is not specified. Defaults to an exponential distribution with parameter 1e-5. |
| number_of_states | int | Number of states to simulate. This is only used if state_priors is not specified. Defaults to 100. |
| state_priors | Optional[Dict[int, float]] | An optional dictionary mapping states to their prior probabilities. Can also be a list of dictionaries of length size_of_cassette or number_of_cassettes * size_of_cassette to specify per-site or per-site-per-cassette priors. If this argument is None, states will be generated using the state_generating_distribution. Defaults to None. |
| heritable_silencing_rate | float | Silencing rate for the cassettes, per node, simulating heritable missing data events. Defaults to 1e-4. |
| stochastic_silencing_rate | float | Rate at which to randomly drop out cassettes, to simulate dropout due to low sensitivity of assays. Defaults to 1e-2. |
| heritable_missing_data_state | int | Integer representing data that has gone missing due to a heritable event (i.e. Cas9 resection or heritable silencing). Defaults to -1. |
| stochastic_missing_data_state | int | Integer representing data that has gone missing due to the stochastic dropout from single-cell assay sensitivity. Defaults to -1. |
| random_seed | Optional[int] | Numpy random seed to use for deterministic simulations. Note that the numpy random seed gets set during every call to overlay_data, thereby producing deterministic simulations every time this function is called. Defaults to None. |
| collapse_sites_on_cassette | bool | Whether or not to collapse cuts that occur in the same cassette in a single iteration. This option only takes effect when size_of_cassette is greater than 1. Defaults to True. |
Outputs
NoneInternal Logic
The constructor initializes the simulator with the provided parameters and performs several checks to ensure the validity of the input. It also pre-computes the mutation rate and state priors for each character in the lineage.overlay_data
Description
This method overlays Cas9-based lineage tracing data onto the provided CassiopeiaTree. It simulates mutations, silencing events, and cassette dropouts based on the parameters specified in the constructor.Inputs
| Name | Type | Description |
|---|---|---|
| tree | CassiopeiaTree | The CassiopeiaTree on which to overlay the data. |
Outputs
NoneInternal Logic
- Initialize character states: Sets all character states to 0 (unmutated) for the root node and -1 (missing) for all other nodes.
- Simulate mutations: For each non-root node, calculate the mutation probability for each open site based on the node’s lifetime and the site’s mutation rate. Randomly introduce mutations based on these probabilities.
- Collapse sites on cassettes: If
collapse_sites_on_cassetteis True andsize_of_cassetteis greater than 1, collapse cuts that occur within the same cassette in a single iteration. - Introduce states at cut sites: For each cut site, randomly select a state from the state distribution and assign it to the site.
- Silence cassettes: Randomly silence entire cassettes based on the heritable silencing rate.
- Apply stochastic silencing: Randomly drop out cassettes at the leaves based on the stochastic silencing rate.
- Set character states: Update the character states of all nodes in the tree based on the simulated mutations and silencing events.
collapse_sites
Description
This method collapses cassettes by identifying cuts that occur within the same cassette and collapsing the sites between them.Inputs
| Name | Type | Description |
|---|---|---|
| character_array | List[int] | The character array being modified. |
| cuts | List[int] | The sites in the character array that are being cut. |
Outputs
| Name | Type | Description |
|---|---|---|
| updated_character_array | List[int] | The updated character array with collapsed cassettes. |
| cuts_remaining | List[int] | The sites that are not part of a cassette collapse. |
Internal Logic
- Identify cassettes with multiple cuts: Group the cuts by the cassette they belong to.
- Collapse sites within cassettes: For each cassette with multiple cuts, identify the leftmost and rightmost cut sites and set all sites between them to the heritable missing data state.
- Return updated character array and remaining cuts: Return the updated character array and the list of cuts that were not part of a cassette collapse.
introduce_states
Description
This method introduces new states at the specified cut sites in the character array.Inputs
| Name | Type | Description |
|---|---|---|
| character_array | List[int] | The character array being modified. |
| cuts | List[int] | The loci being cut. |
Outputs
| Name | Type | Description |
|---|---|---|
| updated_character_array | List[int] | The updated character array with new states at the cut sites. |
Internal Logic
- Iterate over cut sites: For each cut site, randomly select a state from the state distribution based on the site’s prior probabilities.
- Assign new states: Assign the selected state to the corresponding site in the character array.
- Return updated character array: Return the updated character array.
silence_cassettes
Description
This method randomly silences cassettes based on the specified silencing rate.Inputs
| Name | Type | Description |
|---|---|---|
| character_array | List[int] | The character array being modified. |
| silencing_rate | float | The silencing rate. |
| missing_state | int | The state to use for encoding missing data. Defaults to -1. |
Outputs
| Name | Type | Description |
|---|---|---|
| updated_character_array | List[int] | The updated character array with silenced cassettes. |
Internal Logic
- Iterate over cassettes: For each cassette, randomly decide whether to silence it based on the silencing rate.
- Silence cassettes: If a cassette is selected for silencing, set all sites within that cassette to the missing data state.
- Return updated character array: Return the updated character array.
get_cassettes
Description
This method returns a list of indices corresponding to the start positions of the cassettes in the character array.Inputs
NoneOutputs
| Name | Type | Description |
|---|---|---|
| cassettes | List[int] | An array of indices corresponding to the start positions of the cassettes. |
Internal Logic
Calculates the start positions of each cassette based on thesize_of_cassette and number_of_cassettes parameters.
Side Effects
Theoverlay_data method modifies the character states of the input CassiopeiaTree in place.
Dependencies
numpy: Used for random number generation and array operations.pandas: Used for representing the character matrix.
Error Handling
The constructor raisesDataSimulatorError if any of the input parameters are invalid.