How an Analysis is implemented#
In the previous tutorial we modeled fibroblast-macrophage dynamics as follows:
[2]:
from tdm.raw.breast_mibi import read_single_cell_df
from tdm.cell_types import FIBROBLAST, MACROPHAGE, TUMOR, ENDOTHELIAL
from tdm.analysis import Analysis
from tdm.plot.two_cells.phase_portrait import plot_phase_portrait
single_cell_df = read_single_cell_df()
ana = Analysis(
single_cell_df=single_cell_df,
cell_types_to_model=[FIBROBLAST, MACROPHAGE],
allowed_neighbor_types=[FIBROBLAST, MACROPHAGE, TUMOR, ENDOTHELIAL],
polynomial_dataset_kwargs={"degree":2,},
neighborhood_mode='extrapolate',
)
plot_phase_portrait(ana)
/Users/jonathansomer/Documents/phd/projects/tumor-dynamical-modeling/src/tdm/raw/breast_mibi.py:400: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[DIVISION_COL] = is_dividing(
/Users/jonathansomer/Documents/phd/projects/tumor-dynamical-modeling/src/tdm/raw/breast_mibi.py:405: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[X_COL] = microns(df[X_COL])
/Users/jonathansomer/Documents/phd/projects/tumor-dynamical-modeling/src/tdm/raw/breast_mibi.py:406: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[Y_COL] = microns(df[Y_COL])
1/5 Constructing tissues [V]
2/5 Counting cell-neighbors [V]
3/5 Filtering cell-types [V]
4/5 Transforming features [V]
5/5 Fitting a model [V]

In this tutorial we’ll understand what’s happening under the hood by getting to know the three primary OSDR classes:
Tissue
Dataset
Model
The three classes are used as part of an ordered pipeline. We transform the single-cell dataframe into one Tissue
per image, then into a Dataset
and finally into a Model
1. The Tissue#
A Tissue
represents a single real-life slide.
In a Tissue
, cells with similar x,y
positions will be neighbors. As opposed to cells from the single-cell dataframe, that might have a similar x,y
value but can arrive from different patients or tissue sections.
We can access and plot the first tissue in our analysis as follows:
[3]:
from tdm.plot.tissue import plot_tissue
tissue = ana.tissues[0]
plot_tissue(tissue)

Note: The plot_tissue
function has a few useful utilities for debugging, such as plotting circular neighborhoods at specific locations. For more details, go to the Examples section in the docs.
Each Tissue
exposes a few useful properties:#
[4]:
tissue.subject_id
[4]:
'MB-0282'
[5]:
tissue.img_id
[5]:
1
[6]:
cell_df = tissue.cell_df() # the rows of single_cell_df corresponding with this tissue
cell_df.head(3)
[6]:
x | y | division | cell_type | img_id | subject_id | |
---|---|---|---|---|---|---|
0 | 0.000121 | 0.000004 | False | Tu | 1 | MB-0282 |
1 | 0.000222 | 0.000005 | False | T | 1 | MB-0282 |
2 | 0.000354 | 0.000006 | False | Tu | 1 | MB-0282 |
[7]:
tissue.cell_types()
[7]:
['Tu', 'T', 'En', 'F', 'M', 'B']
[8]:
tissue.tissue_dimensions() # maximal x,y values
[8]:
(0.000577658536585366, 0.000660847826086957)
Check out the docs for more.
2. The Dataset#
A Dataset
object organizes the features and labels for the various cell types#
The OSDR dynamical model relies on modeling the relation between division rate and composition of a cell’s neighborhood.
For example, in the fibroblast-macrophage model the dynamics for fibroblasts can be written as:
Where \(p_F^+,p_F^-\) are statistical models that map the density of fibroblasts and macrophages to fibroblast division or death rates.
To fit the model \(p^+_F\) we construct:
a table of features (“X”) based on the densities of fibroblasts and macrophages
a vector of binary labels (“y”) for observations of cell-division.
Note that because different cell types respond to different growth factors we need to fit a separate model for each cell type.
A Dataset
object organizes the features and labels for the various cell types.
To access the features and labels for a specific cell type, use the dataset.fetch(cell_type)
function:
[9]:
neighbors_dataset = ana.nds
features, labels = neighbors_dataset.fetch(cell_type='F')
[10]:
# densities of the various cell types:
features.head(3)
[10]:
Tu | T | En | F | M | B | |
---|---|---|---|---|---|---|
0 | 20.373117 | 2.037312 | 12.223870 | 25.447740 | 0.0 | 0.0 |
1 | 26.634353 | 1.664647 | 6.658588 | 12.652530 | 0.0 | 0.0 |
2 | 15.285997 | 1.698444 | 8.492221 | 19.682886 | 0.0 | 0.0 |
[11]:
# binary cell-division events:
labels.head(3)
[11]:
division | |
---|---|
0 | False |
1 | False |
2 | False |
The Dataset
class is the base class for several classes that implement the .fetch(cell_type)
API.
An Analysis
typicaly makes use of multiple datasets one after another.
The following steps are performed by the Analysis
class in the fibroblast-macrophage analysis:
[12]:
from tdm.dataset import ExtrapolateNeighborsDataset, ConcatDataset, RestrictedNeighborsDataset, PolynomialDataset
tissues = ana.tissues
# 1. count the number of neighbors for each cell in each tissue:
ndss = [ ExtrapolateNeighborsDataset(t) for t in tissues]
# 2. concatenate features and labels from different tissues:
nds = ConcatDataset(ndss)
# 3. filter cells:
rnds = RestrictedNeighborsDataset(
nds,
allowed_neighbor_types=[FIBROBLAST, MACROPHAGE, TUMOR, ENDOTHELIAL],
keep_types=[FIBROBLAST, MACROPHAGE] # derived from the cell_types_to_model argument for Analysis
)
# 4. tranform features:
pds = PolynomialDataset(rnds, degree=2, log_transform=False, scale_features=False)
The features from the manual pipeline above are equal to those from the Analysis
object:
[13]:
features, labels = pds.fetch('F')
features.head(3)
[13]:
1 | F | M | F^2 | F M | M^2 | |
---|---|---|---|---|---|---|
0 | 1.0 | 5.484926 | 1.494975 | 30.084414 | 8.199829 | 2.234951 |
1 | 1.0 | 7.306431 | 1.261286 | 53.383937 | 9.215501 | 1.590843 |
2 | 1.0 | 14.000000 | 0.000000 | 196.000000 | 0.000000 | 0.000000 |
[14]:
features, labels = ana.pds.fetch(FIBROBLAST)
features.head(3)
[14]:
1 | F | M | F^2 | F M | M^2 | |
---|---|---|---|---|---|---|
0 | 1.0 | 5.484926 | 1.494975 | 30.084414 | 8.199829 | 2.234951 |
1 | 1.0 | 7.306431 | 1.261286 | 53.383937 | 9.215501 | 1.590843 |
2 | 1.0 | 14.000000 | 0.000000 | 196.000000 | 0.000000 | 0.000000 |
Summarizing the steps for preparing the final Dataset
used to fit a Model
:#
We first use
NeighborsDataset
orExtrapolateNeighborsDataset
to count the number of neighbors of each cell in every tissue. The list of neighbors datasets for each cell type is accessible viaana.ndss
.Providing
Analysis
with the argumentneighborhood_mode='exclude'
will useNeighborsDataset
, excluding cells whose neighborhoods exceed tissue limites.Alternatively, providing
Analysis
with the argumentneighborhood_mode='extrapolate'
will useExtrapolateNeighborsDataset
, including all cells in the tissue, while scaling the cell counts according to the fraction of the neighborhood that’s observed.
We use
ConcatDataset
to concatenate the cell counts from the separate tissues (accessible viaana.nds
)We use
RestrictedDataset
to filter out cells we don’t wish to model or cells with unallowed neighbors (accessible viaana.rnds
)We use
PolynomialDataset
to introduce interaction terms between the raw cell-counts. This is also the class that transforms or scales features. (accessible viaana.pds
)
Each dataset receives the previous as its primary input. Finally, a Model
is fit to an instance of PolynomialDataset
.
3. The Model#
The Model
fits probabilistic models of division and death rates for each cell type#
The OSDR Model
emulates the .fit
, .predict
api from packages such as sklearn.
But instead of fitting just one model, Model
fits models for division and death per cell type.
Essentially, Model
does the following:
for cell_type in dataset.cell_types():
features, labels = dataset.fetch(cell_type)
model.fit(features, labels)
To predict the probability of a division or death event we use the .predict
function:#
[15]:
model = ana.model
features, labels = ana.pds.fetch('F')
model.predict(
cell_type='F',
obs='division',
features=features.head(3)
)
[15]:
array([0.02096666, 0.01832152, 0.01025308])
To predict changes in cell numbers we use the delta_cells
function:#
Model
also has a useful method for directly computing the expected difference in the number of cells of each type (e.g \(dF/dt\)), based on the current cell counts:
[16]:
# Important! delta_cells expects raw cell counts and not features:
model.delta_cells(cell_counts={FIBROBLAST: [100], MACROPHAGE: [50]}, return_order=['F', 'M'])
[16]:
array([-51.13920971, -26.05025259])
[17]:
# we can also compute the difference between division and death rates:
model.delta_cells(cell_counts={FIBROBLAST: [100], MACROPHAGE: [50]}, return_order=['F', 'M'], mode='rates')
[17]:
array([-0.5113921 , -0.52100505])
The death rate is estimated as a constant#
Recall, we need probabilistic models for both division and death. In practice, because reliable death markers are lacking, we fit a probabilistic model only for division and we approximate the death rate as a constant.
We approximate the death rate with a constant approximately equal to the mean death rate. The mean death rate is expected to be close to the mean division rate in datasets that are derived from a large number of patients because at the population level we expect the numbers of cells to be relatively stable. That is, there shouldn’t be a rapid exponential rise in T-cells accross an entire population (even if their number is exponentially rising in some individuals).
Note:
If the data is derived from a single individual following an acute event, then the total cell population might be rapidly rising or shrinking. In this cases we can’t approximate the death rate as the mean division rate and approximation of the death rate needs to be done through markers or through prior knowledge of death rates in similar instances.
If we are comparing subsets of patients, then it is better to compute a shared death rate from both subsets together, and use that rate in the models for both groups. If we take the mean division rate as the death rate in each group separately we inadvertently “normalize-out” the differences in division rates between the two groups.
[18]:
model.death_prob(FIBROBLAST)
[18]:
0.011392097090435505
[19]:
model.predict(
cell_type=FIBROBLAST,
obs='death',
features=features.head(3)
)
[19]:
array([0.0113921, 0.0113921, 0.0113921])