import starsim as ss
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import sciris as sc
Samples
As Starsim models are usually stochastic, for a single scenario it is often desirable to run the model multiple times with different random seeds. The role of the Samples
class is to facilitate working with large numbers of simulations and scenarios, to ease:
- Loading large result sets
- Filtering/selecting simulation runs
- Plotting individual simulations and aggregate results
- Slicing result sets to compare scenarios
Essentially, if we think of the processed results of a model run as being
- A collection of scalar outputs (e.g., cumulative infections, total deaths)
- A dataframe of time-varying outputs (e.g., new diagnoses per day, number of people on treatment each day)
then the classes Dataset
and Samples
manage collections of these results. In particular, the Samples
class manages different random samples of the same parameters, and the Dataset
class manages a collection of Samples
.
These classes are particularly designed to facilitate working with tens of thousands of simulation runs, where other approaches such as those based on the MultiSim class may not be feasible.
Obtaining simulation output
To demonstrate usage of this class, we will first consider constructing the kinds of output that the Samples
class stores. We begin by running a basic simulation using the SIR model:
= ss.People(5000)
ppl = ss.ndict(ss.RandomNet(n_contacts=ss.poisson(5)))
net = ss.SIR()
sir = ss.Sim(people=ppl, networks=net, diseases=sir, rand_seed=0)
sim ; sim.run()
Initializing sim with 5000 agents
Running 2000.01.01 ( 0/51) (0.00 s) ———————————————————— 2%
Running 2010.01.01 (10/51) (0.14 s) ••••———————————————— 22%
Running 2020.01.01 (20/51) (0.16 s) ••••••••———————————— 41%
Running 2030.01.01 (30/51) (0.19 s) ••••••••••••———————— 61%
Running 2040.01.01 (40/51) (0.21 s) ••••••••••••••••———— 80%
Running 2050.01.01 (50/51) (0.22 s) •••••••••••••••••••• 100%
Dataframe output
A Sim
instance is (in general) too large and complex to efficiently store on disk - the file size and loading time make it prohibitive to work with tens of thousands of simulations. Therefore, rather than storing entire Sim
instances, we instead store dataframes containing just the simulation results and any other pre-processed calculated quantities. There are broadly speaking two types of outputs:
- Scalar outputs at each timepoint (e.g., daily new cases)
- Scalar outputs for each simulation (e.g., total number of deaths)
These outputs can each be produced from a Sim
- the former has a tabular structure, and the latter has a dictionary structure (which can later be assembled into a table where the rows correspond to each simulation). The export_df
method is a quick way to obtain a dataframe with the appropriate structure retaining all results from the Sim
.
In real-world use, it is often helpful to write your own function to extract a dataframe of simulation outputs, because typically some of the outputs need to be extracted from custom Analyzers.
sim.to_df()
timevec | randomnet_n_edges | sir_n_susceptible | sir_n_infected | sir_n_recovered | sir_prevalence | sir_new_infections | sir_cum_infections | n_alive | new_deaths | cum_deaths | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2000-01-01 | 12616.0 | 4937.0 | 63.0 | 0.0 | 0.012600 | 15.0 | 15.0 | 5000.0 | 0.0 | 0.0 |
1 | 2001-01-01 | 12598.0 | 4905.0 | 95.0 | 0.0 | 0.019000 | 32.0 | 47.0 | 5000.0 | 0.0 | 0.0 |
2 | 2002-01-01 | 12436.0 | 4862.0 | 138.0 | 0.0 | 0.027600 | 43.0 | 90.0 | 5000.0 | 0.0 | 0.0 |
3 | 2003-01-01 | 12570.0 | 4795.0 | 205.0 | 0.0 | 0.041000 | 67.0 | 157.0 | 5000.0 | 0.0 | 0.0 |
4 | 2004-01-01 | 12525.0 | 4699.0 | 301.0 | 0.0 | 0.060200 | 96.0 | 253.0 | 5000.0 | 0.0 | 0.0 |
5 | 2005-01-01 | 12456.0 | 4555.0 | 434.0 | 11.0 | 0.086800 | 144.0 | 397.0 | 5000.0 | 0.0 | 0.0 |
6 | 2006-01-01 | 12662.0 | 4373.0 | 587.0 | 40.0 | 0.117400 | 182.0 | 579.0 | 5000.0 | 0.0 | 0.0 |
7 | 2007-01-01 | 12486.0 | 4142.0 | 777.0 | 80.0 | 0.155400 | 231.0 | 810.0 | 4999.0 | 1.0 | 0.0 |
8 | 2008-01-01 | 12516.0 | 3843.0 | 1034.0 | 122.0 | 0.206841 | 299.0 | 1109.0 | 4999.0 | 0.0 | 1.0 |
9 | 2009-01-01 | 12412.0 | 3524.0 | 1290.0 | 184.0 | 0.258052 | 319.0 | 1428.0 | 4998.0 | 1.0 | 1.0 |
10 | 2010-01-01 | 12658.0 | 3128.0 | 1597.0 | 272.0 | 0.319528 | 396.0 | 1824.0 | 4997.0 | 1.0 | 2.0 |
11 | 2011-01-01 | 12534.0 | 2682.0 | 1921.0 | 393.0 | 0.384431 | 446.0 | 2270.0 | 4996.0 | 1.0 | 3.0 |
12 | 2012-01-01 | 12634.0 | 2276.0 | 2159.0 | 559.0 | 0.432146 | 406.0 | 2676.0 | 4994.0 | 2.0 | 4.0 |
13 | 2013-01-01 | 12498.0 | 1882.0 | 2346.0 | 766.0 | 0.469764 | 394.0 | 3070.0 | 4994.0 | 0.0 | 6.0 |
14 | 2014-01-01 | 12401.0 | 1523.0 | 2456.0 | 1014.0 | 0.491790 | 359.0 | 3429.0 | 4993.0 | 1.0 | 6.0 |
15 | 2015-01-01 | 12482.0 | 1250.0 | 2421.0 | 1319.0 | 0.484879 | 273.0 | 3702.0 | 4990.0 | 3.0 | 7.0 |
16 | 2016-01-01 | 12532.0 | 1035.0 | 2275.0 | 1679.0 | 0.455912 | 215.0 | 3917.0 | 4989.0 | 1.0 | 10.0 |
17 | 2017-01-01 | 12585.0 | 880.0 | 2030.0 | 2076.0 | 0.406895 | 155.0 | 4072.0 | 4986.0 | 3.0 | 11.0 |
18 | 2018-01-01 | 12364.0 | 746.0 | 1749.0 | 2489.0 | 0.350782 | 134.0 | 4206.0 | 4984.0 | 2.0 | 14.0 |
19 | 2019-01-01 | 12565.0 | 649.0 | 1449.0 | 2877.0 | 0.290730 | 97.0 | 4303.0 | 4975.0 | 9.0 | 16.0 |
20 | 2020-01-01 | 12442.0 | 573.0 | 1159.0 | 3241.0 | 0.232965 | 76.0 | 4379.0 | 4973.0 | 2.0 | 25.0 |
21 | 2021-01-01 | 12325.0 | 540.0 | 868.0 | 3559.0 | 0.174543 | 33.0 | 4412.0 | 4967.0 | 6.0 | 27.0 |
22 | 2022-01-01 | 12465.0 | 505.0 | 667.0 | 3793.0 | 0.134286 | 35.0 | 4447.0 | 4965.0 | 2.0 | 33.0 |
23 | 2023-01-01 | 12490.0 | 485.0 | 500.0 | 3979.0 | 0.100705 | 20.0 | 4467.0 | 4964.0 | 1.0 | 35.0 |
24 | 2024-01-01 | 12396.0 | 479.0 | 347.0 | 4138.0 | 0.069903 | 6.0 | 4473.0 | 4964.0 | 0.0 | 36.0 |
25 | 2025-01-01 | 12400.0 | 468.0 | 246.0 | 4248.0 | 0.049557 | 11.0 | 4484.0 | 4962.0 | 2.0 | 36.0 |
26 | 2026-01-01 | 12420.0 | 460.0 | 175.0 | 4326.0 | 0.035268 | 8.0 | 4492.0 | 4961.0 | 1.0 | 38.0 |
27 | 2027-01-01 | 12511.0 | 453.0 | 116.0 | 4392.0 | 0.023382 | 7.0 | 4499.0 | 4961.0 | 0.0 | 39.0 |
28 | 2028-01-01 | 12410.0 | 448.0 | 82.0 | 4431.0 | 0.016529 | 5.0 | 4504.0 | 4961.0 | 0.0 | 39.0 |
29 | 2029-01-01 | 12492.0 | 447.0 | 56.0 | 4458.0 | 0.011288 | 1.0 | 4505.0 | 4961.0 | 0.0 | 39.0 |
30 | 2030-01-01 | 12537.0 | 445.0 | 40.0 | 4474.0 | 0.008063 | 2.0 | 4507.0 | 4959.0 | 2.0 | 39.0 |
31 | 2031-01-01 | 12199.0 | 443.0 | 31.0 | 4484.0 | 0.006251 | 2.0 | 4509.0 | 4958.0 | 1.0 | 41.0 |
32 | 2032-01-01 | 12290.0 | 440.0 | 24.0 | 4494.0 | 0.004841 | 3.0 | 4512.0 | 4958.0 | 0.0 | 42.0 |
33 | 2033-01-01 | 12387.0 | 439.0 | 20.0 | 4499.0 | 0.004034 | 1.0 | 4513.0 | 4958.0 | 0.0 | 42.0 |
34 | 2034-01-01 | 12299.0 | 438.0 | 16.0 | 4504.0 | 0.003227 | 1.0 | 4514.0 | 4958.0 | 0.0 | 42.0 |
35 | 2035-01-01 | 12533.0 | 438.0 | 11.0 | 4509.0 | 0.002219 | 0.0 | 4514.0 | 4958.0 | 0.0 | 42.0 |
36 | 2036-01-01 | 12577.0 | 437.0 | 9.0 | 4512.0 | 0.001815 | 1.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
37 | 2037-01-01 | 12405.0 | 437.0 | 6.0 | 4515.0 | 0.001210 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
38 | 2038-01-01 | 12471.0 | 437.0 | 3.0 | 4518.0 | 0.000605 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
39 | 2039-01-01 | 12483.0 | 437.0 | 3.0 | 4518.0 | 0.000605 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
40 | 2040-01-01 | 12359.0 | 437.0 | 1.0 | 4520.0 | 0.000202 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
41 | 2041-01-01 | 12419.0 | 437.0 | 1.0 | 4520.0 | 0.000202 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
42 | 2042-01-01 | 12242.0 | 437.0 | 1.0 | 4520.0 | 0.000202 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
43 | 2043-01-01 | 12449.0 | 437.0 | 0.0 | 4521.0 | 0.000000 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
44 | 2044-01-01 | 12400.0 | 437.0 | 0.0 | 4521.0 | 0.000000 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
45 | 2045-01-01 | 12324.0 | 437.0 | 0.0 | 4521.0 | 0.000000 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
46 | 2046-01-01 | 12445.0 | 437.0 | 0.0 | 4521.0 | 0.000000 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
47 | 2047-01-01 | 12368.0 | 437.0 | 0.0 | 4521.0 | 0.000000 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
48 | 2048-01-01 | 12479.0 | 437.0 | 0.0 | 4521.0 | 0.000000 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
49 | 2049-01-01 | 12304.0 | 437.0 | 0.0 | 4521.0 | 0.000000 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
50 | 2050-01-01 | 12485.0 | 437.0 | 0.0 | 4521.0 | 0.000000 | 0.0 | 4515.0 | 4958.0 | 0.0 | 42.0 |
Scalar/summary outputs
We can also consider extracting a summary dictionary of scalar values. For example:
= {}
summary 'seed'] = sim.pars['rand_seed']
summary['p_death'] = sim.diseases[0].pars.p_death.pars.p
summary['cum_infections'] = sum(sim.results.sir.new_infections)
summary['cum_deaths'] = sum(sim.results.new_deaths)
summary[ summary
{'seed': 0, 'p_death': 0.01, 'cum_infections': 4515.0, 'cum_deaths': 42.0}
Notice how in the example above, the summary contains both simulation inputs (seed, probability of death) as well as simulation outputs (total infections, total deaths). The simulation summary should contain sufficient information about the simulation inputs to identify the simulation. The seed should generally be present. The other inputs normally correspond to variables that scenarios are being run over. In this example, we will run scenarios comparing simulations with different probabilities of death. Therefore, we need to include the death probability in the simulation summary.
Running the model
For usage at scale, the steps of creating a simulation, running it and producing these outputs are usually encapsulated in functions:
def get_sim(seed, p_death):
= ss.People(5000)
ppl = ss.RandomNet(n_contacts=ss.poisson(5))
net = ss.SIR(p_death=p_death)
sir = ss.Sim(people=ppl, networks=net, diseases=sir, rand_seed=seed)
sim =0)
sim.init(verbosereturn sim
def run_sim(seed, p_death):
= get_sim(seed, p_death)
sim =0)
sim.run(verbose= sim.to_df()
df
= {}
summary 'seed'] = sim.pars['rand_seed']
summary['p_death']= sim.diseases[0].pars.p_death.pars.p
summary['cum_infections'] = sum(sim.results.sir.new_infections)
summary['cum_deaths'] = sum(sim.results.new_deaths)
summary[
return df, summary
The functions above could be combined into a single function. However, in real world usage it is often convenient to be able to construct a simulation independently of running it (e.g., for diagnostic purposes or to allow running the sim in a range of different ways). The suggested structure above, with a get_sim()
function and a run_sim()
function are recommended as standard practice.
Now running a simulation for a given beta/seed value and returning the processed outputs can be done in a single step:
# Scalar output
= run_sim(0, 0.2);
df, summary summary
{'seed': 0, 'p_death': 0.2, 'cum_infections': 4628.0, 'cum_deaths': 934.0}
We can produce all of the samples associated with a scenario by iterating over the input seed values. This is being done in a basic loop here, but could be done in more sophistical ways to leverage parallel computing (e.g., with sc.parallelize
for single host parallelization, or with celery
for distributed computation).
# Run a collection of sims
= 20
n = np.arange(n)
seeds = [run_sim(seed, 0.2) for seed in seeds] outputs
Saving and loading the samples
We have now produced simulation outputs (dataframes and summary statistics) for 20 simulation runs. The outputs
here are a list of tuples, containing the dataframe and dictionary outputs for each sample. This list can be passed to the cvv.Samples
class to produce a single compressed file on disk:
= Path('results')
resultsdir =True, parents=True)
resultsdir.mkdir(exist_ok=["p_death"])
ss.Samples.new(resultsdir, outputs, identifierslist(resultsdir.iterdir())
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.2.zip"
[PosixPath('results/0.2.zip')]
Notice that a list of identifiers
should be passed to the Samples
constructor. This is a list of keys in the simulation summary dictionaries that identifies the scenario. These would be model inputs rather than model outputs, and they should be the same for all of the outputs passed into the Samples
object. If no file name is explicitly provided, the file will automatically be assigned a name based on the identifiers.
The Samples
file internally contains metadata recording the identifiers. When Samples
are accessed using the Dataset
class, they can be accessed via the internal metadata. Therefore for a typical workflow, the file name largely doesn’t matter, and it usually doesn’t need to be manually specified.
The saved file can be loaded and accessed via the Samples
class. Importantly, individual files can be extracted from a .zip
file without decompressing the entire archive. This means that loading the summary dataframe and using it to selectively load the full outputs for individual runs can be done efficiently. For example, loading retrieving a single result from a Samples
file would take a similar amount of time regardless of whether the file contained 10 samples or 100000 samples.
# Load the samples
= ss.Samples('results/0.2.zip')
res res.summary
cum_infections | cum_deaths | ||
---|---|---|---|
seed | p_death | ||
0 | 0.2 | 4628.0 | 934.0 |
1 | 0.2 | 4600.0 | 909.0 |
2 | 0.2 | 4618.0 | 926.0 |
3 | 0.2 | 4625.0 | 992.0 |
4 | 0.2 | 4564.0 | 915.0 |
5 | 0.2 | 4656.0 | 971.0 |
6 | 0.2 | 4633.0 | 979.0 |
7 | 0.2 | 4672.0 | 966.0 |
8 | 0.2 | 4613.0 | 873.0 |
9 | 0.2 | 4645.0 | 960.0 |
10 | 0.2 | 4569.0 | 940.0 |
11 | 0.2 | 4639.0 | 920.0 |
12 | 0.2 | 4605.0 | 939.0 |
13 | 0.2 | 4619.0 | 926.0 |
14 | 0.2 | 4664.0 | 978.0 |
15 | 0.2 | 4599.0 | 910.0 |
16 | 0.2 | 4622.0 | 900.0 |
17 | 0.2 | 4603.0 | 942.0 |
18 | 0.2 | 4645.0 | 901.0 |
19 | 0.2 | 4623.0 | 884.0 |
When the Samples
file was created, a dictionary of scalars was provided for each result. These are automatically used to populate a ‘summary’ dataframe, where each identifier (and the seed) are used as the index, and the remaining keys appear as columns, as shown above. As a shortcut, columns of the summary dataframe can be accessed by indexing the Samples
object directly, without having to access the .summary
attribute e.g.,
'cum_infections'] res[
seed p_death
0 0.2 4628.0
1 0.2 4600.0
2 0.2 4618.0
3 0.2 4625.0
4 0.2 4564.0
5 0.2 4656.0
6 0.2 4633.0
7 0.2 4672.0
8 0.2 4613.0
9 0.2 4645.0
10 0.2 4569.0
11 0.2 4639.0
12 0.2 4605.0
13 0.2 4619.0
14 0.2 4664.0
15 0.2 4599.0
16 0.2 4622.0
17 0.2 4603.0
18 0.2 4645.0
19 0.2 4623.0
Name: cum_infections, dtype: float64
Each simulation is uniquely identified by its seed, and the time series dataframe for each simulation can be accessed by indexing the Samples
object with the seed:
0] res[
Unnamed: 0 | randomnet_n_edges | sir_n_susceptible | sir_n_infected | sir_n_recovered | sir_prevalence | sir_new_infections | sir_cum_infections | n_alive | new_deaths | cum_deaths | |
---|---|---|---|---|---|---|---|---|---|---|---|
timevec | |||||||||||
2000-01-01 | 0 | 12616.0 | 4937.0 | 63.0 | 0.0 | 0.012600 | 15.0 | 15.0 | 5000.0 | 0.0 | 0.0 |
2001-01-01 | 1 | 12598.0 | 4905.0 | 95.0 | 0.0 | 0.019000 | 32.0 | 47.0 | 5000.0 | 0.0 | 0.0 |
2002-01-01 | 2 | 12436.0 | 4862.0 | 138.0 | 0.0 | 0.027600 | 43.0 | 90.0 | 5000.0 | 0.0 | 0.0 |
2003-01-01 | 3 | 12570.0 | 4795.0 | 205.0 | 0.0 | 0.041000 | 67.0 | 157.0 | 5000.0 | 0.0 | 0.0 |
2004-01-01 | 4 | 12525.0 | 4699.0 | 301.0 | 0.0 | 0.060200 | 96.0 | 253.0 | 5000.0 | 0.0 | 0.0 |
2005-01-01 | 5 | 12456.0 | 4554.0 | 435.0 | 9.0 | 0.087000 | 145.0 | 398.0 | 4998.0 | 2.0 | 0.0 |
2006-01-01 | 6 | 12660.0 | 4405.0 | 555.0 | 33.0 | 0.111044 | 149.0 | 547.0 | 4993.0 | 5.0 | 2.0 |
2007-01-01 | 7 | 12466.0 | 4195.0 | 724.0 | 67.0 | 0.145003 | 210.0 | 757.0 | 4986.0 | 7.0 | 7.0 |
2008-01-01 | 8 | 12502.0 | 3916.0 | 961.0 | 105.0 | 0.192740 | 279.0 | 1036.0 | 4982.0 | 4.0 | 14.0 |
2009-01-01 | 9 | 12389.0 | 3587.0 | 1227.0 | 155.0 | 0.246287 | 329.0 | 1365.0 | 4969.0 | 13.0 | 18.0 |
2010-01-01 | 10 | 12544.0 | 3203.0 | 1521.0 | 227.0 | 0.306098 | 384.0 | 1749.0 | 4951.0 | 18.0 | 31.0 |
2011-01-01 | 11 | 12437.0 | 2776.0 | 1832.0 | 312.0 | 0.370026 | 427.0 | 2176.0 | 4920.0 | 31.0 | 49.0 |
2012-01-01 | 12 | 12381.0 | 2358.0 | 2101.0 | 428.0 | 0.427033 | 418.0 | 2594.0 | 4887.0 | 33.0 | 80.0 |
2013-01-01 | 13 | 12178.0 | 1954.0 | 2305.0 | 588.0 | 0.471660 | 404.0 | 2998.0 | 4847.0 | 40.0 | 113.0 |
2014-01-01 | 14 | 12053.0 | 1567.0 | 2454.0 | 784.0 | 0.506293 | 387.0 | 3385.0 | 4805.0 | 42.0 | 153.0 |
2015-01-01 | 15 | 12043.0 | 1236.0 | 2487.0 | 1030.0 | 0.517586 | 331.0 | 3716.0 | 4753.0 | 52.0 | 195.0 |
2016-01-01 | 16 | 11899.0 | 975.0 | 2388.0 | 1318.0 | 0.502420 | 261.0 | 3977.0 | 4681.0 | 72.0 | 247.0 |
2017-01-01 | 17 | 11815.0 | 796.0 | 2165.0 | 1628.0 | 0.462508 | 179.0 | 4156.0 | 4589.0 | 92.0 | 319.0 |
2018-01-01 | 18 | 11372.0 | 685.0 | 1868.0 | 1951.0 | 0.407060 | 111.0 | 4267.0 | 4504.0 | 85.0 | 411.0 |
2019-01-01 | 19 | 11297.0 | 576.0 | 1587.0 | 2265.0 | 0.352353 | 109.0 | 4376.0 | 4428.0 | 76.0 | 496.0 |
2020-01-01 | 20 | 11069.0 | 507.0 | 1252.0 | 2602.0 | 0.282746 | 69.0 | 4445.0 | 4361.0 | 67.0 | 572.0 |
2021-01-01 | 21 | 10853.0 | 450.0 | 972.0 | 2869.0 | 0.222885 | 57.0 | 4502.0 | 4291.0 | 70.0 | 639.0 |
2022-01-01 | 22 | 10761.0 | 411.0 | 732.0 | 3097.0 | 0.170590 | 39.0 | 4541.0 | 4240.0 | 51.0 | 709.0 |
2023-01-01 | 23 | 10648.0 | 385.0 | 526.0 | 3276.0 | 0.124057 | 26.0 | 4567.0 | 4187.0 | 53.0 | 760.0 |
2024-01-01 | 24 | 10417.0 | 362.0 | 388.0 | 3404.0 | 0.092668 | 23.0 | 4590.0 | 4154.0 | 33.0 | 813.0 |
2025-01-01 | 25 | 10348.0 | 350.0 | 281.0 | 3495.0 | 0.067646 | 12.0 | 4602.0 | 4126.0 | 28.0 | 846.0 |
2026-01-01 | 26 | 10319.0 | 340.0 | 206.0 | 3561.0 | 0.049927 | 10.0 | 4612.0 | 4107.0 | 19.0 | 874.0 |
2027-01-01 | 27 | 10292.0 | 333.0 | 148.0 | 3616.0 | 0.036036 | 7.0 | 4619.0 | 4097.0 | 10.0 | 893.0 |
2028-01-01 | 28 | 10198.0 | 332.0 | 103.0 | 3653.0 | 0.025140 | 1.0 | 4620.0 | 4088.0 | 9.0 | 903.0 |
2029-01-01 | 29 | 10251.0 | 330.0 | 67.0 | 3687.0 | 0.016389 | 2.0 | 4622.0 | 4084.0 | 4.0 | 912.0 |
2030-01-01 | 30 | 10323.0 | 327.0 | 48.0 | 3699.0 | 0.011753 | 3.0 | 4625.0 | 4074.0 | 10.0 | 916.0 |
2031-01-01 | 31 | 10084.0 | 325.0 | 34.0 | 3712.0 | 0.008346 | 2.0 | 4627.0 | 4071.0 | 3.0 | 926.0 |
2032-01-01 | 32 | 10072.0 | 325.0 | 19.0 | 3726.0 | 0.004667 | 0.0 | 4627.0 | 4070.0 | 1.0 | 929.0 |
2033-01-01 | 33 | 10177.0 | 324.0 | 12.0 | 3732.0 | 0.002948 | 1.0 | 4628.0 | 4068.0 | 2.0 | 930.0 |
2034-01-01 | 34 | 10122.0 | 324.0 | 10.0 | 3734.0 | 0.002458 | 0.0 | 4628.0 | 4068.0 | 0.0 | 932.0 |
2035-01-01 | 35 | 10260.0 | 324.0 | 7.0 | 3737.0 | 0.001721 | 0.0 | 4628.0 | 4068.0 | 0.0 | 932.0 |
2036-01-01 | 36 | 10318.0 | 324.0 | 6.0 | 3737.0 | 0.001475 | 0.0 | 4628.0 | 4067.0 | 1.0 | 932.0 |
2037-01-01 | 37 | 10182.0 | 324.0 | 3.0 | 3739.0 | 0.000738 | 0.0 | 4628.0 | 4066.0 | 1.0 | 933.0 |
2038-01-01 | 38 | 10212.0 | 324.0 | 1.0 | 3741.0 | 0.000246 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2039-01-01 | 39 | 10319.0 | 324.0 | 1.0 | 3741.0 | 0.000246 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2040-01-01 | 40 | 10139.0 | 324.0 | 1.0 | 3741.0 | 0.000246 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2041-01-01 | 41 | 10179.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2042-01-01 | 42 | 10023.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2043-01-01 | 43 | 10143.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2044-01-01 | 44 | 10176.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2045-01-01 | 45 | 10190.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2046-01-01 | 46 | 10213.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2047-01-01 | 47 | 10090.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2048-01-01 | 48 | 10242.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2049-01-01 | 49 | 10172.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
2050-01-01 | 50 | 10206.0 | 324.0 | 0.0 | 3742.0 | 0.000000 | 0.0 | 4628.0 | 4066.0 | 0.0 | 934.0 |
The dataframes in the Samples
object are cached, so that the dataframes don’t all need to be loaded in order to start working with the file. The first time a dataframe is accessed, it will be loaded from disk. Subsequent requests for the dataframe will return a cached version instead. The cached dataframe is copied each time it is retrieved, to prevent accidentally modifying the original data.
Common analysis operations
Here are some examples of common analyses that can be performed using functionality in the Samples
class
Plotting summary quantities
Often it’s useful to be able plot distributions of summary quantities, such as the total infections. This can be performed by directly indexing the Samples
object and then using the appropriate plotting command:
'cum_infections'], density=True)
plt.hist(res[
'Total infections')
plt.xlabel('Probability density') plt.ylabel(
Text(0, 0.5, 'Probability density')
Plotting time series
Time series plots can be obtained by accessing the dataframes associated with each seed, and then plotting quantities from those. For convenience, iterating over the Samples
object will automatically iterate over all of the dataframes associated with each seed. For example:
for df in res:
'sir_new_infections'], color='b', alpha=0.1) plt.plot(df[
Other ways to access content
We have seen so far that we can use
res.summary
- retrieve dataframe of summary outputsres[summary_column]
- retrieve a column of the summary dataframeres[seed]
- retrieve the time series dataframe associated with one of the simulationsfor df in res
- iterate over time series dataframes
Sometimes it is useful to have access to both the summary dictionary and the time series dataframe associated with a single sample. These can be accessed using the get
method, which takes in a seed, and returns both outputs for that seed together:
0) # Retrieve both summary quantities and dataframes res.get(
(#0. 'p_death': 0.2
#1. 'cum_infections': 4628.0
#2. 'cum_deaths': 934.0,
Unnamed: 0 randomnet_n_edges sir_n_susceptible sir_n_infected \
timevec
2000-01-01 0 12616.0 4937.0 63.0
2001-01-01 1 12598.0 4905.0 95.0
2002-01-01 2 12436.0 4862.0 138.0
2003-01-01 3 12570.0 4795.0 205.0
2004-01-01 4 12525.0 4699.0 301.0
2005-01-01 5 12456.0 4554.0 435.0
2006-01-01 6 12660.0 4405.0 555.0
2007-01-01 7 12466.0 4195.0 724.0
2008-01-01 8 12502.0 3916.0 961.0
2009-01-01 9 12389.0 3587.0 1227.0
2010-01-01 10 12544.0 3203.0 1521.0
2011-01-01 11 12437.0 2776.0 1832.0
2012-01-01 12 12381.0 2358.0 2101.0
2013-01-01 13 12178.0 1954.0 2305.0
2014-01-01 14 12053.0 1567.0 2454.0
2015-01-01 15 12043.0 1236.0 2487.0
2016-01-01 16 11899.0 975.0 2388.0
2017-01-01 17 11815.0 796.0 2165.0
2018-01-01 18 11372.0 685.0 1868.0
2019-01-01 19 11297.0 576.0 1587.0
2020-01-01 20 11069.0 507.0 1252.0
2021-01-01 21 10853.0 450.0 972.0
2022-01-01 22 10761.0 411.0 732.0
2023-01-01 23 10648.0 385.0 526.0
2024-01-01 24 10417.0 362.0 388.0
2025-01-01 25 10348.0 350.0 281.0
2026-01-01 26 10319.0 340.0 206.0
2027-01-01 27 10292.0 333.0 148.0
2028-01-01 28 10198.0 332.0 103.0
2029-01-01 29 10251.0 330.0 67.0
2030-01-01 30 10323.0 327.0 48.0
2031-01-01 31 10084.0 325.0 34.0
2032-01-01 32 10072.0 325.0 19.0
2033-01-01 33 10177.0 324.0 12.0
2034-01-01 34 10122.0 324.0 10.0
2035-01-01 35 10260.0 324.0 7.0
2036-01-01 36 10318.0 324.0 6.0
2037-01-01 37 10182.0 324.0 3.0
2038-01-01 38 10212.0 324.0 1.0
2039-01-01 39 10319.0 324.0 1.0
2040-01-01 40 10139.0 324.0 1.0
2041-01-01 41 10179.0 324.0 0.0
2042-01-01 42 10023.0 324.0 0.0
2043-01-01 43 10143.0 324.0 0.0
2044-01-01 44 10176.0 324.0 0.0
2045-01-01 45 10190.0 324.0 0.0
2046-01-01 46 10213.0 324.0 0.0
2047-01-01 47 10090.0 324.0 0.0
2048-01-01 48 10242.0 324.0 0.0
2049-01-01 49 10172.0 324.0 0.0
2050-01-01 50 10206.0 324.0 0.0
sir_n_recovered sir_prevalence sir_new_infections \
timevec
2000-01-01 0.0 0.012600 15.0
2001-01-01 0.0 0.019000 32.0
2002-01-01 0.0 0.027600 43.0
2003-01-01 0.0 0.041000 67.0
2004-01-01 0.0 0.060200 96.0
2005-01-01 9.0 0.087000 145.0
2006-01-01 33.0 0.111044 149.0
2007-01-01 67.0 0.145003 210.0
2008-01-01 105.0 0.192740 279.0
2009-01-01 155.0 0.246287 329.0
2010-01-01 227.0 0.306098 384.0
2011-01-01 312.0 0.370026 427.0
2012-01-01 428.0 0.427033 418.0
2013-01-01 588.0 0.471660 404.0
2014-01-01 784.0 0.506293 387.0
2015-01-01 1030.0 0.517586 331.0
2016-01-01 1318.0 0.502420 261.0
2017-01-01 1628.0 0.462508 179.0
2018-01-01 1951.0 0.407060 111.0
2019-01-01 2265.0 0.352353 109.0
2020-01-01 2602.0 0.282746 69.0
2021-01-01 2869.0 0.222885 57.0
2022-01-01 3097.0 0.170590 39.0
2023-01-01 3276.0 0.124057 26.0
2024-01-01 3404.0 0.092668 23.0
2025-01-01 3495.0 0.067646 12.0
2026-01-01 3561.0 0.049927 10.0
2027-01-01 3616.0 0.036036 7.0
2028-01-01 3653.0 0.025140 1.0
2029-01-01 3687.0 0.016389 2.0
2030-01-01 3699.0 0.011753 3.0
2031-01-01 3712.0 0.008346 2.0
2032-01-01 3726.0 0.004667 0.0
2033-01-01 3732.0 0.002948 1.0
2034-01-01 3734.0 0.002458 0.0
2035-01-01 3737.0 0.001721 0.0
2036-01-01 3737.0 0.001475 0.0
2037-01-01 3739.0 0.000738 0.0
2038-01-01 3741.0 0.000246 0.0
2039-01-01 3741.0 0.000246 0.0
2040-01-01 3741.0 0.000246 0.0
2041-01-01 3742.0 0.000000 0.0
2042-01-01 3742.0 0.000000 0.0
2043-01-01 3742.0 0.000000 0.0
2044-01-01 3742.0 0.000000 0.0
2045-01-01 3742.0 0.000000 0.0
2046-01-01 3742.0 0.000000 0.0
2047-01-01 3742.0 0.000000 0.0
2048-01-01 3742.0 0.000000 0.0
2049-01-01 3742.0 0.000000 0.0
2050-01-01 3742.0 0.000000 0.0
sir_cum_infections n_alive new_deaths cum_deaths
timevec
2000-01-01 15.0 5000.0 0.0 0.0
2001-01-01 47.0 5000.0 0.0 0.0
2002-01-01 90.0 5000.0 0.0 0.0
2003-01-01 157.0 5000.0 0.0 0.0
2004-01-01 253.0 5000.0 0.0 0.0
2005-01-01 398.0 4998.0 2.0 0.0
2006-01-01 547.0 4993.0 5.0 2.0
2007-01-01 757.0 4986.0 7.0 7.0
2008-01-01 1036.0 4982.0 4.0 14.0
2009-01-01 1365.0 4969.0 13.0 18.0
2010-01-01 1749.0 4951.0 18.0 31.0
2011-01-01 2176.0 4920.0 31.0 49.0
2012-01-01 2594.0 4887.0 33.0 80.0
2013-01-01 2998.0 4847.0 40.0 113.0
2014-01-01 3385.0 4805.0 42.0 153.0
2015-01-01 3716.0 4753.0 52.0 195.0
2016-01-01 3977.0 4681.0 72.0 247.0
2017-01-01 4156.0 4589.0 92.0 319.0
2018-01-01 4267.0 4504.0 85.0 411.0
2019-01-01 4376.0 4428.0 76.0 496.0
2020-01-01 4445.0 4361.0 67.0 572.0
2021-01-01 4502.0 4291.0 70.0 639.0
2022-01-01 4541.0 4240.0 51.0 709.0
2023-01-01 4567.0 4187.0 53.0 760.0
2024-01-01 4590.0 4154.0 33.0 813.0
2025-01-01 4602.0 4126.0 28.0 846.0
2026-01-01 4612.0 4107.0 19.0 874.0
2027-01-01 4619.0 4097.0 10.0 893.0
2028-01-01 4620.0 4088.0 9.0 903.0
2029-01-01 4622.0 4084.0 4.0 912.0
2030-01-01 4625.0 4074.0 10.0 916.0
2031-01-01 4627.0 4071.0 3.0 926.0
2032-01-01 4627.0 4070.0 1.0 929.0
2033-01-01 4628.0 4068.0 2.0 930.0
2034-01-01 4628.0 4068.0 0.0 932.0
2035-01-01 4628.0 4068.0 0.0 932.0
2036-01-01 4628.0 4067.0 1.0 932.0
2037-01-01 4628.0 4066.0 1.0 933.0
2038-01-01 4628.0 4066.0 0.0 934.0
2039-01-01 4628.0 4066.0 0.0 934.0
2040-01-01 4628.0 4066.0 0.0 934.0
2041-01-01 4628.0 4066.0 0.0 934.0
2042-01-01 4628.0 4066.0 0.0 934.0
2043-01-01 4628.0 4066.0 0.0 934.0
2044-01-01 4628.0 4066.0 0.0 934.0
2045-01-01 4628.0 4066.0 0.0 934.0
2046-01-01 4628.0 4066.0 0.0 934.0
2047-01-01 4628.0 4066.0 0.0 934.0
2048-01-01 4628.0 4066.0 0.0 934.0
2049-01-01 4628.0 4066.0 0.0 934.0
2050-01-01 4628.0 4066.0 0.0 934.0 )
In the same way that it is possible to index the Samples
object directly in order to retrieve columns from the summary dataframe, it is also possible to directly index the Samples
object to get a column of the time series dataframe. In this case, pass a tuple of items to the Samples
object, where the first item is the seed, and the second is a column from the time series dataframe. For example:
0,'sir_n_infected'] # Equivalent to `res[0]['sir.n_infected']` res[
timevec
2000-01-01 63.0
2001-01-01 95.0
2002-01-01 138.0
2003-01-01 205.0
2004-01-01 301.0
2005-01-01 435.0
2006-01-01 555.0
2007-01-01 724.0
2008-01-01 961.0
2009-01-01 1227.0
2010-01-01 1521.0
2011-01-01 1832.0
2012-01-01 2101.0
2013-01-01 2305.0
2014-01-01 2454.0
2015-01-01 2487.0
2016-01-01 2388.0
2017-01-01 2165.0
2018-01-01 1868.0
2019-01-01 1587.0
2020-01-01 1252.0
2021-01-01 972.0
2022-01-01 732.0
2023-01-01 526.0
2024-01-01 388.0
2025-01-01 281.0
2026-01-01 206.0
2027-01-01 148.0
2028-01-01 103.0
2029-01-01 67.0
2030-01-01 48.0
2031-01-01 34.0
2032-01-01 19.0
2033-01-01 12.0
2034-01-01 10.0
2035-01-01 7.0
2036-01-01 6.0
2037-01-01 3.0
2038-01-01 1.0
2039-01-01 1.0
2040-01-01 1.0
2041-01-01 0.0
2042-01-01 0.0
2043-01-01 0.0
2044-01-01 0.0
2045-01-01 0.0
2046-01-01 0.0
2047-01-01 0.0
2048-01-01 0.0
2049-01-01 0.0
2050-01-01 0.0
Name: sir_n_infected, dtype: float64
Filtering results
The .seeds
attribute contains a listing of seeds, which can be helpful for iteration:
res.seeds
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
The seeds are drawn from the summary dataframe, which defines which seeds are accessible via the Samples
object. Therefore, you can drop rows from the summary dataframe to filter the results. For example, suppose we only wanted to analyze simulations with over 4900 deaths. We could retrieve a copy of the summary dataframe that only contains matching simulations:
'cum_infections']>4900] res.summary.loc[res[
cum_infections | cum_deaths | ||
---|---|---|---|
seed | p_death |
We can then make a copy of the results and write the reduced summary dataframe back to that object:
= res.copy()
res2 = res.summary.loc[res['cum_infections']>4900] res2.summary
Unlike sc.dcp()
, copying using the .copy()
method only deep copies the summary dataframe. It does not duplicate the time series dataframes or the cache. For Samples
objects, it is therefore generally preferable to use .copy()
.
Now notice that there are fewer samples, and the seeds have been filtered:
len(res)
20
len(res2)
0
res2.seeds
array([], dtype=int64)
'cum_infections'], density=True)
plt.hist(res2['Total infections')
plt.xlabel('Probability density') plt.ylabel(
/software/conda/lib/python3.12/site-packages/numpy/lib/_histograms_impl.py:902: RuntimeWarning: invalid value encountered in divide
return n/db/n.sum(), bin_edges
Text(0, 0.5, 'Probability density')
Applying functions and transformations
Sometimes it might be necessary to calculate quantities that are derived from the time series dataframes. These could be simple scalar values, such as totals or averages that had not been computed ahead of time, or extracting values from each simulation at a particular point in time. As an alternative to writing a loop that iterates over the seeds, the .apply()
method takes in a function and maps it to every dataframe. This makes it quick to construct lists or arrays with scalar values extracted from the time series. For example, suppose we wanted to extract the peak number of people infected from each simulation:
= lambda df: df['sir_n_infected'].max()
peak_infections apply(peak_infections) res.
[2487.0,
2502.0,
2524.0,
2437.0,
2391.0,
2471.0,
2526.0,
2459.0,
2397.0,
2483.0,
2436.0,
2443.0,
2449.0,
2338.0,
2506.0,
2401.0,
2509.0,
2365.0,
2438.0,
2476.0]
Options when loading
There are two options available when loading that can change how the Samples
class interacts with the file on disk:
memory_buffer
- copy the entire file into memory. This prevents the file from being locked on disk and allows scripts to be re-run and results regenerated while still running the analysis notebook. This defaults toTrue
for convenience, but loading the entire file into memory can be problematic if the file is large (e.g., >1GB) in which case settingmemory_buffer=False
may be preferablepreload
- Populate the cache in one step. This facilitates interactive usage of the analysis notebook by making the runtime of analysis functions predictable (since all results will be retrieved from the cache) at the expense of a long initial load time
Implementation details
If the file is loaded from a memory buffer, the ._zipfile
attribute will be populated. A helper property .zipfile
is used to access the buffer, so if caching is not used, .zipfile
returns the actual file on disk rather than the buffer:
= ss.Samples('results/0.2.zip', memory_buffer=True) # Copy the entire file into memory
res print(res._zipfile)
print(res.zipfile)
<zipfile.ZipFile file=<_io.BytesIO object at 0x752b0cadff10> mode='r'>
<zipfile.ZipFile file=<_io.BytesIO object at 0x752b0cadff10> mode='r'>
= ss.Samples('results/0.2.zip', memory_buffer=False) # Copy the entire file into memory
res print(res._zipfile)
print(res.zipfile)
None
<zipfile.ZipFile filename='results/0.2.zip' mode='r'>
The dataframes associated with the individual dataframes are cached on access, so pd.read_csv()
only needs to be called once. The cache starts out empty:
res._cache
{}
When a dataframe is accessed, it is automatically stored in the cache:
0]
res[ res._cache.keys()
dict_keys([0])
This means that iterating through the dataframes the first time can be slow (but in general, iterating over all dataframes is avoided in favour of either only using summary outputs, or accessing a subset of the runs):
with sc.Timer():
for df in res:
continue
Elapsed time: 14.9 ms
with sc.Timer():
for df in res:
continue
Elapsed time: 1.08 ms
The preload
option populates the entire cache in advance. This makes creating the Samples
object slower, but operating on the dataframes afterwards will be consistently fast. This type of usage can be useful when wanting to load large files in the background and then interactively work with them afterwards.
with sc.Timer():
= ss.Samples('results/0.2.zip', preload=True) res
Elapsed time: 16.4 ms
with sc.Timer():
for df in res:
continue
Elapsed time: 0.913 ms
with sc.Timer():
for df in res:
continue
Elapsed time: 0.886 ms
Together, these options provide some flexibility in terms of memory and time demands to suit analyses at various different scales.
Running scenarios
Suppose we wanted to compare a range of different p_death
values and initial
values (initial number of infections). We might define these runs as:
= np.arange(1,4)
initials = np.arange(0,1,0.25) p_deaths
Recall that our run_sim()
function had an argument for p_death
. We can extend this to include the initial
parameter too. We can actually generalize this further by passing the parameters as keyword arguments to avoid needing to hard-code all of them. Note that we also need to add the initial
value to the summary outputs:
def get_sim(seed, **kwargs):
= ss.People(1000)
ppl = ss.RandomNet(n_contacts=ss.poisson(5))
net = ss.SIR(pars=kwargs)
sir = ss.Sim(people=ppl, networks=net, diseases=sir, rand_seed=seed, dur=ss.years(10))
sim =0)
sim.init(verbosereturn sim
def run_sim(seed, **kwargs):
= get_sim(seed, **kwargs)
sim =0)
sim.run(verbose= sim.to_df()
df = sim.diseases.sir
sir
= {}
summary 'seed'] = sim.pars.rand_seed
summary['p_death']= sir.pars.p_death.pars.p
summary['initial']= sir.pars.init_prev.pars.p
summary['cum_infections'] = sum(sim.results.sir.new_infections)
summary['cum_deaths'] = sum(sim.results.new_deaths)
summary[
return df, summary
We can now easily run a set of scenarios with different values of p_death
and save each one to a separate Samples
object. Note that when we create the Samples
objects now, we also want to specify that 'init_prev'
is one of the identifiers for the scenarios:
# Clear the existing results
for file_path in resultsdir.glob('*'):
file_path.unlink()
# Run the sweep over initial and p_death
= 10
n = np.arange(n)
seeds for init_prev in initials:
for p_death in p_deaths:
= [run_sim(seed, init_prev=init_prev, p_death=p_death) for seed in seeds]
outputs "p_death", "initial"]) ss.Samples.new(resultsdir, outputs, [
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.0-1.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.25-1.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.5-1.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.75-1.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.0-2.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.25-2.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.5-2.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.75-2.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.0-3.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.25-3.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.5-3.zip"
Zip file saved to "/home/cliffk/idm/starsim/docs/user_guide/results/0.75-3.zip"
The results folder now contains a collection of saved Samples
objects. Notice how the automatically selected file names now contain both the p_death
value and the initial
value, because they were both specified as identifiers. We can load one of these objects in to see how these identifiers are stored and accessed inside the Samples
class:
list(resultsdir.iterdir())
[PosixPath('results/0.0-3.zip'),
PosixPath('results/0.0-1.zip'),
PosixPath('results/0.25-3.zip'),
PosixPath('results/0.5-3.zip'),
PosixPath('results/0.75-3.zip'),
PosixPath('results/0.0-2.zip'),
PosixPath('results/0.75-1.zip'),
PosixPath('results/0.25-2.zip'),
PosixPath('results/0.25-1.zip'),
PosixPath('results/0.75-2.zip'),
PosixPath('results/0.5-2.zip'),
PosixPath('results/0.5-1.zip')]
= ss.Samples('results/0.25-2.zip') res
The ‘id’ of a Samples
object is a dictionary of the identifiers, which makes it easy to access the input parameters associated with a set of scenario runs:
id res.
{'p_death': 0.25, 'initial': 2}
The ‘identifier’ is a tuple of these values, which is suitable for use as a dictionary key. This can be useful for accumulating and comparing variables across scenarios:
res.identifier
(0.25, 2)
Loading multiple scenarios
We saw above that we now have a directory full of .zip
files corresponding to the various scenario runs. These can be accessed using the Dataset
class, which facilitates accessing multiple instances of Samples
. We can pass the folder containing the results to the Dataset
constructor to load them all:
= ss.Dataset(resultsdir)
results results
<Dataset:
'p_death':[0.0, 0.25, 0.5, 0.75]
'initial':[1, 2, 3]
>
The .ids
attribute lists all of the values available across scenarios in the results folder:
results.ids
{'p_death': [0.0, 0.25, 0.5, 0.75], 'initial': [1, 2, 3]}
The individual results can be accessed by indexing the Dataset
instance using the values of the identifiers. For example:
0.25,2] results[
<Samples 0.25-2, 10 seeds>
This indexing operation is sensitive to the order in which the identifiers are specified. The .get()
method allows you to specify them as key-value pairs:
=2, p_death=0.25) results.get(initial
<Samples 0.25-2, 10 seeds>
Iterating over the Dataset
will iterate over the Samples
instances contained within it:
for res in results:
print(res)
<Samples 0.0-1, 10 seeds>
<Samples 0.0-2, 10 seeds>
<Samples 0.0-3, 10 seeds>
<Samples 0.25-1, 10 seeds>
<Samples 0.25-2, 10 seeds>
<Samples 0.25-3, 10 seeds>
<Samples 0.5-1, 10 seeds>
<Samples 0.5-2, 10 seeds>
<Samples 0.5-3, 10 seeds>
<Samples 0.75-1, 10 seeds>
<Samples 0.75-2, 10 seeds>
<Samples 0.75-3, 10 seeds>
This can be used to extract and compare values across scenarios. For example, we could consider the use case of making a plot that compares total deaths across scenarios:
= []
labels = []
y = []
yerr
for res in results:
id)
labels.append(res.'cum_deaths'].median())
y.append(res[
len(results)),y, tick_label=labels)
plt.barh(np.arange('Median total deaths');
plt.xlabel('Scenario') plt.ylabel(
Text(0, 0.5, 'Scenario')
Filtering scenarios
Often plots need to be generated for a subset of scenarios e.g., for sensitivity analysis or to otherwise compare specific scenarios. Dataset.filter
returns a new Dataset
containing a subset of the results:
for res in results.filter(initial=2):
print(res)
<Samples 0.0-2, 10 seeds>
<Samples 0.25-2, 10 seeds>
<Samples 0.5-2, 10 seeds>
<Samples 0.75-2, 10 seeds>
for res in results.filter(p_death=0.25):
print(res)
<Samples 0.25-1, 10 seeds>
<Samples 0.25-2, 10 seeds>
<Samples 0.25-3, 10 seeds>
This is also a quick and efficient operation, so you can easily embed filtering commands inside the analysis to select subsets of the scenarios for plotting and other output generation. For instance:
for res, color in zip(results.filter(initial=2), sc.gridcolors(4)):
0].index, np.median([df['new_deaths'] for df in res], axis=0), color=color, label=f'p_death = {res.id["p_death"]}')
plt.plot(res[
plt.legend()'Sensitivity to p_death (initial = 2)')
plt.title('Year')
plt.xlabel('New deaths') plt.ylabel(
Text(0, 0.5, 'New deaths')
for res, color in zip(results.filter(p_death=0.25), sc.gridcolors(3)):
0].index, np.median([df['new_deaths'] for df in res], axis=0), color=color, label=f'initial = {res.id["initial"]}')
plt.plot(res[
plt.legend()'Sensitivity to initial infections (p_death = 0.25)')
plt.title('Year')
plt.xlabel('New deaths')
plt.ylabel( sc.dateformatter()
/software/conda/lib/python3.12/site-packages/matplotlib/axis.py:1282: RuntimeWarning: Axes data not recognizable as dates: Matplotlib converted them to days starting in 1970, which seems wrong. Please convert to actual dates first, using e.g. sc.date().
Raw values: [0.00000000e+00 2.31481481e-11 4.62962963e-11 6.94444444e-11
9.25925926e-11 1.15740741e-10]
major_labels = self.major.formatter.format_ticks(major_locs)