Array indexing

This guide covers array indexing concepts in Starsim, including universal identifiers (UIDs), active UIDs (auids), and proper array operations.

Overview

Starsim uses an indexing system built on NumPy arrays to efficiently manage agents throughout their lifecycle, including when they die or are removed from the simulation. Understanding this system is crucial for writing correct and efficient code.

Key concepts

Universal identifiers (UIDs)

Every agent in Starsim has a unique identifier called a universal identifier or UID. UIDs are integers that:

Are assigned sequentially starting from 0
Never change during an agent’s lifetime
Are not reused when agents die
Can be used to index any agent, whether alive or dead

Active UIDs (auids)

The simulation also maintains a list of active UIDs (auids), which are the UIDs of agents who are currently alive and active in the simulation. This is a dynamic subset of all UIDs.

Array structure

Starsim arrays have two main components:

raw: Contains data for all agents ever created (indexed by UID)
values: Contains data for active agents only (indexed by position in auids)

Let’s see this in action:

import starsim as ss

# Create a simple simulation to demonstrate indexing
pars = dict(
    n_agents=10,
    diseases=dict(type='sir', init_prev=0.5, p_death=0.2),
    networks='random',
)

sim = ss.Sim(pars)
sim.run()

print(f"Number of agents: {len(sim.people)}")
print(f"UIDs: {sim.people.uid}")
print(f"Active UIDs (auids): {sim.people.auids}")
print(f"All UIDs: {sim.people.uid.raw}")
print(f"Alive: {sim.people.alive.raw}")
print(f"Ages (values): {sim.people.age}")
print(f"Ages (raw): {sim.people.age.raw}")

Initializing sim with 10 agents
  Running 2000.01.01 ( 0/51) (0.00 s)  ———————————————————— 2%
  Running 2010.01.01 (10/51) (0.13 s)  ••••———————————————— 22%
  Running 2020.01.01 (20/51) (0.14 s)  ••••••••———————————— 41%
  Running 2030.01.01 (30/51) (0.15 s)  ••••••••••••———————— 61%
  Running 2040.01.01 (40/51) (0.16 s)  ••••••••••••••••———— 80%
  Running 2050.01.01 (50/51) (0.17 s)  •••••••••••••••••••• 100%

Number of agents: 6
UIDs: <IndexArr "uid", len=6, [2 3 4 5 6 9]>
Active UIDs (auids): [2 3 4 5 6 9]
All UIDs: [0 1 2 3 4 5 6 7 8 9]
Alive: [False False  True  True  True  True  True False False  True]
Ages (values): <FloatArr "age", len=6, [58.149414    0.13023734 43.819874   42.690937   54.292294   27.229174  ]>
Ages (raw): [25.152369    4.9882936  58.149414    0.13023734 43.819874   42.690937
 54.292294   52.604046    7.0768547  27.229174  ]

Operations on active vs all agents

This is a crucial distinction in Starsim:

Statistical operations (like .mean(), .sum(), .std()) operate on active agents only
Indexing operations depend on what type of index you use:
- int or slice: operates on active agents (values)
- ss.uids(): operates on all agents (raw)

Let’s demonstrate this:

print(f"After simulation:")
print(f"Total agents ever created: {len(sim.people.uid.raw)}")
print(f"Active agents: {len(sim.people.auids)}")
print(f"Active UIDs: {sim.people.auids}")

# Statistical operations work on active agents only
print(f"\nMean age (active agents): {sim.people.age.mean():.2f}")
print(f"Mean age (manual calculation): {sim.people.age.values.mean():.2f}")

# This would be different if we included all agents (including dead ones)
print(f"Mean age (all agents, including dead): {sim.people.age.raw[sim.people.age.raw != sim.people.age.nan].mean():.2f}")

After simulation:
Total agents ever created: 10
Active agents: 6
Active UIDs: [2 3 4 5 6 9]

Mean age (active agents): 37.72
Mean age (manual calculation): 37.72
Mean age (all agents, including dead): 31.61

Proper indexing examples

Here are examples of correct and incorrect ways to index Starsim arrays:

Correct indexing patterns

# ✅ Using integer indices (works on active agents)
age_of_first_active = sim.people.age[0]
print(f"Age of first active agent: {age_of_first_active}")

# ✅ Using ss.uids() for specific UIDs
specific_uids = ss.uids([0, 1, 2])
ages_by_uid = sim.people.age[specific_uids]
print(f"Ages of UIDs {specific_uids}: {ages_by_uid}")

# ✅ Using boolean arrays from states
female_uids = sim.people.female.uids  # This gets UIDs of female agents
female_ages = sim.people.age[female_uids]
print(f"Ages of female agents: {female_ages}")

# ✅ Using .true() and .false() methods
alive_uids = sim.people.alive.true()
dead_uids = sim.people.alive.false()
print(f"Alive UIDs: {alive_uids}")
print(f"Dead UIDs: {dead_uids}")

Age of first active agent: 25.152368545532227
Ages of UIDs [0 1 2]: [25.152369   4.9882936 58.149414 ]
Ages of female agents: [54.292294]
Alive UIDs: [2 3 4 5 6 9]
Dead UIDs: []

Incorrect indexing patterns

These examples show what NOT to do:

import sciris as sc

# ❌ Don't index with raw lists of integers - this is ambiguous!
with sc.tryexcept() as tc:
    print('This raises an error:')
    sim.people.age[[0, 1, 2]]  # This would raise an error

# ❌ Don't mix up .values and .raw
age = sim.people.age
print('Mean age:', age.mean())
print('Mean age (values):', age.values.mean()) # <- same as above
print('Mean age (raw):', age.raw.mean()) # <- different since includes dead agents

This raises an error:
<class 'Exception'> Indexing an Arr (age) by ([0, 1, 2]) is ambiguous or not supported. Use ss.uids() instead, or index Arr.raw or Arr.values.
Mean age: 37.718655
Mean age (values): 37.718655
Mean age (raw): 31.613348

Best practices and common pitfalls

Do:

Use ss.uids() when you need to index by specific UIDs
Use statistical methods (.mean(), .sum(), etc.) directly on arrays - they automatically work on active agents
Use .uids property of boolean arrays to get UIDs of agents matching criteria
Use .true() and .false() methods for cleaner boolean array handling
Remember that integer indexing works on active agents, not UIDs

Don’t:

Don’t index with raw lists of integers - use ss.uids() instead
Don’t use .raw arrays for statistics unless you specifically need to include dead agents
Don’t use boolean operators (&, |) on non-boolean arrays - use comparison operators instead
Don’t forget to check if UID arrays are empty before performing operations on them

Performance tips:

Boolean indexing is efficient - use it to filter large populations
UID operations are optimized - use set operations like .intersect() and .union() when appropriate
Statistical operations on arrays are fast - they use NumPy under the hood
Avoid loops when possible - vectorized operations are much faster