Environment

This module implements environments and sample functions that could be used to generate trajectories.

from pycfrl import environment

class pycfrl.environment.environment.SimulatedEnvironment(num_actions: int, state_model_type: Literal['lm', 'nn'] = 'nn', state_model_hidden_dims: list[int] = [32, 32], reward_model_type: Literal['lm', 'nn'] = 'nn', reward_model_hidden_dims: list[int] = [32, 32], is_action_onehot: bool = True, epochs: int = 1000, batch_size: int = 128, learning_rate: int | float = 0.001, is_loss_monitored: bool = True, is_early_stopping: bool = False, test_size: int | float = 0.2, loss_monitoring_patience: int = 10, loss_monitoring_min_delta: int | float = 0.01, early_stopping_patience: int = 10, early_stopping_min_delta: int | float = 0.01, enforce_min_max: bool = False)

Bases: Env

Implementation of an environment that simulates the transition dynamics of real environments.

A SimulatedEnvironment learns transition dynamics from data and makes transitions following the learned dynamics. SimulatedEnvironment inherits from gymnasium.Env and follows an interface similar to gymnasium.Env.

Currently, SyntheticEnvironment assumes the environment is continuous.

__init__(num_actions: int, state_model_type: Literal['lm', 'nn'] = 'nn', state_model_hidden_dims: list[int] = [32, 32], reward_model_type: Literal['lm', 'nn'] = 'nn', reward_model_hidden_dims: list[int] = [32, 32], is_action_onehot: bool = True, epochs: int = 1000, batch_size: int = 128, learning_rate: int | float = 0.001, is_loss_monitored: bool = True, is_early_stopping: bool = False, test_size: int | float = 0.2, loss_monitoring_patience: int = 10, loss_monitoring_min_delta: int | float = 0.01, early_stopping_patience: int = 10, early_stopping_min_delta: int | float = 0.01, enforce_min_max: bool = False) → None

Args:

num_actions (int):: The total number of legit actions.
state_model_type (str, optional):: The type of the model used for learning the transition dynamics of the states. Can be “lm” (polynomial regression) or “nn” (neural network). Currently, only ‘nn’ is supported.
state_model_hidden_dims (list[int], optional):: The hidden dimensions of the neural network for learning the transition dynamics of the states. This argument is not used if state_model_type="lm".
reward_model_type (str, optional):: The type of the model used for learning the transition dynamics of the rewards. Can be “lm” (polynomial regression) or “nn” (neural network). Currently, only ‘nn’ is supported.
reward_model_hidden_dims (list[int], optional):: The hidden dimensions of the neural network for learning the transition dynamics of the rewards. This argument is not used if reward_model_type="lm".
is_action_onehot (bool, optional):: When set to True, the actions will be one-hot encoded internally.
epochs (int, optional):: The number of training epochs for the neural networks. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are set to "lm".
batch_size (int, optional):: The batch size of the neural networks. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are set to "lm".
learning_rate (int or float, optional):: The learning rate of the neural networks. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are set to "lm".
is_loss_monitored (bool, optional):: When set to True, will split the training data into a training set and a validation set, and will monitor the validation loss during training. A warning will be raised if the percent absolute change in the validation loss is greater than loss_monitoring_min_delta for at least one of the final \(p\) epochs during neural network training, where \(p\) is specified by the argument loss_monitoring_patience. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are "lm".
is_early_stopping (bool, optional):: When set to True, will split the training data into a training set and a validation set, and will enforce early stopping based on the validation loss during neural network training. That is, neural network training will stop early if the percent decrease in the validation loss is no greater than early_stopping_min_delta for \(q\) consecutive training epochs, where \(q\) is specified by the argument early_stopping_patience. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are "lm".
test_size (int or float, optional):: An int or float between 0 and 1 (inclusive) that specifies the proportion of the full training data that is used as the validation set for loss monitoring and early stopping. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are "lm", or if both is_loss_monitored and is_early_stopping are False.
loss_monitoring_patience (int, optional):: The number of consecutive epochs with barely-changing validation loss at the end of training that is needed for loss monitoring to not raise warnings. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are "lm", or if is_loss_monitored=False.
loss_monitoring_min_delta (int for float, optional):: The maximum amount of percent absolute change in the validation loss for it to be considered barely-changing by the loss monitoring mechanism. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are "lm", or if is_loss_monitored=False.
early_stopping_patience (int, optional):: The number of consecutive epochs with barely-decreasing validation loss during training that is needed for early stopping to be triggered. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are "lm", or if is_early_stopping=False.
early_stopping_min_delta (int for float, optional):: The maximum amount of decrease in the validation loss for it to be considered barely-decreasing by the early stopping mechanism. Applies to both the network for states and the network for rewards, if applicable. This argument is not used if both state_model_type and reward_model_type are "lm", or if is_early_stopping=False.
enforce_min_max (bool, optional):: When set to True, each component of the output states will be clipped to the maximum and minimum value of the corresponding component in the training data. Similarly, the output rewards will also be clipped to the maximum and minimum value of the rewards in the training data.

fit(zs: list | ndarray, states: list | ndarray, actions: list | ndarray, rewards: list | ndarray) → None

Fit the transition dynamics of the MDP underlying the training data.

Internally, the fit() function fits two separate models, one for the states and the other for the rewards.

Args:

zs (list or np.ndarray):: The observed sensitive attributes of each individual in the training data. It should be a list or array following the Sensitive Attributes Format.
states (list or np.ndarray):: The state trajectory used for training. It should be a list or array following the Full-trajectory States Format.
actions (list or np.ndarray):: The action trajectory used for training. It should be a list or array following the Full-trajectory Actions Format.
rewards (list or np.ndarray):: The reward trajectory used for training. It should be a list or array following the Full-trajectory Rewards Format.

reset(z: list | ndarray, errors_states: ndarray | None = None, seed: int = 1) → tuple[ndarray, None]

Reset the environment to an initial state.

Users must call reset() first before calling step().

Args:

z (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory. It should be a 2D list or array following the Sensitive Attributes Format.
errors_states (np.ndarray):: The exogenous variables for states \(U_{X_0}\) for each individual in the trajectory. It should be a 2D list or array with shape (N, xdim) where N is the total number of individuals in the trajectory and xdim is the number of components of the state vector. When set to None, the function will generate the exogenous variables following a multivariate standard normal distribution with xdim mutually independent components.
seed (int, optional):: The random seed used for the transition.

Returns:

observation (np.ndarray):: The initial states generated following the learned transition dynamics. It is a 2D array following the Single-time States Format.
info (None):: Exists to be compatible with the interface of gymnasium.Env. It is always None.

step(action: list | ndarray, errors_states: ndarray | None = None, errors_rewards: ndarray | None = None, seed: int = 1) → tuple[ndarray, ndarray, Literal[False], Literal[False]]

Generate the states at some time \(t > 0\) following the default transition rule.

Args:

action (list or np.ndarray):: The actions of each individual in the trajectrory. It should be a 1D list or array following the Full-trajectory Actions Format.
errors_states (list or np.ndarray):: The exogenous variables for states (\(U_{X_t}\)) for each individual in the trajectory. It should be a 2D list or array with shape (N, xdim) where N is the total number of individuals in the trajectory and xdim is the number of components of the state vector. When set to None, the function will generate the exogenous variables following a multivariate standard normal distribution with xdim mutually independent components.
errors_rewards (list or np.ndarray):: The exogenous variables for rewards (\(U_{R_{t-1}}\)) for each individual in the trajectory. It should be a 2D list or array with shape (N, 1) where N is the total number of individuals in the trajectory. When set to None, the function will generate the exogenous variables following a standard normal distribution.
seed (int, optional):: The random seed used for the transition.

Returns:

observation (np.ndarray):: The states transitioned to following the pre-specified transition rule. It is a 2D array following the Single-time States Format.
reward (np.ndarray):: The rewards generated following the pre-specified transition rule. It is a 1D array following the Single-time Rewards Format.
terminated (False):: Whether the environment reaches a terminal state. It is always False because SimulatedEnvironment assumes the environment is continuing.
truncated (False):: Whether some truncation condition is satisfied. It is always False because SimulatedEnvironment currenly does not support specifying truncation conditions.

class pycfrl.environment.environment.SyntheticEnvironment(state_dim: int, z_coef: int | float = 1, f_x0: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray, int | float], ~numpy.ndarray] = <function f_x0_default>, f_xt: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray, ~numpy.ndarray, ~numpy.ndarray, int | float], ~numpy.ndarray] = <function f_xt_default>, f_rt: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray, ~numpy.ndarray, ~numpy.ndarray, int | float], ~numpy.ndarray] = <function f_rt_default>)

Bases: Env

Implementation of an environment that makes transitions following a pre-specified rule.

SyntheticEnvironment inherits from gymnasium.Env and follows an interface similar to gymnasium.Env. Users can also specify the transition rule in the constructor of SyntheticEnvironment.

If no transition rule is specified, SyntheticEnvironment will use a set of default transition rules (f_x0_default, f_xt_default, and f_rt_default) that assumes the sensitive attribute vector and the state vector are both univariate. More precisely, the default transition rule is

\[\begin{split}X_0 =& -0.3 + 1.0 \delta Z + U_{X_0} \\ X_t =& -0.3 + 1.0 \delta (Z - 0.5) + 0.5 X_{t-1} + 0.4 (A_{t-1} - 0.5) \\ &+ 0.3 X_{t-1} (A_{t-1} - 0.5) + 0.3 \delta X_{t-1} (Z - 0.5) + 0.4 \delta (Z - 0.5) (A_{t-1} - 0.5) + U_{X_t} \\ R_t =& -0.3 + 0.3 X_t + 0.5 \delta Z + 0.5 A_t + 0.2 \delta X_t Z + 0.7 X_t A_t - 1.0 \delta Z A_t\end{split}\]

Currently, SyntheticEnvironment assumes the environment is continuing.

__init__(state_dim: int, z_coef: int | float = 1, f_x0: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray, int | float], ~numpy.ndarray] = <function f_x0_default>, f_xt: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray, ~numpy.ndarray, ~numpy.ndarray, int | float], ~numpy.ndarray] = <function f_xt_default>, f_rt: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray, ~numpy.ndarray, ~numpy.ndarray, int | float], ~numpy.ndarray] = <function f_rt_default>) → None

Args:

state_dim (int):: The number of components in the state vector.
z_coef (int or float, optional):: The strength of impact of the sensitive attribute on the states and rewards. It is the \(\delta\) in the specification of the default transition rule.
f_x0 (Callable, optional):: Transition rule for generating the state at time \(t = 0\). It should be a function whose argument list, argument names, and return type exactly match those of f_x0_default.
f_xt (Callable, optional):: Transition rule for generating the state at time \(t > 0\). It should be a function whose argument list, argument names, and return type exactly match those of f_xt_default.
f_rt (Callable, optional):: Transition rule for generating the state at time \(t > 0\). It should be a function whose argument list, argument names, and return type exactly match those of f_rt_default.

reset(z: list | ndarray, ux0: list | ndarray) → tuple[ndarray, None]

Reset the environment to an initial state.

Users must call reset() first before calling step().

Args:

z (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory. It should be a 2D list or array following the Sensitive Attributes Format.
ux0 (list or np.ndarray):: The exogenous variables (\(U_{X_0}\)) for each individual in the trajectory. It should be a 2D list or array with shape (N, xdim) where N is the total number of individuals in the trajectory and xdim is the number of components of the state vector.

Returns:

observation (np.ndarray):: The initial states generated following the pre-specified transition rule. It is a 2D array following the Single-time States Format.
info (None):: Exists to be compatible with the interface of gymnasium.Env. It is always None.

step(action: list | ndarray, uxt: list | ndarray, urtm1: list | ndarray) → tuple[ndarray, ndarray, Literal[False], Literal[False]]

Generate the states at some time \(t > 0\) following the default transition rule.

Args:

action (list or np.ndarray):: The actions (\(A_{t-1}\)) of each individual in the trajectrory. It should be a 1D list or array following the Single-time Actions Format.
uxt (list or np.ndarray):: The exogenous variables (\(U_{X_t}\)) for each individual in the trajectory. It should be a 2D list or array with shape (N, xdim) where N is the total number of individuals in the trajectory and xdim is the number of components of the state vector.
urtm1 (list or np.ndarray):: The exogenous variables (\(U_{R_{t-1}}\)) for each individual in the trajectory. It should be a 2D list or array with shape (N, 1) where N is the total number of individuals in the trajectory.

Returns:

observation (np.ndarray):: The states transitioned to following the pre-specified transition rule (\(X_t\)). It is a 2D array following the Single-time States sFormat.
reward (np.ndarray):: The rewards generated following the pre-specified transition rule (\(R_{t-1}\)). It is a 1D array following the Single-time Rewards Format.
terminated (False):: Whether the environment reaches a terminal state. It is always False because SyntheticEnvironment assumes the environment is continuing.
truncated (False):: Whether some truncation condition is satisfied. It is always False because SyntheticEnvironment currenly does not support specifying truncation conditions.

pycfrl.environment.environment.estimate_counterfactual_trajectories_from_data(env: ~pycfrl.environment.environment.SimulatedEnvironment, zs: list | ~numpy.ndarray, states: list | ~numpy.ndarray, actions: list | ~numpy.ndarray, policy: ~pycfrl.agents.agents.Agent, f_ua: ~typing.Callable[[int], ~numpy.ndarray] = <function f_ua_default>, seed: int = 1) → dict[tuple[int | float, ...], dict[str, ndarray | SyntheticEnvironment | Agent]]

Reconstruct the counterfactual trajectories from an observed trajectory.

For each individual in the input trajectory, estimate_counterfactual_trajectories_from_data() reconstructs that individual’s counterfactual trajectories under different values of the sensitive attribute. The sensitive attribute values used here are those that appear in zs.

More precisely, the counterfactual states and rewards are first estimated following the data preprocessing method proposed by Wang et al. (2025), which is referenced below. policy is then used to generate the counterfactual action trajectories using the estimated counterfactual states.

References:: [2]
Wang, J., Shi, C., Piette, J.D., Loftus, J.R., Zeng, D. and Wu, Z. (2025). Counterfactually Fair Reinforcement Learning via Sequential Data Preprocessing. arXiv preprint arXiv:2501.06366.

The counterfactual trajectories estimated using a policy can be used to compute the counterfactual fairness metric of the policy.

Args:

env (SimulatedEnvironment):: An environment that simulates the transition dynamics of the MDP underlying zs, states, actions, and rewards.
zs (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory used for estimating the counterfactual trajectories. It should be a list or array following the Sensitive Attributes Format.
states (list or np.ndarray):: The state trajectory used for estimating the counterfactual trajectories. It should be a list or array following the Full-trajectory States Format.
actions (list or np.ndarray):: The action trajectory used for estimating the counterfactual trajectories. It should be a list or array following the Full-trajectory Actions Format.
policy (Agent):: The policy used to estimate the counterfactual action trajectories.
f_ua (Callable, optional):: A rule to generate exogenous variables for each individual’s actions. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.
seed (int, optional):: The seed used to estimate the counterfactual trajectories.

Returns:

trajectories (dict):: The sampled counterfactual trajectories. It is a dictionary where the keys are the sensitive attribute values in z_eval_levels (the sensitive attribute values are each converted to a tuple in the keys). The value of each key (the key is denoted z) is a dictionary with six keys: “Z” (value is an array whose elements are all z), “X” (value is the state trajectory for each individual under z, organized in the Full-trajectory States Format), “A” (value is an array of action trajectory for each individual under z, organized in the Full-trajectory Actions Format), “R” (value is an array of reward trajectory for each individual under z, organized in the Full-trajectory Rewards Format), “env_z” (value is a copy of env used to generate the trajectories under z, with coresponding buffer memories), and “policy_z” (value is a copy of policy used to generate the trajectories under z, with corresponding buffer memories).

pycfrl.environment.environment.f_errors_rewards_default(N: int) → ndarray

Generate exogenous variables for the rewards from a standard normal distribution.

Args:

N (int):: The total number of individuals for whom the exogenous variables will be generated.

Returns:

ua (np.ndarray):: The generated exogenous variables. It is a (N, 1) array where each entry is sampled from a standard normal distribution.

pycfrl.environment.environment.f_errors_states_default(N: int, state_dim: int) → ndarray

Generate exogenous variables for the states.

The exogenous variables are generated from a standard multivariate normal distribution with mutually independent components.

Args:

N (int):: The total number of individuals for whom the exogenous variables will be generated.
state_dim (int):: The number of components in the state vector.

Returns:

errors_states (np.ndarray):: The generated exogenous variables. It is a (N, state_dim) array where each entry is sampled from a standard multivariate normal distribution with mutually independent components.

Generate the rewards at some time \(t\) following the default transition rule.

Args:

zs (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory. It should be a 2D list or array following the Sensitive Attributes Format.
xt (list or np.ndarray):: The states of each individual in the trajectory at time \(t\). It should be a 2D list or array following the Single-time States Format.
at (list or np.ndarray):: The actions of each individual in the trajectrory at time \(t\). It should be a 1D list or array following the Single-time Actions Format.
urt (list or np.ndarray):: The exogenous variables (\(U_{R_t}\)) for each individual in the trajectory. It should be a 2D list or array with shape (N, 1) where N is the total number of individuals in the trajectory.
z_coef (int or float, optional):: The strength of impact of the sensitive attribute on the states and rewards. It is the \(\delta\) in the specification of the default transition rule.

Returns:

rt (np.ndarray):: The rewards at time \(t\) generated following the default transition rule. It is a 2D array following the Single-time Rewards Format.

pycfrl.environment.environment.f_ua_default(N: int) → ndarray

Generate exogenous variables for the actions from a uniform distribution between 0 and 1.

Args:

N (int):: The total number of individuals for whom the exogenous variables will be generated.

Returns:

ua (np.ndarray):: The generated exogenous variables. It is a (N, 1) array where each entry is sampled from a uniform distribution between 0 and 1.

pycfrl.environment.environment.f_ur_default(N: int) → ndarray

Generate exogenous variables for the rewards from a standard normal distribution.

Args:

N (int):: The total number of individuals for whom the exogenous variables will be generated.

Returns:

ur (np.ndarray):: The generated exogenous variables. It is a (N, 1) array where each entry is sampled from a standard normal distribution.

pycfrl.environment.environment.f_ux_default(N: int, state_dim: int) → ndarray

Generate exogenous variables for the states from a standard normal distribution.

Args:

N (int):: The total number of individuals for whom the exogenous variables will be generated.
state_dim (int):: The number of components in the state vector.

Returns:

ux (np.ndarray):: The generated exogenous variables. It is a (N, state_dim) array where each entry is sampled from a standard normal distribution.

pycfrl.environment.environment.f_x0_default(zs: list | ndarray, ux0: list | ndarray, z_coef: int | float = 1) → ndarray

Generate the states at time \(t = 0\) following the default transition rule.

Args:

zs (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory. It should be a 2D list or array following the Sensitive Attributes Format.
ux0 (list or np.ndarray):: The exogenous variables (\(U_{X_0}\)) for each individual in the trajectory. It should be a 2D list or array with shape (N, xdim) where N is the total number of individuals in the trajectory and xdim is the number of components of the state vector.
z_coef (int or float, optional):: The strength of impact of the sensitive attribute on the states and rewards. It is the \(\delta\) in the specification of the default transition rule.

Returns:

x0 (np.ndarray):: The states at \(t = 0\) generated following the default transition rule. It is a 2D array following the Single-time States Format.

Generate the states at some time \(t > 0\) following the default transition rule.

Args:

zs (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory. It should be a 2D list or array following the Sensitive Attributes Format.
xtm1 (list or np.ndarray):: The states of each individual in the trajectory at time \(t - 1\). It should be a 2D list or array following the Single-time States Format.
atm1 (list or np.ndarray):: The actions of each individual in the trajectrory at time \(t - 1\). It should be a 1D list or array following the Single-time Actions Format.
uxt (list or np.ndarray):: The exogenous variables (\(U_{X_t}\)) for each individual in the trajectory. It should be a 2D list or array with shape (N, xdim) where N is the total number of individuals in the trajectory and xdim is the number of components of the state vector.
z_coef (int or float, optional):: The strength of impact of the sensitive attribute on the states and rewards. It is the \(\delta\) in the specification of the default transition rule.

Returns:

xt (np.ndarray):: The states at time \(t\) generated following the default transition rule. It is a 2D array following the Single-time States Format.

pycfrl.environment.environment.sample_counterfactual_trajectories(env: ~pycfrl.environment.environment.SyntheticEnvironment, zs: list | ~numpy.ndarray, z_eval_levels: list | ~numpy.ndarray, state_dim: int, T: int, policy: ~pycfrl.agents.agents.Agent, f_ux: ~typing.Callable[[int, int], ~numpy.ndarray] = <function f_ux_default>, f_ua: ~typing.Callable[[int], ~numpy.ndarray] = <function f_ua_default>, f_ur: ~typing.Callable[[int], ~numpy.ndarray] = <function f_ur_default>, seed: int = 1) → dict[tuple[int | float, ...], dict[str, ndarray | SyntheticEnvironment | Agent]]

Sample counterfactual trajectories from some synthetic environment.

To sample counterfactual trajectories, for every individual, the function simulates transitions under each of the senstive attribute values specified in z_eval_levels while keeping the exogenous variables the same for all trajectories of the same individual.

The counterfactual trajectories generated by a policy can be used to compute the counterfactual fairness metric of the policy.

Args:

env (SyntheticEnvironment):: The environment to sample trajectory from.
zs (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory that is going to be sampled. It should be a 2D list or array following the Sensitive Attributes Format.
z_eval_levels (list or np.ndarray):: The set of values of sensitive attributes under which counterfactual trajectories will be generated. For every individual, the function generates a counterfactual trajectory for each of the sensitive attribute values specified in this array. It should be a 2D array where each row contains exactly one sensitive attribute value.
state_dim (int):: The number of components in the state vector.
T (int):: The total number of transitions in the trajectory that is to be sampled.
policy (Agent):: The policy used to generate actions for the trajectory.
f_ux (Callable, optional):: A rule to generate exogenous variables for each individual’s states. It should be a function whose argument list, argument names, and return type exactly match those of f_ux_default.
f_ua (Callable, optional):: A rule to generate exogenous variables for each individual’s actions. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.
f_ur (Callable, optional):: A rule to generate exogenous variables for each individual’s rewards. It should be a function whose argument list, argument names, and return type exactly match those of f_ur_default.
seed (int, optional):: The random seed used to generate the trajectories.

Returns:

trajectories (dict):: The sampled counterfactual trajectories. It is a dictionary where the keys are the sensitive attribute values in z_eval_levels (the sensitive attribute values are each converted to a tuple in the keys). The value of each key (the key is denoted z) is a dictionary with six keys: “Z” (value is an array whose elements are all z), “X” (value is the state trajectory for each individual under z, organized in the Full-trajectory States Format), “A” (value is an array of action trajectory for each individual under z, organized in the Full-trajectory Actions Format), “R” (value is an array of reward trajectory for each individual under z, organized in the Full-trajectory Rewards Format), “env_z” (value is a copy of env used to generate the trajectories under z, with coresponding buffer memories), and “policy_z” (value is a copy of policy used to generate the trajectories under z, with corresponding buffer memories).

pycfrl.environment.environment.sample_simulated_env_trajectory(env: ~pycfrl.environment.environment.SimulatedEnvironment, zs: list | ~numpy.ndarray, state_dim: int, T: int, policy: ~pycfrl.agents.agents.Agent, f_ua: ~typing.Callable[[int], ~numpy.ndarray] = <function f_ua_default>, f_errors_states: ~typing.Callable[[int, int], ~numpy.ndarray] = <function f_errors_states_default>, f_errors_rewards: ~typing.Callable[[int], ~numpy.ndarray] = <function f_errors_rewards_default>, seed: int = 1) → tuple[ndarray, ndarray, ndarray, ndarray]

Sample a trajectory from some simulated environment.

Args:

env (SimulatedEnvironment):: The environment to sample trajectory from.
zs (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory that is going to be sampled. It should be a 2D list or array following the Sensitive Attributes Format.
state_dim (int):: The number of components in the state vector.
T (int):: The total number of transitions in the trajectory that is to be sampled.
policy (Agent):: The policy used to generate actions for the trajectory.
f_ua (Callable, optional):: A rule to generate exogenous variables for each individual’s actions. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.
f_errors_states (Callable, optional):: A rule to generate exogenous variables for each individual’s states. It should be a function whose argument list, argument names, and return type exactly match those of f_errors_states_default.
f_errors_rewards (Callable, optional):: A rule to generate exogenous variables for each individual’s rewards. It should be a function whose argument list, argument names, and return type exactly match those of f_errors_rewards_default.
seed (int, optional):: The random seed used to sample the trajectory.

Returns:

Z (np.ndarray):: The observed sensitive attributes of each individual in the sampled trajectory. It is an array following the Sensitive Attributes Format.
X (np.ndarray):: The sampled state trajectory. It is an array following the Full-trajectoriy States Format.
A (np.ndarray):: The sampled action trajectory. It is an array following the Full-trajectory Actions Format.
R (np.ndarray):: The sampled reward trajectory. It is an array following the Full-trajectory Rewards Format.

pycfrl.environment.environment.sample_trajectory(env: ~pycfrl.environment.environment.SyntheticEnvironment, zs: list | ~numpy.ndarray, state_dim: int, T: int, policy: ~pycfrl.agents.agents.Agent, f_ux: ~typing.Callable[[int, int], ~numpy.ndarray] = <function f_ux_default>, f_ua: ~typing.Callable[[int], ~numpy.ndarray] = <function f_ua_default>, f_ur: ~typing.Callable[[int], ~numpy.ndarray] = <function f_ur_default>, seed: int = 1) → tuple[ndarray, ndarray, ndarray, ndarray]

Sample a trajectory from some synthetic environment.

Args:

env (SyntheticEnvironment):: The environment to sample trajectory from.
zs (list or np.ndarray):: The observed sensitive attributes of each individual in the trajectory that is going to be sampled. It should be a 2D list or array following the Sensitive Attributes Format.
state_dim (int):: The number of components in the state vector.
T (int):: The total number of transitions in the trajectory that is to be sampled.
policy (Agent):: The policy used to generate actions for the trajectory.
f_ux (Callable, optional):: A rule to generate exogenous variables for each individual’s states. It should be a function whose argument list, argument names, and return type exactly match those of f_ux_default.
f_ua (Callable, optional):: A rule to generate exogenous variables for each individual’s actions. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.
f_ur (Callable, optional):: A rule to generate exogenous variables for each individual’s rewards. It should be a function whose argument list, argument names, and return type exactly match those of f_ur_default.

Returns:

Z (np.ndarray):: The observed sensitive attributes of each individual in the sampled trajectory. It is an array following the Sensitive Attributes Format.
X (np.ndarray):: The sampled state trajectory. It is an array following the Full-trajectoriy States Format.
A (np.ndarray):: The sampled action trajectory. It is an array following the Full-trajectory Actions Format.
R (np.ndarray):: The sampled reward trajectory. It is an array following the Full-trajectory Rewards Format.