Evaluation

This module implements functions used for evaluating the value and fairness of policies.

from pycfrl import evaluation
pycfrl.evaluation.evaluation.evaluate_fairness_through_model(env: ~pycfrl.environment.environment.SimulatedEnvironment, zs: list | ~numpy.ndarray, states: list | ~numpy.ndarray, actions: list | ~numpy.ndarray, policy: ~pycfrl.agents.agents.Agent, f_ua: ~typing.Callable[[int], int] = <function f_ua_default>, seed: int = 1) integer | floating

Estimate the counterfactual fairness metric of a policy from an offline trajectory.

The function first estimates a set of counterfactual trajectories from the offline trajectory using estimate_counterfactual_trajectories_from_data() in the environment module. Then it computes a counterfactual fairness metric using the following formula given in Wang et al. (2025):

\[\max_{z', z \in eval(Z)} \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} \mathbb{I} \left( A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right) \neq A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right) \right).\]

where \(eval(Z)\) is the set of sensitive attribute values passed in by z_eval_levels, \(A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right)\) is the action taken in the counterfactual trajectory under \(Z=z'\), and \(A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right)\) is the action taken under the counterfactual trajectory under \(Z=z\). This metric is bounded between 0 and 1, with 0 representing perfect fairness and 1 indicating complete unfairness.

References:
Args:
env (SimulatedEnvironment):

An environment that simulates the transition dynamics of the MDP underlying zs, states, actions, and rewards.

zs (list or np.ndarray):

The observed sensitive attributes of each individual in the offline trajectory. It should be a list or array following the Sensitive Attributes Format.

states (list or np.ndarray):

The state trajectory. It should be a list or array following the Full-trajectory States Format.

actions (list or np.ndarray):

The action trajectory. It should be a list or array following the Full-trajectory Actions Format.

policy (Agent):

The policy whose fairness is to be evaluated.

f_ua (Callable, optional):

A rule to generate exogenous variables for each individual’s actions. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.

seed (int, optional):

The seed used to estimate the counterfactual trajectories.

Returns:
cf_metric (np.integer or np.floating):

The counterfactual fairness metric of the policy.

pycfrl.evaluation.evaluation.evaluate_fairness_through_simulation(env: ~pycfrl.environment.environment.SyntheticEnvironment, z_eval_levels: list | ~numpy.ndarray, state_dim: int, N: int, T: int, policy: ~pycfrl.agents.agents.Agent, f_ux: ~typing.Callable[[int, int], int] = <function f_ux_default>, f_ua: ~typing.Callable[[int], int] = <function f_ua_default>, f_ur: ~typing.Callable[[int], int] = <function f_ur_default>, z_probs: list | ~numpy.ndarray | None = None, seed: int = 1) integer | floating

Estimate the counterfactual fairness metric of a policy using simulation in a synthetic environment.

The function first simulates a set of counterfactual trajectories with a pre-specified length using sample_counterfactual_trajectories() in the environment module. Then it computes a counterfactual fairness metric using the following formula given in Wang et al. (2025):

\[\max_{z', z \in eval(Z)} \frac{1}{NT} \sum_{i=1}^{N} \sum_{t=1}^{T} \mathbb{I} \left( A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right) \neq A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right) \right).\]

where \(eval(Z)\) is the set of sensitive attribute values passed in by z_eval_levels, \(A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right)\) is the action taken in the counterfactual trajectory under \(Z=z'\), and \(A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right)\) is the action taken under the counterfactual trajectory under \(Z=z\). This metric is bounded between 0 and 1, with 0 representing perfect fairness and 1 indicating complete unfairness.

References:
Args:
env (SyntheticEnvironment):

The synthetic environment in which the simulation is run.

z_eval_levels (list or np.ndarray):

The values of sensitive attributes for which counterfactual trajectories are generated in the simulation. The observed sensitive attributes of the individuals in the simulation will also be sampled from this set.

state_dim (int):

The number of components in the state vector.

N (int):

The total number of individuals in the counterfactual trajectories sampled during the simulation.

T (int):

The total number of transitions in the counterfactual trajectories sampled during the simulation.

policy (Agent):

The policy whose fairness is to be evaluated.

f_ux (Callable, optional):

A rule to generate exogenous variables for each individual’s states. It should be a function whose argument list, argument names, and return type exactly match those of f_ux_default.

f_ua (Callable, optional):

A rule to generate exogenous variables for each individual’s actions. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.

f_ur (Callable, optional):

A rule to generate exogenous variables for each individual’s rewards. It should be a function whose argument list, argument names, and return type exactly match those of f_ur_default.

z_probs (list or np.ndarray, optional):

The probability of an individual taking each of the values in z_eval_levels as the observed sensitive attribute. When set to None, a uniform distribution will be used.

seed (int, optional):

The random seed used to run the simulation.

Returns:
cf_metric (np.integer or np.floating):

The counterfactual fairness metric of the policy.

pycfrl.evaluation.evaluation.evaluate_reward_through_fqe(zs: list | ~numpy.ndarray, states: list | ~numpy.ndarray, actions: list | ~numpy.ndarray, rewards: list | ~numpy.ndarray, policy: ~pycfrl.agents.agents.Agent, model_type: ~typing.Literal['lm', 'nn'], f_ua: ~typing.Callable[[int], int] = <function f_ua_default>, hidden_dims: list[int] = [32], learning_rate: int | float = 0.1, epochs: int = 500, gamma: int | float = 0.9, max_iter: int = 200, seed: int = 1, is_loss_monitored: bool = True, is_early_stopping_nn: bool = False, test_size_nn: int | float = 0.2, loss_monitoring_patience: int = 10, loss_monitoring_min_delta: int | float = 0.005, early_stopping_patience_nn: int = 10, early_stopping_min_delta_nn: int | float = 0.005, is_q_monitored: bool = True, is_early_stopping_q: bool = False, q_monitoring_patience: int = 5, q_monitoring_min_delta: int | float = 0.005, early_stopping_patience_q: int = 5, early_stopping_min_delta_q: int | float = 0.005) integer | floating

Estimate the value of a policy using fitted Q evaluation (FQE).

The function takes in a offline trajectory and the policy of interest, which are then used by a FQE algorithm to evaluate the value of the policy.

Args:
zs (list or np.ndarray):

The observed sensitive attributes of each individual in the offline trajectory. It should be a list or array following the Sensitive Attributes Format.

states (list or np.ndarray):

The state trajectory. It should be a list or array following the Full-trajectory States Format.

actions (list or np.ndarray):

The action trajectory. It should be a list or array following the Full-trajectory Actions Format.

rewards (list or np.ndarray):

The reward trajectory. It should be a list or array following the Full-trajectory Rewards Format.

policy (Agent):

The policy whose value is to be evaluated.

model_type (str):

The type of the model used for FQE. Can be “lm” (polynomial regression) or “nn” (neural network). Currently, only ‘nn’ is supported.

f_ua (Callable, optional):

A rule to generate exogenous variables for each individual’s actions during training. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.

hidden_dims (list[int], optional):

The hidden dimensions of the neural network. This argument is not used if model_type="lm".

learning_rate (int or float, optional):

The learning rate of the neural network. This argument is not used if model_type="lm".

epochs (int, optional):

The number of training epochs for the neural network. This argument is not used if model_type="lm".

gamma (int or float, optional):

The discount factor for the cumulative discounted reward in the objective function.

max_iter (int, optional):

The number of iterations for learning the Q function.

seed (int, optional):

The random seed used for FQE.

is_loss_monitored (bool, optional):

When set to True, will split the training data into a training set and a validation set, and will monitor the validation loss when training the neural network approximator of the Q function in each iteration. A warning will be raised if the percent absolute change in the validation loss is greater than loss_monitoring_min_delta for at least one of the final \(p\) epochs during neural network training, where \(p\) is specified by the argument loss_monitoring_patience. This argument is not used if model_type="lm".

is_early_stopping_nn (bool, optional):

When set to True, will split the training data into a training set and a validation set, and will enforce early stopping based on the validation loss when training the neural network approximator of the Q function in each iteration. That is, in each iteration, neural network training will stop early if the percent decrease in the validation loss is no greater than early_stopping_min_delta_nn for \(q\) consecutive training epochs, where \(q\) is specified by the argument early_stopping_patience_nn. This argument is not used if model_type="lm".

test_size_nn (int or float, optional):

An int or float between 0 and 1 (inclusive) that specifies the proportion of the full training data that is used as the validation set for loss monitoring and early stopping. This argument is not used if model_type="lm" or both is_loss_monitored and is_early_stopping are False.

loss_monitoring_patience (int, optional):

The number of consecutive epochs with barely-changing validation loss at the end of neural network training that is needed for loss monitoring to not raise warnings. This argument is not used if model_type="lm" or is_loss_monitored=False.

loss_monitoring_min_delta (int for float, optional):

The maximum amount of percent absolute change in the validation loss for it to be considered barely-changing by the loss monitoring mechanism. This argument is not used if model_type="lm" or is_loss_monitored=False.

early_stopping_patience_nn (int, optional):

The number of consecutive epochs with barely-decreasing validation loss during neural network training that is needed for early stopping to be triggered. This argument is not used if model_type="lm" or is_early_stopping_nn=False.

early_stopping_min_delta_nn (int for float, optional):

The maximum amount of decrease in the validation loss for it to be considered barely-decreasing by the early stopping mechanism. This argument is not used if model_type="lm" or is_early_stopping_nn=False.

is_q_monitored (bool, optional):

When set to True, will monitor the Q values estimated by the neural network approximator of the Q function in each iteration at all the state-action pairs present in the training trajectory. A warning will be raised if the percent absolute change in some Q value is greater than q_monitoring_min_delta for at least one of the final \(r\) iterations of model updates, where \(r\) is specified by the argument q_monitoring_patience. This argument is not used if model_type="lm".

is_early_stopping_q (bool, optional):

When set to True, will monitor the Q values estimated by the neural network approximator of the Q function at all the state-action pairs present in the training trajectory, and will enforce early stopping based on the estimated Q values when training the approximated Q function. That is, FQE training will stop early if the percent absolute changes in all the predicted Q values are no greater than early_stopping_min_delta_q for \(s\) consecutive iterations of model updates, where \(s\) is specified by the argument early_stopping_patience_q. This argument is not used if model_type="lm".

q_monitoring_patience (int, optional):

The number of consecutive iterations with barely-changing estimated Q values at the end of the iterative updates that is needed for Q value monitoring to not raise warnings. This argument is not used if model_type="lm" or is_q_monitored=False.

q_monitoring_min_delta (int for float, optional):

The maximum amount of persent absolute change in the estimated Q values for them to be considered barely-changing by the Q value monitoring mechanism. This argument is not used if model_type="lm" or is_q_monitored=False.

early_stopping_patience_q (int, optional):

The number of consecutive iterations with barely-changing estimated Q values that is needed for early stopping to be triggered. This argument is not used if model_type="lm" or is_early_stopping_q=False.

early_stopping_min_delta_q (int for float, optional):

The maximum amount of percent absolute change in the estimated Q values for them to be considered barely-changing by the early stopping mechanism. This argument is not used if model_type="lm" or is_early_stopping_q=False.

Returns:
discounted_cumulative_reward (np.integer or np.floating):

An estimation of the discounted cumulative reward achieved by the policy throughout the trajectory.

pycfrl.evaluation.evaluation.evaluate_reward_through_simulation(env: ~pycfrl.environment.environment.SyntheticEnvironment, z_eval_levels: list | ~numpy.ndarray, state_dim: int, N: int, T: int, policy: ~pycfrl.agents.agents.Agent, f_ux: ~typing.Callable[[int, int], ~numpy.ndarray] = <function f_ux_default>, f_ua: ~typing.Callable[[int], ~numpy.ndarray] = <function f_ua_default>, f_ur: ~typing.Callable[[int], ~numpy.ndarray] = <function f_ur_default>, z_probs: list | ~numpy.ndarray | None = None, gamma: int | float = 0.9, seed: int = 1) integer | floating

Estimate the value of a policy using simulation in a synthetic environment.

The function first simulates a trajectory of a pre-specified length T using the policy. Then it computes the cumulative discounted rewards achieved throughout the trajectory.

Since the discounted rewards are added across all time steps, it should generally be higher if a larger value is specified for the argument T.

Args:
env (SyntheticEnvironment):

The synthetic environment in which the simulation is run.

z_eval_levels (list or np.ndarray):

The values of sensitive attributes used in the simulation. The observed sensitive attributes of the individuals in the simulation will be sampled from this set.

state_dim (int):

The number of components in the state vector.

N (int):

The total number of individuals in the trajectory sampled during the simulation.

T (int):

The total number of transitions in the trajectory sampled during the simulation.

policy (Agent):

The policy whose value is to be evaluated.

f_ux (Callable, optional):

A rule to generate exogenous variables for each individual’s states. It should be a function whose argument list, argument names, and return type exactly match those of f_ux_default.

f_ua (Callable, optional):

A rule to generate exogenous variables for each individual’s actions. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.

f_ur (Callable, optional):

A rule to generate exogenous variables for each individual’s rewards. It should be a function whose argument list, argument names, and return type exactly match those of f_ur_default.

z_probs (list or np.ndarray, optional):

The probability of an individual taking each of the values in z_eval_levels. When set to None, a uniform distribution will be used.

gamma (int or float, optional):

The discount factor used for calculating the discounted cumulative rewards.

seed (int, optional):

The random seed used to run the simulation.

Returns:
discounted_cumulative_reward (np.integer or np.floating):

An estimation of the discounted cumulative reward achieved by the policy throughout the trajectory.

pycfrl.evaluation.evaluation.f_ua_default(N: int) ndarray

Generate exogenous variables for the actions from a uniform distribution between 0 and 1.

Args:
N (int):

The total number of individuals for whom the exogenous variables will be generated.

Returns:
ua (np.ndarray):

The generated exogenous variables. It is a (N, 1) array where each entry is sampled from a uniform distribution between 0 and 1.

pycfrl.evaluation.evaluation.f_ur_default(N: int) ndarray

Generate exogenous variables for the rewards from a standard normal distribution.

Args:
N (int):

The total number of individuals for whom the exogenous variables will be generated.

Returns:
ur (np.ndarray):

The generated exogenous variables. It is a (N, 1) array where each entry is sampled from a standard normal distribution.

pycfrl.evaluation.evaluation.f_ux_default(N: int, state_dim: int) ndarray

Generate exogenous variables for the states from a standard normal distribution.

Args:
N (int):

The total number of individuals for whom the exogenous variables will be generated.

state_dim (int):

The number of components in the state vector.

Returns:
ux (np.ndarray):

The generated exogenous variables. It is a (N, state_dim) array where each entry is sampled from a standard normal distribution.