Estimate the counterfactual fairness metric of a policy from an offline trajectory.
The function first estimates a set of counterfactual trajectories from the offline trajectory
using estimate_counterfactual_trajectories_from_data() in the environment module.
Then it computes a counterfactual fairness metric using the following formula given in Wang et al.
(2025):
where \(eval(Z)\) is the set of sensitive attribute values passed in by z_eval_levels,
\(A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right)\) is the action taken in the
counterfactual trajectory under \(Z=z'\), and
\(A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right)\) is the action taken under the
counterfactual trajectory under \(Z=z\). This metric is bounded between 0 and 1, with 0
representing perfect fairness and 1 indicating complete unfairness.
References:
Args:
env (SimulatedEnvironment):
An environment that simulates the transition dynamics of the
MDP underlying zs, states, actions, and rewards.
zs (list or np.ndarray):
The observed sensitive attributes of each individual in the
offline trajectory. It should be a list or array following the Sensitive Attributes
Format.
states (list or np.ndarray):
The state trajectory. It should be a list or array following
the Full-trajectory States Format.
actions (list or np.ndarray):
The action trajectory. It should be a list or array following
the Full-trajectory Actions Format.
policy (Agent):
The policy whose fairness is to be evaluated.
f_ua (Callable, optional):
A rule to generate exogenous variables for each individual’s
actions. It should be a function whose argument list, argument names, and return
type exactly match those of f_ua_default.
seed (int, optional):
The seed used to estimate the counterfactual trajectories.
Estimate the counterfactual fairness metric of a policy using simulation in a synthetic environment.
The function first simulates a set of counterfactual trajectories with a pre-specified length
using sample_counterfactual_trajectories() in the environment module. Then it computes
a counterfactual fairness metric using the following formula given in Wang et al. (2025):
where \(eval(Z)\) is the set of sensitive attribute values passed in by z_eval_levels,
\(A_t^{Z \leftarrow z'}\left(\bar{U}_t(h_{it})\right)\) is the action taken in the
counterfactual trajectory under \(Z=z'\), and
\(A_t^{Z \leftarrow z}\left(\bar{U}_t(h_{it})\right)\) is the action taken under the
counterfactual trajectory under \(Z=z\). This metric is bounded between 0 and 1, with 0
representing perfect fairness and 1 indicating complete unfairness.
References:
Args:
env (SyntheticEnvironment):
The synthetic environment in which the simulation is run.
z_eval_levels (list or np.ndarray):
The values of sensitive attributes for which
counterfactual trajectories are generated in the simulation.
The observed sensitive attributes of the individuals in the simulation will also be
sampled from this set.
state_dim (int):
The number of components in the state vector.
N (int):
The total number of individuals in the counterfactual trajectories sampled during the
simulation.
T (int):
The total number of transitions in the counterfactual trajectories sampled during the
simulation.
policy (Agent):
The policy whose fairness is to be evaluated.
f_ux (Callable, optional):
A rule to generate exogenous variables for each individual’s
states. It should be a function whose argument list, argument names, and return
type exactly match those of f_ux_default.
f_ua (Callable, optional):
A rule to generate exogenous variables for each individual’s
actions. It should be a function whose argument list, argument names, and return
type exactly match those of f_ua_default.
f_ur (Callable, optional):
A rule to generate exogenous variables for each individual’s
rewards. It should be a function whose argument list, argument names, and return
type exactly match those of f_ur_default.
z_probs (list or np.ndarray, optional):
The probability of an individual taking each of
the values in z_eval_levels as the observed sensitive attribute. When set to
None, a uniform distribution will be used.
Estimate the value of a policy using fitted Q evaluation (FQE).
The function takes in a offline trajectory and the policy of interest, which are then used by
a FQE algorithm to evaluate the value of the policy.
Args:
zs (list or np.ndarray):
The observed sensitive attributes of each individual in the
offline trajectory. It should be a list or array following the Sensitive Attributes
Format.
states (list or np.ndarray):
The state trajectory. It should be a list or array following
the Full-trajectory States Format.
actions (list or np.ndarray):
The action trajectory. It should be a list or array following
the Full-trajectory Actions Format.
rewards (list or np.ndarray):
The reward trajectory. It should be a list or array following
the Full-trajectory Rewards Format.
policy (Agent):
The policy whose value is to be evaluated.
model_type (str):
The type of the model used for FQE. Can be “lm” (polynomial regression) or
“nn” (neural network). Currently, only ‘nn’ is supported.
f_ua (Callable, optional):
A rule to generate exogenous variables for each individual’s
actions during training. It should be a function whose argument list, argument names,
and return type exactly match those of f_ua_default.
hidden_dims (list[int], optional):
The hidden dimensions of the neural network. This
argument is not used if model_type="lm".
learning_rate (int or float, optional):
The learning rate of the neural network. This
argument is not used if model_type="lm".
epochs (int, optional):
The number of training epochs for the neural network. This
argument is not used if model_type="lm".
gamma (int or float, optional):
The discount factor for the cumulative discounted reward
in the objective function.
max_iter (int, optional):
The number of iterations for learning the Q function.
seed (int, optional):
The random seed used for FQE.
is_loss_monitored (bool, optional):
When set to True, will split the training data into a training set and a
validation set, and will monitor the validation loss when training the neural network
approximator of the Q function in each iteration. A warning
will be raised if the percent absolute change in the validation loss is greater than loss_monitoring_min_delta for at
least one of the final \(p\) epochs during neural network training, where \(p\) is specified
by the argument loss_monitoring_patience. This argument is not used if model_type="lm".
is_early_stopping_nn (bool, optional):
When set to True, will split the training data into a training set and a
validation set, and will enforce early stopping based on the validation loss
when training the neural network approximator of the Q function in each iteration. That is, in each iteration,
neural network training will stop early
if the percent decrease in the validation loss is no greater than early_stopping_min_delta_nn for \(q\) consecutive training
epochs, where \(q\) is specified by the argument early_stopping_patience_nn. This argument is not used if
model_type="lm".
test_size_nn (int or float, optional):
An int or float between 0 and 1 (inclusive) that
specifies the proportion of the full training data that is used as the validation set for loss
monitoring and early stopping. This argument is not used if model_type="lm" or
both is_loss_monitored and is_early_stopping are False.
loss_monitoring_patience (int, optional):
The number of consecutive epochs with barely-changing validation loss at the end of neural network training that is needed
for loss monitoring to not raise warnings. This argument is not used if model_type="lm"
or is_loss_monitored=False.
loss_monitoring_min_delta (int for float, optional):
The maximum amount of percent absolute change in the validation loss for it to be considered
barely-changing by the loss monitoring mechanism. This argument is
not used if model_type="lm" or is_loss_monitored=False.
early_stopping_patience_nn (int, optional):
The number of consecutive epochs with barely-decreasing validation loss during neural network training that is needed
for early stopping to be triggered. This argument is not used if model_type="lm"
or is_early_stopping_nn=False.
early_stopping_min_delta_nn (int for float, optional):
The maximum amount of decrease in the validation loss for it to be considered
barely-decreasing by the early stopping mechanism. This argument is
not used if model_type="lm" or is_early_stopping_nn=False.
is_q_monitored (bool, optional):
When set to True, will monitor the Q values estimated by the neural network
approximator of the Q function in each iteration at all the state-action pairs present in the training trajectory. A warning
will be raised if the percent absolute change in some Q value is greater than q_monitoring_min_delta for at
least one of the final \(r\) iterations of model updates, where \(r\) is specified
by the argument q_monitoring_patience. This argument is not used if model_type="lm".
is_early_stopping_q (bool, optional):
When set to True, will monitor the Q values estimated by the neural network
approximator of the Q function at all the state-action pairs present in the training trajectory,
and will enforce early stopping based on the estimated Q values
when training the approximated Q function. That is,
FQE training will stop early
if the percent absolute changes in all the predicted Q values are no greater than early_stopping_min_delta_q for \(s\) consecutive
iterations of model updates, where \(s\) is specified by the argument early_stopping_patience_q. This argument is not used if
model_type="lm".
q_monitoring_patience (int, optional):
The number of consecutive iterations with barely-changing estimated Q values at the end of the iterative updates that is needed
for Q value monitoring to not raise warnings. This argument is not used if model_type="lm"
or is_q_monitored=False.
q_monitoring_min_delta (int for float, optional):
The maximum amount of persent absolute change in the estimated Q values for them to be considered
barely-changing by the Q value monitoring mechanism. This argument is
not used if model_type="lm" or is_q_monitored=False.
early_stopping_patience_q (int, optional):
The number of consecutive iterations with barely-changing estimated Q values that is needed
for early stopping to be triggered. This argument is not used if model_type="lm"
or is_early_stopping_q=False.
early_stopping_min_delta_q (int for float, optional):
The maximum amount of percent absolute change in the estimated Q values for them to be considered
barely-changing by the early stopping mechanism. This argument is
not used if model_type="lm" or is_early_stopping_q=False.
Returns:
discounted_cumulative_reward (np.integer or np.floating):
An estimation of the discounted
cumulative reward achieved by the policy throughout the trajectory.
Estimate the value of a policy using simulation in a synthetic environment.
The function first simulates a trajectory of a pre-specified length T using the policy.
Then it computes the cumulative discounted rewards achieved throughout the trajectory.
Since the discounted rewards are added across all time steps, it should generally be higher
if a larger value is specified for the argument T.
Args:
env (SyntheticEnvironment):
The synthetic environment in which the simulation is run.
z_eval_levels (list or np.ndarray):
The values of sensitive attributes used in the simulation.
The observed sensitive attributes of the individuals in the simulation will be sampled
from this set.
state_dim (int):
The number of components in the state vector.
N (int):
The total number of individuals in the trajectory sampled during the simulation.
T (int):
The total number of transitions in the trajectory sampled during the simulation.
policy (Agent):
The policy whose value is to be evaluated.
f_ux (Callable, optional):
A rule to generate exogenous variables for each individual’s
states. It should be a function whose argument list, argument names, and return
type exactly match those of f_ux_default.
f_ua (Callable, optional):
A rule to generate exogenous variables for each individual’s
actions. It should be a function whose argument list, argument names, and return
type exactly match those of f_ua_default.
f_ur (Callable, optional):
A rule to generate exogenous variables for each individual’s
rewards. It should be a function whose argument list, argument names, and return
type exactly match those of f_ur_default.
z_probs (list or np.ndarray, optional):
The probability of an individual taking each of
the values in z_eval_levels. When set to None, a uniform distribution
will be used.
gamma (int or float, optional):
The discount factor used for calculating the discounted
cumulative rewards.
seed (int, optional):
The random seed used to run the simulation.
Returns:
discounted_cumulative_reward (np.integer or np.floating):
An estimation of the discounted
cumulative reward achieved by the policy throughout the trajectory.