FQE
This module implements the fitted Q evaluation algorithm for offline policy evaluation.
from pycfrl import fqe
- class pycfrl.fqe.fqe.FQE(num_actions: int, policy: Agent, model_type: Literal['lm', 'nn'], hidden_dims: list[int] = [32], learning_rate: int | float = 0.1, epochs: int = 500, gamma: int | float = 0.9, is_loss_monitored: bool = True, is_early_stopping_nn: bool = False, test_size_nn: int | float = 0.2, loss_monitoring_patience: int = 10, loss_monitoring_min_delta: int | float = 0.005, early_stopping_patience_nn: int = 10, early_stopping_min_delta_nn: int | float = 0.005, is_q_monitored: bool = True, is_early_stopping_q: bool = False, q_monitoring_patience: int = 5, q_monitoring_min_delta: int | float = 0.005, early_stopping_patience_q: int = 5, early_stopping_min_delta_q: int | float = 0.005)
Bases:
objectImplementation of the fitted Q evaluation (FQE) algorithm.
FQE can be used to estimate the value of a policy using offline data.
- __init__(num_actions: int, policy: Agent, model_type: Literal['lm', 'nn'], hidden_dims: list[int] = [32], learning_rate: int | float = 0.1, epochs: int = 500, gamma: int | float = 0.9, is_loss_monitored: bool = True, is_early_stopping_nn: bool = False, test_size_nn: int | float = 0.2, loss_monitoring_patience: int = 10, loss_monitoring_min_delta: int | float = 0.005, early_stopping_patience_nn: int = 10, early_stopping_min_delta_nn: int | float = 0.005, is_q_monitored: bool = True, is_early_stopping_q: bool = False, q_monitoring_patience: int = 5, q_monitoring_min_delta: int | float = 0.005, early_stopping_patience_q: int = 5, early_stopping_min_delta_q: int | float = 0.005) None
- Args:
- num_actions (int):
The total number of legit actions.
- policy (Agent):
The policy to be evaluated.
- model_type (str):
The type of the model used for learning the Q function. Can be “lm” (polynomial regression) or “nn” (neural network). Currently, only ‘nn’ is supported.
- hidden_dims (list[int], optional):
The hidden dimensions of the neural network. This argument is not used if
model_type="lm".- learning_rate (int or float, optional):
The learning rate of the neural network. This argument is not used if
model_type="lm".- epochs (int, optional):
The number of training epochs for the neural network. This argument is not used if
model_type="lm".- gamma (int or float, optional):
The discount factor for the cumulative discounted reward in the objective function.
- is_loss_monitored (bool, optional):
When set to
True, will split the training data into a training set and a validation set, and will monitor the validation loss when training the neural network approximator of the Q function in each iteration. A warning will be raised if the percent absolute change in the validation loss is greater thanloss_monitoring_min_deltafor at least one of the final \(p\) epochs during neural network training, where \(p\) is specified by the argumentloss_monitoring_patience. This argument is not used ifmodel_type="lm".- is_early_stopping_nn (bool, optional):
When set to
True, will split the training data into a training set and a validation set, and will enforce early stopping based on the validation loss when training the neural network approximator of the Q function in each iteration. That is, in each iteration, neural network training will stop early if the percent decrease in the validation loss is no greater thanearly_stopping_min_delta_nnfor \(q\) consecutive training epochs, where \(q\) is specified by the argumentearly_stopping_patience_nn. This argument is not used ifmodel_type="lm".- test_size_nn (int or float, optional):
An
intorfloatbetween 0 and 1 (inclusive) that specifies the proportion of the full training data that is used as the validation set for loss monitoring and early stopping. This argument is not used ifmodel_type="lm"or bothis_loss_monitoredandis_early_stoppingareFalse.- loss_monitoring_patience (int, optional):
The number of consecutive epochs with barely-changing validation loss at the end of neural network training that is needed for loss monitoring to not raise warnings. This argument is not used if
model_type="lm"oris_loss_monitored=False.- loss_monitoring_min_delta (int for float, optional):
The maximum amount of percent absolute change in the validation loss for it to be considered barely-changing by the loss monitoring mechanism. This argument is not used if
model_type="lm"oris_loss_monitored=False.- early_stopping_patience_nn (int, optional):
The number of consecutive epochs with barely-decreasing validation loss during neural network training that is needed for early stopping to be triggered. This argument is not used if
model_type="lm"oris_early_stopping_nn=False.- early_stopping_min_delta_nn (int for float, optional):
The maximum amount of decrease in the validation loss for it to be considered barely-decreasing by the early stopping mechanism. This argument is not used if
model_type="lm"oris_early_stopping_nn=False.- is_q_monitored (bool, optional):
When set to
True, will monitor the Q values estimated by the neural network approximator of the Q function in each iteration at all the state-action pairs present in the training trajectory. A warning will be raised if the percent absolute change in some Q value is greater thanq_monitoring_min_deltafor at least one of the final \(r\) iterations of model updates, where \(r\) is specified by the argumentq_monitoring_patience. This argument is not used ifmodel_type="lm".- is_early_stopping_q (bool, optional):
When set to
True, will monitor the Q values estimated by the neural network approximator of the Q function at all the state-action pairs present in the training trajectory, and will enforce early stopping based on the estimated Q values when training the approximated Q function. That is, FQE training will stop early if the percent absolute changes in all the predicted Q values are no greater thanearly_stopping_min_delta_qfor \(s\) consecutive iterations of model updates, where \(s\) is specified by the argumentearly_stopping_patience_q. This argument is not used ifmodel_type="lm".- q_monitoring_patience (int, optional):
The number of consecutive iterations with barely-changing estimated Q values at the end of the iterative updates that is needed for Q value monitoring to not raise warnings. This argument is not used if
model_type="lm"oris_q_monitored=False.- q_monitoring_min_delta (int for float, optional):
The maximum amount of percent absolute change in the estimated Q values for them to be considered barely-changing by the Q value monitoring mechanism. This argument is not used if
model_type="lm"oris_q_monitored=False.- early_stopping_patience_q (int, optional):
The number of consecutive iterations with barely-changing estimated Q values that is needed for early stopping to be triggered. This argument is not used if
model_type="lm"oris_early_stopping_q=False.- early_stopping_min_delta_q (int for float, optional):
The maximum amount of percent absolute change in the estimated Q values for them to be considered barely-changing by the early stopping mechanism. This argument is not used if
model_type="lm"oris_early_stopping_q=False.
- evaluate(zs: list | ~numpy.ndarray, states: list | ~numpy.ndarray, actions: list | ~numpy.ndarray, f_ua: ~typing.Callable[[int], int] = <function f_ua_default>) ndarray
Estimate the value of the policy.
It uses the FQE algorithm and the input offline trajectory to evaluate the policy of interest.
- Args:
- zs (list or np.ndarray):
The observed sensitive attributes of each individual in the offline trajectory used for evaluation. It should be a 2D list or array following the Sensitive Attributes Format.
- states (list or np.ndarray):
The state trajectory used for evaluation. It should be a 3D list or array following the Full-trajectory States Format.
- actions (list or np.ndarray):
The action trajectory used for evaluation, often generated using a behavior policy. It should be a 2D list or array following the Full-trajectory Actions Format.
- f_ua (Callable, optional):
A rule to generate exogenous variables for each individual’s actions during evaluation. It should be a 2D function whose argument list, argument names, and return type exactly match those of f_ua_default.
- Returns:
- Y (np.ndarray):
A vector containing multiple estimates of the value of the policy of interest. It is an array with shape (N*T, ) where N is the number of individuals in the input offline trajectory and T is the total number of transitions in the input offline trajectory.
- fit(zs: list | ~numpy.ndarray, states: list | ~numpy.ndarray, actions: list | ~numpy.ndarray, rewards: list | ~numpy.ndarray, max_iter: int = 1000, f_ua: ~typing.Callable[[int], int] = <function f_ua_default>) None
Fit the FQE.
- Args:
- zs (list or np.ndarray):
The observed sensitive attributes of each individual in the training data. It should be a 2D list or array following the Sensitive Attributes Format.
- states (list or np.ndarray):
The state trajectory used for training. It should be a 3D list or array following the Full-trajectory States Format.
- actions (list or np.ndarray):
The action trajectory used for training, often generated using a behavior policy. It should be a 2D list or array following the Full-trajectory Actions Format.
- rewards (list or np.ndarray):
The reward trajectory used for training. It should be a 2D list or array following the Full-trajectory Rewards Format.
- max_iter (int, optional):
The number of iterations for learning the Q function.
- f_ua (Callable, optional):
A rule to generate exogenous variables for each individual’s actions during training. It should be a function whose argument list, argument names, and return type exactly match those of
f_ua_default.
- pycfrl.fqe.fqe.f_ua_default(N: int) ndarray
Generate exogenous variables for the actions from a uniform distribution between 0 and 1.
- Args:
- N (int):
The total number of individuals for whom the exogenous variables will be generated.
- Returns:
- ua (np.ndarray):
The generated exogenous variables. It is a (N, 1) array where each entry is sampled from a uniform distribution between 0 and 1.