FQE

This module implements the fitted Q evaluation algorithm for offline policy evaluation.

from pycfrl import fqe
class pycfrl.fqe.fqe.FQE(num_actions: int, policy: Agent, model_type: Literal['lm', 'nn'], hidden_dims: list[int] = [32], learning_rate: int | float = 0.1, epochs: int = 500, gamma: int | float = 0.9, is_loss_monitored: bool = True, is_early_stopping_nn: bool = False, test_size_nn: int | float = 0.2, loss_monitoring_patience: int = 10, loss_monitoring_min_delta: int | float = 0.005, early_stopping_patience_nn: int = 10, early_stopping_min_delta_nn: int | float = 0.005, is_q_monitored: bool = True, is_early_stopping_q: bool = False, q_monitoring_patience: int = 5, q_monitoring_min_delta: int | float = 0.005, early_stopping_patience_q: int = 5, early_stopping_min_delta_q: int | float = 0.005)

Bases: object

Implementation of the fitted Q evaluation (FQE) algorithm.

FQE can be used to estimate the value of a policy using offline data.

__init__(num_actions: int, policy: Agent, model_type: Literal['lm', 'nn'], hidden_dims: list[int] = [32], learning_rate: int | float = 0.1, epochs: int = 500, gamma: int | float = 0.9, is_loss_monitored: bool = True, is_early_stopping_nn: bool = False, test_size_nn: int | float = 0.2, loss_monitoring_patience: int = 10, loss_monitoring_min_delta: int | float = 0.005, early_stopping_patience_nn: int = 10, early_stopping_min_delta_nn: int | float = 0.005, is_q_monitored: bool = True, is_early_stopping_q: bool = False, q_monitoring_patience: int = 5, q_monitoring_min_delta: int | float = 0.005, early_stopping_patience_q: int = 5, early_stopping_min_delta_q: int | float = 0.005) None
Args:
num_actions (int):

The total number of legit actions.

policy (Agent):

The policy to be evaluated.

model_type (str):

The type of the model used for learning the Q function. Can be “lm” (polynomial regression) or “nn” (neural network). Currently, only ‘nn’ is supported.

hidden_dims (list[int], optional):

The hidden dimensions of the neural network. This argument is not used if model_type="lm".

learning_rate (int or float, optional):

The learning rate of the neural network. This argument is not used if model_type="lm".

epochs (int, optional):

The number of training epochs for the neural network. This argument is not used if model_type="lm".

gamma (int or float, optional):

The discount factor for the cumulative discounted reward in the objective function.

is_loss_monitored (bool, optional):

When set to True, will split the training data into a training set and a validation set, and will monitor the validation loss when training the neural network approximator of the Q function in each iteration. A warning will be raised if the percent absolute change in the validation loss is greater than loss_monitoring_min_delta for at least one of the final \(p\) epochs during neural network training, where \(p\) is specified by the argument loss_monitoring_patience. This argument is not used if model_type="lm".

is_early_stopping_nn (bool, optional):

When set to True, will split the training data into a training set and a validation set, and will enforce early stopping based on the validation loss when training the neural network approximator of the Q function in each iteration. That is, in each iteration, neural network training will stop early if the percent decrease in the validation loss is no greater than early_stopping_min_delta_nn for \(q\) consecutive training epochs, where \(q\) is specified by the argument early_stopping_patience_nn. This argument is not used if model_type="lm".

test_size_nn (int or float, optional):

An int or float between 0 and 1 (inclusive) that specifies the proportion of the full training data that is used as the validation set for loss monitoring and early stopping. This argument is not used if model_type="lm" or both is_loss_monitored and is_early_stopping are False.

loss_monitoring_patience (int, optional):

The number of consecutive epochs with barely-changing validation loss at the end of neural network training that is needed for loss monitoring to not raise warnings. This argument is not used if model_type="lm" or is_loss_monitored=False.

loss_monitoring_min_delta (int for float, optional):

The maximum amount of percent absolute change in the validation loss for it to be considered barely-changing by the loss monitoring mechanism. This argument is not used if model_type="lm" or is_loss_monitored=False.

early_stopping_patience_nn (int, optional):

The number of consecutive epochs with barely-decreasing validation loss during neural network training that is needed for early stopping to be triggered. This argument is not used if model_type="lm" or is_early_stopping_nn=False.

early_stopping_min_delta_nn (int for float, optional):

The maximum amount of decrease in the validation loss for it to be considered barely-decreasing by the early stopping mechanism. This argument is not used if model_type="lm" or is_early_stopping_nn=False.

is_q_monitored (bool, optional):

When set to True, will monitor the Q values estimated by the neural network approximator of the Q function in each iteration at all the state-action pairs present in the training trajectory. A warning will be raised if the percent absolute change in some Q value is greater than q_monitoring_min_delta for at least one of the final \(r\) iterations of model updates, where \(r\) is specified by the argument q_monitoring_patience. This argument is not used if model_type="lm".

is_early_stopping_q (bool, optional):

When set to True, will monitor the Q values estimated by the neural network approximator of the Q function at all the state-action pairs present in the training trajectory, and will enforce early stopping based on the estimated Q values when training the approximated Q function. That is, FQE training will stop early if the percent absolute changes in all the predicted Q values are no greater than early_stopping_min_delta_q for \(s\) consecutive iterations of model updates, where \(s\) is specified by the argument early_stopping_patience_q. This argument is not used if model_type="lm".

q_monitoring_patience (int, optional):

The number of consecutive iterations with barely-changing estimated Q values at the end of the iterative updates that is needed for Q value monitoring to not raise warnings. This argument is not used if model_type="lm" or is_q_monitored=False.

q_monitoring_min_delta (int for float, optional):

The maximum amount of percent absolute change in the estimated Q values for them to be considered barely-changing by the Q value monitoring mechanism. This argument is not used if model_type="lm" or is_q_monitored=False.

early_stopping_patience_q (int, optional):

The number of consecutive iterations with barely-changing estimated Q values that is needed for early stopping to be triggered. This argument is not used if model_type="lm" or is_early_stopping_q=False.

early_stopping_min_delta_q (int for float, optional):

The maximum amount of percent absolute change in the estimated Q values for them to be considered barely-changing by the early stopping mechanism. This argument is not used if model_type="lm" or is_early_stopping_q=False.

evaluate(zs: list | ~numpy.ndarray, states: list | ~numpy.ndarray, actions: list | ~numpy.ndarray, f_ua: ~typing.Callable[[int], int] = <function f_ua_default>) ndarray

Estimate the value of the policy.

It uses the FQE algorithm and the input offline trajectory to evaluate the policy of interest.

Args:
zs (list or np.ndarray):

The observed sensitive attributes of each individual in the offline trajectory used for evaluation. It should be a 2D list or array following the Sensitive Attributes Format.

states (list or np.ndarray):

The state trajectory used for evaluation. It should be a 3D list or array following the Full-trajectory States Format.

actions (list or np.ndarray):

The action trajectory used for evaluation, often generated using a behavior policy. It should be a 2D list or array following the Full-trajectory Actions Format.

f_ua (Callable, optional):

A rule to generate exogenous variables for each individual’s actions during evaluation. It should be a 2D function whose argument list, argument names, and return type exactly match those of f_ua_default.

Returns:
Y (np.ndarray):

A vector containing multiple estimates of the value of the policy of interest. It is an array with shape (N*T, ) where N is the number of individuals in the input offline trajectory and T is the total number of transitions in the input offline trajectory.

fit(zs: list | ~numpy.ndarray, states: list | ~numpy.ndarray, actions: list | ~numpy.ndarray, rewards: list | ~numpy.ndarray, max_iter: int = 1000, f_ua: ~typing.Callable[[int], int] = <function f_ua_default>) None

Fit the FQE.

Args:
zs (list or np.ndarray):

The observed sensitive attributes of each individual in the training data. It should be a 2D list or array following the Sensitive Attributes Format.

states (list or np.ndarray):

The state trajectory used for training. It should be a 3D list or array following the Full-trajectory States Format.

actions (list or np.ndarray):

The action trajectory used for training, often generated using a behavior policy. It should be a 2D list or array following the Full-trajectory Actions Format.

rewards (list or np.ndarray):

The reward trajectory used for training. It should be a 2D list or array following the Full-trajectory Rewards Format.

max_iter (int, optional):

The number of iterations for learning the Q function.

f_ua (Callable, optional):

A rule to generate exogenous variables for each individual’s actions during training. It should be a function whose argument list, argument names, and return type exactly match those of f_ua_default.

pycfrl.fqe.fqe.f_ua_default(N: int) ndarray

Generate exogenous variables for the actions from a uniform distribution between 0 and 1.

Args:
N (int):

The total number of individuals for whom the exogenous variables will be generated.

Returns:
ua (np.ndarray):

The generated exogenous variables. It is a (N, 1) array where each entry is sampled from a uniform distribution between 0 and 1.