Detecting Changes and Avoiding Catastrophic Forgetting in Dynamic Partially Observable Environments

The ability of an agent to detect changes in an environment is key to successful adaptation. This ability involves at least two phases: learning a model of an environment, and detecting that a change is likely to have occurred when this model is no longer accurate. This task is particularly challenging in partially observable environments, such as those modeled with partially observable Markov decision processes (POMDPs). Some predictive learners are able to infer the state from observations and thus perform better with partial observability. Predictive state representations (PSRs) and neural networks are two such tools that can be trained to predict the probabilities of future observations. However, most such existing methods focus primarily on static problems in which only one environment is learned. In this paper, we propose an algorithm that uses statistical tests to estimate the probability of different predictive models to fit the current environment. We exploit the underlying probability distributions of predictive models to provide a fast and explainable method to assess and justify the model's beliefs about the current environment. Crucially, by doing so, the method can label incoming data as fitting different models, and thus can continuously train separate models in different environments. This new method is shown to prevent catastrophic forgetting when new environments, or tasks, are encountered. The method can also be of use when AI-informed decisions require justifications because its beliefs are based on statistical evidence from observations. We empirically demonstrate the benefit of the novel method with simulations in a set of POMDP environments.

The learning and predicting parts of the constrained gradient algorithm were separated. As the stored state history is the knowledge of the constrained gradient agent, the stored history is not updated when a prediction is made, only when the agent is asked to learn.
If the agent is asked not to learn for multiple consecutive time steps, then continuing the agent's training from its most recent state in history negatively affects the agent's performance if that state is not accurate for the timestep when learning is resumed. For this reason, we introduced a parameter to determine whether or not the agent trained on the most recent time step. If the agent did not train on the last time step but is asked to on the next, the agent removes states from history until the latest state in history matches with the current predicted state. This way the agent continues learning with a minimised effect on performance.
For the constrained gradient agent, the AMD does not cluster predictions based on the prediction of the next time step, but on the PSR state vector, which consists of predictions spanning multiple time steps in the future. This allows the clustering to be more accurate, especially in the case where two states may have the same probability of the next observation, but not observations further in the future. If only the prediction of the next observation is used to cluster, in this case, two separate states in the underlying environment would be grouped into a single cluster in AMD. When an observation with a predicted probability of 0 is observed, it is a clear indication that the current model is not valid. Due to the PSR formulation presented in section 2.1, the PSR state vector becomes inaccurate. To illustrate the process with a practical example, consider the simple environment shown in Fig. 1. A possible PSR for this environment has the core tests { , 'blue'}. In this case, the stationary distribution is [1, 0.5], the PSR state vector for state S0 is [1, 0], and the PSR state vector for state S1 is [1, 1].
If the current PSR state vector at time i is y(h i ) = [1, 0] (corresponding to state S0), and observation 'blue' occurs at time i + 1, the new state vector will not be possible to calculate due to division by zero. This condition can be see in the operation to find the element in y(h i+1 ) corresponding to the empty test y (h i+1 ): This case can be prevented by setting the state vector as the default state y( ) whenever Although the constrained gradient algorithm was used, AMD can be used with any PSR learner.

Neural Networks
When training, the neural network is fed the 5 most recent actions and observations as input. The network is trained based on prediction loss from the next observation given by the environment. The error -which is not fed to the agent, only used to measure performance -is calculated by comparing the agent's prediction of the next observation with the environment's true probability. Three layers were used: the input layer of size 10; a hidden layer of size 10; and an output layer of size 2, according to the possible number of observations in the environment. Softmax is used on the output layer.
AMD assumes the network's predictions are accurate for the environment it is supposed to have learned. The clusters are formed based on the output of the neural network, which is the agent's prediction of the probability distribution of the next observation. Therefore, the state and prediction provided to AMD are the same.
Note that the neural network has some potential issues which should be considered when working with more complex environments. Any length of input is not long enough to produce an accurate output if the environment is an infinite Markov model. LSTMs may be a better choice. We used a vanilla neural network as we only intend to show the value of AMD, rather than the performance of the network. Additionally, in POMDPs where some different underlying states share the same probability distribution of the next observation (for example, in Fig. 2, both S 0 and S 1 have the same probability of observing the "cream" observation next), the neural network's output does not distinguish between the underlying states. Using a hidden layer to form clusters instead may help distinguish between such states at the cost of increasing computational time.

PSEUDO-CODE
Algorithm 1 describes the AMD algorithm for detecting the probability of a predictive model's current observations coming from the environment in which it was trained 1 .