RLLib: C++ Template Library to Learn Behaviors and Represent Learnable Knowledge using On/Off Policy Reinforcement Learning

Saminda Abeyruwan saminda@cs.miami.edu

Abstract

RLLib is a lightweight C++ template library that implements incremental, standard, and gradient temporal-difference learning algorithms in Reinforcement Learning. It is an optimized library for robotic applications that operates under fast duty cycles (e.g., ≤ 30 ms). RLLib has been tested and evaluated on Robocup 3D Soccer Simulation agents and physical NAO V4 humanoid robots to learn behaviors and represent learnable knowledge.
 
Note: This is a work-in-progress document. Last modified in 11/19/2013.

1  Introduction

The RoboCup initiative presents a dynamic, complex, adversarial, real-time, and stochastic multi-agent environment for agents to learn about different problems. Reinforcement Learning (RL) is a framework that provides a set to tools to design of sophisticated and hard-to-engineer behaviors. RL agents naturally formalize their learning goals in two layers:
  1. physical layers - control related to walking, kicking, and etc.; and
  2. decision layers - emerge behaviors though actions, options, or knowledge representations.
Therefore, the modern RL framework includes controlling; it also provides means to represent highly expressive knowledge using General Value Functions (GVFs) from sensorimotor streams. RLLib is a lightweight C++ template library that implements incremental, standard, and gradient temporal-difference learning algorithms in RL to effectively solve goals defined in physical and decision layers. It is designed and written specifically for robotic applications that operate under fast duty cycles. RLLib1 is released under open source license Apache License, Version 2.0, and it is first announced in MLOSS on September 16, 2013.

2  Features

  1. Off-policy prediction algorithms [GTD(λ) and GQ(λ)];
  2. Off-policy control algorithms [Greedy-GQ(λ), Softmax-GQ(λ), and Off-PAC (provision to be used in on-policy setting)];
  3. On-policy algorithms [TD(λ), SARSA(λ), Expected-SARSA(λ), and Actor-Critic (continuous and discrete actions, discounted, averaged reward settings, etc.)];
  4. Supervised learning algorithms [Adaline, IDBD, SemiLinearIDBD, and Autostep];
  5. Policies [Random, RandomXPercentBias, Greedy, Epsilon-greedy, Boltzmann, Normal, and Softmax];
  6. Sparse feature extractors (e.g., tile coding) with pluggable hash functions;
  7. An efficient implementation of the dot product for sparse coder based feature representations;
  8. Benchmark environments [Mountain Car, Mountain Car 3D, Swinging Pendulum, Helicopter, and Continuous Grid World];
  9. Optimized for very fast duty cycles (e.g., using culling traces, RLLib is tested on the RoboCup 3D simulator agent, and on the physical NAO V4 robot (cognition thread));
  10. A framework to represent highly expressive learnable knowledge representations in RL using GVFs;
  11. A framework to design complex behaviors and controllers using topological graphs;
  12. A framework to visualize benchmark problems; and
  13. A plethora of examples demonstrating on-policy and off-policy control experiments.
More information is available at [Abeyruwan et al., 2013].

3  Reinforcement Learning

The standard RL framework [Sutton and Barto, 1998] models an AI agent and an environment interactions in discrete time steps t=1,2,3,…. The agent senses the state of the world at each time step StS and selects an action AtA. One time step later the agent receives a scalar reward Rt+1R, and senses the state St+1S. The rewards are generated according to a reward function r:St+1R. The objective of the standard RL framework is to learn the stochastic action-selection policy π: S ×A → [0,1], that gives the probability of selecting each action in each state, π(s, a) = π(s|a) = P(At = a|St = s), such that the agent maximizes rewards summed over the time steps.
Recently, within the context of the RL framework a knowledge representation language has been introduced, that is expressive and learnable from sensorimotor data. This representation is directly usable for robotic soccer as agent-environment interactions are conducted through perceptors and actuators. In this approach, knowledge is represented as a large number of approximate value functions each with its
  1. Own policy;
  2. Pseudo-reward function;
  3. Pseudo-termination function; and
  4. Pseudo-terminal-reward function.
The interpretation of the approximate value function as a knowledge representation language grounded on information from perceptors and actuators is defined as: The knowledge expressed as an approximate value function is true or accurate, if its numerical values matches those of the mathematically defined value function it is approximating.
Therefore, a value function asks a question, and an approximate value function provides an answer to that question. Based on prior interpretation, the standard RL framework extends to represent learnable knowledge as follows. The standard RL framework extends to include a terminal-reward-function, z:SR, where z(s) is the terminal reward received when the termination occurs in state s. In the RL framework, γ ∈ [0,1) is used to discount delayed rewards. Another interpretation of the discounting factor is a constant probability of 1−γ termination of arrival to a state with zero terminal-reward. This factor is generalized to a termination function γ:S → [0,1], where 1− γ(s) is the probability of termination at state s, and a terminal reward z(s) is generated.
In continuous state or action spaces, approximate value functions are learned using function approximation. They are learned using efficient on- or off-policy learning algorithms in RL. We briefly introduce some of the important concepts related to the GVFs. The complete information about the GVFs are available in [Sutton et al., 2011,Maei and Sutton, 2010,Maei et al., 2010,Maei, 2011].

3.1  Off-Policy Action-Value Methods for GVFs

The first method to learn about GVFs, from off-policy experiences, is to use action-value functions. Let Gt be the complete return from state St at time t, then the sum of the rewards (transient plus terminal) until termination at time T is:
Gt = T

k=t+1 
r(Sk) + z(ST).
The action-value function is:
Qπ(s,a) = E(Gt|St = s, At = a, At+1:T−1 ∼ π, T  ∼ γ),
where, Qπ:S×AR. This is the expected return for a trajectory started from state s, and action a, and selecting actions according to the policy π, until termination occurs with γ. We approximate the action-value function with Q:S×AR. Therefore, the action-value function is a precise grounded question, while the approximate action-value function offers the numerical answer.
The GVFs are defined over four functions: π, γ, r,and z. The functions r and z act as pseudo-reward and pseudo-terminal-reward functions respectively. Function γ is also in pseudo form as well. However, γ function is more substantive than reward functions as the termination interrupts the normal flow of state transitions. In pseudo termination, the standard termination is omitted. In robotic soccer, the base problem can be defined as the time until a goal is scored by either the home or the opponent team. We can consider a pseudo-termination has occurred when the striker is changed. The GVF with respect to a state-action function is defined as:
Qπ,γ, r,z(s,a) = E(Gt|St = s, At = a,At+1:T−1 ∼ π, T ∼ γ).
The four functions, π, γ, r,and z, are the question functions to GVFs, which in return defines the general value function's semantics. The RL agent learns an approximate action-value function, Q, using the four auxiliary functions π,γ, r and z. We assume that the state space is continuous and the action space is discrete. We approximate the action-value function using a linear function approximator. We use a feature extractor ϕ: St ×At → {0,1}N, N ∈ N, built on tile coding [Sutton and Barto, 1998] to generate feature vectors from state variables and actions. This is a sparse vector with a constant number of "1" features, hence, a constant norm. In addition, tile coding has the key advantage of real-time learning and to implement computationally efficient algorithms to learn approximate value functions. In linear function approximation, there exists a weight vector, θ ∈ RN, N ∈ N, to be learned. Therefore, the approximate GVFs are defined as:
^
Q
 
(s,a,θ)=θTϕ(s,a),
such that, Q:S ×A×RNR. Weights are learned using the gradient-descent temporal-difference Algorithms [Maei, 2011]. The Algorithm learns stably and efficiently using linear function approximation from off-policy experiences. Off-policy experiences are generated from a behavior policy, πb, that is different from the policy being learned about named as target policy, π. Therefore, one could learn multiple target policies from the same behavior policy.

3.2  Off-Policy Policy Gradient Methods for GVFs

The second method to learn about GVFs is using the off-policy policy gradient methods with actor-critic architectures that use a state-value function suitable for learning GVFs. It is defined as:
Vπ,γ, r,z(s) = E(Gt|St = s,At:T−1 ∼ π, T ∼ γ),
where, Vπ,γ, r,z(s) is the true state-value function, and the approximate GVF is defined as:
^
V
 
(s,v)=vTϕ(s),
where, the functions π,γ, r, and z are defined as in the subsection (3.1). Since our the target policy π is discrete stochastic, we use a Gibbs distribution of the form:
π(a | s) = euT ϕ(s, a)



b 
euT ϕ(s, b)
,
where, ϕ(s,a) are state-action features for state s, and action a, which are in general unrelated to state features ϕ(s), that are used in state-value function approximation. u ∈ RNu, NuN, is a weight vector, which is modified by the actor to learn about the stochastic target policy. The log-gradient of the policy at state s, and action a, is:
u π(a | s)

π(a | s)
= ϕ(s,a) −

b 
π(b|s) ϕ(s,b).
To summarize, the definitions of the question functions and the answer functions are given as: The question functions are defined by:
  1. π:St ×At → [0, 1] (target policy is greedy w.r.t. learned value function);
  2. γ:St → [0, 1] (termination function);
  3. r:St+1R (transient reward function); and
  4. z:St+1R (terminal reward function).
The answer functions are defined by:
  1. πb: St ×At → [0, 1] (behavior policy);
  2. It:St ×At → [0, 1] (interest function);
  3. ϕ:St ×AtRN (feature-vector function); and
  4. λ:St → [0, 1] (eligibility-trace decay-rate function).

4  Off-Policy Example

This section provides in-depth guide to model off-policy learning scenarios using RLLib. We use ContinuousGridWorld [Thomas Degris, 2012] as an illustration. The following example shows the setup protocol.
  // From the RLLib
  #include "Vector.h"
  #include "Trace.h"
  #include "Projector.h"
  #include "ControlAlgorithm.h"
  #include "Representation.h"

  // From the simulation
  #include "ContinuousGridworld.h"
  #include "RL.h"

  using namespace RLLib;

  int main()
  {
    Probabilistic::srand(0);
    RLProblem<double>* problem = new ContinuousGridworld<double>;
    Hashing* hashing = new MurmurHashing;
    Projector<double>* projector = new TileCoderHashing<double>(1000000, 10, true, hashing);
    StateToStateAction<double>* toStateAction = new StateActionTilings<double>(projector, problem->getDiscreteActionList());

    double alpha_v = 0.1 / projector->vectorNorm();
    double alpha_w = 0.0001 / projector->vectorNorm();
    double gamma = 0.99;
    double lambda = 0.4;
    double alpha_u = 0.001 / projector->vectorNorm();
    
    Trace<double>* critice = new ATrace<double>(projector->dimension());
    GTDLambda<double>* critic = new GTDLambda<double>(alpha_v, alpha_w, gamma, lambda, critice);
    PolicyDistribution<double>* target = new BoltzmannDistribution<double>(projector->dimension(), problem->getDiscreteActionList());

    Trace<double>* actore = new ATrace<double>(projector->dimension());
    Traces<double>* actoreTraces = new Traces<double>();
    actoreTraces->push_back(actore);
    ActorOffPolicy<double>* actor = new ActorLambdaOffPolicy<double>(alpha_u, gamma, lambda, target, actoreTraces);

    Policy<double>* behavior = new RandomPolicy<double>(problem->getDiscreteActionList());
    OffPolicyControlLearner<double>* control = new OffPAC<double>(behavior, critic, actor, toStateAction, projector);

    RLAgent<double>* agent = new LearnerAgent<double>(control);
    Simulator<double>* sim = new Simulator<double>(agent, problem, 5000, 2000, 5);
    sim->setTestEpisodesAfterEachRun(true);
    //sim->setVerbose(false);
    sim->run();
    sim->computeValueFunction();

    control->persist("cgw_offpac.data");

    control->reset();
    control->resurrect("cgw_offpac.data");
    sim->runEvaluate(100);

    delete problem;
    delete hashing;
    delete projector;
    delete toStateAction;
    delete critice;
    delete critic;
    delete actore;
    delete actoreTraces;
    delete actor;
    delete behavior;
    delete target;
    delete control;
    delete agent;
    delete sim;

    return 0;
  }

5  On-Policy Example

This section provides in-depth guide to model on-policy learning scenarios using RLLib. We use SwingingPendulum [Degris et al., 2012] as an illustration. The following example shows the setup protocol.

  // From the RLLib
  #include "Vector.h"
  #include "Trace.h"
  #include "Projector.h"
  #include "ControlAlgorithm.h"
  #include "Representation.h"

  // From the simulation
  #include "SwingPendulum.h"
  #include "RL.h"

  int main()
  {
    Probabilistic::srand(0);
    RLProblem<double>* problem = new SwingPendulum<double>;
    Hashing* hashing = new MurmurHashing;
    Projector<double>* projector = new TileCoderHashing<double>(1000, 10, false, hashing);
    StateToStateAction<double>* toStateAction = new StateActionTilings<double>(projector, problem->getContinuousActionList());

    double alpha_v = 0.1 / projector->vectorNorm();
    double alpha_u = 0.001 / projector->vectorNorm();
    double alpha_r = .0001;
    double gamma = 1.0;
    double lambda = 0.5;

    Trace<double>* critice = new ATrace<double>(projector->dimension());
    TDLambda<double>* critic = new TDLambda<double>(alpha_v, gamma, lambda, critice);

    PolicyDistribution<double>* policyDistribution = new NormalDistributionScaled<double>(0, 1.0, projector->dimension(), problem->getContinuousActionList());
    Range<double> policyRange(-2.0, 2.0);
    Range<double> problemRange(-2.0, 2.0);
    PolicyDistribution<double>* acting = new ScaledPolicyDistribution<double>(problem->getContinuousActionList(), policyDistribution, &policyRange, &problemRange);

    Trace<double>* actore1 = new ATrace<double>(projector->dimension());
    Trace<double>* actore2 = new ATrace<double>(projector->dimension());
    Traces<double>* actoreTraces = new Traces<double>();
    actoreTraces->push_back(actore1);
    actoreTraces->push_back(actore2);
    ActorOnPolicy<double>* actor = new ActorLambda<double>(alpha_u, gamma, lambda, acting, actoreTraces);

    OnPolicyControlLearner<double>* control = new AverageRewardActorCritic<double>(critic, actor, projector,  toStateAction, alpha_r);

    RLAgent<double>* agent = new LearnerAgent<double>(control);
    Simulator<double>* sim = new Simulator<double>(agent, problem, 5000, 100, 10);
    sim->setVerbose(true);
    sim->run();

    sim->runEvaluate(100);
    sim->computeValueFunction();

    delete problem;
    delete hashing;
    delete projector;
    delete toStateAction;
    delete critice;
    delete critic;
    delete actore1;
    delete actore2;
    delete actoreTraces;
    delete actor;
    delete policyDistribution;
    delete acting;
    delete control;
    delete agent;
    delete sim;

    return 0;
  }


References

[12013Abeyruwan et al.Abeyruwan, Seekircher, and Visser]
Saminda Abeyruwan, Andreas Seekircher, and Ubbo Visser. Dynamic Role Assignment using General Value Functions. In AAMAS 2013, ALA Workshop, 2013.
[22012Degris et al.Degris, Pilarski, and Sutton]
Thomas Degris, M. Pilarski, Patrick, and S. Sutton, Richard. Model-Free Reinforcement Learning with Continuous Action in Practice. In American Control Conference, Montreal, Canada, June 2012. URL http://hal.inria.fr/hal-00764281.
[32011Maei]
Hamid Reza Maei. Gradient Temporal-Difference Learning Algorithms. PhD Thesis, University of Alberta., 2011.
[42010Maei and Sutton]
Hamid Reza Maei and Richard S Sutton. GQ(λ): A General Gradient Algorithm for Temporal-Difference Prediction Learning with Eligibility Traces. Proceedings of the 3rd Conference on Artificial General Intelligence (AGI-10), pages 1-6, 2010.
[52010Maei et al.Maei, Szepesvári, Bhatnagar, and Sutton]
Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton. Toward Off-Policy Learning Control with Function Approximation. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pages 719-726, 2010.
[61998Sutton and Barto]
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. ISBN 0262193981.
[72011Sutton et al.Sutton, Modayil, Delp, Degris, Pilarski, White, and Precup]
Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White, and Doina Precup. Horde: A Scalable Real-Time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems, AAMAS '11, pages 761-768. International Foundation for Autonomous Agents and Multiagent Systems, 2011. ISBN 0-9826571-6-1, 978-0-9826571-6-4.
[82012Thomas Degris]
Richard S. Sutton Thomas Degris, Martha White. Off-Policy Actor-Critic. In Proceedings of the Twenty-Ninth International Conference on Machine Learning (ICML), 2012.

Footnotes:

1The latest release version of RLLib is v1.5. RLLib is released to the public on May 17, 2013. The v1.0 release notification is available in http://tinyurl.com/rllib-1-0.


File translated from TEX by TTH, version 4.01.
On 19 Nov 2013, 19:38.