RLLib: C++ Template Library to Learn Behaviors and Represent Learnable Knowledge using On/Off Policy Reinforcement Learning
Saminda Abeyruwan saminda@cs.miami.edu
Abstract
RLLib is a lightweight C++ template library that implements incremental, standard, and gradient temporaldifference learning algorithms in Reinforcement Learning. It is an optimized library for robotic applications that operates under fast duty cycles (e.g., ≤ 30 ms). RLLib has been tested and evaluated on Robocup 3D Soccer Simulation agents and physical NAO V4 humanoid robots to learn behaviors and represent learnable knowledge.
Note: This is a workinprogress document. Last modified in 11/19/2013.
1 Introduction
The RoboCup initiative presents a dynamic, complex, adversarial, realtime, and stochastic multiagent environment for agents to learn about different problems. Reinforcement Learning (RL) is a framework that provides a set to tools to design of sophisticated and hardtoengineer behaviors. RL agents naturally formalize their learning goals in two layers:
 physical layers  control related to walking, kicking, and etc.; and
 decision layers  emerge behaviors though actions, options, or knowledge representations.
Therefore, the modern RL framework includes controlling; it also provides means to represent highly expressive knowledge using General Value Functions (GVFs) from sensorimotor streams. RLLib is a lightweight C++ template library that implements incremental, standard, and gradient temporaldifference learning algorithms in RL to effectively solve goals defined in
physical and decision layers. It is designed and written specifically for robotic applications that operate under fast duty cycles. RLLib^{1} is released under open source license Apache License, Version 2.0, and it is first announced in MLOSS on September 16, 2013.
2 Features
 Offpolicy prediction algorithms [GTD(λ) and GQ(λ)];
 Offpolicy control algorithms [GreedyGQ(λ), SoftmaxGQ(λ), and OffPAC (provision to be used in onpolicy setting)];
 Onpolicy algorithms [TD(λ), SARSA(λ), ExpectedSARSA(λ), and ActorCritic (continuous and discrete actions, discounted, averaged reward settings, etc.)];
 Supervised learning algorithms [Adaline, IDBD, SemiLinearIDBD, and Autostep];
 Policies [Random, RandomXPercentBias, Greedy, Epsilongreedy, Boltzmann, Normal, and Softmax];
 Sparse feature extractors (e.g., tile coding) with pluggable hash functions;
 An efficient implementation of the dot product for sparse coder based feature representations;
 Benchmark environments [Mountain Car, Mountain Car 3D, Swinging Pendulum, Helicopter, and Continuous Grid World];
 Optimized for very fast duty cycles (e.g., using culling traces, RLLib is tested on
the RoboCup 3D simulator agent, and on the physical NAO V4 robot (cognition thread));
 A framework to represent highly expressive learnable knowledge representations in RL using GVFs;
 A framework to design complex behaviors and controllers using topological graphs;
 A framework to visualize benchmark problems; and
 A plethora of examples demonstrating onpolicy and offpolicy control experiments.
More information is available at [Abeyruwan et al., 2013].
3 Reinforcement Learning
The standard RL framework [Sutton and Barto, 1998] models an AI agent and an environment interactions in discrete
time steps t=1,2,3,…. The agent senses the state of the world at each time step S_{t} ∈ S and
selects an action A_{t} ∈ A. One time step later the agent receives
a scalar reward R_{t+1} ∈ R, and senses the state S_{t+1} ∈ S. The rewards are
generated according to a reward function r:S_{t+1}→ R. The
objective of the standard RL framework is to learn the stochastic actionselection policy
π: S ×A → [0,1], that gives the probability of selecting each
action in each state, π(s, a) = π(sa) = P(A_{t} = aS_{t} = s), such that the agent maximizes
rewards summed over the time steps.
Recently, within the context of the RL framework a knowledge representation language has been introduced, that is expressive and learnable from sensorimotor
data. This representation is directly usable for robotic soccer as agentenvironment interactions
are conducted through perceptors and actuators. In this approach, knowledge is represented as a large
number of approximate value functions each with its
 Own policy;
 Pseudoreward function;
 Pseudotermination function; and
 Pseudoterminalreward function.
The interpretation of the approximate value function as a knowledge representation language grounded
on information from perceptors and actuators is defined as:
The knowledge expressed as an approximate value function is true or accurate, if its numerical values matches those of the mathematically defined value function it is approximating.
Therefore, a value function asks a question, and an approximate value function provides an
answer to that question. Based on prior interpretation, the standard RL framework extends to represent learnable knowledge as follows. The standard RL framework extends to include a terminalrewardfunction, z:S → R, where z(s) is the terminal reward received when the termination occurs in state s. In the RL framework, γ ∈ [0,1) is used to discount delayed rewards. Another interpretation of the discounting factor is a constant probability of 1−γ termination of arrival to a state with zero terminalreward. This factor is generalized to a termination function γ:S → [0,1], where 1− γ(s) is the probability of termination at state s, and a terminal reward z(s) is generated.
In continuous state or action spaces, approximate value functions are learned using function approximation. They are learned using efficient on or offpolicy learning algorithms in RL. We briefly introduce some of the
important concepts related to the GVFs. The complete information about the GVFs are available in
[Sutton et al., 2011,Maei and Sutton, 2010,Maei et al., 2010,Maei, 2011].
3.1 OffPolicy ActionValue Methods for GVFs
The first method to learn about GVFs, from offpolicy experiences, is to use actionvalue functions. Let G_{t} be the complete return from state S_{t} at time t, then the sum of the rewards (transient plus
terminal) until termination at time T is:
G_{t} = 
T ∑
k=t+1

r(S_{k}) + z(S_{T}). 

The actionvalue function is:
Q^{π}(s,a) = E(G_{t}S_{t} = s, A_{t} = a, A_{t+1:T−1} ∼ π, T ∼ γ), 

where, Q^{π}:S×A→ R. This is the expected return for a trajectory started from state s,
and action a, and selecting actions according to the policy π, until termination occurs
with γ. We approximate the actionvalue function with ∧Q:S×A→ R. Therefore, the actionvalue function is a precise grounded
question, while the approximate actionvalue function offers the numerical answer.
The GVFs are defined over four functions: π, γ, r,and z. The functions r and
z act as pseudoreward and pseudoterminalreward functions respectively. Function γ is also in
pseudo form as well. However, γ function is more substantive than reward functions as the termination interrupts
the normal flow of state transitions. In pseudo termination, the standard termination is omitted. In
robotic soccer, the base problem can be defined as the time until a goal is scored by either the home or the opponent team. We can consider a pseudotermination has occurred when the striker is changed.
The GVF with respect to a stateaction function is defined as:
Q^{π,γ, r,z}(s,a) = E(G_{t}S_{t} = s, A_{t} = a,A_{t+1:T−1} ∼ π, T ∼ γ). 

The four functions, π, γ, r,and z, are the question functions
to GVFs, which in return defines the general value function's semantics. The RL agent learns an approximate
actionvalue function, ∧Q, using the four auxiliary functions
π,γ, r and z. We assume that the state space is continuous and the action space is
discrete. We approximate the actionvalue function using a linear function approximator. We use a
feature extractor ϕ: S_{t} ×A_{t} → {0,1}^{N}, N ∈ N, built on tile coding
[Sutton and Barto, 1998] to generate feature vectors from state variables and actions. This is a sparse
vector with a constant number of "1" features, hence, a constant norm. In addition, tile coding has
the key advantage of realtime learning and to implement computationally efficient algorithms to
learn approximate value functions. In linear function approximation, there exists a weight vector,
θ ∈ R^{N}, N ∈ N, to be learned. Therefore, the approximate GVFs are defined as:
such that, ∧Q:S ×A×R^{N} → R. Weights are learned using the gradientdescent temporaldifference
Algorithms [Maei, 2011]. The Algorithm learns stably and efficiently using linear
function approximation from offpolicy experiences. Offpolicy experiences are generated
from a behavior policy, π_{b}, that is different from the policy being learned about named
as target policy, π. Therefore, one could learn multiple target policies from the same
behavior policy.
3.2 OffPolicy Policy Gradient Methods for GVFs
The second method to learn about GVFs is using the offpolicy policy gradient methods with actorcritic architectures that use a statevalue function suitable for learning GVFs. It is defined as:
V^{π,γ, r,z}(s) = E(G_{t}S_{t} = s,A_{t:T−1} ∼ π, T ∼ γ), 

where, V^{π,γ, r,z}(s) is the true statevalue function, and the approximate GVF is defined as:
where, the functions π,γ, r, and z are defined as in the subsection (3.1). Since our the target policy π is discrete stochastic, we use a Gibbs distribution of the form:
π(a  s) = 
e^{uT ϕ(s, a)}

, 

where, ϕ(s,a) are stateaction features for state s, and action a, which are in general unrelated to state features ϕ(s), that are used in statevalue function approximation. u ∈ R^{Nu}, N_{u} ∈ N, is a weight vector, which is modified by the actor to learn about the stochastic target policy. The loggradient of the policy at state s, and action a, is:

∇_{u} π(a  s)
π(a  s)

= ϕ(s,a) − 
∑
b

π(bs) ϕ(s,b). 

To summarize, the definitions of the question functions and the answer functions are given as:
The question functions are defined by:
 π:S_{t} ×A_{t} → [0, 1] (target policy
is greedy w.r.t. learned value function);
 γ:S_{t} → [0, 1] (termination function);
 r:S_{t+1} → R (transient reward function); and
 z:S_{t+1} → R (terminal reward function).
The answer functions are defined by:
 π_{b}: S_{t} ×A_{t} → [0, 1] (behavior policy);
 I_{t}:S_{t} ×A_{t} → [0, 1] (interest function);
 ϕ:S_{t} ×A_{t} → R^{N} (featurevector function); and
 λ:S_{t} → [0, 1] (eligibilitytrace decayrate function).
4 OffPolicy Example
This section provides indepth guide to model offpolicy learning scenarios using RLLib. We use ContinuousGridWorld [Thomas Degris, 2012] as an illustration. The following example shows the setup protocol.
// From the RLLib
#include "Vector.h"
#include "Trace.h"
#include "Projector.h"
#include "ControlAlgorithm.h"
#include "Representation.h"
// From the simulation
#include "ContinuousGridworld.h"
#include "RL.h"
using namespace RLLib;
int main()
{
Probabilistic::srand(0);
RLProblem<double>* problem = new ContinuousGridworld<double>;
Hashing* hashing = new MurmurHashing;
Projector<double>* projector = new TileCoderHashing<double>(1000000, 10, true, hashing);
StateToStateAction<double>* toStateAction = new StateActionTilings<double>(projector, problem>getDiscreteActionList());
double alpha_v = 0.1 / projector>vectorNorm();
double alpha_w = 0.0001 / projector>vectorNorm();
double gamma = 0.99;
double lambda = 0.4;
double alpha_u = 0.001 / projector>vectorNorm();
Trace<double>* critice = new ATrace<double>(projector>dimension());
GTDLambda<double>* critic = new GTDLambda<double>(alpha_v, alpha_w, gamma, lambda, critice);
PolicyDistribution<double>* target = new BoltzmannDistribution<double>(projector>dimension(), problem>getDiscreteActionList());
Trace<double>* actore = new ATrace<double>(projector>dimension());
Traces<double>* actoreTraces = new Traces<double>();
actoreTraces>push_back(actore);
ActorOffPolicy<double>* actor = new ActorLambdaOffPolicy<double>(alpha_u, gamma, lambda, target, actoreTraces);
Policy<double>* behavior = new RandomPolicy<double>(problem>getDiscreteActionList());
OffPolicyControlLearner<double>* control = new OffPAC<double>(behavior, critic, actor, toStateAction, projector);
RLAgent<double>* agent = new LearnerAgent<double>(control);
Simulator<double>* sim = new Simulator<double>(agent, problem, 5000, 2000, 5);
sim>setTestEpisodesAfterEachRun(true);
//sim>setVerbose(false);
sim>run();
sim>computeValueFunction();
control>persist("cgw_offpac.data");
control>reset();
control>resurrect("cgw_offpac.data");
sim>runEvaluate(100);
delete problem;
delete hashing;
delete projector;
delete toStateAction;
delete critice;
delete critic;
delete actore;
delete actoreTraces;
delete actor;
delete behavior;
delete target;
delete control;
delete agent;
delete sim;
return 0;
}
5 OnPolicy Example
This section provides indepth guide to model onpolicy learning scenarios using RLLib. We use SwingingPendulum [Degris et al., 2012] as an illustration. The following example shows the setup protocol.
// From the RLLib
#include "Vector.h"
#include "Trace.h"
#include "Projector.h"
#include "ControlAlgorithm.h"
#include "Representation.h"
// From the simulation
#include "SwingPendulum.h"
#include "RL.h"
int main()
{
Probabilistic::srand(0);
RLProblem<double>* problem = new SwingPendulum<double>;
Hashing* hashing = new MurmurHashing;
Projector<double>* projector = new TileCoderHashing<double>(1000, 10, false, hashing);
StateToStateAction<double>* toStateAction = new StateActionTilings<double>(projector, problem>getContinuousActionList());
double alpha_v = 0.1 / projector>vectorNorm();
double alpha_u = 0.001 / projector>vectorNorm();
double alpha_r = .0001;
double gamma = 1.0;
double lambda = 0.5;
Trace<double>* critice = new ATrace<double>(projector>dimension());
TDLambda<double>* critic = new TDLambda<double>(alpha_v, gamma, lambda, critice);
PolicyDistribution<double>* policyDistribution = new NormalDistributionScaled<double>(0, 1.0, projector>dimension(), problem>getContinuousActionList());
Range<double> policyRange(2.0, 2.0);
Range<double> problemRange(2.0, 2.0);
PolicyDistribution<double>* acting = new ScaledPolicyDistribution<double>(problem>getContinuousActionList(), policyDistribution, &policyRange, &problemRange);
Trace<double>* actore1 = new ATrace<double>(projector>dimension());
Trace<double>* actore2 = new ATrace<double>(projector>dimension());
Traces<double>* actoreTraces = new Traces<double>();
actoreTraces>push_back(actore1);
actoreTraces>push_back(actore2);
ActorOnPolicy<double>* actor = new ActorLambda<double>(alpha_u, gamma, lambda, acting, actoreTraces);
OnPolicyControlLearner<double>* control = new AverageRewardActorCritic<double>(critic, actor, projector, toStateAction, alpha_r);
RLAgent<double>* agent = new LearnerAgent<double>(control);
Simulator<double>* sim = new Simulator<double>(agent, problem, 5000, 100, 10);
sim>setVerbose(true);
sim>run();
sim>runEvaluate(100);
sim>computeValueFunction();
delete problem;
delete hashing;
delete projector;
delete toStateAction;
delete critice;
delete critic;
delete actore1;
delete actore2;
delete actoreTraces;
delete actor;
delete policyDistribution;
delete acting;
delete control;
delete agent;
delete sim;
return 0;
}
References
 [12013Abeyruwan et al.Abeyruwan, Seekircher, and Visser]

Saminda Abeyruwan, Andreas Seekircher, and Ubbo Visser.
Dynamic Role Assignment using General Value Functions.
In AAMAS 2013, ALA Workshop, 2013.
 [22012Degris et al.Degris, Pilarski, and Sutton]

Thomas Degris, M. Pilarski, Patrick, and S. Sutton, Richard.
ModelFree Reinforcement Learning with Continuous Action in
Practice.
In American Control Conference, Montreal, Canada, June 2012.
URL http://hal.inria.fr/hal00764281.
 [32011Maei]

Hamid Reza Maei.
Gradient TemporalDifference Learning Algorithms.
PhD Thesis, University of Alberta., 2011.
 [42010Maei and Sutton]

Hamid Reza Maei and Richard S Sutton.
GQ(λ): A General Gradient Algorithm for
TemporalDifference Prediction Learning with Eligibility Traces.
Proceedings of the 3rd Conference on Artificial General
Intelligence (AGI10), pages 16, 2010.
 [52010Maei et al.Maei, Szepesvári, Bhatnagar, and Sutton]

Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S.
Sutton.
Toward OffPolicy Learning Control with Function
Approximation.
In Proceedings of the 27th International Conference on
Machine Learning (ICML 2010), pages 719726, 2010.
 [61998Sutton and Barto]

Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction.
MIT Press, 1998.
ISBN 0262193981.
 [72011Sutton et al.Sutton, Modayil, Delp, Degris, Pilarski, White, and Precup]

Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M.
Pilarski, Adam White, and Doina Precup.
Horde: A Scalable RealTime Architecture for Learning
Knowledge from Unsupervised Sensorimotor Interaction.
In The 10th International Conference on Autonomous
Agents and Multiagent Systems, AAMAS '11, pages 761768.
International Foundation for Autonomous Agents and Multiagent
Systems, 2011.
ISBN 0982657161, 9780982657164.
 [82012Thomas Degris]

Richard S. Sutton Thomas Degris, Martha White.
OffPolicy ActorCritic.
In Proceedings of the TwentyNinth International
Conference on Machine Learning (ICML), 2012.
Footnotes:
^{1}The latest release version of RLLib is v1.5. RLLib is released to the public on May 17, 2013. The v1.0 release notification is available in http://tinyurl.com/rllib10.
File translated from
T_{E}X
by
T_{T}H,
version 4.01.
On 19 Nov 2013, 19:38.