Wednesday, December 18, 2013

Reinforcement learning in Scala 2: Model

This page is the second installement of our introduction to reinforcement learning using the temporal difference. This section is dedicated to the ubiquitous Q-learning algorithm.

Overview
The previous post, Reinforcement learning in Scala: Policies introduced the concept of reinforcement learning, temporal difference and more specifically the Q-learning algorithm:
Reinforcement Learning I: States & Policy The following components were implemented:
  • States QLState and their associated actions QLAction
  • States space QLSpace
  • Policy QLPolicy as a container of tuple {reward, Q-value, probabilities} of the Q-learning model
The last two components to complete the implementation of Q-learning are the model and the training algorithm.

Model and Training
The first step is to define a model for the reinforcement learning. A model is created during training and is composed of
  • Best policy to transition from any initial state to a goal state
  • Coverage ratio as defined as the percentage of training cyles that reach the (or one of the) goal.

class QLModel[T](val bestPolicy: QLPolicy[T], val coverage: Double)

The QLearning class takes 3 arguments
  • A set of configuration parameters config
  • The search/states space qlSpace
  • The initial policy associated with the states (reward and probabilities) qlPolicy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
class QLearning[T](
   config: QLConfig, 
   qlSpace: QLSpace[T], 
   qlPolicy: QLPolicy[T]) 

    //model in Q-learning algorithm
  val model: Option[QLModel[T]] = train.toOption
    
    // Generate a model through multi-epoch training
  def train: Try[Option[QLModel[T]]] {}
  private def train(r: Random): Boolean {}

   // Predict a state as a destination of this current 
   // state, given a model
  def predict : PartialFunction[QLState[T], QLState[T]] {}

  // Select next state and action index
  def nextState(st: (QLState[T], Int)): (QLState[T], Int) {} 
}

The model of type Option[QLModel] (line 7) is created by the method train (line 10). Its value is None if training failed.

The training method train consists of executing config.numEpisodes cycle or episode of a sequence of state transition (line 5). The random generator r is used in the initialization of the search space.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def train: Option[QLModel[T]] = {
  val r = new Random(System.currentTimeMillis)

  Try {
    val completions = Range(0, config.numEpisodes).filter(train(r) )

    val coverage = completions.toSize.toDouble/config.numEpisodes
    if(coverage > config.minCoverage) 
       new QLModel[T](qlPolicy, coverage)
    else 
       QLModel.empty[T]
  }.toOption
}

The training process exits with the model if the minimum minCoverage (number of episodes for which the goal state is reached) is met (line 8).

The method train(r: scala.util.Random) uses a tail recursion to transition from the initial random state to one of the goal state. The tail recursion is implemented by the search method (line 4). The method implements the recursive temporal difference formula introduced in the previous post (Reinforcement Learning I: States & Policy) (lines 14-18).
The state for which the action generates the highest reward R given a policy qlPolicy (line 10) is computed for each new state transition. The Q-value of the current policy is then updated qlPolicy.setQ before repeating the process for the next state, through recursion (line 21).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def train(r: Random): Boolean = {
   
  @scala.annotation.tailrec
  def search(st: (QLState[T], Int)): (QLState[T], Int) = {
    val states = qlSpace.nextStates(st._1)
    if( states.isEmpty || st._2 >= config.episodeLength ) 
        (st._1, -1)
    
    else {
      val state = states.maxBy(s => qlPolicy.R(st._1.id, s.id))
      if( qlSpace.isGoal(state) )
          (state, st._2)
      else {
        val r = qlPolicy.R(st._1.id, state.id)   
        val q = qlPolicy.Q(st._1.id, state.id)
        // Q-Learning formula
        val deltaQ = r + config.gamma*qlSpace.maxQ(state, qlPolicy) -q
        val nq = q + config.alpha*deltaQ
        
        qlPolicy.setQ(st._1.id, state.id,  nq)
        search((state, st._2+1))
       }
     }
  } 
   
  r.setSeed(System.currentTimeMillis*Random.nextInt)

  val finalState = search((qlSpace.init(r), 0))
  if( finalState._2 != -1) 
    qlSpace.isGoal(finalState._1) 
  else 
    false
}

Note: There is no guaranty that one of the goal state is reached from any initial state chosen randomly. It is expected that some of the training epoch fails. This is the reason why monitoring coverage is critical. Obviously, you may choose a deterministic approach to the initialization of each training epoch by picking up any state beside the goal state(s), as a starting state.


Prediction
Once trained, the model is used to predict the next state with the highest value (or probability) given an existing state. The prediction is implemented as a partial function.

1
2
3
4
def predict : PartialFunction[QLState[T], QLState[T]] = {
  case state: QLState[T] if(model != None) => 
    if( state.isGoal) state else nextState(state, 0)._1
}

The method nextState does the heavy lifting. It retrieves the list of states associated with the current state st through its actions set (line 2). The next most rewarding state qState is computed using the reward matrix R of the best policy of the QLearning model (lines 6 - 8).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def nextState(st: (QLState[T], Int)): (QLState[T], Int) =  {
  val states = qlSpace.nextStates(st._1)
  if( states.isEmpty || st._2 >= config.episodeLength) 
    st
  else {
    val qState = states.maxBy(
     s => model.map(_.bestPolicy.R(st._1.id, s.id))
           .getOrElse(-1.0)
    )

    nextState( (qState, st._2+1))
  }
}


References
Online Value Function Determination for Reinforcement Learning  J Laird, N Debersky, N Tinkerhess
https://github.com/prnicolas
Scala for Machine Learning - Chapter 11 Reinforcement Learning / Q-Learning algorithm - P. Nicolas - Packt Publishing 2014

Tuesday, December 10, 2013

Reinforcement Learning in Scala 1: States & Policies

This post describes a very common reinforcement learning methodology: Tenporal difference update as implemented in Scala. This first section introduces and implements the concept of states and policies.

Overview
There are many different approaches to implement reinforcement learning
One of the most commonly used method is searching the value function space using temporal difference method
All known reinforcement learning methods share the same objective of solving the sequential decision tasks. In a sequential decision task, an agent interacts with a dynamic system by selecting actions that affect the transition between states in order to optimize a given reward function.

At any given step i, the agent select an action a(i) on the current state s(i). The dynamic system responds by rewarding the agent for its optimal selection of the next state:\[s_{i+1}=V(s_{i})\]
The learning agent infers the policy that maps the set of states {s} to the set of available actions {a}, using a value function  \[V(s_{i})\] The policy is defined at \[\pi :\,\{s_{i}\} \mapsto \{a_{i}\} \left \{ s_{i}|s_{i+1}=V(s_{i}) \right \}\]


Temporal Difference
The most common approach of learning a value function V is to use the Temporal Difference method (TD). The method uses observations of prediction differences from consecutive states, s(i) & s(i+1). If we note r the reward for selection an action from state s(i) to s(i+1) and n the learning rate, then the value V is updated as \[V(s_{i})\leftarrow V(s_{i})+\eta .(V(s_{i+1}) -V(s_{i}) + r_{i})\]
Therefore the goal of the temporal difference method is to learn the value function for the optimal policy. The 'action-value' function represents the expected value of action a on a state s and defined as \[Q(s_{i},a_{i}) = r(s_{i}) + V(s_{i})\] where r is the reward value for the state.

On-policies vs. Off-policy
The Temporal Difference method relies on the estimate of the final reward to be computed for each state. There are two methods of the Temporal Difference algorithm:On-Policy and Off-Policy:
  - On-Policy method learns the value of the policy used to make the decision. The value function is derived from the execution of actions using the same policy but based on history
 - Off-Policy method learns potentially different policies. Therefore the estimate is computed using actions that have not been executed yet.
The most common formula for temporal difference approach is the Q-learning formula. It introduces the concept of discount rate to reduce the impact of the first few states on the optimization of the policy. It does not need a model of its environment. The exploitation of action-value approach consists of selecting the next state is by computing the action with the maximum reward. Conversely the exploration approach focus on the total anticipated reward.The update equation for the Q-Learning is \[Q(s_{i},a_{i}) \leftarrow Q(s_{i},a_{i}) + \eta .(r_{i+1} +\alpha .max_{a_{i+1}}Q(s_{i+1},a_{i+1}) - Q(s_{i},a_{i}))\] \[Q(s_{i},a_{i}): \mathrm{expected\,value\,action\,a\,on\,state\,s}\,\,\eta : \mathrm{learning\,rate}\,\,\alpha : \mathrm{discount\,rate}\] . One of the most commonly used On-Policy method is Sarsa which does not necessarily select the action that offer the most value.The update equation is defined as\[Q(s_{i},a_{i}) \leftarrow Q(s_{i},a_{i}) + \eta .(r_{i+1} +\alpha .Q(s_{i+1},a_{i+1}) - Q(s_{i},a_{i}))\]
States and Actions
Functional languages are particularly suitable for iterative computation. We use Scala for the implementation of the temporal difference algorithm. We allow the user to specify any variant of the learning formula, using local functions or closures.
Firstly, we have to define a state class, QLState (line 1) that contains a list of actions of type QLAction (line 3) that can be executed from this state. The only purpose of this class is to connect a list of action to a source state. The parameterized class argument property (line 4) is used to "attach" some extra characteristics to this state.

1
2
3
4
5
6
7
8
class QLState[T](
  val id: Int, 
  val actions: List[QLAction[T]] = List.empty, 
  property: T) {
    
  @inline
  def isGoal: Boolean = !actions.isEmpty
}

As described in the introduction, an action of class QLAction has a source state from and a destination state to(state which is reached following the action). A state except the goal state, has multiple actions but an action has only one destination or resulting state.

case class QLAction[T](from: Int, to: Int)

The state and action can be loaded, generated and managed by a directed graph or search space of type QLSpace. The search space contains the list of all the possible states available to the agent.
One or more of these states can be selected as goals. The algorithm does not restrict the agent to a single state. The process ends when one of the goal states is reached (OR logic). The algorithm does not support combined goals (AND logic).

Let's implement the basic components of the search space QLSpace. The class list all available states (line 2) and one or more final or goal states goalIds (line 3). Although you would expect that the search space contains a single final or goal state, it is not uncommon to have online training using more than one goal state.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class QLSpace[T](
   states: Array[QLState[T]], 
   goalIds: Array[Int]) {

    // Indexed map of states 
  val statesMap: immutable.Map[Int, QLState[T]] = 
    states.map(st => (st.id, st)).toMap
    // List set of one or more goals  
  val goalStates = new immutable.HashSet[Int]() ++ goalIds
 
    // Compute the maximum Q value for a given state and policy
  def maxQ(st: QLState[T], policy: QLPolicy[T]): Double = { 
    val best = states.filter( _ != st)
       .maxBy(_st => policy.EQ(st.id, _st.id))
    policy.EQ(st.id, best.id)
  }
 
    // Retrieves the list of states destination of state, st
  def nextStates(st: QLState[T]): List[QLState[T]] =
     st.actions.map(ac => statesMap.get(ac.to).get)
 
  def init(r: Random): QLState[T] = 
    states(r.nextInt(states.size-1))
}

A hash map statesMap maintains a dictionary of all the possible states with the state id as unique key (lines 6, 7). The class QLSpace has three important methods:
  • init initializes the search with a random state for each training epoch (lines 22, 23)
  • nextStates returns the list of destination states associated to the state st (lines 19, 20)
  • maxQ return the maximum Q-value for this state st given the current policy policy(lines 12-15). The method filters out itself from the search from the next best action. It then compute the maximum reward or Q(state, action) value according to the given policy policy

The next step is to defined a policy.

Learning Policy
A policy is defined by three components
  • A reward collected after transitioning from one state to another state (line 2). The reward is provided by the user
  • A Q(State, Action) value, value associated to a transition state and an action (line 4)
  • A probability (with default values of 1.0) that defines the obstacles or hindrance to migrate from one state to another (line 3)
The estimate combine the Q-value (incentive to move to the best next step) and probability (hindrance to move to any particular state) (line 7).

1
2
3
4
5
6
7
8
class QLData {
  var reward: Double = 1.0
  var probability: Double = 1.0
  var value: Double = 0.0) {
  
  @inline
  final def estimate: Double = value*probability
}

The policy of type QLPolicy is a container for the state transition attributes, rewards, Q-values and probabilities.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
class QLPolicy[T](numStates: Int, input: Array[QLInput]) {
 
  val qlData = {
    val data = Array.tabulate(numStates)(
      _ => Array.fill(numStates)(new QLData)
    )
 
    input.foreach(in => {  
      data(in.from)(in.to).reward = in.reward
      data(in.from)(in.to).probability = in.prob
    })
    data
  }
  
  def setQ(from: Int, to: Int, value: Double): Unit =
     qlData(from)(to).value = value
 
  def Q(from: Int, to: Int): Double = qlData(from)(to).value
}

The constructor for QLPolicy takes two arguments:
  • Number of states numStates (line 1)
  • Sequence of input of type QLInput to the policy
The constructor create a numStates x numStates matrix of transition of type QLData (lines 3 - 12), from the input.

The type QLInput wraps the input data (index of the input state from, index of the output state to, reward and probability associated to the state transition) into a single convenient class.

case class QLInput(
   from: Int, 
   to: Int, 
   reward: Double = 1.0, 
   prob: Double = 1.0)

The next post will dig into the generation of a model through Q-learning training


References
Online Value Function Determination for Reinforcement Learning  J Laird, N Debersky, N Tinkerhess
https://github.com/prnicolas Scala for Machine Learning - Chapter 11 Reinforcement Learning / Q-Learning algorithm P. Nicolas - Packt Publishing 2014

Monday, November 18, 2013

Dependencies Injection in Scala

This post describes the concept of injection dependencies along with an implementation of the Cake pattern.

Overview
Dependency injection is a design pattern that has been widely used in Java, by leveraging frameworks such as Spring. The objective of the pattern is to replace hard-coded dependencies with run-time association or injection of new type.

Java defines modules or components through the semantics and convention of packages. The functionality of a module is defined through one or more interfaces and implemented through the composition and inheritance of concrete classes. Polymorphism is used to "dynamically wire" those classes, assembled through composition and inheritance into patterns which address a specific design problem (publish-subscribe, strategy, factory..).
However those capabilities have been proven limited for creating very dynamic and complex applications. The Scala programming language provides developers with a dependencies injection mechanism based on self type annotation and that does not rely on 3rd party framework.

Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted

Reuse through Inheritance
The simplest and commonly used form of reuse in any Object Oriented Programming is Inheritance. Let's consider an interface House which is implemented by an abstract or concrete class 'House with Furniture & Appliance" which in turn is sub-classed by a well defined House.

It is well documented that inheritance is a poor mechanism for code reuse because data is not properly encapsulated as a sub-class may access internals of the base class. Moreover any future changes in the base class of interface (Framework) will propagate through the sub-class (dependencies).

Reuse through Composition
It is a well documented and researched fact that composition provides a more robust encapsulation than inheritance as the main class delegates or routes method invocation to the appropriate internal components. Contrary to inheritance for which changes in the base class may have unintended consequences over the subclasses, changes in components or inner classes can be made independently of the main class or outer component.
Clearly in the example above, composition is more appropriate. After all a House with Furniture and Appliances can be defined as a House that contains 
Furniture and Appliance:



Inversion of Control
Framework such as Spring have introduced the concept of Inversion of Control Containers (IoC) and dependency injection which is a form of IoC. In case of inversion of control, a framework define interfaces which are extended or implemented by the application or client code. Instead of having the application using the Framework API, the framework relies on the application for implementing a specific function.

Let's take the example of a generic service,Service that access a database using Java.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
public interface Service {
    JSonObject query(String mySQLQuery);
}
 
public interface DbAccess { }
 
public class ServiceImpl implements Service {
  private Dbaccess dbAccess = null;
  public void setDbaccess(Dbaccess dbAccess) { 
    this.dbAccess = dbAccess; 
  }
 
  @override
  public JSonObject query(String mySQLQuery) {}
}

In the example above, a concrete implementation of DbAccess interface such as mySQLQuery can be injected (as passed to the implementation of the service, line 9). The database access instance is used to enable the query and return a JSON object (line 14).
Scala provides the developer with a similar and powerful mechanism to inject dependencies to a concrete class, known as Cake pattern.

Dependency Injection
At its core, dependencies injection relies on 3 components:
- Consumer or list of dependent components
- Provider which injects the dependencies
- Dependencies

Let's consider the recipe example above. A House requires not only Furniture, Appliance but a Layout plan with step by step instructions. 


Each piece of furniture is defined by its name, category and price. More specific categories of furniture such as PatioFurniture and BathroomFurniture can also be created with similar arguments.

case class Furniture(name: String, category: String, price: Double) 
 
case class PatioFurniture(
  name: String, 
  price: Double, 
  season: String
) extends Furniture(name, "Patio", price)
 
case class BathroomFurniture(
  name: String, 
  price: Double, 
  floor: Int=1
) extends Furniture(name, "Bathroom", price)

A house contains also appliances and require a layout plan to be setup. An appliance has a name, price and a warranty if needed and available. The class Layout is instantiated with a name and an option.

case class Appliance(
  name: String, 
  warranty: Boolean, 
  price: Double)

case class Layout(name: String, option: Int)

The goal is to furnish a house with a combination of appliances, pieces of furniture, following a layout plan. The implementation computes the total cost, once a combination of furniture, appliance and layout is selected.

To this purpose, we create several modules, implemented as a trait of type FurnitureModule (line 1) to encapsulate each category of furniture. In our case the FurnitureModule define patio furniture, PatioFurniture (lines 6, 10).The bathroom furniture, instance of BathroomFurniture (lines 15- 19) is wrapped into its dedicated module BathroomFurnitureModule (line 14).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
trait FurnitureModule {
    // abstract attribute
  val furnitures: List[Furniture]
   
  class Furniture(id: String, category: String, price: Double) 
  class PatioFurniture(
    id: String, 
    price: Double, 
    season: String
  ) extends Furniture(id, "Patio", price)
}
 
 
trait BathroomFurnitureModule extends FurnitureModule {
  class BathroomFurniture(
    id: String, 
    price: Double, 
    floor: Int=1
  )  extends Furniture(id, "Bathroom", price)
}

Let's dig into the code ...
The first trait, FurnitureModule defines a generic Furniture (line 5) and a specialized furniture type, PatioFurniture. Dynamic binding is managed by the module that encapsulates the hierarchy of furniture types. Alternatively, the client code can manage dynamic binding or dependency injection by creating a sub-module, BathroomFurnitureModule (line 14), to manage other type of furniture, BathroomFurniture. The bathroom furniture, instance of BathroomFurniture (lines 15- 19) is wrapped into its dedicated module BathroomFurnitureModule (line 14). It should be clear, by now that these two different approaches
  • Customizing modules
  • Customizing classes or types within each module
can be applied either separately or all together.

The same strategy is applied to the Appliance and Layout.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
trait ApplianceModule {
  val appliances: List[Appliance]
   
  class Appliance(name: String, warranty: Boolean, price: Double) {
     
    def this(name: String, price: Double) = this(name, true ,price)
    def cost: Double = if( warranty ) price * 1.15 else price
  }
}
 
 
trait LayoutModule {
  val layout: Layout
     
  class Layout(name: String, option: Int) {
    val movingInstructions: List[String] = List.empty
  }
}

A single class Appliance (line 4) with two constructors (lines 4 & 6) is enough to describe all types of appliances in the ApplianceModule space or scope. The same concept applies to the LayoutModule component (lines 12, 17).

The factory class, RoomFurnishing,(line 1) relies on a self referential condition, using one of the components, LayoutModule and other components as mixin, ApplianceModule and FurnitureModule (line 2).. The factory class defines all the methods that is required to manage any combination of the components to enable us to compute the total cost (line 4).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class RoomFurnishing {
self: LayoutModule with FurnitureModule with ApplianceModule => 

  def cost: String = {
    val totalCost = furnitures.map(_.cost).sum + appliances.map(_.cost).sum
    s"RoomFurnishing: ${totalCost }"
  }
    
  def movingDate: String = "October, 2013"
}

Here is an example how the client code can dynamically assemble all the components and compute the total cost.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
val houseFurnishing =
 if( country ) 
   new RoomFurnishing 
     with FurnitureModule 
       with ApplianceModule 
        with LayoutModule {
 
    val layout = new Layout("Country Home", 2)

    val furnitures = List[Furniture](
      new Furniture("Modern Chair", "Chair", 136.0), 
      new PatioFurniture("Bench", 350.0, "Summer")
    )
      
    val appliances = List[Appliance](
      new Appliance("Microwave Oven", false, 185.0), 
      new Appliance("Dishwaher", true, 560.0)
    )
  }
    
  else 
    new RoomFurnishing 
      with BathroomFurnitureModule 
        with ApplianceModule 
          with LayoutModule {
 
      val layout = new Layout("Apartment", 4)
      val furnitures = List[BathroomFurniture](
        new BathroomFurniture("Stool", 89.5), 
        new BathroomFurniture("Cabinet", 255.6, 2)
      )
       
      val appliances = List[Appliance](
        new Appliance("Microwave Oven", false, 185.0), 
        new Appliance("Dishwaher", true, 560.0)
      )
  }
Console.println(houseFurnishing.cost)

The dynamic composition of Furniture, appliance given a layout to furnish a room, instance of RoomFurnishing is made clear in the code snippet. If the goal is to furnish a country house (line 2), then the interior decorator selects Modern chair (line 11), bench (line 12), Microwave Oven (line 16) and Dishwasher (line 17)following a Country Home layout (line 8). The same principle applies to the selection of appliances and furniture to furnish an appartment (lines 22- 36)

This technique combines class composition, inheritance, self-reference and abstract variables to provide a simple and flexible framework. The post also introduces the concept of layout or assembler to hide some complexity and become the primary mixin. I may elaborate on this concept in a future post. I strongly recommend the article written by Jonas Bonér on Cake pattern (listed in the references) to get in depth understanding on both the motivation and variant of this pattern.


References
Jonas Boner article on Cake Pattern
Cake solutions Blog
The Scala Programming Language - M. Odersky, L. Spoon, B.Venners - Artima 2007 github.com/prnicolas