## Tuesday, December 10, 2013

### Reinforcement Learning in Scala 1: States & Policies

Target audience: Advanced
Estimated reading time: 45'

This post describes a very common reinforcement learning methodology: Tenporal difference update as implemented in Scala. This first section introduces and implements the concept of states and policies.

Overview
There are many different approaches to implement reinforcement learning
One of the most commonly used method is searching the value function space using temporal difference method
All known reinforcement learning methods share the same objective of solving the sequential decision tasks. In a sequential decision task, an agent interacts with a dynamic system by selecting actions that affect the transition between states in order to optimize a given reward function.

At any given step i, the agent select an action a(i) on the current state s(i). The dynamic system responds by rewarding the agent for its optimal selection of the next state:$s_{i+1}=V(s_{i})$
The learning agent infers the policy that maps the set of states {s} to the set of available actions {a}, using a value function  $V(s_{i})$ The policy is defined at $\pi :\,\{s_{i}\} \mapsto \{a_{i}\} \left \{ s_{i}|s_{i+1}=V(s_{i}) \right \}$

Temporal Difference
The most common approach of learning a value function V is to use the Temporal Difference method (TD). The method uses observations of prediction differences from consecutive states, s(i) & s(i+1). If we note r the reward for selection an action from state s(i) to s(i+1) and n the learning rate, then the value V is updated as $V(s_{i})\leftarrow V(s_{i})+\eta .(V(s_{i+1}) -V(s_{i}) + r_{i})$
Therefore the goal of the temporal difference method is to learn the value function for the optimal policy. The 'action-value' function represents the expected value of action a on a state s and defined as $Q(s_{i},a_{i}) = r(s_{i}) + V(s_{i})$ where r is the reward value for the state.

On-policy vs. Off-policy
The Temporal Difference method relies on the estimate of the final reward to be computed for each state. There are two methods of the Temporal Difference algorithm:On-Policy and Off-Policy:
- On-Policy method learns the value of the policy used to make the decision. The value function is derived from the execution of actions using the same policy but based on history
- Off-Policy method learns potentially different policies. Therefore the estimate is computed using actions that have not been executed yet.

The most common formula for temporal difference approach is the Q-learning formula. It introduces the concept of discount rate to reduce the impact of the first few states on the optimization of the policy. It does not need a model of its environment. The exploitation of action-value approach consists of selecting the next state is by computing the action with the maximum reward. Conversely the exploration approach focus on the total anticipated reward.The update equation for the Q-Learning is $Q(s_{i},a_{i}) \leftarrow Q(s_{i},a_{i}) + \eta .(r_{i+1} +\alpha .max_{a_{i+1}}Q(s_{i+1},a_{i+1}) - Q(s_{i},a_{i}))$ $Q(s_{i},a_{i}): \mathrm{expected\,value\,action\,a\,on\,state\,s}\,\,\eta : \mathrm{learning\,rate}\,\,\alpha : \mathrm{discount\,rate}$ . One of the most commonly used On-Policy method is Sarsa which does not necessarily select the action that offer the most value.The update equation is defined as$Q(s_{i},a_{i}) \leftarrow Q(s_{i},a_{i}) + \eta .(r_{i+1} +\alpha .Q(s_{i+1},a_{i+1}) - Q(s_{i},a_{i}))$

States and Actions

Functional languages are particularly suitable for iterative computation. We use Scala for the implementation of the temporal difference algorithm. We allow the user to specify any variant of the learning formula, using local functions or closures.
Firstly, we have to define a state class, QLState (line 1) that contains a list of actions of type QLAction (line 3) that can be executed from this state. The only purpose of this class is to connect a list of action to a source state. The parameterized class argument property (line 4) is used to "attach" some extra characteristics to this state.

 1 2 3 4 5 6 7 8 class QLState[T]( val id: Int, val actions: List[QLAction[T]] = List.empty, property: T) { @inline def isGoal: Boolean = !actions.isEmpty } 

As described in the introduction, an action of class QLAction has a source state from and a destination state to(state which is reached following the action). A state except the goal state, has multiple actions but an action has only one destination or resulting state.

case class QLAction[T](from: Int, to: Int)


The state and action can be loaded, generated and managed by a directed graph or search space of type QLSpace. The search space contains the list of all the possible states available to the agent.
One or more of these states can be selected as goals. The algorithm does not restrict the agent to a single state. The process ends when one of the goal states is reached (OR logic). The algorithm does not support combined goals (AND logic).

Let's implement the basic components of the search space QLSpace. The class list all available states (line 2) and one or more final or goal states goalIds (line 3). Although you would expect that the search space contains a single final or goal state, it is not uncommon to have online training using more than one goal state.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 class QLSpace[T]( states: Array[QLState[T]], goalIds: Array[Int]) { // Indexed map of states val statesMap: immutable.Map[Int, QLState[T]] = states.map(st => (st.id, st)).toMap // List set of one or more goals val goalStates = new immutable.HashSet[Int]() ++ goalIds // Compute the maximum Q value for a given state and policy def maxQ(st: QLState[T], policy: QLPolicy[T]): Double = { val best = states.filter( _ != st) .maxBy(_st => policy.EQ(st.id, _st.id)) policy.EQ(st.id, best.id) } // Retrieves the list of states destination of state, st def nextStates(st: QLState[T]): List[QLState[T]] = st.actions.map(ac => statesMap.get(ac.to).get) def init(r: Random): QLState[T] = states(r.nextInt(states.size-1)) } 

A hash map statesMap maintains a dictionary of all the possible states with the state id as unique key (lines 6, 7). The class QLSpace has three important methods:
• init initializes the search with a random state for each training epoch (lines 22, 23)
• nextStates returns the list of destination states associated to the state st (lines 19, 20)
• maxQ return the maximum Q-value for this state st given the current policy policy(lines 12-15). The method filters out itself from the search from the next best action. It then compute the maximum reward or Q(state, action) value according to the given policy policy
The next step is to defined a policy.

Learning Policy
A policy is defined by three components
• A reward collected after transitioning from one state to another state (line 2). The reward is provided by the user
• A Q(State, Action) value, value associated to a transition state and an action (line 4)
• A probability (with default values of 1.0) that defines the obstacles or hindrance to migrate from one state to another (line 3)
The estimate combine the Q-value (incentive to move to the best next step) and probability (hindrance to move to any particular state) (line 7).

 1 2 3 4 5 6 7 8 class QLData { var reward: Double = 1.0 var probability: Double = 1.0 var value: Double = 0.0) { @inline final def estimate: Double = value*probability } 

The policy of type QLPolicy is a container for the state transition attributes, rewards, Q-values and probabilities.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 class QLPolicy[T](numStates: Int, input: Array[QLInput]) { val qlData = { val data = Array.tabulate(numStates)( _ => Array.fill(numStates)(new QLData) ) input.foreach(in => { data(in.from)(in.to).reward = in.reward data(in.from)(in.to).probability = in.prob }) data } def setQ(from: Int, to: Int, value: Double): Unit = qlData(from)(to).value = value def Q(from: Int, to: Int): Double = qlData(from)(to).value } 

The constructor for QLPolicy takes two arguments:
• Number of states numStates (line 1)
• Sequence of input of type QLInput to the policy
The constructor create a numStates x numStates matrix of transition of type QLData (lines 3 - 12), from the input.

The type QLInput wraps the input data (index of the input state from, index of the output state to, reward and probability associated to the state transition) into a single convenient class.

case class QLInput(
from: Int,
to: Int,
reward: Double = 1.0,
prob: Double = 1.0)


The next post will dig into the generation of a model through Q-learning training

References

#### 4 comments:

1. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. machine learning projects for final year In case you will succeed, you have to begin building machine learning projects in the near future.

Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.

Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

The Nodejs Training Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

2. It has also been beneficial to the manufacturing companies as it speeds up execution and generates large scale process. data science course syllabus

3. This is my first time i visit here. I found so many entertaining stuff in your blog, especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the leisure here! Keep up the good work. I have been meaning to write something like this on my website and you have given me an idea.

data science course in India

4. I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
Artificial Intelligence Course