Mountain Car

목표는 언덕위로 차량을 올려놓는 것 입니다.

학습 완료된 화면

Observation

Num Observation Min Max
0 position -1.2 0.6
1 velocity -0.07 0.07
env = gym.make('MountainCar-v0')
env.observation_space.high  # array([0.6 , 0.07], dtype=float32)
env.observation_space.low   # array([-1.2 , -0.07], dtype=float32)

Actions

Num Action
0 push left
1 no push
2 push right

Q-Learning

Bellman Equation

\[Q(s, a) = learning\ rate \cdot (r + \gamma( max(Q(s^{\prime}, a^{\prime}))))\]

Q Function

\[Q(s, a) = Q(s,a) + \text{lr} \left[ R(s, a) + \gamma \max Q^\prime (s^\prime, a^\prime) - Q(s, a) \right]\]
  • \(\text{lr}\) : Learning rate
  • \(R(s, a)\) : 현재 state, action으로 얻은 reward
  • \(Q\) : 현재의 Q value
  • \(\max Q^\prime (s^\prime, a^\prime)\) : Maximum future reward
  • \(s^\prime\) : step(action)으로 얻은 next_state
  • \(\gamma\) : Discount rate

Build Q Table

Continuous value를 어떻게든 테이블로 만들도록 잘라넣습니다.

env = gym.make('MountainCar-v0')
n_state = (env.observation_space.high - env.observation_space.low) * np.array([10, 100])
n_state = np.round(n_state, 0).astype(int) + 1

Q = np.random.uniform(-1, 1, size=(n_state[0], n_state[1], env.action_space.n))
print('Q shape:', Q.shape)
print('Q Table')
print(Q[1:2])
Q shape: (19, 15, 3)
Q Table
[[[ 0.99048129  0.83269269  0.23944522]
  [-0.4517455  -0.76882655 -0.00480888]
  [ 0.61718192  0.01420441 -0.08474976]
  [-0.38611008  0.34376222 -0.71499911]
  [-0.78333052 -0.30410788 -0.30258901]
  [ 0.8138172  -0.69035782  0.99421675]
  [-0.73070808 -0.60350616  0.57929507]
  [ 0.81467379 -0.82851229 -0.44759567]
  [-0.83048389  0.00949504 -0.28621805]
  [-0.80981087 -0.54730307  0.39901784]
  [-0.98453426  0.12534842  0.4347526 ]
  [-0.51690061 -0.69667071 -0.13774189]
  [ 0.91651489 -0.88653031 -0.93615038]
  [ 0.0208071   0.19121545 -0.32631843]
  [ 0.34336055  0.10997157  0.60867634]]]

training

def discretize(env, state):
    state = (state - env.observation_space.low) * np.array([10, 100])
    state = np.round(state, 0).astype(int)
    return state

def train(env, Q, epochs=10000, lr=0.1, gamma=0.9, epsilon=0.9):
    reduction = epsilon/epochs
    action_n = env.action_space.n
    
    rewards = list()
    
    for epoch in tqdm_notebook(range(epochs)):
        state = env.reset()
        state = discretize(env, state)
        
        done = False
        _tot_reward = 0
        _tot_rand_action = 0
        _tot_q_action = 0
        _max_pos = 0
        
        while not done:

            # Calculate next action
            if np.random.random() < 1 - epsilon:
                action = np.argmax(Q[state[0], state[1]])
                _tot_q_action += 1
            else:
                action = np.random.randint(0, action_n)
                _tot_rand_action += 1
                
            # Step!
            next_state, reward, done, info = env.step(action)
            next_state_apx = discretize(env, next_state)

            # Terminal Update
            if done and next_state[0] >= 0.5:
                Q[next_state_apx[0], next_state_apx[1], action] = reward
            else:
                delta = lr * (reward + gamma * np.max(Q[next_state_apx[0], next_state_apx[1]]) - 
                              Q[state[0], state[1], action])
                Q[state[0], state[1], action] += delta
            
            state = next_state_apx
            _tot_reward += reward
            
        # Decay Epsilon
        if epsilon > 0:
            epsilon -= reduction
            epsilon = round(epsilon, 4)
            
        # Track Rewards
        rewards.append(_tot_reward)
        
        # Log
        if epoch%100 == 0:
            print(f'\repoch:{epoch} | tot reward:{_tot_reward} | epsilon:{epsilon} | ' 
                  f'rand action:{_tot_rand_action} | Q action:{_tot_q_action}')

train(env, Q)
epoch:0 | tot reward:-200.0 | epsilon:0.8999 | rand action:178 | Q action:22
epoch:100 | tot reward:-200.0 | epsilon:0.8899 | rand action:170 | Q action:30
epoch:200 | tot reward:-200.0 | epsilon:0.8799 | rand action:168 | Q action:32
epoch:300 | tot reward:-200.0 | epsilon:0.8699 | rand action:170 | Q action:30
epoch:400 | tot reward:-200.0 | epsilon:0.8599 | rand action:163 | Q action:37
epoch:500 | tot reward:-200.0 | epsilon:0.8499 | rand action:164 | Q action:36
epoch:600 | tot reward:-200.0 | epsilon:0.8399 | rand action:165 | Q action:35
epoch:700 | tot reward:-200.0 | epsilon:0.8299 | rand action:162 | Q action:38
epoch:800 | tot reward:-200.0 | epsilon:0.8199 | rand action:159 | Q action:41
epoch:900 | tot reward:-200.0 | epsilon:0.8099 | rand action:155 | Q action:45
epoch:1000 | tot reward:-200.0 | epsilon:0.7999 | rand action:162 | Q action:38
epoch:1100 | tot reward:-200.0 | epsilon:0.7899 | rand action:163 | Q action:37
epoch:1200 | tot reward:-200.0 | epsilon:0.7799 | rand action:150 | Q action:50
epoch:1300 | tot reward:-200.0 | epsilon:0.7699 | rand action:139 | Q action:61
epoch:1400 | tot reward:-200.0 | epsilon:0.7599 | rand action:155 | Q action:45
epoch:1500 | tot reward:-200.0 | epsilon:0.7499 | rand action:148 | Q action:52
epoch:1600 | tot reward:-200.0 | epsilon:0.7399 | rand action:148 | Q action:52
epoch:1700 | tot reward:-200.0 | epsilon:0.7299 | rand action:146 | Q action:54
epoch:1800 | tot reward:-200.0 | epsilon:0.7199 | rand action:139 | Q action:61
epoch:1900 | tot reward:-200.0 | epsilon:0.7099 | rand action:149 | Q action:51
epoch:2000 | tot reward:-200.0 | epsilon:0.6999 | rand action:141 | Q action:59
epoch:2100 | tot reward:-200.0 | epsilon:0.6899 | rand action:144 | Q action:56
epoch:2200 | tot reward:-200.0 | epsilon:0.6799 | rand action:130 | Q action:70
epoch:2300 | tot reward:-200.0 | epsilon:0.6699 | rand action:121 | Q action:79
epoch:2400 | tot reward:-200.0 | epsilon:0.6599 | rand action:134 | Q action:66
epoch:2500 | tot reward:-200.0 | epsilon:0.6499 | rand action:112 | Q action:88
epoch:2600 | tot reward:-200.0 | epsilon:0.6399 | rand action:135 | Q action:65
epoch:2700 | tot reward:-200.0 | epsilon:0.6299 | rand action:124 | Q action:76
epoch:2800 | tot reward:-200.0 | epsilon:0.6199 | rand action:123 | Q action:77
epoch:2900 | tot reward:-200.0 | epsilon:0.6099 | rand action:123 | Q action:77
epoch:3000 | tot reward:-200.0 | epsilon:0.5999 | rand action:126 | Q action:74
epoch:3100 | tot reward:-200.0 | epsilon:0.5899 | rand action:109 | Q action:91
epoch:3200 | tot reward:-200.0 | epsilon:0.5799 | rand action:124 | Q action:76
epoch:3300 | tot reward:-200.0 | epsilon:0.5699 | rand action:114 | Q action:86
epoch:3400 | tot reward:-200.0 | epsilon:0.5599 | rand action:103 | Q action:97
epoch:3500 | tot reward:-200.0 | epsilon:0.5499 | rand action:115 | Q action:85
epoch:3600 | tot reward:-200.0 | epsilon:0.5399 | rand action:99 | Q action:101
epoch:3700 | tot reward:-200.0 | epsilon:0.5299 | rand action:118 | Q action:82
epoch:3800 | tot reward:-200.0 | epsilon:0.5199 | rand action:106 | Q action:94
epoch:3900 | tot reward:-200.0 | epsilon:0.5099 | rand action:97 | Q action:103
epoch:4000 | tot reward:-200.0 | epsilon:0.4999 | rand action:108 | Q action:92
epoch:4100 | tot reward:-200.0 | epsilon:0.4899 | rand action:106 | Q action:94
epoch:4200 | tot reward:-200.0 | epsilon:0.4799 | rand action:91 | Q action:109
epoch:4300 | tot reward:-200.0 | epsilon:0.4699 | rand action:84 | Q action:116
epoch:4400 | tot reward:-198.0 | epsilon:0.4599 | rand action:76 | Q action:122
epoch:4500 | tot reward:-200.0 | epsilon:0.4499 | rand action:92 | Q action:108
epoch:4600 | tot reward:-200.0 | epsilon:0.4399 | rand action:91 | Q action:109
epoch:4700 | tot reward:-200.0 | epsilon:0.4299 | rand action:83 | Q action:117
epoch:4800 | tot reward:-200.0 | epsilon:0.4199 | rand action:75 | Q action:125
epoch:4900 | tot reward:-200.0 | epsilon:0.4099 | rand action:88 | Q action:112
epoch:5000 | tot reward:-200.0 | epsilon:0.3999 | rand action:84 | Q action:116
epoch:5100 | tot reward:-200.0 | epsilon:0.3899 | rand action:76 | Q action:124
epoch:5200 | tot reward:-200.0 | epsilon:0.3799 | rand action:71 | Q action:129
epoch:5300 | tot reward:-200.0 | epsilon:0.3699 | rand action:68 | Q action:132
epoch:5400 | tot reward:-200.0 | epsilon:0.3599 | rand action:75 | Q action:125
epoch:5500 | tot reward:-200.0 | epsilon:0.3499 | rand action:64 | Q action:136
epoch:5600 | tot reward:-200.0 | epsilon:0.3399 | rand action:72 | Q action:128
epoch:5700 | tot reward:-200.0 | epsilon:0.3299 | rand action:79 | Q action:121
epoch:5800 | tot reward:-200.0 | epsilon:0.3199 | rand action:68 | Q action:132
epoch:5900 | tot reward:-200.0 | epsilon:0.3099 | rand action:72 | Q action:128
epoch:6000 | tot reward:-200.0 | epsilon:0.2999 | rand action:57 | Q action:143
epoch:6100 | tot reward:-200.0 | epsilon:0.2899 | rand action:70 | Q action:130
epoch:6200 | tot reward:-200.0 | epsilon:0.2799 | rand action:48 | Q action:152
epoch:6300 | tot reward:-200.0 | epsilon:0.2699 | rand action:51 | Q action:149
epoch:6400 | tot reward:-200.0 | epsilon:0.2599 | rand action:54 | Q action:146
epoch:6500 | tot reward:-200.0 | epsilon:0.2499 | rand action:34 | Q action:166
epoch:6600 | tot reward:-200.0 | epsilon:0.2399 | rand action:56 | Q action:144
epoch:6700 | tot reward:-158.0 | epsilon:0.2299 | rand action:38 | Q action:120
epoch:6800 | tot reward:-200.0 | epsilon:0.2199 | rand action:39 | Q action:161
epoch:6900 | tot reward:-190.0 | epsilon:0.2099 | rand action:33 | Q action:157
epoch:7000 | tot reward:-200.0 | epsilon:0.1999 | rand action:41 | Q action:159
epoch:7100 | tot reward:-200.0 | epsilon:0.1899 | rand action:40 | Q action:160
epoch:7200 | tot reward:-161.0 | epsilon:0.1799 | rand action:27 | Q action:134
epoch:7300 | tot reward:-200.0 | epsilon:0.1699 | rand action:26 | Q action:174
epoch:7400 | tot reward:-200.0 | epsilon:0.1599 | rand action:36 | Q action:164
epoch:7500 | tot reward:-159.0 | epsilon:0.1499 | rand action:26 | Q action:133
epoch:7600 | tot reward:-159.0 | epsilon:0.1399 | rand action:21 | Q action:138
epoch:7700 | tot reward:-158.0 | epsilon:0.1299 | rand action:13 | Q action:145

Playing

env = gym.make('MountainCar-v0')
state = env.reset()
state = discretize(env, state)

env.render()
input()

while True:
    env.render()
    action = np.argmax(Q[state[0], state[1]])
    state, reward, done, info = env.step(action)
    state = discretize(env, state)
    
    print(f'\rstate:{state} | reward:{reward} | done:{done} | info:{info}')
    
    if done:
        break