4. Implementing DeepMind's DQN

1. [코딩] q3_nature.py의 get_q_values_op를 작성하여 [mnih2015human] 논문에 나와있는 대로 deep Q-network를 만들어보자.

나머지 코드들은 linear approximation에서 짠 코드 그대로 따라갈 것이다.

python q3_nature.py 명령어로 이를 CPU로 돌려보아라. 대략 1분에서 2분정도 걸릴 것이다. [10점]

 

 

 

sol) nature지 아래에 실린 architecture 란에 보면 : 

The first hidden layer convolves 32 filters of 8 X 8 with stride 4 with the input image and applies a rectifier nonlinearity.
The second hidden layer convolves 64 filters of 4 X 4 with stride 2, again followed by a rectifier nonlinearity
This is followed by a third convolutional layer that convolves 64 filters of 3 X 3 with stride 1 followed by a rectifier.
The final hidden layer is fully-connected and consists of 512 rectifier units.

The output layer is a fully-connected linear layer with a single output for each valid action.

라는 설명이 있습니다.

 

이를 대충 해석하면,

첫번째 convolutional layer은 32 filters, 8*8 kernel size, stride 4, ReLU

두번째 convolutional layer은 64 filters, 4*4 kernel size, stride 2, ReLU

세번째 convolutional layer은 64 filters, 3*3 kernel size, stride 1, ReLU

마지막 hidden layer은 fully-connected layer로 512개의 output에 ReLU,

output layer은 각각의 action에 대한 single output이 나오는 fully-connected layer이라고 합니다.

 

이것을 코드로 구현하면 다음과 같은 코드가 나옵니다.

class NatureQN(Linear):
    """
    Implementing DeepMind's Nature paper. Here are the relevant urls.
    https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf
    https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
    """

    def get_q_values_op(self, state, scope, reuse=False):
        """
        Returns Q values for all actions

        Args:
            state: (tf tensor) 
                shape = (batch_size, img height, img width, nchannels)
            scope: (string) scope name, that specifies if target network or not
            reuse: (bool) reuse of variables in the scope

        Returns:
            out: (tf tensor) of shape = (batch_size, num_actions)
        """
        # this information might be useful
        num_actions = self.env.action_space.n

        ##############################################################
        """
        TODO: implement the computation of Q values like in the paper
                https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf
                https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

              you may find the section "model architecture" of the appendix of the 
              nature paper particulary useful.

              store your result in out of shape = (batch_size, num_actions)

        HINT: 
            - You may find the following functions useful:
                - tf.layers.conv2d
                - tf.layers.flatten
                - tf.layers.dense

            - Make sure to also specify the scope and reuse

        """

        """
        The first hidden layer convolves 32 filters of 8 X 8 with stride 4 with the input image and applies a rectifier nonlinearity.
        The second hidden layer convolves 64 filters of 4 X 4 with stride 2, again followed by a rectifier nonlinearity
        This is followed by a third convolutional layer that convolves 64 filters of 3 X 3 with stride 1 followed by a rectifier.
        The final hidden layer is fully-connected and consists of 512 rectifier units.
        The output layer is a fully-connected linear layer with a single output for each valid action.
        """
        ##############################################################
        ################ YOUR CODE HERE - 10-15 lines ################
        input = state
        with tf.variable_scope(scope,reuse=reuse):
            conv1 = tf.layers.conv2d(input, 32, (8, 8), strides=4, activation=tf.nn.relu, name='conv1')
            conv2 = tf.layers.conv2d(conv1, 64, (4, 4), strides=2, activation=tf.nn.relu, name='conv2')
            conv3 = tf.layers.conv2d(conv2, 64, (3, 3), strides=1, activation=tf.nn.relu, name='conv3')
            flat = tf.layers.flatten(conv3, name='flatten')
            fc = tf.layers.dense(flat, 512, activation=tf.nn.relu, name='fully-connected')
            out = tf.layers.dense(fc, num_actions, name='out')


        pass

        ##############################################################
        ######################## END YOUR CODE #######################
        return out

(보기 좋으라고 Class 선언부부터 넣어두지만, 구현한 부분은 가장 아래 10줄 가량입니다. input = state 부터...)

 

 

2. (written 5pts) Attach the plot of scores, scores.png, from the directory results/q3 nature to your writeup. Compare this model with linear approximation. How do the final performances compare? How about the training time?

 

2. results/q3 nature에서 scores.png 사진을 첨부하시오. linear approximation model과 비교하시오.

성능은 어떻게 차이가 나는가? 훈련 시간은 얼마나 차이가 나는가?

Linear Approximation

일단, DQN보다 Linear Approximation이 더 빠르게 converge 하기 시작하는 모습을 볼 수 있다.

심지어는 DQN을 돌릴 때 가끔씩 reward가 4.0에서 4.1로 가지 못한 채 training이 끝나는 모습도 볼 수 있었다.

아무래도 DQN이 Linear Approximation보다 더 unstable한 방식이라 그런 것 같다.

 

또한, training 시간도 DQN쪽이 더 오래 걸린다.

 

 

이것을 통해, 이렇게 간단한 test environment 같은 경우는 Linear Approximation이 더 잘 작동할 수도 있다는 것을 알게 되었다.

즉, environment에 따라 서로 다른 최적의 모델이 존재한다는 것을 알 수 있다.

 

 

+ Recent posts