EECS 498-007 / 598-005 Assignment #3-1

Deep Learning for Computer Vision

EECS 498-007 / 598-005 Assignment #3-1

딥땔감 2021. 2. 3. 21:09

드디어 미루고 미뤘던 Assignment # 3이다.

우선 Computational Graph에서 gradient를 계산하는 일종의 공식으로 쓰일 수 있는 각종 게이트부터 외우고 시작하자.

앞으로 과제 진행할 때 매우 매우 유용할 것이다.

Assignment # 3는 먼저 Fully-Connected Neural Network와 Dropout을 구현하는 것부터 시작된다.

일단 Linear 레이어에서의 forward와 backward부터 구현하는데

  @staticmethod
  def forward(x, w, b):
    """
    Computes the forward pass for an linear (fully-connected) layer.
    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
    examples, where each example x[i] has shape (d_1, ..., d_k). We will
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and
    then transform it to an output vector of dimension M.
    Inputs:
    - x: A tensor containing input data, of shape (N, d_1, ..., d_k)
    - w: A tensor of weights, of shape (D, M)
    - b: A tensor of biases, of shape (M,)
    Returns a tuple of:
    - out: output, of shape (N, M)
    - cache: (x, w, b)
    """
    out = None
    #############################################################################
    # TODO: Implement the linear forward pass. Store the result in out. You     #
    # will need to reshape the input into rows.                                 #
    #############################################################################
    # Replace "pass" statement with your code
    xx = x.view(x.shape[0], -1)
    out = xx.matmul(w) + b
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    cache = (x, w, b)
    return out, cache

forward의 경우 먼저 W와 곱하기 쉽도록 x의 각 데이터의 차원을 1차원으로 만들어준다.

그리고 그 값을 W와 곱해주고 bias를 더해주어 forward pass를 완료한다.

@staticmethod
  def backward(dout, cache):
    """
    Computes the backward pass for an linear layer.
    Inputs:
    - dout: Upstream derivative, of shape (N, M)
    - cache: Tuple of:
      - x: Input data, of shape (N, d_1, ... d_k)
      - w: Weights, of shape (D, M)
      - b: Biases, of shape (M,)
    Returns a tuple of:
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
    - dw: Gradient with respect to w, of shape (D, M)
    - db: Gradient with respect to b, of shape (M,)
    """
    x, w, b = cache
    dx, dw, db = None, None, None
    #############################################################################
    # TODO: Implement the linear backward pass.                                 #
    #############################################################################
    # Replace "pass" statement with your code
    xx = x.flatten().view(x.shape[0],-1)
    dx = dout.matmul(w.T).reshape(x.shape)
    dw = xx.T.matmul(dout)
    db = dout.sum(axis=0)
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    return dx, dw, db

backward pass 과정에서는 먼저 Wx + b 노드를 만난다. 위의 add gate에 따라서 db = dout이 되고, Wx의 gradient는 dout이 된다. 그리고 각 W와 x의 gradient는 mul gate에 따라 정지지며 이 과정을 코드로 표현하면 위와 같다.

class ReLU(object):

  @staticmethod
  def forward(x):
    """
    Computes the forward pass for a layer of rectified linear units (ReLUs).
    Input:
    - x: Input; a tensor of any shape
    Returns a tuple of:
    - out: Output, a tensor of the same shape as x
    - cache: x
    """
    out = None
    #############################################################################
    # TODO: Implement the ReLU forward pass.                                    #
    # You should not change the input tensor with an in-place operation.        #
    #############################################################################
    # Replace "pass" statement with your code
    out = x.clamp(min = 0)
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    cache = x
    return out, cache

  @staticmethod
  def backward(dout, cache):
    """
    Computes the backward pass for a layer of rectified linear units (ReLUs).
    Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input x, of same shape as dout
    Returns:
    - dx: Gradient with respect to x
    """
    dx, x = None, cache
    #############################################################################
    # TODO: Implement the ReLU backward pass.                                   #
    # You should not change the input tensor with an in-place operation.        #
    #############################################################################
    # Replace "pass" statement with your code
    dx = dout * (x > 0)
    #############################################################################
    #                              END OF YOUR CODE                             #
    #############################################################################
    return dx

이어서 ReLU 레이어도 forward, backward pass를 구현하는데, forward는 clamp 메소드를 이용해서 음수는 0으로

만들어주면 되며, backward는 이전에 강의했던대로 x가 양수인 값이면 downstream gradient는 dout, 0 혹은 음수이면

0으로 처리해주면 된다.

class Linear_ReLU(object):

  @staticmethod
  def forward(x, w, b):
    """
    Convenience layer that performs an linear transform followed by a ReLU.

    Inputs:
    - x: Input to the linear layer
    - w, b: Weights for the linear layer
    Returns a tuple of:
    - out: Output from the ReLU
    - cache: Object to give to the backward pass
    """
    a, fc_cache = Linear.forward(x, w, b)
    out, relu_cache = ReLU.forward(a)
    cache = (fc_cache, relu_cache)
    return out, cache

  @staticmethod
  def backward(dout, cache):
    """
    Backward pass for the linear-relu convenience layer
    """
    fc_cache, relu_cache = cache
    da = ReLU.backward(dout, relu_cache)
    dx, dw, db = Linear.backward(da, fc_cache)
    return dx, dw, db

이어서 'Sandwich' layer를 구성하는데 forward는 레이어 순서대로 Linear.forward, ReLU.forward,

반대로 backward는 ReLU.backward, Linear.backward를 사용하여 주면 된다.

class TwoLayerNet(object):
  """
  A two-layer fully-connected neural network with ReLU nonlinearity and
  softmax loss that uses a modular layer design. We assume an input dimension
  of D, a hidden dimension of H, and perform classification over C classes.
  The architecure should be linear - relu - linear - softmax.
  Note that this class does not implement gradient descent; instead, it
  will interact with a separate Solver object that is responsible for running
  optimization.

  The learnable parameters of the model are stored in the dictionary
  self.params that maps parameter names to PyTorch tensors.
  """

  def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
         weight_scale=1e-3, reg=0.0, dtype=torch.float32, device='cpu'):
    """
    Initialize a new network.
    Inputs:
    - input_dim: An integer giving the size of the input
    - hidden_dim: An integer giving the size of the hidden layer
    - num_classes: An integer giving the number of classes to classify
    - weight_scale: Scalar giving the standard deviation for random
      initialization of the weights.
    - reg: Scalar giving L2 regularization strength.
    - dtype: A torch data type object; all computations will be performed using
      this datatype. float is faster but less accurate, so you should use
      double for numeric gradient checking.
    - device: device to use for computation. 'cpu' or 'cuda'
    """
    self.params = {}
    self.reg = reg

    ###########################################################################
    # TODO: Initialize the weights and biases of the two-layer net. Weights   #
    # should be initialized from a Gaussian centered at 0.0 with              #
    # standard deviation equal to weight_scale, and biases should be          #
    # initialized to zero. All weights and biases should be stored in the     #
    # dictionary self.params, with first layer weights                        #
    # and biases using the keys 'W1' and 'b1' and second layer                #
    # weights and biases using the keys 'W2' and 'b2'.                        #
    ###########################################################################
    # Replace "pass" statement with your code
    self.params['W1'] = torch.normal(0.0, weight_scale, (input_dim, hidden_dim), dtype = dtype, device = device)
    self.params['W2'] = torch.normal(0.0, weight_scale, (hidden_dim, num_classes), dtype = dtype, device = device)
    self.params['b1'] = torch.zeros((hidden_dim,), dtype = dtype, device = device)
    self.params['b2'] = torch.zeros((num_classes,), dtype = dtype, device = device)
    ###########################################################################
    #                            END OF YOUR CODE                             #
    ###########################################################################

다음으로 TwoLayerNet을 구현하는데, 당연히 클래스 초기화부터 해주어야한다.

여기서 각 W는 weight_scale만큼 스케일링한 가우시안 분포를 따르는 값으로 초기화하면 되며,

bias는 0으로 초기화하면 된다.

def loss(self, X, y=None):
    """
    Compute loss and gradient for a minibatch of data.

    Inputs:
    - X: Tensor of input data of shape (N, d_1, ..., d_k)
    - y: int64 Tensor of labels, of shape (N,). y[i] gives the label for X[i].

    Returns:
    If y is None, then run a test-time forward pass of the model and return:
    - scores: Tensor of shape (N, C) giving classification scores, where
      scores[i, c] is the classification score for X[i] and class c.
    If y is not None, then run a training-time forward and backward pass and
    return a tuple of:
    - loss: Scalar value giving the loss
    - grads: Dictionary with the same keys as self.params, mapping parameter
      names to gradients of the loss with respect to those parameters.
    """
    scores = None
    ###########################################################################
    # TODO: Implement the forward pass for the two-layer net, computing the   #
    # class scores for X and storing them in the scores variable.             #
    ###########################################################################
    # Replace "pass" statement with your code
    h, cache_1 = Linear_ReLU.forward(X, self.params['W1'], self.params['b1'])
    scores, cache_2  = Linear.forward(h, self.params['W2'], self.params['b2'])
    ###########################################################################
    #                            END OF YOUR CODE                             #
    ###########################################################################

    # If y is None then we are in test mode so just return scores
    if y is None:
      return scores

    loss, grads = 0, {}
    ###########################################################################
    # TODO: Implement the backward pass for the two-layer net. Store the loss #
    # in the loss variable and gradients in the grads dictionary. Compute data#
    # loss using softmax, and make sure that grads[k] holds the gradients for #
    # self.params[k]. Don't forget to add L2 regularization!                  #
    #                                                                         #
    # NOTE: To ensure that your implementation matches ours and you pass the  #
    # automated tests, make sure that your L2 regularization does not include #
    # a factor of 0.5.                                                        #
    ###########################################################################
    # Replace "pass" statement with your code
    loss, dout = softmax_loss(scores, y)
    loss += self.reg * ( (self.params['W1'] ** 2).sum() + (self.params['W2'] ** 2).sum() )

    dh, grads['W2'], grads['b2'] = Linear.backward(dout, cache_2)
    grads['W2'] += self.reg * self.params['W2'] * 2

    dx, grads['W1'], grads['b1'] = Linear_ReLU.backward(dh, cache_1)
    grads['W1'] += self.reg * self.params['W1'] * 2
    ###########################################################################
    #                            END OF YOUR CODE                             #
    ###########################################################################

    return loss, grads

이어서 TwoLayerNet의 loss 메소드에서 forwad와 backward를 구현하는데

TwoLayerNet는 Linear - ReLU - Linear - softmax 로 구현되어있다.

따라서 forward pass는 LinearReLU.forward, Linear.forward를 써주어서 구현하면 되며

backward pass는 Linear.backward, Linear_ReLU.backward를 이용하여 W2, b2와 W1, b1 순으로 구해주면 된다.

추가로 loss도 계산해주어야 하는데, loss는 사전 제공받았던 softmax_loss를 사용해서 구하면 된다.

그리고 위 과정에서 L2 regularization을 적용해야한다.

def create_solver_instance(data_dict, dtype, device):
  model = TwoLayerNet(hidden_dim=200, dtype=dtype, device=device)
  ##############################################################################
  # TODO: Use a Solver instance to train a TwoLayerNet that achieves at least  #
  # 50% accuracy on the validation set.                                        #
  ##############################################################################
  solver = None
  # Replace "pass" statement with your code
  solver = Solver(model, data_dict, optim_config={'learning_rate':1,},
                  lr_decay = 0.95, num_epochs = 20, batch_size = 200, print_every=200, device = device)
  ##############################################################################
  #                             END OF YOUR CODE                               #
  ##############################################################################
  return solver

그 다음으로 TwoLayerNet을 학습시키는 create_solber_instance를 만들어야하는데, 복잡할 것이

주피터 노트북의 바로 위 셀의 예시 코드를 보고 구현하면 된다.

class FullyConnectedNet(object):
  """
  A fully-connected neural network with an arbitrary number of hidden layers,
  ReLU nonlinearities, and a softmax loss function.
  For a network with L layers, the architecture will be:

  {linear - relu - [dropout]} x (L - 1) - linear - softmax

  where dropout is optional, and the {...} block is repeated L - 1 times.

  Similar to the TwoLayerNet above, learnable parameters are stored in the
  self.params dictionary and will be learned using the Solver class.
  """

  def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
               dropout=0.0, reg=0.0, weight_scale=1e-2, seed=None,
               dtype=torch.float, device='cpu'):
    """
    Initialize a new FullyConnectedNet.

    Inputs:
    - hidden_dims: A list of integers giving the size of each hidden layer.
    - input_dim: An integer giving the size of the input.
    - num_classes: An integer giving the number of classes to classify.
    - dropout: Scalar between 0 and 1 giving the drop probability for networks
      with dropout. If dropout=0 then the network should not use dropout.
    - reg: Scalar giving L2 regularization strength.
    - weight_scale: Scalar giving the standard deviation for random
      initialization of the weights.
    - seed: If not None, then pass this random seed to the dropout layers. This
      will make the dropout layers deteriminstic so we can gradient check the
      model.
    - dtype: A torch data type object; all computations will be performed using
      this datatype. float is faster but less accurate, so you should use
      double for numeric gradient checking.
    - device: device to use for computation. 'cpu' or 'cuda'
    """
    self.use_dropout = dropout != 0
    self.reg = reg
    self.num_layers = 1 + len(hidden_dims)
    self.dtype = dtype
    self.params = {}

    ############################################################################
    # TODO: Initialize the parameters of the network, storing all values in    #
    # the self.params dictionary. Store weights and biases for the first layer #
    # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
    # initialized from a normal distribution centered at 0 with standard       #
    # deviation equal to weight_scale. Biases should be initialized to zero.   #
    ############################################################################
    # Replace "pass" statement with your code
    hidden_dims = [input_dim] + hidden_dims
    hidden_dims.append(num_classes)
    for i in range(1, self.num_layers + 1):
      self.params['W' + str(i)] = torch.normal(0.0, weight_scale, (hidden_dims[i - 1],hidden_dims[i]), device = device).type(dtype)
      self.params['b' + str(i)] = torch.zeros((hidden_dims[i],), device=device).type(dtype)
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # When using dropout we need to pass a dropout_param dictionary to each
    # dropout layer so that the layer knows the dropout probability and the mode
    # (train / test). You can pass the same dropout_param to each dropout layer.
    self.dropout_param = {}
    if self.use_dropout:
      self.dropout_param = {'mode': 'train', 'p': dropout}
      if seed is not None:
        self.dropout_param['seed'] = seed

이제부터는 임의의 레이어만큼의 Linear_ReLU와 마지막의 Linear 레이어, softmax로 구성된 FullyConnectedNet를

구현하는데 이전과 마찬가지로 초기화부터 진행한다. 초기화는 이전의 TwoLayerNet과 동일한 방식으로

for문을 이용하여 구현한다.

def loss(self, X, y=None):
    """
    Compute loss and gradient for the fully-connected net.
    Input / output: Same as TwoLayerNet above.
    """
    X = X.to(self.dtype)
    mode = 'test' if y is None else 'train'

    # Set train/test mode for batchnorm params and dropout param since they
    # behave differently during training and testing.
    if self.use_dropout:
      self.dropout_param['mode'] = mode
    scores = None
    ############################################################################
    # TODO: Implement the forward pass for the fully-connected net, computing  #
    # the class scores for X and storing them in the scores variable.          #
    #                                                                          #
    # When using dropout, you'll need to pass self.dropout_param to each       #
    # dropout forward pass.                                                    #
    ############################################################################
    # Replace "pass" statement with your code
    cache = {}
    dropout_cache = {}
    h = X
    for i in range(1, self.num_layers):
      h, cache[str(i)] = Linear_ReLU.forward(h, self.params['W' + str(i)], self.params['b' + str(i)])
      if self.use_dropout:
        h, dropout_cache[str(i)] = Dropout.forward(h, self.dropout_param)
    i = i + 1
    h, cache[str(i)] = Linear.forward(h, self.params['W' + str(i)], self.params['b' + str(i)])
    scores = h 
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    # If test mode return early
    if mode == 'test':
      return scores

    loss, grads = 0.0, {}
    ############################################################################
    # TODO: Implement the backward pass for the fully-connected net. Store the #
    # loss in the loss variable and gradients in the grads dictionary. Compute #
    # data loss using softmax, and make sure that grads[k] holds the gradients #
    # for self.params[k]. Don't forget to add L2 regularization!               #
    # NOTE: To ensure that your implementation matches ours and you pass the   #
    # automated tests, make sure that your L2 regularization includes a factor #
    # of 0.5 to simplify the expression for the gradient.                      #
    ############################################################################
    # Replace "pass" statement with your code
    loss, up_grad = softmax_loss(scores, y)
    for i in range(1, self.num_layers + 1):
      loss += self.reg * (self.params['W' + str(i)] **2 ).sum()
    i = self.num_layers
    up_grad, grads['W' + str(i)], grads['b' + str(i)] = Linear.backward(up_grad, cache[str(i)])
    for i in range(self.num_layers - 1, 0, -1):
      if self.use_dropout:
        up_grad = Dropout.backward(up_grad, dropout_cache[str(i)])
      up_grad, grads['W' + str(i)], grads['b' + str(i)] = Linear_ReLU.backward(up_grad, cache[str(i)])

    for i in range(1, self.num_layers + 1):
      grads['W' + str(i)] += self.reg * self.params['W' + str(i)] * 2
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    return loss, grads

FullyConnectedNet에서의 forward pass와 backward pass 모두 loss 메소드에서 진행되는데, TwoLayerNet에서와

마찬가지로 모델에 레이어가 쌓인 순서대로 각 레이어의 forwad 메소드와 backward 메소드를 진행해주면 된다.

추가로 loss 메소드이니만큼 loss도 softmax_loss 함수를 이용해서 계산하면 된다.

그리고 위 코드에는 Dropout 레이어가 추가로 적용되었는데, 이번 과제 후반주에 구현할거니까,

그냥 dropout 레이어가 쓰이는구나 하고 넘어가면 된다.

def get_three_layer_network_params():
  ############################################################################
  # TODO: Change weight_scale and learning_rate so your model achieves 100%  #
  # training accuracy within 20 epochs.                                      #
  ############################################################################
  weight_scale = 1e-1
  learning_rate = 1e-1
  ############################################################################
  #                             END OF YOUR CODE                             #
  ############################################################################
  return weight_scale, learning_rate


def get_five_layer_network_params():
  ############################################################################
  # TODO: Change weight_scale and learning_rate so your model achieves 100%  #
  # training accuracy within 20 epochs.                                      #
  ############################################################################
  weight_scale = 1e-1
  learning_rate = 2e-1
  ############################################################################
  #                             END OF YOUR CODE                             #
  ############################################################################
  return weight_scale, learning_rate

그리고 FullyConnectedNet에서 임의의 learning_rate과 weight scale을 정해서 training acc를 100% 찍게 하면 되는데,

특별히 연구할거 없이 그냥 대충 집어넣으면 쉽게 할 수 있다.

def sgd_momentum(w, dw, config=None):
  """
  Performs stochastic gradient descent with momentum.
  config format:
  - learning_rate: Scalar learning rate.
  - momentum: Scalar between 0 and 1 giving the momentum value.
    Setting momentum = 0 reduces to sgd.
  - velocity: A numpy array of the same shape as w and dw used to store a
    moving average of the gradients.
  """
  if config is None: config = {}
  config.setdefault('learning_rate', 1e-2)
  config.setdefault('momentum', 0.9)
  v = config.get('velocity', torch.zeros_like(w))

  next_w = None
  #############################################################################
  # TODO: Implement the momentum update formula. Store the updated value in   #
  # the next_w variable. You should also use and update the velocity v.       #
  #############################################################################
  # Replace "pass" statement with your code
  v = config['momentum'] * v - config['learning_rate'] * dw
  next_w = w + v 
  #############################################################################
  #                              END OF YOUR CODE                             #
  #############################################################################
  config['velocity'] = v

  return next_w, config

def rmsprop(w, dw, config=None):
  """
  Uses the RMSProp update rule, which uses a moving average of squared
  gradient values to set adaptive per-parameter learning rates.
  config format:
  - learning_rate: Scalar learning rate.
  - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
    gradient cache.
  - epsilon: Small scalar used for smoothing to avoid dividing by zero.
  - cache: Moving average of second moments of gradients.
  """
  if config is None: config = {}
  config.setdefault('learning_rate', 1e-2)
  config.setdefault('decay_rate', 0.99)
  config.setdefault('epsilon', 1e-8)
  config.setdefault('cache', torch.zeros_like(w))

  next_w = None
  ###########################################################################
  # TODO: Implement the RMSprop update formula, storing the next value of w #
  # in the next_w variable. Don't forget to update cache value stored in    #
  # config['cache'].                                                        #
  ###########################################################################
  # Replace "pass" statement with your code
  grad_squared = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dw ** 2)
  next_w = w - (config['learning_rate'] * dw)/ (grad_squared.sqrt() + config['epsilon'])
  ###########################################################################
  #                             END OF YOUR CODE                            #
  ###########################################################################

  return next_w, config

def adam(w, dw, config=None):
  """
  Uses the Adam update rule, which incorporates moving averages of both the
  gradient and its square and a bias correction term.
  config format:
  - learning_rate: Scalar learning rate.
  - beta1: Decay rate for moving average of first moment of gradient.
  - beta2: Decay rate for moving average of second moment of gradient.
  - epsilon: Small scalar used for smoothing to avoid dividing by zero.
  - m: Moving average of gradient.
  - v: Moving average of squared gradient.
  - t: Iteration number.
  """
  if config is None: config = {}
  config.setdefault('learning_rate', 1e-3)
  config.setdefault('beta1', 0.9)
  config.setdefault('beta2', 0.999)
  config.setdefault('epsilon', 1e-8)
  config.setdefault('m', torch.zeros_like(w))
  config.setdefault('v', torch.zeros_like(w))
  config.setdefault('t', 0)

  next_w = None
  #############################################################################
  # TODO: Implement the Adam update formula, storing the next value of w in   #
  # the next_w variable. Don't forget to update the m, v, and t variables     #
  # stored in config.                                                         #
  #                                                                           #
  # NOTE: In order to match the reference output, please modify t _before_    #
  # using it in any calculations.                                             #
  #############################################################################
  # Replace "pass" statement with your code
  config['t'] += 1
  config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dw
  config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dw ** 2)
  m_unbias = config['m'] / (1 - config['beta1'] ** config['t'])
  v_unbias = config['v'] / (1 - config['beta2'] ** config['t'])
  next_w = w - config['learning_rate'] * m_unbias / (v_unbias.sqrt() + config['epsilon'])
  #############################################################################
  #                              END OF YOUR CODE                             #
  #############################################################################

  return next_w, config

이어서 update rule를 구현하는데 sgd + momentum, rmsprop, adam 순으로 구현한다.

이에 관해서는 강의 슬라이드 그대로 따라하면 되기에 그다지 어려울 것은 없었을 것이다.

class Dropout(object):

  @staticmethod
  def forward(x, dropout_param):
    """
    Performs the forward pass for (inverted) dropout.
    Inputs:
    - x: Input data: tensor of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We *drop* each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
      if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
      function deterministic, which is needed for gradient checking but not
      in real networks.
    Outputs:
    - out: Tensor of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.
    NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
    See http://cs231n.github.io/neural-networks-2/#reg for more details.
    NOTE 2: Keep in mind that p is the probability of **dropping** a neuron
    output; this might be contrary to some sources, where it is referred to
    as the probability of keeping a neuron output.
    """
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
      torch.manual_seed(dropout_param['seed'])

    mask = None
    out = None

    if mode == 'train':
      ###########################################################################
      # TODO: Implement training phase forward pass for inverted dropout.       #
      # Store the dropout mask in the mask variable.                            #
      ###########################################################################
      # Replace "pass" statement with your code
      mask = torch.rand_like(x) > p
      out = x * mask
      ###########################################################################
      #                             END OF YOUR CODE                            #
      ###########################################################################
    elif mode == 'test':
      ###########################################################################
      # TODO: Implement the test phase forward pass for inverted dropout.       #
      ###########################################################################
      # Replace "pass" statement with your code
      out = x
      ###########################################################################
      #                             END OF YOUR CODE                            #
      ###########################################################################

    cache = (dropout_param, mask)

    return out, cache

  @staticmethod
  def backward(dout, cache):
    """
    Perform the backward pass for (inverted) dropout.
    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from Dropout.forward.
    """
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
      ###########################################################################
      # TODO: Implement training phase backward pass for inverted dropout       #
      ###########################################################################
      # Replace "pass" statement with your code
      dx = dout * mask
      ###########################################################################
      #                            END OF YOUR CODE                             #
      ###########################################################################
    elif mode == 'test':
      dx = dout
    return dx

드디어 dropout layer이다. 이전에 했던대로 forwad와 backward 메소드를 구현하면 되는데, forward 메소드는

train일때와 test일때 구분해서 진행하도록 하는데, train일때는 임의의 확률 p로 각 unit을, 코드에서는

x의 각 원소를 0으로 만들면되고, test에서는 아무런 수정없이 그대로 x를 리턴해주면 된다.

backward에서는 이전에 relu에서처럼 값이 0으로 바뀌지 않은 unit들의 downstream gradient를 upstream gradient로

해주면 되고, 0으로 바뀐 unit들의 downstream gradient는 0으로 해주면 된다.

===============================================================

728x90

저작자표시 비영리