Quantcast
Channel: Proper way to do gradient clipping?
Viewing all articles
Browse latest Browse all 22

Proper way to do gradient clipping?

$
0
0

@ntubertchen
Hi,
Use torch.nn.utils.clip_grad_norm to keep the gradients within a specific range (clip). In RNNs the gradients tend to grow very large (this is called ‘the exploding gradient problem’), and clipping them helps to prevent this from happening . It is probably helpful to look at the implementation because it teaches us that:

  1. “The norm is computed over all gradients together, as if they were concatenated into a single vector.”
  2. You can control the norm type (lp-norm, with p defaulting to 2; or the L-inf norm).
  3. All of the gradient coefficients are multiplied by the same clip_coef.
  4. clip_grad_norm is invoked after all of the gradients have been updated. I.e. between loss.backward() and optimizer.step(). So during loss.backward(), the gradients that are propagated backwards are not clipped, until the backward pass completes and clip_grad_norm() is invoked. optimizer.step() will then use the updated gradients.

Regarding the code you ask about:

for p in model.parameters():
    p.data.add_(-lr, p.grad.data)

This iterates across all of the model.parameters() and performs an in-place multiply-add on each of the parameter tensors §.
p.data.add_ is functionally equal to:

p.data = p.data + (-lr * p.grad.data)

In other words, this performs a similar function as optimizer.step(), using the gradients to updates the model parameters, but without the extra sophistication of a torch.optim.Optimizer. If you use the above code, then you should not use an optimizer (and vice-versa).

Cheers,
Neta

Read full topic


Viewing all articles
Browse latest Browse all 22

Trending Articles