Quantcast
Channel: Proper way to do gradient clipping?
Viewing all articles
Browse latest Browse all 22

Proper way to do gradient clipping?

$
0
0

Maybe I’m doing something wrong here, but using gradient clipping like

nn.utils.clip_grad_norm(model.parameters(), clip)
for p in model.parameters():
    p.data.add_(-lr, p.grad.data)

makes my network train much slower than with optimizer.step().

Here’s what it looks like with gradient clipping, with clip=5:

Epoch: 1/10... Step: 10... Loss: 4.4288
Epoch: 1/10... Step: 20... Loss: 4.4274
Epoch: 1/10... Step: 30... Loss: 4.4259
Epoch: 1/10... Step: 40... Loss: 4.4250
Epoch: 1/10... Step: 50... Loss: 4.4237
Epoch: 1/10... Step: 60... Loss: 4.4223
Epoch: 1/10... Step: 70... Loss: 4.4209
Epoch: 1/10... Step: 80... Loss: 4.4193
Epoch: 1/10... Step: 90... Loss: 4.4188
Epoch: 1/10... Step: 100... Loss: 4.4174

And without gradient clipping, everything else equal:

Epoch: 1/10... Step: 10... Loss: 3.2837
Epoch: 1/10... Step: 20... Loss: 3.1901
Epoch: 1/10... Step: 30... Loss: 3.1512
Epoch: 1/10... Step: 40... Loss: 3.1296
Epoch: 1/10... Step: 50... Loss: 3.1170
Epoch: 1/10... Step: 60... Loss: 3.0758
Epoch: 1/10... Step: 70... Loss: 2.9787
Epoch: 1/10... Step: 80... Loss: 2.9104
Epoch: 1/10... Step: 90... Loss: 2.8271
Epoch: 1/10... Step: 100... Loss: 2.6813

There is probably something I don’t understand, but I’m just switching out those two bits of code.

Read full topic


Viewing all articles
Browse latest Browse all 22

Trending Articles