-
Notifications
You must be signed in to change notification settings - Fork 472
[WIP] Multi head attention, dot scaled, and attention dropout #325
base: master
Are you sure you want to change the base?
Conversation
# Conflicts: # onmt/Factory.lua # onmt/modules/GlobalAttention.lua
Codecov Report
@@ Coverage Diff @@
## master #325 +/- ##
==========================================
+ Coverage 69.05% 69.13% +0.07%
==========================================
Files 75 75
Lines 6477 6503 +26
==========================================
+ Hits 4473 4496 +23
- Misses 2004 2007 +3
Continue to review full report at Codecov.
|
Jean , did you get interesting results with the multi head functionality ? |
@vince62s - I cleaned up the implementation - it is simpler now and I get slight but consistent improvements in PPL with 2 heads on 1M dataset with/without dropout_attention (0.1), and general model. On the other hand |
First comments: JUST realized I can't change the multihead value between epochs .... |
thanks Vincents. For PPL explosion, I could only get it converge with Adam on my side or by setting lower LR. the problem with SGD and making LR smaller globally is that all the model is penalized. Adam naturally adapts to local variations but on the other hand, we know that we did not get similar results between adam and sgd. maybe by changing after the first epoch? |
In Attention Is All You Need paper, several concepts are introduced that can fit in our current attention module:
dot
model (option-global_attention dot_scaled
)-multi_head_attention N
) - this idea has been actually introduced in A Structured Self-attentive Sentence Embedding-dropout_attention
)first and last are easy and can be quickly tested.
For the second one, it will requires also some modification in the translation - but beyond potential improvement, it will be interesting to visualize mutiple attentions