[WIP] Multi head attention, dot scaled, and attention dropout #325

jsenellart · 2017-06-21T21:07:28Z

In Attention Is All You Need paper, several concepts are introduced that can fit in our current attention module:

So-called "Scaled Dot-Product Attention" - improving dot model (option -global_attention dot_scaled)
multi-head attention (option -multi_head_attention N) - this idea has been actually introduced in A Structured Self-attentive Sentence Embedding
dropout on attention (option -dropout_attention)

first and last are easy and can be quickly tested.

For the second one, it will requires also some modification in the translation - but beyond potential improvement, it will be interesting to visualize mutiple attentions

# Conflicts: # onmt/Factory.lua # onmt/modules/GlobalAttention.lua

codecov-io · 2017-06-21T21:33:55Z

Codecov Report

Merging #325 into master will increase coverage by 0.07%.
The diff coverage is 92.59%.

@@            Coverage Diff             @@
##           master     #325      +/-   ##
==========================================
+ Coverage   69.05%   69.13%   +0.07%     
==========================================
  Files          75       75              
  Lines        6477     6503      +26     
==========================================
+ Hits         4473     4496      +23     
- Misses       2004     2007       +3

Impacted Files	Coverage Δ
onmt/modules/GlobalAttention.lua	`98.38% <100%> (+1.01%)`	⬆️
onmt/modules/PositionEmbedding.lua	`96.87% <100%> (ø)`	⬆️
onmt/Factory.lua	`52.4% <33.33%> (-1.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3a9343...e661fbf. Read the comment docs.

vince62s · 2017-08-20T06:34:36Z

Jean , did you get interesting results with the multi head functionality ?

…ihead

jsenellart · 2017-11-03T05:40:54Z

@vince62s - I cleaned up the implementation - it is simpler now and I get slight but consistent improvements in PPL with 2 heads on 1M dataset with/without dropout_attention (0.1), and general model. On the other hand dot_scaled does not seem to improve dot. Could you do some tests on your side too?

vince62s · 2017-11-14T09:27:40Z

First comments:
I did it on a smaller dataset (500k).
If I train with multi_head 2 or 4 from the very begining, PPL explodes.
If I train with Multihead 1 for the first epoch and then 2, I am getting a lower PPL in the end, but not necessarily better BLEU.
If I train with Multihead 1 for the first epoch and then 4, I am getting exactly the same PPL (at 2 decimals) as in the previous experiment (at least for epoch 2 and 3): that does not sound right.

JUST realized I can't change the multihead value between epochs ....
so retrying with a lower LR.

jsenellart · 2017-11-15T14:37:10Z

thanks Vincents. For PPL explosion, I could only get it converge with Adam on my side or by setting lower LR. the problem with SGD and making LR smaller globally is that all the model is penalized. Adam naturally adapts to local variations but on the other hand, we know that we did not get similar results between adam and sgd. maybe by changing after the first epoch?

Jean A. Senellart added 5 commits June 21, 2017 17:09

dot scaled global attention

a73b131

simplify code for multi-head

eb9daf0

multi-head attention, & add dropout on attention

338d2b2

Merge remote-tracking branch 'upstream/master' into multihead

57b273d

# Conflicts: # onmt/Factory.lua # onmt/modules/GlobalAttention.lua

test files

2c9930c

Jean A. Senellart added 2 commits June 22, 2017 16:42

fix typo in option name

8580c77

document multihead

5217ead

Jean A. Senellart added 3 commits September 11, 2017 21:47

Merge branch 'master' of https://github.com/OpenNMT/OpenNMT into mult…

f59b462

…ihead

Merge remote-tracking branch 'upstream/master' into multihead

45a8cf2

simplify multihead and more generic attention prototype

2bc16c7

Jean A. Senellart added 2 commits November 3, 2017 17:14

code reformating

cfd58ee

Merge remote-tracking branch 'upstream/master' into multihead

e661fbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Multi head attention, dot scaled, and attention dropout #325

[WIP] Multi head attention, dot scaled, and attention dropout #325

jsenellart commented Jun 21, 2017 •

edited

codecov-io commented Jun 21, 2017 •

edited

vince62s commented Aug 20, 2017

jsenellart commented Nov 3, 2017

vince62s commented Nov 14, 2017 •

edited

jsenellart commented Nov 15, 2017

[WIP] Multi head attention, dot scaled, and attention dropout #325

Are you sure you want to change the base?

[WIP] Multi head attention, dot scaled, and attention dropout #325

Conversation

jsenellart commented Jun 21, 2017 • edited

codecov-io commented Jun 21, 2017 • edited

Codecov Report

vince62s commented Aug 20, 2017

jsenellart commented Nov 3, 2017

vince62s commented Nov 14, 2017 • edited

jsenellart commented Nov 15, 2017

jsenellart commented Jun 21, 2017 •

edited

codecov-io commented Jun 21, 2017 •

edited

vince62s commented Nov 14, 2017 •

edited