Skip to content
This repository has been archived by the owner on Jun 10, 2021. It is now read-only.

[WIP] Multi head attention, dot scaled, and attention dropout #325

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

jsenellart
Copy link
Contributor

@jsenellart jsenellart commented Jun 21, 2017

In Attention Is All You Need paper, several concepts are introduced that can fit in our current attention module:

  • So-called "Scaled Dot-Product Attention" - improving dot model (option -global_attention dot_scaled)
  • multi-head attention (option -multi_head_attention N) - this idea has been actually introduced in A Structured Self-attentive Sentence Embedding
  • dropout on attention (option -dropout_attention)

first and last are easy and can be quickly tested.

For the second one, it will requires also some modification in the translation - but beyond potential improvement, it will be interesting to visualize mutiple attentions

@codecov-io
Copy link

codecov-io commented Jun 21, 2017

Codecov Report

Merging #325 into master will increase coverage by 0.07%.
The diff coverage is 92.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #325      +/-   ##
==========================================
+ Coverage   69.05%   69.13%   +0.07%     
==========================================
  Files          75       75              
  Lines        6477     6503      +26     
==========================================
+ Hits         4473     4496      +23     
- Misses       2004     2007       +3
Impacted Files Coverage Δ
onmt/modules/GlobalAttention.lua 98.38% <100%> (+1.01%) ⬆️
onmt/modules/PositionEmbedding.lua 96.87% <100%> (ø) ⬆️
onmt/Factory.lua 52.4% <33.33%> (-1.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3a9343...e661fbf. Read the comment docs.

@vince62s
Copy link
Member

Jean , did you get interesting results with the multi head functionality ?

@jsenellart
Copy link
Contributor Author

@vince62s - I cleaned up the implementation - it is simpler now and I get slight but consistent improvements in PPL with 2 heads on 1M dataset with/without dropout_attention (0.1), and general model. On the other hand dot_scaled does not seem to improve dot. Could you do some tests on your side too?

@vince62s
Copy link
Member

vince62s commented Nov 14, 2017

First comments:
I did it on a smaller dataset (500k).
If I train with multi_head 2 or 4 from the very begining, PPL explodes.
If I train with Multihead 1 for the first epoch and then 2, I am getting a lower PPL in the end, but not necessarily better BLEU.
If I train with Multihead 1 for the first epoch and then 4, I am getting exactly the same PPL (at 2 decimals) as in the previous experiment (at least for epoch 2 and 3): that does not sound right.

JUST realized I can't change the multihead value between epochs ....
so retrying with a lower LR.

@jsenellart
Copy link
Contributor Author

thanks Vincents. For PPL explosion, I could only get it converge with Adam on my side or by setting lower LR. the problem with SGD and making LR smaller globally is that all the model is penalized. Adam naturally adapts to local variations but on the other hand, we know that we did not get similar results between adam and sgd. maybe by changing after the first epoch?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants