Skip to content

danwilhelm/cumsum_transformer

Repository files navigation

How a Transformer Computes the Cumulative Sum Sign

Author: Dan Wilhelm, dan@danwilhelm.com

We investigate how a one-layer attention+feed-forward transformer computes the cumulative sum. I've written the investigation conversationally, providing numerous examples and insights.

Of particular interest, we:

  1. design a 38-weight attention-only circuit with smaller loss than the provided model;
  2. manually remove the MLP and rewire a trained circuit, retaining 100% accuracy;
  3. prove that an equally-attended attention block is equivalent to a single linear projection (of the prior-input mean!); and
  4. provide an independent transformer implementation to make it easier to modify the internals.

This is my proposed solution to a monthly puzzle authored by Callum McDougall! You may find more information about the challenge and monthly problem series here:

Table of Contents

  1. Introduction
  2. All 24 embedding channels directly encode token sign and magnitude
  3. Attention softmax equally attends to each token
  4. Equally-divided attention computes the expanding mean
  5. Feed-forward network "cleans up" the signal
  6. What about the zero unembed pattern?
  7. Surgically removing the MLP, retaining 100% accuracy
  8. Designing a 38-weight attention-only cumsum circuit
  9. Appendix A. Rewriting two linear transforms as one
  10. Appendix B. Designing a 38-weight circuit with skip connections

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published