The Transformer architecture includes several parts that work together to enable effective sequence modeling. Key mechanisms in a transformer are multi-head attention, layer normalization, residual connections, and feed-forward networks. In this blog, I focus on the main component of the Transformer architecture: multi-head attention.

Multi-head Attention: The Superhero with Many Perspectives