This consists of two linear transformations with a non-linear activation (typically ReLU) in between.
: Calculates a "relevance score" between tokens, allowing the model to understand how much focus one word should have on another (e.g., relating "he" to "Tom"). transformers components
: The mechanism uses learned weight matrices ( ) to project input vectors into three spaces. Query ( ) : What the token is looking for. Key ( ) : What the token contains. Value ( ) : The information the token provides once matched. This consists of two linear transformations with a