@@ -121,131 +121,131 @@ Here:
121121
122122!split
123123===== Interpretation of the Attention Matrix =====
124- The matrix
125- \[
124+ The matrix
125+ !bt
126+ \[
126127 A = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)
127- \]
128- acts like a learned kernel or coupling matrix.
129-
130- Its entries satisfy
131- \[
132- A_{ij} \ge 0,
133- \qquad
134- \sum_j A_{ij}=1.
135- \]
136-
137- Thus
138- \[
128+ \]
129+ !et
130+ acts like a learned kernel or coupling matrix. Its entries satisfy
131+ !bt
132+ \[
133+ A_{ij} \ge 0, \qquad \sum_j A_{ij}=1.
134+ \]
135+ !et
136+ Thus
137+ !bt
138+ \[
139139 y_i = \sum_j A_{ij} v_j
140- \]
141- is a weighted average of the value vectors.
140+ \]
141+ !et
142+ is a weighted average of the value vectors.
142143
143- Interpretation:
144-
145- * query \(q_i\): ``what does position \(i\) look for?''
146- * key \(k_j\): ``what does position \(j\) offer?''
147- * value \(v_j\): ``what information does position \(j\) contribute?''
148-
144+ !split
145+ ===== Interpretation: =====
146+ !bblock
147+ * query $q_i$: ``what does position \(i\) look for?''
148+ * key $k_j$: ``what does position \(j\) offer?''
149+ * value $v_j$: ``what information does position \(j\) contribute?''
150+ !eblock
149151
150152
151- %------------------------------------------------
152- !split ===== Comparison with a Standard MLP}
153- A standard MLP layer has the form
154- \[
153+ !split
154+ ===== Comparison with a Standard MLP =====
155+ A standard MLP layer has the form
156+ !bt
157+ \[
155158 y = \sigma(Wx + b),
156- \]
157- with fixed weights \(W\).
158-
159- In contrast, attention uses
160- \[
159+ \]
160+ !et
161+ with fixed weights $W$.
162+ In contrast, attention uses
163+ !bt
164+ \[
161165 y_i = \sum_j A_{ij}(X)\, v_j,
162- \]
163- where the effective coupling \(A_{ij}\) depends on the input \(X\).
166+ \]
167+ !et
168+ where the effective coupling $A_{ij}$ depends on the input $X$.
169+ Thus: $\text{MLP: fixed couplings} \qquad\text{vs}\qquad\text{Transformer: adaptive couplings$.
170+ This is one reason transformers are so expressive.
164171
165- Thus:
166- \[
167- \text{MLP: fixed couplings}
168- \qquad\text{vs}\qquad
169- \text{Transformer: adaptive couplings}.
170- \]
171172
172- This is one reason transformers are so expressive.
173173
174+ !split
175+ ===== Comparison with CNNs =====
174176
175- %------------------------------------------------
176- !split ===== Comparison with CNNs}
177- In a convolutional neural network,
178- \[
177+ In a convolutional neural network,
178+ !bt
179+ \[
179180 y_i = \sum_{r \in \mathcal{N}(i)} w_r\, x_{i+r},
180- \]
181- where \(\mathcal{N}(i)\) is a small local neighborhood.
182-
183- Thus CNNs assume:
184-
185- * locality,
186- * translation invariance,
187- * fixed kernels.
188-
189-
190- Attention instead uses
191- \[
181+ \]
182+ !et
183+ where $\mathcal{N}(i)$ is a small local neighborhood.
184+ !bblock Thus CNNs assume:
185+ * locality,
186+ * translation invariance,
187+ * fixed kernels/filters.
188+ !eblock
189+ !bblock Attention instead uses
190+ !bt
191+ \[
192192 y_i = \sum_j A_{ij}(X)\, x_j,
193- \]
194- which is
195-
196- * global,
197- * adaptive,
198- * not restricted to local neighborhoods.
199-
200-
201-
202- %------------------------------------------------
203- !split ===== Comparison with RNNs}
204- In an RNN,
205- \[
206- h_t = f(h_{t-1}, x_t),
207- \]
208- so information propagates sequentially.
193+ \]
194+ !et
195+ which is
196+ * global,
197+ * adaptive,
198+ * not restricted to local neighborhoods.
199+ !eblock
209200
210- Advantages:
211-
212- * natural for time series,
213- * explicit recurrence.
214-
215201
216- Limitations:
217-
218- * long-range dependencies are hard,
219- * training can be unstable,
220- * computation is hard to parallelize.
221-
222202
223- Transformers avoid recurrence and instead connect all positions simultaneously through attention.
203+ !split
204+ ===== Comparison with RNNs =====
205+ In an RNN,
206+ !bt
207+ \[
208+ h_t = f(h_{t-1}, x_t),
209+ \]
210+ !et
211+ so information propagates sequentially.
212+ !bblock Advantages:
213+ * natural for time series,
214+ * explicit recurrence.
215+ !eblock
216+ !bblock Limitations:
217+ * long-range dependencies are hard,
218+ * training can be unstable,
219+ * computation is hard to parallelize.
220+ !eblock
221+ Transformers avoid recurrence and instead connect all positions simultaneously through attention.
224222
225223
226- %------------------------------------------------
227- !split ===== Why the Factor \(1/\sqrt{d_k}\)?}
228- The attention logits are
229- \[
230- q_i \cdot k_j.
231- \]
232224
233- If \(q_i\) and \(k_j\) have \(d_k\) components with comparable variance, then typically
234- \[
225+ !split
226+ ===== Why the Factor $1/\sqrt{d_k}$? =====
227+ The attention logits are
228+ !bt
229+ \[
230+ q_i \cdot k_j.
231+ \]
232+ !et
233+ If $q_i$ and \(k_j\) have \(d_k\) components with comparable variance, then typically
234+ !bt
235+ \[
235236 q_i \cdot k_j \sim O(\sqrt{d_k})
236- \]
237- or \(O(d_k)\) depending on scaling assumptions.
238-
239- Without normalization, these logits can become large, causing the softmax to saturate.
240-
241- Therefore one rescales by
242- \[
237+ \]
238+ !et
239+ or \(O(d_k)\) depending on scaling assumptions.
240+ Without normalization, these logits can become large, causing the softmax to saturate. Therefore one rescales by
241+ !bt
242+ \[
243243 \frac{1}{\sqrt{d_k}}
244- \]
245- to stabilize optimization and keep gradients in a useful regime.
244+ \]
245+ !et
246+ to stabilize optimization and keep gradients in a useful regime.
246247
247248
248- %------------------------------------------------
249249!split ===== Multi-Head Attention}
250250 In practice, transformers use multiple attention heads.
251251
0 commit comments