Skip to content

Commit 041ad98

Browse files
committed
Update t.do.txt
1 parent 068510f commit 041ad98

File tree

1 file changed

+102
-102
lines changed

1 file changed

+102
-102
lines changed

doc/src/week10/Latexfiles/t.do.txt

Lines changed: 102 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -121,131 +121,131 @@ Here:
121121

122122
!split
123123
===== Interpretation of the Attention Matrix =====
124-
The matrix
125-
\[
124+
The matrix
125+
!bt
126+
\[
126127
A = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)
127-
\]
128-
acts like a learned kernel or coupling matrix.
129-
130-
Its entries satisfy
131-
\[
132-
A_{ij} \ge 0,
133-
\qquad
134-
\sum_j A_{ij}=1.
135-
\]
136-
137-
Thus
138-
\[
128+
\]
129+
!et
130+
acts like a learned kernel or coupling matrix. Its entries satisfy
131+
!bt
132+
\[
133+
A_{ij} \ge 0, \qquad \sum_j A_{ij}=1.
134+
\]
135+
!et
136+
Thus
137+
!bt
138+
\[
139139
y_i = \sum_j A_{ij} v_j
140-
\]
141-
is a weighted average of the value vectors.
140+
\]
141+
!et
142+
is a weighted average of the value vectors.
142143

143-
Interpretation:
144-
145-
* query \(q_i\): ``what does position \(i\) look for?''
146-
* key \(k_j\): ``what does position \(j\) offer?''
147-
* value \(v_j\): ``what information does position \(j\) contribute?''
148-
144+
!split
145+
===== Interpretation: =====
146+
!bblock
147+
* query $q_i$: ``what does position \(i\) look for?''
148+
* key $k_j$: ``what does position \(j\) offer?''
149+
* value $v_j$: ``what information does position \(j\) contribute?''
150+
!eblock
149151

150152

151-
%------------------------------------------------
152-
!split ===== Comparison with a Standard MLP}
153-
A standard MLP layer has the form
154-
\[
153+
!split
154+
===== Comparison with a Standard MLP =====
155+
A standard MLP layer has the form
156+
!bt
157+
\[
155158
y = \sigma(Wx + b),
156-
\]
157-
with fixed weights \(W\).
158-
159-
In contrast, attention uses
160-
\[
159+
\]
160+
!et
161+
with fixed weights $W$.
162+
In contrast, attention uses
163+
!bt
164+
\[
161165
y_i = \sum_j A_{ij}(X)\, v_j,
162-
\]
163-
where the effective coupling \(A_{ij}\) depends on the input \(X\).
166+
\]
167+
!et
168+
where the effective coupling $A_{ij}$ depends on the input $X$.
169+
Thus: $\text{MLP: fixed couplings} \qquad\text{vs}\qquad\text{Transformer: adaptive couplings$.
170+
This is one reason transformers are so expressive.
164171

165-
Thus:
166-
\[
167-
\text{MLP: fixed couplings}
168-
\qquad\text{vs}\qquad
169-
\text{Transformer: adaptive couplings}.
170-
\]
171172

172-
This is one reason transformers are so expressive.
173173

174+
!split
175+
===== Comparison with CNNs =====
174176

175-
%------------------------------------------------
176-
!split ===== Comparison with CNNs}
177-
In a convolutional neural network,
178-
\[
177+
In a convolutional neural network,
178+
!bt
179+
\[
179180
y_i = \sum_{r \in \mathcal{N}(i)} w_r\, x_{i+r},
180-
\]
181-
where \(\mathcal{N}(i)\) is a small local neighborhood.
182-
183-
Thus CNNs assume:
184-
185-
* locality,
186-
* translation invariance,
187-
* fixed kernels.
188-
189-
190-
Attention instead uses
191-
\[
181+
\]
182+
!et
183+
where $\mathcal{N}(i)$ is a small local neighborhood.
184+
!bblock Thus CNNs assume:
185+
* locality,
186+
* translation invariance,
187+
* fixed kernels/filters.
188+
!eblock
189+
!bblock Attention instead uses
190+
!bt
191+
\[
192192
y_i = \sum_j A_{ij}(X)\, x_j,
193-
\]
194-
which is
195-
196-
* global,
197-
* adaptive,
198-
* not restricted to local neighborhoods.
199-
200-
201-
202-
%------------------------------------------------
203-
!split ===== Comparison with RNNs}
204-
In an RNN,
205-
\[
206-
h_t = f(h_{t-1}, x_t),
207-
\]
208-
so information propagates sequentially.
193+
\]
194+
!et
195+
which is
196+
* global,
197+
* adaptive,
198+
* not restricted to local neighborhoods.
199+
!eblock
209200

210-
Advantages:
211-
212-
* natural for time series,
213-
* explicit recurrence.
214-
215201

216-
Limitations:
217-
218-
* long-range dependencies are hard,
219-
* training can be unstable,
220-
* computation is hard to parallelize.
221-
222202

223-
Transformers avoid recurrence and instead connect all positions simultaneously through attention.
203+
!split
204+
===== Comparison with RNNs =====
205+
In an RNN,
206+
!bt
207+
\[
208+
h_t = f(h_{t-1}, x_t),
209+
\]
210+
!et
211+
so information propagates sequentially.
212+
!bblock Advantages:
213+
* natural for time series,
214+
* explicit recurrence.
215+
!eblock
216+
!bblock Limitations:
217+
* long-range dependencies are hard,
218+
* training can be unstable,
219+
* computation is hard to parallelize.
220+
!eblock
221+
Transformers avoid recurrence and instead connect all positions simultaneously through attention.
224222

225223

226-
%------------------------------------------------
227-
!split ===== Why the Factor \(1/\sqrt{d_k}\)?}
228-
The attention logits are
229-
\[
230-
q_i \cdot k_j.
231-
\]
232224

233-
If \(q_i\) and \(k_j\) have \(d_k\) components with comparable variance, then typically
234-
\[
225+
!split
226+
===== Why the Factor $1/\sqrt{d_k}$? =====
227+
The attention logits are
228+
!bt
229+
\[
230+
q_i \cdot k_j.
231+
\]
232+
!et
233+
If $q_i$ and \(k_j\) have \(d_k\) components with comparable variance, then typically
234+
!bt
235+
\[
235236
q_i \cdot k_j \sim O(\sqrt{d_k})
236-
\]
237-
or \(O(d_k)\) depending on scaling assumptions.
238-
239-
Without normalization, these logits can become large, causing the softmax to saturate.
240-
241-
Therefore one rescales by
242-
\[
237+
\]
238+
!et
239+
or \(O(d_k)\) depending on scaling assumptions.
240+
Without normalization, these logits can become large, causing the softmax to saturate. Therefore one rescales by
241+
!bt
242+
\[
243243
\frac{1}{\sqrt{d_k}}
244-
\]
245-
to stabilize optimization and keep gradients in a useful regime.
244+
\]
245+
!et
246+
to stabilize optimization and keep gradients in a useful regime.
246247

247248

248-
%------------------------------------------------
249249
!split ===== Multi-Head Attention}
250250
In practice, transformers use multiple attention heads.
251251

0 commit comments

Comments
 (0)