Skip to content

Commit 87990d8

Browse files
committed
Update t.do.txt
1 parent 041ad98 commit 87990d8

File tree

1 file changed

+71
-78
lines changed

1 file changed

+71
-78
lines changed

doc/src/week10/Latexfiles/t.do.txt

Lines changed: 71 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
!split
2-
===== Deep Learning in Context =====
2+
===== Deep Learning =====
33
Classical deep learning architectures include:
44

55
* multilayer perceptrons (MLPs),
@@ -17,7 +17,7 @@ Each architecture encodes a specific inductive bias:
1717
Transformers are also deep neural networks, but with a different structural principle: _adaptive interaction through attention._
1818

1919
!split
20-
===== What Is a Transformer? =====
20+
===== What ss a transformer? =====
2121
A transformer is a neural-network architecture built around the idea of _self-attention_.
2222

2323
Core principle:
@@ -34,7 +34,7 @@ This makes them especially powerful for:
3434
* scientific fields and operator learning.
3535

3636
!split
37-
===== Input as a Sequence of Vectors =====
37+
===== Input as a sequence of vectors =====
3838
Suppose the input is a sequence
3939

4040
!bt
@@ -56,7 +56,7 @@ A transformer maps this sequence to another sequence of vectors:
5656

5757

5858
!split
59-
===== Self-Attention: The Basic Formula =====
59+
===== Self-attention: The basic formula =====
6060

6161
For each position $i$, the transformer forms
6262
!bt
@@ -67,7 +67,8 @@ For each position $i$, the transformer forms
6767
where:
6868
* $v_j = W_V x_j$ are the _values_,
6969
* $\alpha_{ij}$ are attention weights.
70-
70+
71+
7172
!split
7273
===== How to compute weights =====
7374
The weights are computed from
@@ -89,7 +90,7 @@ This is the central mechanism of a transformer.
8990

9091

9192
!split
92-
===== Matrix Form of Attention =====
93+
===== Matrix form of attention =====
9394

9495

9596
Collect the input vectors into a matrix
@@ -120,7 +121,7 @@ Here:
120121

121122

122123
!split
123-
===== Interpretation of the Attention Matrix =====
124+
===== Interpretation of the attention matrix =====
124125
The matrix
125126
!bt
126127
\[
@@ -246,91 +247,83 @@ Without normalization, these logits can become large, causing the softmax to sat
246247
to stabilize optimization and keep gradients in a useful regime.
247248

248249

249-
!split ===== Multi-Head Attention}
250-
In practice, transformers use multiple attention heads.
250+
!split
251+
===== Multi-Head Attention =====
252+
253+
In practice, transformers use multiple attention heads.
251254

252255
For head \(h\):
253-
\[
256+
!bt
257+
\[
254258
Q^{(h)} = XW_Q^{(h)},\quad
255259
K^{(h)} = XW_K^{(h)},\quad
256260
V^{(h)} = XW_V^{(h)}.
257-
\]
258-
259-
Each head computes
260-
\[
261-
\mathrm{head}_h
262-
=
263-
\mathrm{Attention}(Q^{(h)},K^{(h)},V^{(h)}).
264-
\]
265-
266-
Then all heads are concatenated and linearly mixed:
267-
\[
268-
\mathrm{MultiHead}(X)
269-
=
270-
\mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_H)W_O.
271-
\]
272-
273-
Intuition:
274-
275-
* different heads learn different interaction patterns,
276-
* one head may capture local structure, another long-range dependence.
277-
261+
\]
262+
!et
263+
Each head computes
264+
!bt
265+
\[
266+
\mathrm{head}_h=\mathrm{Attention}(Q^{(h)},K^{(h)},V^{(h)}).
267+
\]
268+
!et
269+
Then all heads are concatenated and linearly mixed:
270+
!bt
271+
\[
272+
\mathrm{MultiHead}(X)=\mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_H)W_O.
273+
\]
274+
!et
275+
!bblock Intuition:
276+
* different heads learn different interaction patterns,
277+
* one head may capture local structure, another long-range dependence.
278+
!eblock
278279

279280

280-
%------------------------------------------------
281-
!split ===== The Transformer Block}
282-
A standard transformer block contains:
283-
284-
\begin{enumerate}
285-
* multi-head attention,
286-
* residual connection,
287-
* layer normalization,
288-
* position-wise feedforward network,
289-
* another residual connection and normalization.
290-
\end{enumerate}
291-
292-
Schematically:
293-
\[
281+
!split
282+
===== The Transformer Block =====
283+
A standard transformer block contains:
284+
* multi-head attention,
285+
* residual connection,
286+
* layer normalization,
287+
* position-wise feedforward network,
288+
* another residual connection and normalization.
289+
290+
Schematically:
291+
!bt
292+
\[
294293
X \mapsto X + \mathrm{MHA}(X)
295-
\]
296-
followed by
297-
\[
294+
\]
295+
!et
296+
followed by
297+
!bt
298+
\[
298299
X \mapsto X + \mathrm{MLP}(X).
299-
\]
300-
301-
So a transformer is still a deep network built from familiar components:
302-
\[
303-
\text{attention} + \text{MLP} + \text{residual structure}.
304-
\]
300+
\]
301+
!et
302+
So a transformer is still a deep network built from familiar components:
303+
!bt
304+
\[
305+
\text{attention} + \text{MLP} + \text{residual structure}.
306+
\]
307+
!et
305308

306309

307-
%------------------------------------------------
308-
!split ===== Positional Information}
309-
Attention itself is permutation-equivariant:
310-
\[
310+
!split
311+
===== Positional Information =====
312+
Attention itself is permutation-equivariant:
313+
!bt
314+
\[
311315
(x_1,\dots,x_n)\mapsto (y_1,\dots,y_n)
312-
\]
313-
depends only on pairwise relations, not on absolute order.
314-
315-
For sequence tasks we must add positional information.
316-
317-
Common approaches:
318-
319-
* learned positional embeddings,
320-
* sinusoidal encodings,
321-
* relative position encodings.
322-
323-
324-
For example, sinusoidal encodings use
325-
\[
326-
\mathrm{PE}(m,2r)=\sin\!\left(\frac{m}{10000^{2r/d}}\right),
327-
\qquad
328-
\mathrm{PE}(m,2r+1)=\cos\!\left(\frac{m}{10000^{2r/d}}\right).
329-
\]
316+
\]
317+
!et
318+
depends only on pairwise relations, not on absolute order.
330319

320+
For sequence tasks we must add positional information. Common approaches:
321+
* learned positional embeddings,
322+
* sinusoidal encodings,
323+
* relative position encodings.
331324

332-
%------------------------------------------------
333-
!split ===== Transformers as Kernel Machines}
325+
!split
326+
===== Transformers as Kernel Machines =====
334327
Attention has the form
335328
\[
336329
y_i = \sum_j A_{ij}(X)\, v_j.

0 commit comments

Comments
 (0)