11!split
2- ===== Deep Learning in Context =====
2+ ===== Deep Learning =====
33Classical deep learning architectures include:
44
55* multilayer perceptrons (MLPs),
@@ -17,7 +17,7 @@ Each architecture encodes a specific inductive bias:
1717Transformers are also deep neural networks, but with a different structural principle: _adaptive interaction through attention._
1818
1919!split
20- ===== What Is a Transformer ? =====
20+ ===== What ss a transformer ? =====
2121A transformer is a neural-network architecture built around the idea of _self-attention_.
2222
2323Core principle:
@@ -34,7 +34,7 @@ This makes them especially powerful for:
3434* scientific fields and operator learning.
3535
3636!split
37- ===== Input as a Sequence of Vectors =====
37+ ===== Input as a sequence of vectors =====
3838Suppose the input is a sequence
3939
4040!bt
@@ -56,7 +56,7 @@ A transformer maps this sequence to another sequence of vectors:
5656
5757
5858!split
59- ===== Self-Attention : The Basic Formula =====
59+ ===== Self-attention : The basic formula =====
6060
6161For each position $i$, the transformer forms
6262!bt
@@ -67,7 +67,8 @@ For each position $i$, the transformer forms
6767where:
6868* $v_j = W_V x_j$ are the _values_,
6969* $\alpha_{ij}$ are attention weights.
70-
70+
71+
7172!split
7273===== How to compute weights =====
7374The weights are computed from
@@ -89,7 +90,7 @@ This is the central mechanism of a transformer.
8990
9091
9192!split
92- ===== Matrix Form of Attention =====
93+ ===== Matrix form of attention =====
9394
9495
9596Collect the input vectors into a matrix
@@ -120,7 +121,7 @@ Here:
120121
121122
122123!split
123- ===== Interpretation of the Attention Matrix =====
124+ ===== Interpretation of the attention matrix =====
124125The matrix
125126!bt
126127\[
@@ -246,91 +247,83 @@ Without normalization, these logits can become large, causing the softmax to sat
246247to stabilize optimization and keep gradients in a useful regime.
247248
248249
249- !split ===== Multi-Head Attention}
250- In practice, transformers use multiple attention heads.
250+ !split
251+ ===== Multi-Head Attention =====
252+
253+ In practice, transformers use multiple attention heads.
251254
252255 For head \(h\):
253- \[
256+ !bt
257+ \[
254258 Q^{(h)} = XW_Q^{(h)},\quad
255259 K^{(h)} = XW_K^{(h)},\quad
256260 V^{(h)} = XW_V^{(h)}.
257- \]
258-
259- Each head computes
260- \[
261- \mathrm{head}_h
262- =
263- \mathrm{Attention}(Q^{(h)},K^{(h)},V^{(h)}).
264- \]
265-
266- Then all heads are concatenated and linearly mixed:
267- \[
268- \mathrm{MultiHead}(X)
269- =
270- \mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_H)W_O.
271- \]
272-
273- Intuition:
274-
275- * different heads learn different interaction patterns,
276- * one head may capture local structure, another long-range dependence.
277-
261+ \]
262+ !et
263+ Each head computes
264+ !bt
265+ \[
266+ \mathrm{head}_h=\mathrm{Attention}(Q^{(h)},K^{(h)},V^{(h)}).
267+ \]
268+ !et
269+ Then all heads are concatenated and linearly mixed:
270+ !bt
271+ \[
272+ \mathrm{MultiHead}(X)=\mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_H)W_O.
273+ \]
274+ !et
275+ !bblock Intuition:
276+ * different heads learn different interaction patterns,
277+ * one head may capture local structure, another long-range dependence.
278+ !eblock
278279
279280
280- %------------------------------------------------
281- !split ===== The Transformer Block}
282- A standard transformer block contains:
283-
284- \begin{enumerate}
285- * multi-head attention,
286- * residual connection,
287- * layer normalization,
288- * position-wise feedforward network,
289- * another residual connection and normalization.
290- \end{enumerate}
291-
292- Schematically:
293- \[
281+ !split
282+ ===== The Transformer Block =====
283+ A standard transformer block contains:
284+ * multi-head attention,
285+ * residual connection,
286+ * layer normalization,
287+ * position-wise feedforward network,
288+ * another residual connection and normalization.
289+
290+ Schematically:
291+ !bt
292+ \[
294293 X \mapsto X + \mathrm{MHA}(X)
295- \]
296- followed by
297- \[
294+ \]
295+ !et
296+ followed by
297+ !bt
298+ \[
298299 X \mapsto X + \mathrm{MLP}(X).
299- \]
300-
301- So a transformer is still a deep network built from familiar components:
302- \[
303- \text{attention} + \text{MLP} + \text{residual structure}.
304- \]
300+ \]
301+ !et
302+ So a transformer is still a deep network built from familiar components:
303+ !bt
304+ \[
305+ \text{attention} + \text{MLP} + \text{residual structure}.
306+ \]
307+ !et
305308
306309
307- %------------------------------------------------
308- !split ===== Positional Information}
309- Attention itself is permutation-equivariant:
310- \[
310+ !split
311+ ===== Positional Information =====
312+ Attention itself is permutation-equivariant:
313+ !bt
314+ \[
311315 (x_1,\dots,x_n)\mapsto (y_1,\dots,y_n)
312- \]
313- depends only on pairwise relations, not on absolute order.
314-
315- For sequence tasks we must add positional information.
316-
317- Common approaches:
318-
319- * learned positional embeddings,
320- * sinusoidal encodings,
321- * relative position encodings.
322-
323-
324- For example, sinusoidal encodings use
325- \[
326- \mathrm{PE}(m,2r)=\sin\!\left(\frac{m}{10000^{2r/d}}\right),
327- \qquad
328- \mathrm{PE}(m,2r+1)=\cos\!\left(\frac{m}{10000^{2r/d}}\right).
329- \]
316+ \]
317+ !et
318+ depends only on pairwise relations, not on absolute order.
330319
320+ For sequence tasks we must add positional information. Common approaches:
321+ * learned positional embeddings,
322+ * sinusoidal encodings,
323+ * relative position encodings.
331324
332- %------------------------------------------------
333- !split ===== Transformers as Kernel Machines}
325+ !split
326+ ===== Transformers as Kernel Machines =====
334327 Attention has the form
335328 \[
336329 y_i = \sum_j A_{ij}(X)\, v_j.
0 commit comments