Skip to content

Commit bff8b3a

Browse files
committed
Update t.do.txt
1 parent 87990d8 commit bff8b3a

File tree

1 file changed

+72
-126
lines changed

1 file changed

+72
-126
lines changed

doc/src/week10/Latexfiles/t.do.txt

Lines changed: 72 additions & 126 deletions
Original file line numberDiff line numberDiff line change
@@ -145,9 +145,9 @@ is a weighted average of the value vectors.
145145
!split
146146
===== Interpretation: =====
147147
!bblock
148-
* query $q_i$: ``what does position \(i\) look for?''
149-
* key $k_j$: ``what does position \(j\) offer?''
150-
* value $v_j$: ``what information does position \(j\) contribute?''
148+
* query $q_i$: ``what does position $i$ look for?''
149+
* key $k_j$: ``what does position $j$ offer?''
150+
* value $v_j$: ``what information does position $j$ contribute?''
151151
!eblock
152152

153153

@@ -231,13 +231,13 @@ The attention logits are
231231
q_i \cdot k_j.
232232
\]
233233
!et
234-
If $q_i$ and \(k_j\) have \(d_k\) components with comparable variance, then typically
234+
If $q_i$ and $k_j$ have $d_k$ components with comparable variance, then typically
235235
!bt
236236
\[
237237
q_i \cdot k_j \sim O(\sqrt{d_k})
238238
\]
239239
!et
240-
or \(O(d_k)\) depending on scaling assumptions.
240+
or $O(d_k)$ depending on scaling assumptions.
241241
Without normalization, these logits can become large, causing the softmax to saturate. Therefore one rescales by
242242
!bt
243243
\[
@@ -252,7 +252,7 @@ to stabilize optimization and keep gradients in a useful regime.
252252

253253
In practice, transformers use multiple attention heads.
254254

255-
For head \(h\):
255+
For head $h$:
256256
!bt
257257
\[
258258
Q^{(h)} = XW_Q^{(h)},\quad
@@ -324,138 +324,84 @@ For sequence tasks we must add positional information. Common approaches:
324324

325325
!split
326326
===== Transformers as Kernel Machines =====
327-
Attention has the form
328-
\[
329-
y_i = \sum_j A_{ij}(X)\, v_j.
330-
\]
331-
332-
This resembles an integral kernel operator:
333-
\[
334-
(\mathcal{K}f)(x)
335-
=
336-
\int K(x,x') f(x')\, dx'.
337-
\]
338-
339-
In the discrete setting, attention acts like
340-
\[
341-
y_i = \sum_j K(x_i,x_j)\, v_j,
342-
\]
343-
but with a kernel \(K\) learned adaptively from the data.
344-
345-
This viewpoint is useful in scientific machine learning and operator learning.
346-
347-
348-
%------------------------------------------------
349-
!split ===== Physics Viewpoint I: Adaptive Many-Body Couplings}
350-
A useful physics analogy is
351-
\[
352-
y_i = \sum_j J_{ij}(X)\, x_j,
353-
\]
354-
where
355-
\[
356-
J_{ij}(X) \equiv A_{ij}(X)
357-
\]
358-
is an input-dependent coupling.
359-
360-
This resembles:
361-
362-
* mean-field interaction matrices,
363-
* adaptive spin couplings,
364-
* message-passing in many-body systems,
365-
* self-consistent effective interactions.
366-
327+
Attention has the form
328+
!bt
329+
\[
330+
y_i = \sum_j A_{ij}(X)\, v_j.
331+
\]
332+
!et
333+
This resembles an integral kernel operator:
334+
!bt
335+
\[
336+
(\mathcal{K}f)(x)=\int K(x,x') f(x')\, dx'.
337+
\]
338+
!et
339+
In the discrete setting, attention acts like
340+
!bt
341+
\[
342+
y_i = \sum_j K(x_i,x_j)\, v_j,
343+
\]
344+
!et
345+
but with a kernel $K$ learned adaptively from the data.
346+
This viewpoint is useful in scientific machine learning and operator learning.
367347

368-
So transformers can be interpreted as systems with
369-
\[
370-
\text{state-dependent interaction kernels}.
371-
\]
372-
373-
374-
%------------------------------------------------
375-
!split ===== Physics Viewpoint II: Statistical Mechanics Analogy}
376-
The softmax weights are
377-
\[
378-
A_{ij} =
379-
\frac{e^{s_{ij}}}{\sum_\ell e^{s_{i\ell}}},
380-
\qquad
381-
s_{ij} = \frac{q_i\cdot k_j}{\sqrt{d_k}}.
382-
\]
383-
384-
This looks like a Gibbs or Boltzmann weight:
385-
\[
386-
p_j = \frac{e^{-\beta E_j}}{Z}.
387-
\]
388-
389-
Indeed, if we identify
390-
\[
391-
E_{ij} = - s_{ij},
392-
\]
393-
then attention weights are Gibbs-like probabilities over interaction partners.
394-
395-
This suggests a statistical-mechanics interpretation:
396-
397-
* scores \(s_{ij}\) define effective energies,
398-
* softmax performs a local partition-function normalization.
399-
400348

401349

402-
%------------------------------------------------
403-
!split ===== Mean-Field Interpretation}
404-
Suppose each degree of freedom \(i\) updates by coupling to all others through effective coefficients \(A_{ij}\):
405-
\[
406-
y_i = \sum_j A_{ij} v_j.
407-
\]
350+
!split
351+
===== Physics Viewpoint: Adaptive Many-Body Couplings =====
352+
A useful analogy is
353+
!bt
354+
\[
355+
y_i = \sum_j J_{ij}(X)\, x_j,
356+
\]
357+
!et
358+
where
359+
!bt
360+
\[
361+
J_{ij}(X) \equiv A_{ij}(X)
362+
\]
363+
!et
364+
is an input-dependent coupling.
365+
This resembles:
366+
* mean-field interaction matrices,
367+
* adaptive spin couplings,
368+
* message-passing in many-body systems,
369+
* self-consistent effective interactions.
370+
Transformers can be interpreted as systems with state-dependent interaction kernels.
408371

409-
This resembles a mean-field update:
410-
\[
411-
m_i^{\text{new}} = F\!\left(\sum_j J_{ij} m_j\right),
412-
\]
413-
except that the couplings themselves depend on the current state.
414372

415-
Thus transformers may be viewed as
416-
\[
417-
\text{nonlinear, adaptive mean-field models}.
418-
\]
419373

374+
!split
375+
===== Attention and graph structure =====
420376

421-
%------------------------------------------------
422-
!split ===== Attention and Graph Structure}
423-
If the input is viewed as a graph with nodes \(i\), then attention defines a complete graph with weighted edges
424-
\[
377+
If the input is viewed as a graph with nodes $i$, then attention defines a complete graph with weighted edges
378+
!bt
379+
\[
425380
i \longleftrightarrow j
426-
\]
427-
of strength \(A_{ij}\).
428-
429-
This makes transformers closely related to:
430-
431-
* graph neural networks,
432-
* message-passing networks,
433-
* nonlocal interaction models.
434-
381+
\]
382+
!et
383+
of strength $A_{ij}$.
384+
This makes transformers closely related to:
385+
* graph neural networks,
386+
* message-passing networks,
387+
* nonlocal interaction models.
435388

436-
Difference:
437-
438-
* graph neural networks often use a fixed graph,
439-
* transformers learn the graph dynamically from the data.
389+
Difference:
390+
* graph neural networks often use a fixed graph,
391+
* transformers learn the graph dynamically from the data.
440392

441393

394+
!split
395+
===== Transformers and PDEs: Why they Matter =====
442396

443-
%------------------------------------------------
444-
!split ===== Transformers and PDEs: Why They Matter}
445-
Many PDE problems involve long-range dependencies:
446-
447-
* elliptic equations,
448-
* nonlocal operators,
449-
* multiscale dynamics,
450-
* global constraints.
397+
Many PDE problems involve long-range dependencies:
398+
* elliptic equations,
399+
* nonlocal operators,
400+
* multiscale dynamics,
401+
* global constraints.
451402

452-
453-
CNNs are excellent for local structure, but transformers can naturally represent
454-
\[
455-
\text{nonlocal coupling across the whole domain}.
456-
\]
457-
458-
This is one reason transformers have become important in scientific machine learning.
403+
CNNs are excellent for local structure, but transformers can naturally represent nonlocal coupling across the whole domain.
404+
This is one reason transformers have become important in scientific machine learning.
459405

460406

461407

@@ -503,7 +449,7 @@ Attention then computes
503449
y_i = \sum_j A_{ij} v_j,
504450
\]
505451
!et
506-
which allows the representation at point \(x_i\) to depend on all other sampled points.
452+
which allows the representation at point $x_i$ to depend on all other sampled points.
507453

508454
!split
509455
===== Useful for =====

0 commit comments

Comments
 (0)