@@ -145,9 +145,9 @@ is a weighted average of the value vectors.
145145!split
146146===== Interpretation: =====
147147!bblock
148- * query $q_i$: ``what does position \(i\) look for?''
149- * key $k_j$: ``what does position \(j\) offer?''
150- * value $v_j$: ``what information does position \(j\) contribute?''
148+ * query $q_i$: ``what does position $i$ look for?''
149+ * key $k_j$: ``what does position $j$ offer?''
150+ * value $v_j$: ``what information does position $j$ contribute?''
151151!eblock
152152
153153
@@ -231,13 +231,13 @@ The attention logits are
231231 q_i \cdot k_j.
232232\]
233233!et
234- If $q_i$ and \( k_j\) have \( d_k\) components with comparable variance, then typically
234+ If $q_i$ and $ k_j$ have $ d_k$ components with comparable variance, then typically
235235!bt
236236\[
237237 q_i \cdot k_j \sim O(\sqrt{d_k})
238238\]
239239!et
240- or \( O(d_k)\) depending on scaling assumptions.
240+ or $ O(d_k)$ depending on scaling assumptions.
241241Without normalization, these logits can become large, causing the softmax to saturate. Therefore one rescales by
242242!bt
243243\[
@@ -252,7 +252,7 @@ to stabilize optimization and keep gradients in a useful regime.
252252
253253In practice, transformers use multiple attention heads.
254254
255- For head \(h\) :
255+ For head $h$ :
256256!bt
257257\[
258258 Q^{(h)} = XW_Q^{(h)},\quad
@@ -324,138 +324,84 @@ For sequence tasks we must add positional information. Common approaches:
324324
325325!split
326326===== Transformers as Kernel Machines =====
327- Attention has the form
328- \[
329- y_i = \sum_j A_{ij}(X)\, v_j.
330- \]
331-
332- This resembles an integral kernel operator:
333- \[
334- (\mathcal{K}f)(x)
335- =
336- \int K(x,x') f(x')\, dx'.
337- \]
338-
339- In the discrete setting, attention acts like
340- \[
341- y_i = \sum_j K(x_i,x_j)\, v_j,
342- \]
343- but with a kernel \(K\) learned adaptively from the data.
344-
345- This viewpoint is useful in scientific machine learning and operator learning.
346-
347-
348- %------------------------------------------------
349- !split ===== Physics Viewpoint I: Adaptive Many-Body Couplings}
350- A useful physics analogy is
351- \[
352- y_i = \sum_j J_{ij}(X)\, x_j,
353- \]
354- where
355- \[
356- J_{ij}(X) \equiv A_{ij}(X)
357- \]
358- is an input-dependent coupling.
359-
360- This resembles:
361-
362- * mean-field interaction matrices,
363- * adaptive spin couplings,
364- * message-passing in many-body systems,
365- * self-consistent effective interactions.
366-
327+ Attention has the form
328+ !bt
329+ \[
330+ y_i = \sum_j A_{ij}(X)\, v_j.
331+ \]
332+ !et
333+ This resembles an integral kernel operator:
334+ !bt
335+ \[
336+ (\mathcal{K}f)(x)=\int K(x,x') f(x')\, dx'.
337+ \]
338+ !et
339+ In the discrete setting, attention acts like
340+ !bt
341+ \[
342+ y_i = \sum_j K(x_i,x_j)\, v_j,
343+ \]
344+ !et
345+ but with a kernel $K$ learned adaptively from the data.
346+ This viewpoint is useful in scientific machine learning and operator learning.
367347
368- So transformers can be interpreted as systems with
369- \[
370- \text{state-dependent interaction kernels}.
371- \]
372-
373-
374- %------------------------------------------------
375- !split ===== Physics Viewpoint II: Statistical Mechanics Analogy}
376- The softmax weights are
377- \[
378- A_{ij} =
379- \frac{e^{s_{ij}}}{\sum_\ell e^{s_{i\ell}}},
380- \qquad
381- s_{ij} = \frac{q_i\cdot k_j}{\sqrt{d_k}}.
382- \]
383-
384- This looks like a Gibbs or Boltzmann weight:
385- \[
386- p_j = \frac{e^{-\beta E_j}}{Z}.
387- \]
388-
389- Indeed, if we identify
390- \[
391- E_{ij} = - s_{ij},
392- \]
393- then attention weights are Gibbs-like probabilities over interaction partners.
394-
395- This suggests a statistical-mechanics interpretation:
396-
397- * scores \(s_{ij}\) define effective energies,
398- * softmax performs a local partition-function normalization.
399-
400348
401349
402- %------------------------------------------------
403- !split ===== Mean-Field Interpretation}
404- Suppose each degree of freedom \(i\) updates by coupling to all others through effective coefficients \(A_{ij}\):
405- \[
406- y_i = \sum_j A_{ij} v_j.
407- \]
350+ !split
351+ ===== Physics Viewpoint: Adaptive Many-Body Couplings =====
352+ A useful analogy is
353+ !bt
354+ \[
355+ y_i = \sum_j J_{ij}(X)\, x_j,
356+ \]
357+ !et
358+ where
359+ !bt
360+ \[
361+ J_{ij}(X) \equiv A_{ij}(X)
362+ \]
363+ !et
364+ is an input-dependent coupling.
365+ This resembles:
366+ * mean-field interaction matrices,
367+ * adaptive spin couplings,
368+ * message-passing in many-body systems,
369+ * self-consistent effective interactions.
370+ Transformers can be interpreted as systems with state-dependent interaction kernels.
408371
409- This resembles a mean-field update:
410- \[
411- m_i^{\text{new}} = F\!\left(\sum_j J_{ij} m_j\right),
412- \]
413- except that the couplings themselves depend on the current state.
414372
415- Thus transformers may be viewed as
416- \[
417- \text{nonlinear, adaptive mean-field models}.
418- \]
419373
374+ !split
375+ ===== Attention and graph structure =====
420376
421- %------------------------------------------------
422- !split ===== Attention and Graph Structure}
423- If the input is viewed as a graph with nodes \(i\), then attention defines a complete graph with weighted edges
424- \[
377+ If the input is viewed as a graph with nodes $i$, then attention defines a complete graph with weighted edges
378+ !bt
379+ \[
425380 i \longleftrightarrow j
426- \]
427- of strength \(A_{ij}\).
428-
429- This makes transformers closely related to:
430-
431- * graph neural networks,
432- * message-passing networks,
433- * nonlocal interaction models.
434-
381+ \]
382+ !et
383+ of strength $A_{ij}$.
384+ This makes transformers closely related to:
385+ * graph neural networks,
386+ * message-passing networks,
387+ * nonlocal interaction models.
435388
436- Difference:
437-
438- * graph neural networks often use a fixed graph,
439- * transformers learn the graph dynamically from the data.
389+ Difference:
390+ * graph neural networks often use a fixed graph,
391+ * transformers learn the graph dynamically from the data.
440392
441393
394+ !split
395+ ===== Transformers and PDEs: Why they Matter =====
442396
443- %------------------------------------------------
444- !split ===== Transformers and PDEs: Why They Matter}
445- Many PDE problems involve long-range dependencies:
446-
447- * elliptic equations,
448- * nonlocal operators,
449- * multiscale dynamics,
450- * global constraints.
397+ Many PDE problems involve long-range dependencies:
398+ * elliptic equations,
399+ * nonlocal operators,
400+ * multiscale dynamics,
401+ * global constraints.
451402
452-
453- CNNs are excellent for local structure, but transformers can naturally represent
454- \[
455- \text{nonlocal coupling across the whole domain}.
456- \]
457-
458- This is one reason transformers have become important in scientific machine learning.
403+ CNNs are excellent for local structure, but transformers can naturally represent nonlocal coupling across the whole domain.
404+ This is one reason transformers have become important in scientific machine learning.
459405
460406
461407
@@ -503,7 +449,7 @@ Attention then computes
503449 y_i = \sum_j A_{ij} v_j,
504450\]
505451!et
506- which allows the representation at point \( x_i\) to depend on all other sampled points.
452+ which allows the representation at point $ x_i$ to depend on all other sampled points.
507453
508454!split
509455===== Useful for =====
0 commit comments