bayesian-statistics-python/Chapter 15 at main · Vipeen21/bayesian-statistics-python · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
Chapter 15: Applications – Econometrics, Finance, and Machine Learning

In this final chapter, we see how the Bayesian methods we've learned are applied to solve complex, real-world problems. Each section provides a concrete, runnable code example for the specific techniques used in each field.

15.1 Econometrics

Econometrics heavily relies on time-series and panel data, where Bayesian methods offer principled ways to handle complexity, prevent overfitting, and model latent structures.


Bayesian Vector Autoregression (BVAR)

Vector Autoregressions (VARs) model the dynamics of multiple time series jointly. However, they have many parameters, leading to overfitting. A BVAR uses priors to "shrink" the coefficients towards simpler, more plausible values (e.g., a random walk), improving forecast performance.
Python

import pymc as pm
import numpy as np
import arviz as az
import matplotlib.pyplot as plt

# 1. Simulate a 2-dimensional VAR(1) process
np.random.seed(42)
T = 100
# True coefficient matrix A1
A1_true = np.array([[0.7, 0.2],
                    [-0.1, 0.9]])
# True covariance of errors
cov_true = np.array([[1.0, 0.3],
                     [0.3, 1.5]])
# Generate data
Y = np.zeros((T, 2))
for t in range(1, T):
    Y[t, :] = A1_true @ Y[t-1, :] + np.random.multivariate_normal([0,0], cov_true)

# 2. Define the BVAR model in PyMC
with pm.Model() as bvar_model:
    # Priors on the coefficient matrix (like a simplified Minnesota prior)
    # Shrinking coefficients towards zero
    A1 = pm.Normal("A1", mu=0, sigma=0.5, shape=(2, 2))

    # Prior on the error covariance matrix using LKJ Cholesky decomposition
    chol, _, _ = pm.LKJCholeskyCov(
        "chol", n=2, eta=2.0, sd_dist=pm.Exponential.dist(1.0, shape=2), compute_corr=True
    )
    cov = pm.Deterministic("cov", chol @ chol.T)

    # Likelihood
    # The mean for Y[t] is A1 * Y[t-1]
    mu = pm.math.dot(A1, Y[:-1, :].T).T
    likelihood = pm.MvNormal("likelihood", mu=mu, chol=chol, observed=Y[1:, :])

    trace_bvar = pm.sample(1000, tune=1500)

# 3. Analyze the results
az.plot_posterior(trace_bvar, var_names=["A1", "cov"], grid=(3,2), figsize=(10,8))
plt.show()

# Compare true vs. posterior mean of A1
print("True A1 matrix:\n", A1_true)
print("\nPosterior mean of A1 matrix:\n", trace_bvar.posterior['A1'].mean(axis=(0,1)))
The posterior distributions for the A1 matrix elements will be centered near their true values. The priors prevent the estimates from overfitting to the noise in the 100 data points.


Hierarchical Model for Panel Data

Panel data follows multiple units (e.g., firms, countries) over time. A hierarchical (or multilevel) model is perfect for this, allowing us to estimate both population-level effects and unit-specific deviations by "partially pooling" information.
Python

import pandas as pd

# 1. Simulate panel data (e.g., sales for 5 stores over 20 years)
np.random.seed(123)
n_stores = 5
n_years = 20
store_idx = np.repeat(np.arange(n_stores), n_years)
# Create different intercepts for each store
store_intercepts_true = np.random.normal(50, 10, n_stores)
# Assume a common trend (slope)
slope_true = 2.5
X = np.tile(np.arange(n_years), n_stores)
y = store_intercepts_true[store_idx] + slope_true * X + np.random.normal(0, 5, size=n_stores*n_years)

# 2. Define the hierarchical model
with pm.Model() as panel_model:
    # Hyperpriors for the group-level (store) intercepts
    mu_a = pm.Normal('mu_a', mu=50, sigma=10)
    sigma_a = pm.HalfNormal('sigma_a', sigma=10)

    # Common slope for all stores (fixed effect)
    b = pm.Normal('b', mu=0, sigma=5)

    # Store-specific intercepts (random effects)
    a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=n_stores)

    # Model error
    sigma_y = pm.HalfNormal('sigma_y', sigma=10)

    # Expected value
    mu = a[store_idx] + b * X

    # Likelihood
    likelihood = pm.Normal('likelihood', mu=mu, sigma=sigma_y, observed=y)

    trace_panel = pm.sample(1000, tune=1000)

# 3. Plot the random intercepts
az.plot_forest(trace_panel, var_names=['a'], combined=True)
plt.axvline(x=store_intercepts_true.mean(), color='red', linestyle='--', label='True Mean Intercept')
plt.title("Posterior for Store-Specific Intercepts (Random Effects)")
plt.legend()
plt.show()
The forest plot shows the posterior for each store's intercept. These estimates "borrow strength" from each other, resulting in more stable and realistic values than if we had analyzed each store completely separately.


State-Space Models (Dynamic Linear Models)

State-space models are used to infer latent (unobserved) states that evolve over time. The Bayesian framework is a natural fit for estimating both the latent state and the model's parameters simultaneously. The Nile river flow example is a classic application.
Python

# 1. Load Nile river flow data
nile_data = pd.read_csv(pm.get_data("nile.csv"))
nile_years = nile_data["year"].values
nile_flow = nile_data["volume"].values

# 2. Define the state-space model (DLM)
with pm.Model() as nile_dlm:
    # Prior on the initial state (unobserved river level)
    level_init = pm.Normal("level_init", mu=np.mean(nile_flow), sigma=200)

    # Priors for process and observation variances
    sigma_level = pm.HalfNormal("sigma_level", sigma=50) # How much the true level can change (process error)
    sigma_obs = pm.HalfNormal("sigma_obs", sigma=100) # How noisy our measurements are (observation error)

    # The latent random walk for the river level
    level = pm.GaussianRandomWalk("level", mu=0, sigma=sigma_level, init_dist=pm.Normal.dist(mu=level_init, sigma=200), shape=len(nile_flow))

    # Likelihood: Observations are noisy measurements of the true level
    likelihood = pm.Normal("likelihood", mu=level, sigma=sigma_obs, observed=nile_flow)

    trace_nile = pm.sample(2000, tune=2000)

# 3. Plot the inferred latent state
plt.figure(figsize=(12, 5))
plt.plot(nile_years, nile_flow, 'o', color='gray', alpha=0.7, label="Observed Flow")
plt.plot(nile_years, trace_nile.posterior['level'].mean(axis=(0,1)), 'C0', lw=2, label="Inferred Underlying Level")
az.plot_hdi(nile_years, trace_nile.posterior['level'], color='C0', fill_kwargs={'alpha': 0.2})
plt.title("Bayesian State-Space Model of Nile River Flow")
plt.xlabel("Year")
plt.ylabel("Volume")
plt.legend()
plt.show()
The model correctly infers the famous structural break in the Nile's flow around 1898, demonstrating its ability to track a changing, unobserved process over time.

15.2 Finance

In finance, uncertainty is paramount. Bayesian methods provide a superior framework for handling parameter uncertainty in asset pricing, portfolio construction, and risk management.


Bayesian Model Averaging for Asset Pricing

Which factors (market, size, value, momentum, etc.) truly explain stock returns? Instead of picking one model, Bayesian Model Averaging (BMA) combines predictions from multiple models, weighted by the evidence for each model. We can use LOO/WAIC as a practical proxy for model evidence.
Python

# 1. Simulate data from a 3-factor model where one factor is irrelevant
np.random.seed(42)
N = 100
# Three potential factors (uncorrelated for simplicity)
F1 = np.random.randn(N)
F2 = np.random.randn(N)
F3 = np.random.randn(N) # Irrelevant factor
# Asset excess returns
R = 0.02 + 1.2 * F1 + 0.5 * F2 + 0.0 * F3 + np.random.normal(0, 0.5, N)

# 2. Define and fit multiple regression models
models = {}
traces = {}
# Model 1: Factor 1 only
with pm.Model() as models['F1']:
    pm.LinearRegression('R', F1, R)
    traces['F1'] = pm.sample(500, tune=500, progressbar=False)
# Model 2: Factors 1 and 2
with pm.Model() as models['F1_F2']:
    pm.LinearRegression('R', np.vstack([F1, F2]).T, R)
    traces['F1_F2'] = pm.sample(500, tune=500, progressbar=False)
# Model 3: All three factors
with pm.Model() as models['F1_F2_F3']:
    pm.LinearRegression('R', np.vstack([F1, F2, F3]).T, R)
    traces['F1_F2_F3'] = pm.sample(500, tune=500, progressbar=False)

# 3. Compare models using ArviZ's compare function (which also computes BMA weights)
comparison_df = az.compare(traces)
print(comparison_df)
az.plot_compare(comparison_df)
plt.show()
The comparison table will rank the F1_F2 model as the best (lowest LOO score). It also provides a weight column, which represents the BMA weight for each model based on its predictive performance. This tells us how much we should trust each model when making predictions.


Bayesian Portfolio Optimization

Classic portfolio optimization uses single point estimates of expected returns and covariances, ignoring the massive uncertainty around them. A Bayesian approach yields a full posterior distribution for the optimal asset weights, giving a more realistic view.
Python

# 1. Simulate some stock returns
np.random.seed(42)
true_means = np.array([0.05, 0.08]) # Annual returns for two assets
true_cov = np.array([[0.1**2, 0.5*0.1*0.15], [0.5*0.1*0.15, 0.15**2]])
returns = np.random.multivariate_normal(true_means, true_cov, size=20)

# 2. Fit a Bayesian model to get posteriors for means and covariance
with pm.Model() as portfolio_model:
    packed_chol = pm.LKJCholeskyCov("packed_chol", n=2, eta=2.0, sd_dist=pm.Exponential.dist(1.0))
    chol = pm.expand_packed_triangular(2, packed_chol)
    cov = pm.Deterministic("cov", chol @ chol.T)
    mu = pm.Normal("mu", mu=0, sigma=0.2, shape=2)
    obs = pm.MvNormal("obs", mu=mu, chol=chol, observed=returns)
    trace_portfolio = pm.sample(1000, tune=1000)

# 3. For each posterior sample, calculate the optimal max-Sharpe weights
posterior_samples = az.extract(trace_portfolio)
optimal_weights = []
for mu_s, cov_s in zip(posterior_samples['mu'], posterior_samples['cov']):
    inv_cov = np.linalg.inv(cov_s.values)
    w = inv_cov @ mu_s.values
    optimal_weights.append(w / np.sum(w)) # Normalize to sum to 1
optimal_weights = np.array(optimal_weights)

# 4. Plot the distribution of optimal weights
plt.hist(optimal_weights[:, 0], bins=30, alpha=0.7, label="Posterior dist. of weight for Asset 1", density=True)
plt.hist(optimal_weights[:, 1], bins=30, alpha=0.7, label="Posterior dist. of weight for Asset 2", density=True)
plt.title("Distribution of Optimal Portfolio Weights")
plt.xlabel("Weight")
plt.legend()
plt.show()
Instead of a single "optimal" allocation, the histograms show the full range of plausible optimal weights, directly visualizing the uncertainty inherent in the portfolio construction problem.


Bayesian Risk Management (Value-at-Risk)

Value-at-Risk (VaR) is a measure of downside risk. A Bayesian approach generates a full predictive distribution for future returns, from which we can calculate a VaR that properly accounts for parameter uncertainty.
Python

# 1. Simulate some asset returns with fat tails (Student-T distribution)
np.random.seed(101)
# nu=4 implies fatter tails than a normal distribution
asset_returns = np.random.standard_t(df=4, size=200) * 0.02 # Scaled to be like daily % returns

# 2. Fit a Bayesian Student-T model to the returns
with pm.Model() as var_model:
    mu = pm.Normal('mu', mu=0, sigma=0.01)
    sigma = pm.HalfNormal('sigma', sigma=0.05)
    nu = pm.Gamma('nu', alpha=2, beta=0.1) # Prior on the degrees-of-freedom

    likelihood = pm.StudentT('likelihood', nu=nu, mu=mu, sigma=sigma, observed=asset_returns)
    trace_var = pm.sample(1000, tune=1000)

    # 3. Generate samples from the posterior predictive distribution
    ppc = pm.sample_posterior_predictive(trace_var)

# 4. Calculate the 5% VaR from the predictive distribution
future_returns = ppc.posterior_predictive['likelihood'].values.flatten()
VaR_95 = np.quantile(future_returns, 0.05)

# 5. Plot the results
plt.hist(future_returns, bins=100, density=True, label="Posterior Predictive Distribution of Returns")
plt.axvline(VaR_95, color='red', linestyle='--', lw=2, label=f"95% Value-at-Risk = {VaR_95*100:.2f}%")
plt.title("Bayesian Value-at-Risk")
plt.xlabel("1-Day Return")
plt.legend()
plt.show()
The histogram shows our belief about the distribution of tomorrow's return. The red line marks the 5th percentile: we are 95% confident that the loss will not be worse than this value, with our confidence properly reflecting the uncertainty in the model's parameters.

15.3 Machine Learning

Bayesian methods provide powerful tools for machine learning, offering principled regularization, robust uncertainty quantification, and flexible non-parametric models.


Priors as Regularization: The Bayesian Lasso

In machine learning, regularization is used to prevent overfitting, especially with high-dimensional data. Placing specific priors on regression coefficients is equivalent to regularization. A Laplace prior corresponds to L1 (Lasso) regularization, which performs feature selection by shrinking irrelevant coefficients exactly to zero.
Python

# 1. Simulate high-dimensional data where most features are irrelevant
np.random.seed(24)
N = 100 # samples
K = 20  # features
# True coefficients (only 3 are non-zero)
true_coeffs = np.array([5, -3, 2] + [0]*(K-3))
X = np.random.randn(N, K)
y_lasso = X @ true_coeffs + np.random.normal(0, 2, N)

# 2. Define the Bayesian Lasso model (using Laplace priors)
with pm.Model() as lasso_model:
    # Laplace prior encourages sparsity
    beta = pm.Laplace('beta', mu=0, b=1, shape=K)
    intercept = pm.Normal('intercept', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=5)

    mu = intercept + pm.math.dot(X, beta)
    likelihood = pm.Normal('likelihood', mu=mu, sigma=sigma, observed=y_lasso)

    trace_lasso = pm.sample(1000, tune=1500)

# 3. Plot the posteriors of the coefficients
az.plot_forest(trace_lasso, var_names=['beta'], combined=True, r_hat=True)
plt.axvline(x=0, color='red', linestyle='--')
plt.title("Posterior Distributions of Coefficients (Bayesian Lasso)")
plt.show()
The forest plot will show that the posteriors for the first three coefficients are clearly away from zero. In contrast, the posteriors for the 17 irrelevant coefficients are sharply peaked at zero, demonstrating the powerful feature selection property of the Laplace prior.


Bayesian Neural Networks (BNN)

A BNN places priors on the weights and biases of a neural network. Instead of learning a single set of weights, it infers a full posterior distribution over them. This allows the network to quantify its predictive uncertainty, indicating when it is "not sure" about a prediction.
Python

from sklearn.datasets import make_moons

# 1. Generate non-linear classification data
X_bnn, Y_bnn = make_moons(n_samples=200, noise=0.1, random_state=42)

# 2. Define the BNN architecture in PyMC
n_hidden = 5
with pm.Model() as bnn_model:
    # Input -> Hidden Layer 1
    w1 = pm.Normal('w1', mu=0, sigma=1, shape=(X_bnn.shape[1], n_hidden))
    b1 = pm.Normal('b1', mu=0, sigma=1, shape=n_hidden)
    # Hidden Layer 1 -> Output
    w2 = pm.Normal('w2', mu=0, sigma=1, shape=(n_hidden, 1))
    b2 = pm.Normal('b2', mu=0, sigma=1, shape=1)

    # Define the network logic
    act_1 = pm.math.tanh(pm.math.dot(X_bnn, w1) + b1)
    output = pm.math.sigmoid(pm.math.dot(act_1, w2) + b2)

    # Likelihood
    likelihood = pm.Bernoulli('likelihood', p=output, observed=Y_bnn.reshape(-1, 1))

    # 3. Fit using fast Variational Inference (ADVI)
    approx = pm.fit(n=30000, method='advi')
    trace_bnn = approx.sample(1000)

# 4. Plot the decision boundary with uncertainty
grid = pm.make_grid(X_bnn, n=100)
with bnn_model: # Resets the model to use the new grid data
    pm.set_data({"X_bnn_data": grid})
    ppc_bnn = pm.sample_posterior_predictive(trace_bnn, var_names=['likelihood'], return_inferencedata=False, random_seed=42)

# Plotting
plt.figure(figsize=(8,6))
contour = plt.contourf(grid[:,0].reshape(100,100), grid[:,1].reshape(100,100), ppc_bnn['likelihood'].mean(axis=0).reshape(100,100), cmap='coolwarm', alpha=0.8)
plt.scatter(X_bnn[Y_bnn==0][:,0], X_bnn[Y_bnn==0][:,1], label='Class 0')
plt.scatter(X_bnn[Y_bnn==1][:,0], X_bnn[Y_bnn==1][:,1], label='Class 1')
plt.title("BNN Decision Boundary and Uncertainty")
plt.colorbar(contour, label="Posterior Mean Probability of Class 1")
plt.legend()
plt.show()
The plot shows a smooth decision boundary. The color saturation indicates certainty: deep red/blue regions are confident predictions, while lighter, purplish regions near the boundary represent high uncertainty.


Topic Models: Latent Dirichlet Allocation (LDA)

LDA is a fully Bayesian hierarchical model used to discover abstract "topics" in a collection of documents. It assumes each document is a mixture of topics, and each topic is a distribution over words. While possible to build in PyMC, standard practice is to use optimized libraries like scikit-learn.
Python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 1. Sample text data
corpus = [
    "The stock market fell due to interest rate hikes and inflation fears",
    "Investors are moving capital from equity to bonds and cash",
    "The central bank announced a new monetary policy to fight inflation",
    "NASA launched a new rocket to the moon for scientific exploration",
    "The Hubble telescope captured stunning images of distant galaxies",
    "Space exploration is crucial for scientific advancement and new discoveries"
]

# 2. Vectorize the text data (convert words to counts)
vectorizer = CountVectorizer(stop_words='english')
X_lda = vectorizer.fit_transform(corpus)

# 3. Fit the LDA model (a Bayesian model under the hood)
lda = LatentDirichletAllocation(n_components=2, random_state=42) # 2 topics: Finance & Space
lda.fit(X_lda)

# 4. Inspect the topics by printing the top words for each
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = f"Topic #{topic_idx}: "
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

print("Discovered Topics:")
print_top_words(lda, vectorizer.get_feature_names_out(), 5)
The output will clearly show that the algorithm has discovered one topic dominated by words like "inflation," "market," "bank," and "stock," and another dominated by "space," "exploration," "scientific," and "rocket," successfully identifying the underlying themes in the corpus.


Gaussian Processes (GPs)

GPs are a flexible, non-parametric Bayesian approach to regression. Instead of fitting a single function, a GP infers a distribution over functions that are consistent with the data, providing excellent uncertainty quantification.
Python

# 1. Generate some non-linear data
np.random.seed(1)
X_gp = np.linspace(0, 10, 20)[:, None]
y_gp = np.sin(X_gp).ravel() + np.random.normal(0, 0.2, 20)

# 2. Define and fit the Gaussian Process model in PyMC
with pm.Model() as gp_model:
    # Priors for the covariance function hyperparameters
    length_scale = pm.Gamma("length_scale", alpha=2, beta=1)
    eta = pm.HalfCauchy("eta", beta=5)

    # Specify the covariance function (kernel)
    cov_func = eta**2 * pm.gp.cov.ExpQuad(1, ls=length_scale)
    gp = pm.gp.Marginal(cov_func=cov_func)

    # Prior for the observation noise
    sigma = pm.HalfNormal("sigma", sigma=5)

    # Likelihood
    y_ = gp.marginal_likelihood("y", X=X_gp, y=y_gp, noise=sigma)
    trace_gp = pm.sample(1000, tune=1000)

# 3. Predict on new data points
X_new = np.linspace(-2, 12, 100)[:, None]
with gp_model:
    f_pred = gp.conditional("f_pred", Xnew=X_new)
    pred_samples = pm.sample_posterior_predictive(trace_gp, var_names=["f_pred"])

# 4. Plot the GP fit and its uncertainty
plt.figure(figsize=(10, 5))
plt.plot(X_gp, y_gp, 'o', color='C0', ms=8, label="Observed Data")
plt.plot(X_new, pred_samples.posterior_predictive["f_pred"].mean(axis=(0,1)), "C1", lw=2, label="Posterior Mean Function")
az.plot_hdi(X_new.flatten(), pred_samples.posterior_predictive["f_pred"], color='C1', hdi_prob=0.95, fill_kwargs={'alpha': 0.2})
plt.title("Gaussian Process Regression")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()
The plot beautifully illustrates the power of GPs. The model learns the underlying sine wave and, crucially, the uncertainty (the shaded band) is narrow where we have data and widens dramatically where we don't, perfectly capturing the model's confidence in its predictions.