Performance Optimizations by rgknox · Pull Request #1540 · NGEET/fates

rgknox · 2026-03-05T16:33:49Z

Description:

This branch contains a litany of performance enhancements to FATES. I may try to break these changes into chunks, but it might not be worth it. With a full FATES run at a single site with 9 patches and ~250 cohorts, speed-ups are about the following:

1. Everything in this PR: 2/3 to 1/2 total run-time as base (1.5x+)
1. Everything in 1) AND dual-loop* energy balance in the host: less than 1/2 of total run-time (ie 2x+)
1. Everything in 2) AND using 4 threads per site (host-side parallelism): 3x+

*dual-loop energy balancing is a host-side change that will work with these changes. It reduces the total calls to photosynthesis to achieve temperature convergence in CanopyFluxesMod.

This PR is should be coupled with, or precede: E3SM-Project/E3SM#8143

These changes can be described in several categories:

Breaking up photosynthesis and canopy radiation drivers to have patch-specific driver subroutines, which accommodates patch-level parallelism and a refactored "double loop" land-energy balance calculation
Utilizing automatic arrays (stack memory) versus dynamically allocated scratch arrays (heap) where appropriate
Defining patch-level data-structures to help reduce memory usage and perform more efficient calculations: the list of unique pfts on each patch, number of unique pfts, and maximum number of veg-layers on each patch
Using arrays of cohort data (as attached to the patch data structure) during high frequency operations, which are filled at the end of the dynamics timestep during summarization
Created a pointer array that helps us quickly identify the fates linked-list patch associated with the host's patch index, which is synonymous with patch%patchno. Ie, so we can do this: patch => site%pa_vec(ifp)
Removing unnecessary routines
Moving routines to places in the call sequence where necessary but less frequently (e.g. organ respiration rates need not be inside the land energy balance iterator since they don't affect conductance). Also, temperature affects on biophysical rates (i.e. vcmax, jmax and LMR) need only be updated once per patch-PFT during photosynthesis, and need not be updated for every unique leaf layer.
Converting routines to be classified as "elemental". Elemental's primary role is to enable calling routines to operate on a subroutine as either scalar or vector arguments. Its secondary role is that it tells compilers that they can safely in-line code from that routine into the calling routine. This reduces computational overhead, sometimes greatly when it can leverage vectorization and SIMD stuff, but also to some degree when it can't.
Pre-calculating as much math as possible in high-frequency routines. For instance when applying temperature corrections to biophysical rates, the math is demanding (exponentials, power functions, divides, etc). So the functions were split into parts to perform as much math as possible before any math is applied that is dependent on vegetation temperature (which changes at the model timestep).
Making more "use specific" variants of highest used functions, which allows us to make them faster. For instance, many calls to quadratic smoothing have constants as the "a" term. In these scenarios, we don't need to perform "if" calls to make sure that a is positive and non-zero. Since quadratics are literally the most called function we have, its usefull to get rid of these logicals.
Method of retrieving the values (ie the masses) of C, N and P in plant organs that is faster and minimizes "cache misses".

Note:

This work made use of conversations with AI bots, such as gemini and claude (via Kiro). No suggestions were applied directly by the AIs. Blocks of code and vignettes were copied over, but in small sections and were evaluated each in entirety.

Collaborators:

@cdkoven @rosiealice @bishtgautam @peterdschwartz @mpaiao @glemieux @samsrabin

Expectation of Answer Changes:

This work used "do_b4b" logical flags. When these flags are set to true, changes are expected at the roundoff level for all FATES configurations. When these flags are set to false, there are non round-off level changes that are still appropriate. For instance, we don't need to update the growth and home temperatures for leaf temperature acclimation every 30 minutes, its excessive. This can be done once per day, since these are at least 30 day averages. But changing this value will subtly change results. Thus the do_b4b flag.

*Note: Aside from mentioned above. I found FATES can potentially generate chaotic behavior. For instance, when converting Q10 equations to use the exponential function instead of the power function, difference in the math outcome should be incredibly small (e-15). However these differences generated non-trivial differences in results. I'm intending to investigate this more and compare differences from the math alternatives against perturbations of initial conditions.

Checklist

If this is your first time contributing, please read the CONTRIBUTING document.

All checklist items must be checked to enable merging this pull request:

Contributor

WIP, NOT YET

The in-code documentation has been updated with descriptive comments
The documentation has been assessed to determine if updates are necessary

Integrator

FATES PASS/FAIL regression tests were run
Evaluation of test results for answer changes was performed and results provided
FATES-CLM6 Code Freeze: satellite phenology regression tests are b4b

If satellite phenology regressions are not b4b, please hold merge and notify the FATES development team.

Documentation

Technical Note update:
User's Guide update:

Test Results:

CTSM (or) E3SM (specify which) test hash-tag:

CTSM (or) E3SM (specify which) baseline hash-tag:

FATES baseline hash-tag:

Test Output:

… allow multithreading

…reated patch level fine-root fraction scratch space as well

…ation, better argument declarations

…anyway

…ctures

rgknox added 22 commits February 6, 2026 12:21

adding patch-level calls to photosynthesis for multithreading

6303f42

refactored photosynthesis, conductance and maintenance respiration to…

9444ab3

… allow multithreading

finished first pass of making patch photosynthesis its own routine, c…

d522019

…reated patch level fine-root fraction scratch space as well

fixes to arguments in patch parallel photosynthesis

4e2439a

Merge branch 'main' into patch-parallel

9cd2b5e

added reminder text

84e07b9

changed memory usage in photosynthesis code

050fb6a

towards patch parallel radiation

4c6b346

first pass at patch parallel load balancing

1709666

memory and algorithmic efficiency updates to photosynthesis

7f02ad5

refactors for memory efficient photosynthesis

01ef41e

Multithreaded canopy radiation

c05fe59

refactors to photosynthesis code, memory usage, removed unused comput…

bab2e63

…ation, better argument declarations

updated load-balancing algorithm

7eef75f

Merge branch 'main' into patch-parallel-lb

f3373fe

removed dynamic patch memory for fnrt scratch space, stack is faster …

60e8a7b

…anyway

removing persistent scratch arrays for two-stream

4445453

fixed commented out code

a8d97ed

efficiency updates, related to faster math and patch array pointers

334e12c

fates photosynthesis speedups

a240a2f

More optimizations to psn

c2da59f

adding dedicated pointers and getter functions to plant organ states

a0c8082

github-project-automation Bot added this to FATES Pull Request Planning and Status Mar 5, 2026

github-project-automation Bot moved this to Finding Reviewers in FATES Pull Request Planning and Status Mar 5, 2026

rgknox added the status: Not Ready The author is signaling that this PR is a work in progress and not ready for integration. label Mar 5, 2026

rgknox mentioned this pull request Mar 5, 2026

FATES Performance Optimizations and Patch(PFT) level Shared Memory Paralellism E3SM-Project/E3SM#8143

Draft

rgknox added 4 commits March 5, 2026 16:43

b4b provisions

053d1d0

Merge branch 'patch-parallel-lbmathop' into patch-parallel-lbmathop-prt

0250f5d

working out scratch vectors for high frequency operations

411c1b5

scratch arrays for cohorts, part 1

a54f1be

rgknox and others added 6 commits March 8, 2026 22:40

Clean version of vector cohorts and optimized organ pointers

efed123

changes bprates to structure with arrays, instead of an array of stru…

5c7c2f1

…ctures

used cache-friendly call to organ masses

c33a71d

fixes for optimizations

2412370

changed rdark to rdark_tstep

cef64a1

Converted more high-frequency respiration terms to the cohort arrays

8774e6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimizations#1540

Performance Optimizations#1540
rgknox wants to merge 32 commits intoNGEET:mainfrom
rgknox:patch-parallel-lbmathop-prt

rgknox commented Mar 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rgknox commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Note:

Collaborators:

Expectation of Answer Changes:

Checklist

Documentation

Test Results:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rgknox commented Mar 5, 2026 •

edited

Loading