Skip to content

Improve HashSet<T> performance by enabling JIT bounds check elimination#125893

Merged
stephentoub merged 1 commit intodotnet:mainfrom
danmoseley:hashset-perf-opt
Mar 22, 2026
Merged

Improve HashSet<T> performance by enabling JIT bounds check elimination#125893
stephentoub merged 1 commit intodotnet:mainfrom
danmoseley:hashset-perf-opt

Conversation

@danmoseley
Copy link
Member

Improve HashSet performance by enabling JIT bounds check elimination

Change while (i >= 0) to while ((uint)i < (uint)entries.Length) in all hash-chain traversal loops in HashSet<T>, matching the pattern already used in Dictionary<TKey,TValue>.

Rationale

Dictionary<TKey,TValue> uses while ((uint)i < (uint)entries.Length) for its hash-chain loops (see FindValue, TryInsert, Remove). This unsigned comparison serves as both the loop exit condition and an implicit bounds check on entries[i], allowing the JIT to eliminate the redundant range check.

HashSet<T> uses while (i >= 0) for the same purpose. While functionally equivalent (chain indices are always non-negative, with -1 as sentinel), this signed comparison only tells the JIT that i is non-negative — not that it's within array bounds. The JIT must therefore emit a separate bounds check on every entries[i] access.

Note: HashSet<T>.AlternateEqualityComparer.FindValue already uses the unsigned pattern (with a do/while + (uint)i >= (uint)entries.Length guard); this PR brings the remaining 7 loops into alignment.

Changes

All changes are in HashSet.cs, one-line loop condition substitutions:

  • FindItemIndex — 2 loops (value-type and comparer branches)
  • AddIfNotPresent — 2 loops (value-type and comparer branches)
  • Remove — 1 loop
  • AlternateEqualityComparer.Add — 1 loop
  • AlternateEqualityComparer.Remove — 1 loop

JIT codegen

FindItemIndex<int> under FullOpts (x64):

Before (385 bytes): signed loop + separate bounds check

; loop top
cmp ecx, r13d
jae RNGCHKFAIL          ; <-- bounds check
...
; loop bottom
test ecx, ecx
jge LOOP_TOP             ; signed: i >= 0

After (379 bytes): unsigned loop, bounds check eliminated

; loop bottom
cmp r13d, ecx
ja LOOP_TOP              ; unsigned: Length > i (doubles as bounds check)
; no RNGCHKFAIL

Benchmark results

BenchmarkDotNet v0.16.0, Intel Core i9-14900K, .NET 11.0.0-dev, --affinity 1 (pinned to P-core).
Benchmark harness: --coreRun comparing baseline vs optimized CoreLib. Results confirmed stable across multiple runs; suspicious values were re-run with swapped --coreRun order to rule out positional bias.

Int32 (value type, default comparer devirtualized + inlined)

Benchmark Size Ratio Notes
ContainsTrue 512 0.90 10% faster
ContainsTrueComparer 512 0.50 2x faster (see note below)
Remove_Hit 16 0.97
Remove_Hit 512 0.96
Remove_Hit 4096 0.94–0.98
Remove_Miss all 1.00 neutral
ContainsFalse 512 1.00 neutral
AddGivenSize 512 1.00 neutral
CreateAddAndRemove 512 1.00 neutral
CreateAddAndClear 512 1.00 neutral
CtorFromCollection 512 1.00 neutral
IterateForEach 512 1.00 neutral

ContainsTrueComparer 0.50: This benchmark uses a custom IEqualityComparer<int> wrapping the default comparer, so it exercises FindItemIndex's comparer branch. Confirmed across 3 separate runs (0.50, 0.48, 0.52).

Miss paths unaffected: ContainsFalse and Remove_Miss are neutral as expected — on a miss with a good hash function, the bucket chain is typically empty or has a single entry, so the loop body barely executes and the per-iteration bounds check saving has minimal impact.

Add paths neutral: AddGivenSize and CreateAddAndClear are neutral because Add benchmarks are dominated by memory allocation and resize, not the duplicate-check chain walk.

String (reference type)

Benchmark Size Ratio Notes
All benchmarks all 1.00 neutral

The bounds check is still eliminated for string (FindItemIndex: 345→335 bytes), but string hash and equality comparison costs dominate per-element work, making the saved instruction negligible.

Summary

The improvement is concentrated on value types with the default comparer, where EqualityComparer<T>.Default.Equals is devirtualized and inlined to a simple comparison. In that case the bounds check is a meaningful fraction of per-element work in the inner loop.

AlternateEqualityComparer paths: Not exercised by existing benchmarks, but changed for consistency — AlternateEqualityComparer.FindValue already uses the unsigned pattern in the same file, so leaving Add/Remove with while (i >= 0) would create an inconsistency within the same inner class.

No regressions observed.

Alternatives considered

Only 3 of the 7 changed loops have benchmarks that show measurable improvement (FindItemIndex x2, Remove). The remaining 4 (AddIfNotPresent x2, AlternateEqualityComparer Add/Remove) could be left unchanged to minimize the diff. However, that would increase inconsistency: AlternateEqualityComparer.FindValue already uses the unsigned pattern, and having a mix of while (i >= 0) and while ((uint)i < (uint)entries.Length) across hash-chain loops in the same file would be harder to reason about than a uniform pattern. Each change is a single mechanical token substitution with no behavioral difference.

Change while (i >= 0) to while ((uint)i < (uint)entries.Length) in all 7
hash-chain traversal loops, matching the pattern already used in
Dictionary<TKey,TValue>. This lets the JIT eliminate the separate bounds
check on entries[i], as the unsigned loop condition serves as both loop
exit and implicit range check.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 21, 2026 20:42
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 21, 2026
@danmoseley
Copy link
Member Author

@EgorBot -linux_amd -osx_arm64

using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;

public class HashSetBench
{
    private sealed class WrapComparer : IEqualityComparer<int>
    {
        public bool Equals(int x, int y) => EqualityComparer<int>.Default.Equals(x, y);
        public int GetHashCode(int obj) => EqualityComparer<int>.Default.GetHashCode(obj);
    }

    private HashSet<int> _setInt;
    private HashSet<int> _setIntComparer;
    private HashSet<string> _setString;
    private int[] _foundInt;
    private int[] _missingInt;
    private string[] _foundString;

    [GlobalSetup]
    public void Setup()
    {
        _foundInt = Enumerable.Range(0, 512).ToArray();
        _missingInt = Enumerable.Range(10000, 512).ToArray();
        _setInt = new HashSet<int>(_foundInt);
        _setIntComparer = new HashSet<int>(_foundInt, new WrapComparer());
        _foundString = _foundInt.Select(i => i.ToString()).ToArray();
        _setString = new HashSet<string>(_foundString);
    }

    [Benchmark]
    public bool ContainsTrue_Int()
    {
        bool r = false;
        var set = _setInt;
        var found = _foundInt;
        for (int i = 0; i < found.Length; i++)
            r ^= set.Contains(found[i]);
        return r;
    }

    [Benchmark]
    public bool ContainsTrueComparer_Int()
    {
        bool r = false;
        var set = _setIntComparer;
        var found = _foundInt;
        for (int i = 0; i < found.Length; i++)
            r ^= set.Contains(found[i]);
        return r;
    }

    [Benchmark]
    public bool ContainsFalse_Int()
    {
        bool r = false;
        var set = _setInt;
        var keys = _missingInt;
        for (int i = 0; i < keys.Length; i++)
            r ^= set.Contains(keys[i]);
        return r;
    }

    [Benchmark]
    public bool Remove_Hit_Int()
    {
        var set = _setInt;
        var keys = _foundInt;
        bool r = false;
        for (int i = 0; i < keys.Length; i++)
        {
            r = set.Remove(keys[i]);
            set.Add(keys[i]);
        }
        return r;
    }

    [Benchmark]
    public bool ContainsTrue_String()
    {
        bool r = false;
        var set = _setString;
        var found = _foundString;
        for (int i = 0; i < found.Length; i++)
            r ^= set.Contains(found[i]);
        return r;
    }
}

@danmoseley danmoseley added area-System.Collections and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 21, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates HashSet<T>’s hash-chain traversal loop conditions to use an unsigned index-vs-length comparison, aligning with the established Dictionary<TKey,TValue> pattern so the JIT can eliminate redundant bounds checks in the hot inner loops.

Changes:

  • Replaced while (i >= 0) with while ((uint)i < (uint)entries.Length) in FindItemIndex (2 loops) and AddIfNotPresent (2 loops).
  • Replaced while (i >= 0) with while ((uint)i < (uint)entries.Length) in Remove.
  • Applied the same pattern to AlternateLookup<TAlternate>’s Add and Remove loops for consistency with existing unsigned-guarded traversal in FindValue.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@danmoseley
Copy link
Member Author

danmoseley commented Mar 21, 2026

Two other changes were evaluated and dropped:

Remove branch splitting — adding typeof(T).IsValueType && comparer == null guard to Remove (matching the pattern in FindItemIndex/AddIfNotPresent and Dictionary.Remove). This enables devirtualization of EqualityComparer<T>.Default.Equals for value types, but Remove<int> codegen grew from 365 to 376 bytes. Benchmarks showed no measurable difference (ratios 0.97-1.01 across sizes 16/512/4096 for both hit and miss).

Entry.HashCode int to uint — changing Entry.HashCode from int to uint to match Dictionary.Entry.hashCode. Benchmarks showed no benefit on Contains or Remove, and a possible ~8% regression on Add (ratio 1.08 on re-run, consistently above 1.0). Investigating this separate to this PR.

This analysis was performed by Copilot.

@danmoseley
Copy link
Member Author

Investigation: Entry.HashCode int→uint (matching Dictionary)

I investigated whether changing Entry.HashCode from int to uint (as Dictionary uses) would provide additional benefit on top of the loop condition changes in this PR.

Result: codegen-neutral. The JIT produces byte-for-byte identical instructions for AddIfNotPresent<int> (624 bytes) regardless of whether HashCode is int or uint. This makes sense — cmp eax, ecx is the same instruction for signed and unsigned equality, and the (uint) cast on GetHashCode() is a no-op at machine level.

The ~8% Add regression I initially measured was benchmarking noise (confirmed by the identical codegen). Not worth the churn for zero codegen difference.


This investigation was performed with GitHub Copilot assistance.

@danmoseley
Copy link
Member Author

@EgorBot -linux_amd

@danmoseley
Copy link
Member Author

@EgorBot -linux_amd

using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;

public class HashSetBench
{
    private sealed class WrapComparer : IEqualityComparer<int>
    {
        public bool Equals(int x, int y) => EqualityComparer<int>.Default.Equals(x, y);
        public int GetHashCode(int obj) => EqualityComparer<int>.Default.GetHashCode(obj);
    }

    private HashSet<int> _setInt;
    private HashSet<int> _setIntComparer;
    private HashSet<string> _setString;
    private int[] _foundInt;
    private int[] _missingInt;
    private string[] _foundString;

    [GlobalSetup]
    public void Setup()
    {
        _foundInt = Enumerable.Range(0, 512).ToArray();
        _missingInt = Enumerable.Range(10000, 512).ToArray();
        _setInt = new HashSet<int>(_foundInt);
        _setIntComparer = new HashSet<int>(_foundInt, new WrapComparer());
        _foundString = _foundInt.Select(i => i.ToString()).ToArray();
        _setString = new HashSet<string>(_foundString);
    }

    [Benchmark]
    public bool ContainsTrue_Int()
    {
        bool r = false;
        var set = _setInt;
        var found = _foundInt;
        for (int i = 0; i < found.Length; i++)
            r ^= set.Contains(found[i]);
        return r;
    }

    [Benchmark]
    public bool ContainsTrueComparer_Int()
    {
        bool r = false;
        var set = _setIntComparer;
        var found = _foundInt;
        for (int i = 0; i < found.Length; i++)
            r ^= set.Contains(found[i]);
        return r;
    }

    [Benchmark]
    public bool ContainsFalse_Int()
    {
        bool r = false;
        var set = _setInt;
        var keys = _missingInt;
        for (int i = 0; i < keys.Length; i++)
            r ^= set.Contains(keys[i]);
        return r;
    }

    [Benchmark]
    public bool Remove_Hit_Int()
    {
        var set = _setInt;
        var keys = _foundInt;
        bool r = false;
        for (int i = 0; i < keys.Length; i++)
        {
            r = set.Remove(keys[i]);
            set.Add(keys[i]);
        }
        return r;
    }

    [Benchmark]
    public bool ContainsTrue_String()
    {
        bool r = false;
        var set = _setString;
        var found = _foundString;
        for (int i = 0; i < found.Length; i++)
            r ^= set.Contains(found[i]);
        return r;
    }
}

@danmoseley
Copy link
Member Author

my bad, I forgot to include benchmark code so 2026-03-21 22:14:17.015 � Too many benchmarks discovered: 4262.

let's try again

@EgorBo
Copy link
Member

EgorBo commented Mar 22, 2026

my bad, I forgot to include benchmark code so 2026-03-21 22:14:17.015 � Too many benchmarks discovered: 4262.

let's try again

Yeah, when no code snippet is provided, it assumes you want dotnet/performance benchmarks. typically, it expects BDN's --filter to know what kind of benchmarks to run, but the bot has a hard limit (around 50 or so)

@EgorBo
Copy link
Member

EgorBo commented Mar 22, 2026

@MihuBot

@danmoseley
Copy link
Member Author

ContainsFalse regression on Turin — codegen analysis

The egorbot Turin results show a reproducible ~11% regression on ContainsFalse_Int (0.85 and 0.89 across two runs). ARM64 and local Intel show neutral (0.99). Here's why.

Root cause: The loop condition change adds entries.Length to the critical path at loop entry.

Baseline: dec ecx; js — the js (jump-if-sign) reuses flags from dec, so it can decide whether to enter the loop with zero extra work.

PR: dec ecx; mov r13d,[entries.Length]; cmp r13d,ecx; jbe — must wait for the entries.Length memory load to complete before the comparison can execute.

For ContainsFalse, every lookup misses. With 512 items in a 521-bucket table (98% load factor), nearly every miss hits an occupied bucket, traverses one entry (hash mismatch), then exits via entry.Next == -1. The loop body is entered exactly once, so the eliminated bounds check (saving 2 instructions inside the loop) doesn't accumulate enough to offset the added entry-path latency.

Why only Turin (Zen 5)? Intel Golden Cove and Apple M2 both showed 0.99 — their more aggressive out-of-order execution likely hides the entries.Length load latency via speculative execution. Zen 5 appears more sensitive to this specific dependency chain.

Tradeoff assessment: The wins clearly dominate:

  • ContainsTrue_Int: +5–7% on all platforms
  • ContainsTrueComparer_Int: +71–75% on all platforms
  • ContainsFalse_Int: -11–15% on Turin only (neutral elsewhere)
  • Remove, String: neutral everywhere

Real workloads rarely consist of 100% misses, so any mix of hits and misses will net positive.


This analysis was performed with GitHub Copilot assistance.

@danmoseley
Copy link
Member Author

danmoseley commented Mar 22, 2026

Benchmark summary (egorbot)

All benchmarks use 512-element HashSet<int> or HashSet<string>. Ratio = PR/main (lower is faster).

AMD EPYC 9V45 (Zen 5, Turin) -- two runs:

Benchmark Run 1 Run 2 Verdict
ContainsTrue_Int 0.95 0.94 Faster
ContainsTrueComparer_Int 0.57 0.57 Faster
ContainsFalse_Int 1.17 1.12 Slower
Remove_Hit_Int 1.00 1.00 Same
ContainsTrue_String 1.01 0.99 Same

Apple M2 (ARM64):

Benchmark Ratio Verdict
ContainsTrue_Int 0.93 Faster
ContainsTrueComparer_Int 0.58 Faster
ContainsFalse_Int 1.01 Same
Remove_Hit_Int 0.99 Same
ContainsTrue_String 1.00 Same

ContainsTrue and ContainsTrueComparer improve on all platforms. ContainsFalse regresses on AMD Turin (Zen 5) only -- not on Intel x64 or Apple ARM64 (see codegen analysis above -- extra entries.Length load on the miss-path critical path, hidden by OOO execution on other microarchitectures). Remove and String are neutral everywhere.

@danmoseley danmoseley requested a review from stephentoub March 22, 2026 02:17
@danmoseley
Copy link
Member Author

OK I think all the evidence is in and this good. Ready for review.

@danmoseley
Copy link
Member Author

Literally all validation legs passed? 🤯🎉

@stephentoub stephentoub merged commit b2bba6d into dotnet:main Mar 22, 2026
151 checks passed
@danmoseley danmoseley deleted the hashset-perf-opt branch March 22, 2026 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants