feat: add partial broadcast for cached values by ruslandoga · Pull Request #3218 · Logflare/logflare

ruslandoga · 2026-02-27T12:47:17Z

ANL-1352

Ziinc · 2026-03-10T12:07:58Z

  end
+
+  defp maybe_broadcast_cached(cache, cache_key, value) do
+    Logflare.Utils.Tasks.start_child(fn ->


thinking that instead of creating a task for each broadcast, we can have a dedicated genserver for this to avoid the overhead of ephemeral task procs

I can add a benchmark to compare the two but I'd rather avoid a single process handling cross-node broadcasts. AFAIK :erpc.multicast still awaits on spawn_request reply, so if one of the nodes connection is slow, it could potentially block all of the cache broadcast from sending and grow the process's message queue.

I looked into it a bit more and it seems like Tasks are unnecessary since multicast uses spawn_request with reply=no which means it should be returning immediately.

# since node=nil doesn't exist, it probably means we are not waiting for `send/2` to finish iex> :erlang.spawn_request(_bad_node_name = nil, :erpc, :execute_cast, [Function, :identity, [1]], reply: :no) #==> #Reference<0.2294785123.1844445189.249014> # and it's pretty much instant iex> :timer.tc fn -> :erlang.spawn_request(_bad_node_name = nil, :erpc, :execute_cast, [Function, :identity, [1]], reply: :no) end #==> {12, #Reference<0.2294785123.1844445189.249084>}

Side-note: maybe we can add global cache broadcast rate limiting similar to https://github.com/getsentry/sentry-elixir/blob/master/lib/sentry/logger_handler/rate_limiter.ex to avoid sending out too many messages, just as a pre-caution.

ruslandoga · 2026-03-18T20:00:50Z

@Ziinc I think it's ready for the initial review!

ruslandoga · 2026-03-18T20:01:39Z

+  "LOGFLARE_CACHE_BROADCAST_RATIO" |> System.get_env("0.1") |> String.to_float()
+
+cache_broadcast_max_nodes =
+  "LOGFLARE_CACHE_BROADCAST_MAX_NODES" |> System.get_env("5") |> String.to_integer()


I'm not sure how it should be configured. For now I went with "global" env vars that configure all context caches with the same values.

lets default to 3 nodes.
LOGFLARE_CACHE_GOSSIP_MAX_NODES

I wonder if we really want to name it gossip. In the current implementation it's not really a gossip, it's just a one-hop partial broadcast. And from the discussion on Slack we don't actually want real gossiping since it would result in more network load due to redundant messaging (at least in my current understanding).

for initial implementation we're looking at 1 hop, but if network messaging does not go through the roof then multi-hop is ideal due to benefits of the cluster size being large.
having the hop count, hop probability, and the max message throughput as env var configs would be good so that we can experiment with this.

ruslandoga · 2026-03-18T20:02:55Z

    Enum.map(list_caches_with_metrics(), fn {cache, _} -> cache end)
  end

+  defp wal_tombstone_specs do


I didn't want to create another module for it since it's not like a normal ContextCache. So I wrote the spec here. This "cache" is used to remember recently deleted records so that they are dropped from "cache broadcasts" to avoid race conditions (record fetch miss, lookup -> cache broadcast -> record deleted -> wal broadcast -> wal broadcast arrives faster -> cache broadcast arrives later).

a separate ContextCache.WalTombstoneCache would probably be better, we have a few ephemeral caches like these as you have already identified so it is fine.
They're differentiated by the .Cache suffix.

ruslandoga · 2026-03-18T20:06:06Z

    end
  end
+
+  describe "broadcasts" do


I can also add a more realistic test similar to Phoenix PubSub distributed tests where extra nodes are started and connected and "cache broadcast" side-effects are tested on them. Or I can use mocks.

a distributed test harness would be good for this and other broadcasting tests.
i had tried to use LocalCluster before for this but couldn't get a nice testing flow.

would prefer avoiding mocks where possible

I've added Logflare.ContextCache.GossipClusterTest, it's a bit clunky but seems to work for this use case.

Ziinc · 2026-03-20T06:10:40Z

+  System.get_env("LOGFLARE_CACHE_BROADCAST_ENABLED", default_cache_broadcast_enabled) == "true"
+
+cache_broadcast_ratio =
+  "LOGFLARE_CACHE_BROADCAST_RATIO" |> System.get_env("0.1") |> String.to_float()


lets hardcode this value in the logic and let the user set the max broadcast nodes

default of 0.05 would be better (5% of cluster)
so in a 100 node cluster, at most 5 would receive an update (assuming user sets a high max)

LOGFLARE_CACHE_GOSSIP_RATIO

I've updated the defaults and renamed the env vars.

Ziinc · 2026-03-20T06:15:37Z

+  "LOGFLARE_CACHE_BROADCAST_RATIO" |> System.get_env("0.1") |> String.to_float()
+
+cache_broadcast_max_nodes =
+  "LOGFLARE_CACHE_BROADCAST_MAX_NODES" |> System.get_env("5") |> String.to_integer()


lets default to 3 nodes.
LOGFLARE_CACHE_GOSSIP_MAX_NODES

Ziinc · 2026-03-20T06:17:47Z

+# allows excluding heavy caches via: LOGFLARE_CACHE_BROADCAST_EXCLUDE="auth,saved_searches"
+excluded_caches =
+  System.get_env("LOGFLARE_CACHE_BROADCAST_EXCLUDE", "")
+  |> String.split(",", trim: true)
+  |> Enum.map(&String.trim/1)
+  |> MapSet.new()


i don't think having an explicit exclusion is a good approach, would be too much configuration for this which should be a transparent optimization.

Ziinc · 2026-03-20T06:19:39Z

+cache_broadcasts =
+  Map.new(known_caches, fn {short_name, name} ->
+    config = %{
+      ratio: cache_broadcast_ratio,
+      max_nodes: cache_broadcast_max_nodes,
+      enabled: cache_broadcast_enabled? and not MapSet.member?(excluded_caches, short_name)
+    }
+
+    {name, config}
+  end)
+
+config :logflare, :cache_broadcasts, cache_broadcasts


storing the ratio and the max_nodes on app env would be enough, this seems unnecessary. all caches should be broadcasted.

Now it's

config :logflare, :context_cache_gossip, %{ enabled: cache_gossip_enabled?, ratio: cache_gossip_ratio, max_nodes: cache_gossip_max_nodes }

Ziinc · 2026-03-20T06:46:01Z

    Enum.map(list_caches_with_metrics(), fn {cache, _} -> cache end)
  end

+  defp wal_tombstone_specs do


a separate ContextCache.WalTombstoneCache would probably be better, we have a few ephemeral caches like these as you have already identified so it is fine.
They're differentiated by the .Cache suffix.

Ziinc · 2026-03-20T06:46:34Z

  def start_link(arg), do: Supervisor.start_link(__MODULE__, arg, name: __MODULE__)

  context_caches_with_metrics = Logflare.ContextCache.Supervisor.list_caches_with_metrics()
+  wal_tombstones = Logflare.ContextCache.wal_tombstones_cache_name()


separate module name would make this easier.

I've added Logflare.ContextCache.Tombstones.Cache

Ziinc · 2026-03-20T06:54:10Z

+      counter("logflare.context_cache.broadcast.count",
+        event_name: "logflare.context_cache.broadcast.stop",
+        tags: [:cache, :enabled],
+        description: "Total cache broadcast attempts"
+      ),
+      distribution("logflare.context_cache.broadcast.stop.duration",
+        tags: [:cache, :enabled],
+        unit: {:native, :millisecond},
+        description: "Latency of dispatching the cache broadcast"
+      ),
+      counter("logflare.context_cache.receive_broadcast.count",
+        event_name: "logflare.context_cache.receive_broadcast.stop",
+        tags: [:cache, :action],
+        description: "Total cache broadcasts received and their outcome (cached or dropped)"
+      ),
+      distribution("logflare.context_cache.receive_broadcast.stop.duration",
+        tags: [:cache, :action],
+        unit: {:native, :millisecond},
+        description: "Latency of processing an incoming cache broadcast"


should make clear in event_name that it is for the local gossip mechanism of the ContextCache on fetching nodes, and not necessarily from the node with TransactionBroadcaster that does broadcasting of wal updates.
clarification is needed due to upcoming changes in here where the TransactionBroadcaster will be broadcasting updated values with the wal.

The metrics now have context_cache_gossip in their name

"logflare.context_cache_gossip.multicast.count" "logflare.context_cache_gossip.multicast.stop.duration" "logflare.context_cache_gossip.receive.count" "logflare.context_cache_gossip.receive.stop.duration"

Ziinc · 2026-03-20T06:56:56Z

    end
  end
+
+  describe "broadcasts" do


a distributed test harness would be good for this and other broadcasting tests.
i had tried to use LocalCluster before for this but couldn't get a nice testing flow.

Ziinc · 2026-03-20T06:57:25Z

    end
  end
+
+  describe "broadcasts" do


would prefer avoiding mocks where possible

Ziinc · 2026-03-23T12:58:52Z

@ruslandoga just to clarify, we do want a full gossip protocol. But with safe initial defaults reduced to minimize risks since this is experimental. We want to make all inputs to the propagation configurable so that we can experiment with it.

ruslandoga · 2026-04-03T17:57:16Z

I am thinking about renaming Tombstones to RecentlyBusted or something like that since it's not just for deleted records but also for updated ones. After #3184 however it might be that Tombstones/RecentlyBusted would be just for deleted records, but I am not sure yet.

ruslandoga · 2026-04-03T19:16:57Z

+
+  # Explicitly ignore high-volume/ephemeral caches
+  def maybe_gossip(Logflare.Logs.LogEvents.Cache, _key, _value), do: :ok
+  def maybe_gossip(Logflare.Auth.Cache, _key, _value), do: :ok


I've added auth cache here as well, but not sure about it. For one thing it has unusual return types like {:ok, value, value}, and it's scary to mis-cache those.

ruslandoga · 2026-04-03T20:02:15Z

+  defp extract_pkeys(%{id: id}), do: [id]
+
+  defp extract_pkeys(v) do
+    Logger.warning("Unable to extract primary key from gossip value: #{inspect(v)}")


Some values I've noticed that we can't extract primary keys for (and therefore can't tombstone/prevent race conditions):

[{"event_message", [{{:string_contains, "testing"}, {:route, {1312, 0}}}]}] from Rules cache at key {:rules_tree_by_source_id, [12104]} -- what's the rule id?

{:error, :token_not_found} for Auth cache at key {:verify_access_token, ["vIby58vR7h", []]} -- looks like a memoized call, should those be gossiped at all?

{:ok, %Logflare.OauthAccessTokens.OauthAccessToken{}, %Logflare.User{}} for Auth cache at key {:verify_access_token, ["940e5d63954fa0e99f2dc2ae0fa1d032f5d07cda655459b4b2f2fc4df6d255d4", ["private"]]} -- is tombstone for user or token?

I'll prevent Auth and Rules caches from gossiping for now since it's unclear how best to extract pkeys for tombstoning.

ruslandoga · 2026-04-03T21:29:39Z

+
+  @telemetry_handler_id "context-cache-gossip-logger"
+
+  def attach_logger do


This can aid troubleshooting or just checking in with the node. The logs I'm adding there are probably too noisy to be on full-time. These functions can be removed later.

Ziinc · 2026-04-15T07:21:51Z

+    end
+  end
+
+  def multicast(Cachex.Spec.cache(name: cache), key, value) do


Added in #3375

Ziinc · 2026-04-15T07:27:50Z

+  @doc false
+  def record_tombstones(context_pkeys) when is_list(context_pkeys) do
+    # Writes a short-lived marker for a primary key indicating it was recently updated or deleted.
+    # Incoming cache multicasts check this tombstone cache to determine if their payload could be stale.


inline docs should be @doc

Moved in #3375

Ziinc · 2026-04-15T07:28:50Z

+  import Logflare.Factory
+
+  alias Ecto.Adapters.SQL.Sandbox, as: EctoSandbox
+  alias Logflare.{ContextCache, Users}


Suggested change

alias Logflare.{ContextCache, Users}

alias Logflare.ContextCache

alias Logflare.Users

Ziinc · 2026-04-15T07:36:08Z

+    if not Node.alive?() do
+      case :net_kernel.start(:"test@127.0.0.1", %{}) do
+        {:ok, _} ->
+          :ok
+
+        {:error, reason} ->
+          raise """
+          =============================================
+          Failed to start distributed Erlang for tests.
+
+          Please make sure `epmd` is running:
+
+              epmd -daemon
+
+          =============================================
+          Underlying error: #{inspect(reason)}
+          """
+      end


i think we should teardown :net_kernel to avoid funky behaviour in other tests. it is fine for now, but would not be the case if we were to introduce more similar distributed testing

also, tearing it down will slightly improve ci test speed

Adding teardown in #3376

* add partial cache broadcast on misses * credo * credo again * naming * comment * comments * naming * continue * credo * wording * continue * fewer changes * add refresh * begin tests * seems to work * add key to telemetry info * add key to telemetry info * doc * fix credo warning * extract gossip functions into own module * fix mime error * continue * move epmd setup to ci * use ; instead of && * seems to work * cleanup * cleanup again * add big error if epmd is not running * cleanup * format error * improve float parsing (allow 0 and 1) * add warnings * pipe * comment on epmd * move comment * move unboxed runs to setup from setup_all * cache user * continue * cleanup * continue * more logs * wording * comment * naming * eh * notice --------- Co-authored-by: Ziinc <Ziinc@users.noreply.github.com>

* feat: initial implementation of bigquery iam provisioning * chore: remove stripe-mock and add in price id to seeds for paid plan testing * chore: add stripe helper * docs: add docs on additional bigquery projects * chore: version bump * chore: instance template scripts (#3369) * feat: LQL timestamp support for unix timestamps and ISO8601 (#3327) * feat: LQL timestamp support for unix timestamps and ISO8601 * fix: don't apply local timezone correction to absolute timestamps * feat: add partial broadcast for cached values (#3218) * add partial cache broadcast on misses * credo * credo again * naming * comment * comments * naming * continue * credo * wording * continue * fewer changes * add refresh * begin tests * seems to work * add key to telemetry info * add key to telemetry info * doc * fix credo warning * extract gossip functions into own module * fix mime error * continue * move epmd setup to ci * use ; instead of && * seems to work * cleanup * cleanup again * add big error if epmd is not running * cleanup * format error * improve float parsing (allow 0 and 1) * add warnings * pipe * comment on epmd * move comment * move unboxed runs to setup from setup_all * cache user * continue * cleanup * continue * more logs * wording * comment * naming * eh * notice --------- Co-authored-by: Ziinc <Ziinc@users.noreply.github.com> * chore: add telegraf --------- Co-authored-by: Adam Mokan <amokan@gmail.com> Co-authored-by: Matt Stubbs <22266+msmithstubbs@users.noreply.github.com> Co-authored-by: ruslandoga <ruslandoga+gh@icloud.com>

github-actions Bot assigned ruslandoga Feb 27, 2026

ruslandoga commented Feb 27, 2026

View reviewed changes

Comment thread lib/logflare/context_cache.ex Outdated

Ziinc reviewed Mar 10, 2026

View reviewed changes

ruslandoga force-pushed the rd/cache-partial-broadcast branch 2 times, most recently from 29cd132 to 40d61bc Compare March 18, 2026 19:20

add partial cache broadcast on misses

b8c1ed6

ruslandoga force-pushed the rd/cache-partial-broadcast branch from 40d61bc to b8c1ed6 Compare March 18, 2026 19:21

ruslandoga added 2 commits March 18, 2026 22:29

credo

e19b258

credo again

1b5797a

ruslandoga marked this pull request as ready for review March 18, 2026 20:00

ruslandoga requested a review from Ziinc March 18, 2026 20:00

ruslandoga commented Mar 18, 2026

View reviewed changes

Ziinc requested changes Mar 20, 2026

View reviewed changes

ruslandoga added 5 commits April 2, 2026 04:24

Merge branch 'main' into rd/cache-partial-broadcast

b2331ac

naming

da0a25b

comment

0370555

comments

1f11002

naming

91b6a19

ruslandoga marked this pull request as draft April 2, 2026 14:02

ruslandoga added 9 commits April 2, 2026 17:31

continue

359b522

credo

3079a9b

wording

a44cb6a

continue

7c4d5bd

fewer changes

7e597bf

Merge branch 'main' into rd/cache-partial-broadcast

425b8f9

add refresh

3306acf

begin tests

5960cb5

seems to work

ecf5931

ruslandoga added 9 commits April 3, 2026 21:10

format error

e510a94

improve float parsing (allow 0 and 1)

3695f9c

add warnings

a088f28

pipe

2f11b47

comment on epmd

748cbf5

move comment

a04cf8b

move unboxed runs to setup from setup_all

0dc0415

cache user

dfb9a30

continue

5cbc731

ruslandoga commented Apr 3, 2026

View reviewed changes

cleanup

96ed7e6

ruslandoga commented Apr 3, 2026

View reviewed changes

continue

6eb7e37

ruslandoga commented Apr 3, 2026

View reviewed changes

ruslandoga and others added 8 commits April 4, 2026 00:37

more logs

556ac10

wording

dd9e5e8

comment

5550242

naming

3644f03

eh

41e3d7a

Merge branch 'main' into rd/cache-partial-broadcast

35687d3

notice

21d15ff

Merge branch 'main' into rd/cache-partial-broadcast

c6842ed

Ziinc approved these changes Apr 15, 2026

View reviewed changes

Ziinc merged commit 921ad2f into Logflare:main Apr 15, 2026
14 checks passed

ruslandoga deleted the rd/cache-partial-broadcast branch April 15, 2026 09:31

This was referenced Apr 15, 2026

chore: Add Cache Gossip module documentation #3375

Merged

chore: Stop net_kernel after Cache Gossip test #3376

Merged


		@telemetry_handler_id "context-cache-gossip-logger"

		def attach_logger do

	alias Logflare.{ContextCache, Users}
	alias Logflare.ContextCache
	alias Logflare.Users

Conversation

ruslandoga commented Feb 27, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga commented Mar 18, 2026

Uh oh!

ruslandoga Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ziinc Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ruslandoga Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ziinc commented Mar 23, 2026

Uh oh!

ruslandoga commented Apr 3, 2026

Uh oh!

ruslandoga Mar 18, 2026 •

edited

Loading

ruslandoga Mar 18, 2026 •

edited

Loading

ruslandoga Mar 20, 2026 •

edited

Loading

Ziinc Mar 23, 2026 •

edited

Loading

ruslandoga Mar 18, 2026 •

edited

Loading

ruslandoga Mar 18, 2026 •

edited

Loading

ruslandoga Apr 3, 2026 •

edited

Loading

ruslandoga Apr 3, 2026 •

edited

Loading