Skip to content

Commit 308547c

Browse files
author
Per G. da Silva
committed
Add resolution trace logging
Signed-off-by: Per G. da Silva <pegoncal@redhat.com>
1 parent c2df9e7 commit 308547c

18 files changed

Lines changed: 1130 additions & 8 deletions

File tree

Lines changed: 311 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
# Resolution Trace Data Collection Guide
2+
3+
All instrumented log lines use the prefix `RESOLUTION_TRACE` with key=value pairs.
4+
Every line includes `phase=` and `event=BEGIN|DONE`. DONE lines include `duration_ms=`.
5+
6+
## Collecting Logs
7+
8+
```bash
9+
# From an OLM pod on an OpenShift cluster
10+
oc logs deploy/olm-operator -n openshift-operator-lifecycle-manager > olm.log
11+
12+
# Extract only trace lines
13+
grep RESOLUTION_TRACE olm.log > traces.log
14+
```
15+
16+
## Phase Reference
17+
18+
### Orchestration (step_resolver.go)
19+
20+
| Phase | Type | Key Fields |
21+
|---|---|---|
22+
| `resolve_steps` | wall clock | `namespace`, `subscription_count` |
23+
| `list_subscriptions` | I/O (API) | `namespace`, `subscription_count` |
24+
| `list_operator_groups` | I/O (API) | `namespace`, `count` |
25+
| `core_resolve` | hybrid | `namespace`, `subscription_count`, `namespace_count`, `operator_count` |
26+
| `generate_steps` | compute | `namespace`, `operator_count`, `step_count`, `bundle_lookup_count` |
27+
28+
### Core Resolution (resolver.go)
29+
30+
| Phase | Type | Key Fields |
31+
|---|---|---|
32+
| `cache_namespaced` | I/O coord | `namespace`, `namespace_count` |
33+
| `existing_variables` | compute | `namespace`, `entry_count`, `variable_count` |
34+
| `subscription_variables` | compute | `namespace`, `subscription_count`, `total_variables` |
35+
| `single_sub_variables` | compute | `namespace`, `subscription`, `package`, `channel`, `catalog`, `has_current`, `variable_count` |
36+
| `add_invariants` | compute | `namespace`, `variable_count_before`, `variable_count_after`, `invariants_added` |
37+
| `cache_error_check` | I/O wait | `namespace` |
38+
| `sat_solve` | compute | `namespace`, `variable_count`, `subscription_count`, `solved_count` |
39+
| `extract_results` | compute | `namespace`, `operator_count` |
40+
41+
### Per-Subscription Detail (resolver.go)
42+
43+
| Phase | Type | Key Fields |
44+
|---|---|---|
45+
| `cache_find_entries` | compute | `catalog`, `entry_count` |
46+
| `sort_channel` | compute | `entry_count`, `sorted_count` |
47+
| `build_candidates` | compute | `sorted_bundle_count`, `candidate_count` |
48+
| `bundle_variables_dfs` | compute | `initial_stack_size`, `visited_count`, `variables_count`, `bundles_processed`, `dependency_predicate_count`, `total_dep_bundles` |
49+
50+
### Cache (cache.go)
51+
52+
| Phase | Type | Key Fields |
53+
|---|---|---|
54+
| `cache_populate` | I/O coord | `source_count`, `cache_hit_count`, `cache_miss_count` |
55+
56+
### Catalog Snapshots (source_registry.go, source_csvs.go)
57+
58+
| Phase | Type | Key Fields |
59+
|---|---|---|
60+
| `catalog_snapshot` | I/O (gRPC) | `catalog`, `bundle_count`, `package_count` |
61+
| `list_bundles` | I/O (gRPC) | `catalog`, `bundle_count` |
62+
| `get_packages` | I/O (gRPC) | `catalog`, `package_count`, `call_count`, `duration_ms` |
63+
| `csv_snapshot` | I/O (API) | `namespace`, `csv_count`, `subscription_count`, `entry_count` |
64+
65+
### SAT Solver (solver/solve.go)
66+
67+
| Phase | Type | Key Fields |
68+
|---|---|---|
69+
| `sat_add_constraints` | compute | `variable_count`, `constraint_count`, `literal_count` |
70+
| `sat_search` | compute | `anchor_count`, `assumption_count`, `final_assumption_count`, `outcome` |
71+
| `sat_minimize` | compute | `extras_count`, `excluded_count`, `cardinality_bound`, `result_count` |
72+
73+
---
74+
75+
## Experiment 1: Resolution Latency vs Subscription Count
76+
77+
**Goal:** How does total resolution time scale with the number of Subscriptions in a namespace?
78+
79+
### Collect total resolution time per namespace
80+
81+
```bash
82+
grep "phase=resolve_steps event=DONE" traces.log \
83+
| grep -oP 'namespace=\S+ subscription_count=\d+ .*?duration_ms=\d+' \
84+
| sort
85+
```
86+
87+
### Collect core_resolve time (excludes API listing overhead)
88+
89+
```bash
90+
grep "phase=core_resolve event=DONE" traces.log \
91+
| grep -oP 'namespace=\S+ subscription_count=\d+ .*?duration_ms=\d+'
92+
```
93+
94+
### Per-subscription breakdown within a namespace
95+
96+
```bash
97+
NS="my-namespace"
98+
grep "phase=single_sub_variables event=DONE" traces.log \
99+
| grep "namespace=$NS" \
100+
| grep -oP 'subscription=\S+ package=\S+ catalog=\S+ variable_count=\d+ .*?duration_ms=\d+'
101+
```
102+
103+
### Subscription count vs SAT solve time
104+
105+
```bash
106+
grep "phase=sat_solve event=DONE" traces.log \
107+
| grep -oP 'namespace=\S+ variable_count=\d+ subscription_count=\d+ .*?duration_ms=\d+'
108+
```
109+
110+
---
111+
112+
## Experiment 2: Resolution Latency vs Catalog Size
113+
114+
**Goal:** How does resolution time change based on which catalog is used and how many bundles it contains?
115+
116+
### Catalog snapshot fetch times (gRPC round-trip cost)
117+
118+
```bash
119+
grep "phase=catalog_snapshot event=DONE" traces.log \
120+
| grep -oP 'catalog=\S+ bundle_count=\d+ package_count=\d+ .*?duration_ms=\d+'
121+
```
122+
123+
### gRPC ListBundles time per catalog
124+
125+
```bash
126+
grep "phase=list_bundles event=DONE" traces.log \
127+
| grep -oP 'catalog=\S+ bundle_count=\d+ .*?duration_ms=\d+'
128+
```
129+
130+
### gRPC GetPackage aggregate time per catalog
131+
132+
```bash
133+
grep "phase=get_packages event=DONE" traces.log \
134+
| grep -oP 'catalog=\S+ package_count=\d+ call_count=\d+ duration_ms=\d+'
135+
```
136+
137+
### Cache hit/miss rate (are snapshots being reused?)
138+
139+
```bash
140+
grep "phase=cache_populate event=DONE" traces.log \
141+
| grep -oP 'source_count=\d+ cache_hit_count=\d+ cache_miss_count=\d+ .*?duration_ms=\d+'
142+
```
143+
144+
### Per-subscription: which catalog and how many entries matched
145+
146+
```bash
147+
grep "phase=cache_find_entries event=DONE" traces.log \
148+
| grep -oP 'catalog=\S+ entry_count=\d+ .*?duration_ms=\d+'
149+
```
150+
151+
---
152+
153+
## Experiment 3: Dependency Graph Complexity
154+
155+
**Goal:** How does the DFS dependency resolution scale with bundles and dependencies?
156+
157+
### DFS work per resolution
158+
159+
```bash
160+
grep "phase=bundle_variables_dfs event=DONE" traces.log \
161+
| grep -oP 'bundles_processed=\d+ dependency_predicate_count=\d+ total_dep_bundles=\d+ visited_count=\d+ variables_count=\d+ .*?duration_ms=\d+'
162+
```
163+
164+
### Invariant growth (GVK/package conflict constraints)
165+
166+
```bash
167+
grep "phase=add_invariants event=DONE" traces.log \
168+
| grep -oP 'namespace=\S+ variable_count_before=\d+ variable_count_after=\d+ invariants_added=\d+ .*?duration_ms=\d+'
169+
```
170+
171+
---
172+
173+
## Experiment 4: SAT Solver Internals
174+
175+
**Goal:** How does the SAT problem size affect solve time?
176+
177+
### Problem size (variables, constraints, literals)
178+
179+
```bash
180+
grep "phase=sat_add_constraints event=DONE" traces.log \
181+
| grep -oP 'variable_count=\d+ constraint_count=\d+ literal_count=\d+ .*?duration_ms=\d+'
182+
```
183+
184+
### Search complexity (anchor count, backtracking)
185+
186+
```bash
187+
grep "phase=sat_search event=DONE" traces.log \
188+
| grep -oP 'outcome=\d+ anchor_count=\d+ final_assumption_count=\d+ .*?duration_ms=\d+'
189+
```
190+
191+
### Cardinality minimization cost
192+
193+
```bash
194+
grep "phase=sat_minimize event=DONE" traces.log \
195+
| grep -oP 'extras_count=\d+ excluded_count=\d+ cardinality_bound=\d+ result_count=\d+ .*?duration_ms=\d+'
196+
```
197+
198+
---
199+
200+
## Experiment 5: Installed Operators (Virtual Catalog)
201+
202+
**Goal:** How does the number of already-installed CSVs affect resolution?
203+
204+
### CSV snapshot size and cost
205+
206+
```bash
207+
grep "phase=csv_snapshot event=DONE" traces.log \
208+
| grep -oP 'namespace=\S+ csv_count=\d+ subscription_count=\d+ entry_count=\d+ .*?duration_ms=\d+'
209+
```
210+
211+
### Existing variables processing time
212+
213+
```bash
214+
grep "phase=existing_variables event=DONE" traces.log \
215+
| grep -oP 'namespace=\S+ entry_count=\d+ variable_count=\d+ .*?duration_ms=\d+'
216+
```
217+
218+
---
219+
220+
## Filtering by Namespace
221+
222+
Every query above can be scoped to a single namespace:
223+
224+
```bash
225+
NS="openshift-operators"
226+
grep "RESOLUTION_TRACE" traces.log | grep "namespace=$NS"
227+
```
228+
229+
For phases that use `catalog=` instead of `namespace=` (snapshot phases), the catalog
230+
field is formatted as `namespace/name`, so filter with:
231+
232+
```bash
233+
grep "catalog=$NS/" traces.log
234+
```
235+
236+
---
237+
238+
## Correlation: Linking Phases to the Same Resolution Cycle
239+
240+
The `resolve_steps` phase (in step_resolver.go) injects a `resolution_id=` field
241+
into the logrus logger. All phases emitted from **step_resolver.go** carry this
242+
field automatically. Phases from deeper layers (resolver.go, solver) do not carry
243+
`resolution_id` in the log message itself, but they appear between the `resolve_steps`
244+
BEGIN and DONE timestamps and share the same `namespace=` value.
245+
246+
### Get all phases for a specific resolution cycle by resolution_id
247+
248+
```bash
249+
ID="a1b2c3d4"
250+
grep "resolution_id=$ID" olm.log | grep RESOLUTION_TRACE
251+
```
252+
253+
### Correlate by timestamp window (for phases without resolution_id)
254+
255+
```bash
256+
# 1. Find the resolve_steps BEGIN/DONE timestamps for a namespace
257+
grep "phase=resolve_steps" traces.log | grep "namespace=$NS"
258+
259+
# 2. Use awk to filter traces within that time window
260+
# (all traces between a BEGIN and its matching DONE belong to one cycle)
261+
```
262+
263+
---
264+
265+
## Quick One-Liner: Full Pipeline Summary for a Namespace
266+
267+
```bash
268+
NS="openshift-operators"
269+
echo "=== Resolution Timeline ==="
270+
grep "RESOLUTION_TRACE.*namespace=$NS.*event=DONE" traces.log \
271+
| grep -oP 'phase=\S+ .*?duration_ms=\d+' \
272+
| column -t
273+
274+
echo ""
275+
echo "=== Catalog Snapshots ==="
276+
grep "phase=catalog_snapshot event=DONE" traces.log \
277+
| grep -oP 'catalog=\S+ bundle_count=\d+ package_count=\d+ .*?duration_ms=\d+' \
278+
| column -t
279+
280+
echo ""
281+
echo "=== SAT Solver Detail ==="
282+
grep "phase=sat_" traces.log \
283+
| grep "event=DONE" \
284+
| grep -oP 'phase=\S+ .*' \
285+
| column -t
286+
```
287+
288+
---
289+
290+
## Generating a CSV for Analysis
291+
292+
```bash
293+
# Example: subscription count vs resolution time
294+
echo "namespace,subscription_count,duration_ms" > resolution_by_subs.csv
295+
grep "phase=resolve_steps event=DONE" traces.log \
296+
| sed -E 's/.*namespace=(\S+).*subscription_count=([0-9]+).*duration_ms=([0-9]+).*/\1,\2,\3/' \
297+
>> resolution_by_subs.csv
298+
299+
# Example: catalog snapshot size vs fetch time
300+
echo "catalog,bundle_count,package_count,duration_ms" > catalog_snapshots.csv
301+
grep "phase=catalog_snapshot event=DONE" traces.log \
302+
| sed -E 's/.*catalog=(\S+).*bundle_count=([0-9]+).*package_count=([0-9]+).*duration_ms=([0-9]+).*/\1,\2,\3,\4/' \
303+
>> catalog_snapshots.csv
304+
305+
# Example: SAT problem size vs solve time
306+
echo "namespace,variable_count,subscription_count,solved_count,duration_ms" > sat_solve.csv
307+
grep "phase=sat_solve event=DONE" traces.log \
308+
| grep -v "error=true" \
309+
| sed -E 's/.*namespace=(\S+).*variable_count=([0-9]+).*subscription_count=([0-9]+).*solved_count=([0-9]+).*duration_ms=([0-9]+).*/\1,\2,\3,\4,\5/' \
310+
>> sat_solve.csv
311+
```

staging/operator-lifecycle-manager/pkg/controller/registry/resolver/cache/cache.go

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,8 @@ func (c *Cache) Namespaced(namespaces ...string) MultiCatalogOperatorFinder {
150150
CachePopulateTimeout = time.Minute
151151
)
152152

153+
populateStart := cacheTraceBegin(c.logger, "cache_populate")
154+
153155
sources := c.sp.Sources(namespaces...)
154156

155157
result := NamespacedOperatorCache{
@@ -179,7 +181,9 @@ func (c *Cache) Namespaced(namespaces ...string) MultiCatalogOperatorFinder {
179181
}
180182
}()
181183

184+
cacheHits := len(sources) - len(misses)
182185
if len(misses) == 0 {
186+
cacheTraceDone(c.logger, "cache_populate", populateStart, cacheKV("source_count", len(sources)), cacheKV("cache_hit_count", cacheHits), cacheKV("cache_miss_count", 0))
183187
return &result
184188
}
185189

@@ -236,6 +240,7 @@ func (c *Cache) Namespaced(namespaces ...string) MultiCatalogOperatorFinder {
236240
}(ctx, &hdr, sources[miss])
237241
}
238242

243+
cacheTraceDone(c.logger, "cache_populate", populateStart, cacheKV("source_count", len(sources)), cacheKV("cache_hit_count", cacheHits), cacheKV("cache_miss_count", len(misses)))
239244
return &result
240245
}
241246

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
package cache
2+
3+
import (
4+
"fmt"
5+
"strings"
6+
"time"
7+
8+
"github.com/sirupsen/logrus"
9+
)
10+
11+
func cacheTraceBegin(log logrus.StdLogger, phase string, kvs ...string) time.Time {
12+
log.Printf("RESOLUTION_TRACE phase=%s event=BEGIN %s", phase, strings.Join(kvs, " "))
13+
return time.Now()
14+
}
15+
16+
func cacheTraceDone(log logrus.StdLogger, phase string, start time.Time, kvs ...string) {
17+
ms := time.Since(start).Milliseconds()
18+
log.Printf("RESOLUTION_TRACE phase=%s event=DONE duration_ms=%d %s", phase, ms, strings.Join(kvs, " "))
19+
}
20+
21+
func cacheKV(key string, value interface{}) string {
22+
return fmt.Sprintf("%s=%v", key, value)
23+
}

0 commit comments

Comments
 (0)