Skip to content

SREP-3620: Add feature-testing evidence collection commands#862

Open
MitaliBhalla wants to merge 1 commit intoopenshift:masterfrom
MitaliBhalla:SREP-3620
Open

SREP-3620: Add feature-testing evidence collection commands#862
MitaliBhalla wants to merge 1 commit intoopenshift:masterfrom
MitaliBhalla:SREP-3620

Conversation

@MitaliBhalla
Copy link
Contributor

@MitaliBhalla MitaliBhalla commented Mar 11, 2026

Adds new osdctl commands to help SRE teams collect evidence during feature validation testing (IAM policies, operators, etc.), reducing manual toil and speeding up feature validation.

New Commands

osdctl aws cloudtrail errors - Surface AWS permission errors from CloudTrail
osdctl cluster snapshot - Capture point-in-time cluster state (nodes, operators, namespaces)
osdctl cluster diff - Compare two snapshots to identify changes
osdctl evidence collect - All-in-one evidence collection

Usage Examples

Surface CloudTrail permission errors

osdctl aws cloudtrail errors -C <cluster-id> --since 1h

Capture cluster state before/after testing

osdctl cluster snapshot -C <cluster-id> -o before.yaml
osdctl cluster snapshot -C <cluster-id> -o after.yaml
osdctl cluster diff before.yaml after.yaml

Collect all evidence at once

osdctl evidence collect -C <cluster-id> --output ./evidence/

Features
Supports both ROSA Classic and ROSA HCP clusters
Detects HCP clusters and provides relevant guidance
JSON output option for programmatic use
AWS Console links for CloudTrail events

Testing

Tested on ROSA Classic cluster mbhalla-osdctl (v4.21.4):

  • CloudTrail errors: Found 18 permission errors
  • Snapshot: Captured 7 nodes, 34 operators, 20 MachineConfigs
  • Diff: Correctly detected changes between snapshots
  • Evidence: Generated summary and evidence.yaml

Jira

https://issues.redhat.com/browse/SREP-3620

Summary by CodeRabbit

  • New Features

    • Added osdctl aws command group with cloudtrail errors to surface CloudTrail error events (filtering by type, JSON output, optional console links) and aws utilities.
    • Added osdctl cluster snapshot and cluster diff for capturing and comparing point-in-time cluster state; added osdctl evidence collect to gather cluster state, CloudTrail, events, and diagnostics into structured output and a summary.
  • Documentation

    • Expanded README and docs with usage, examples, and options for aws, cloudtrail errors, cluster snapshot/diff, and evidence collect.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 11, 2026

@MitaliBhalla: This pull request references SREP-3620 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Adds new osdctl commands to help SRE teams collect evidence during feature validation testing (IAM policies, operators, etc.), reducing manual toil and speeding up feature validation.

New Commands

osdctl aws cloudtrail errors - Surface AWS permission errors from CloudTrail
osdctl cluster snapshot - Capture point-in-time cluster state (nodes, operators, namespaces)
osdctl cluster diff - Compare two snapshots to identify changes
osdctl evidence collect - All-in-one evidence collection

Usage Examples

Surface CloudTrail permission errors

osdctl aws cloudtrail errors -C <cluster-id> --since 1h

Capture cluster state before/after testing

osdctl cluster snapshot -C <cluster-id> -o before.yaml
osdctl cluster snapshot -C <cluster-id> -o after.yaml
osdctl cluster diff before.yaml after.yaml

Collect all evidence at once

osdctl evidence collect -C <cluster-id> --output ./evidence/

Features
Supports both ROSA Classic and ROSA HCP clusters
Detects HCP clusters and provides relevant guidance
JSON output option for programmatic use
AWS Console links for CloudTrail events

Testing

Tested on ROSA Classic cluster mbhalla-osdctl (v4.21.4):

  • CloudTrail errors: Found 18 permission errors
  • Snapshot: Captured 7 nodes, 34 operators, 20 MachineConfigs
  • Diff: Correctly detected changes between snapshots
  • Evidence: Generated summary and evidence.yaml

Jira

https://issues.redhat.com/browse/SREP-3620

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds new CLI command groups for AWS and evidence collection, plus cluster snapshot and diff utilities. Implements CloudTrail error querying, cluster snapshot creation, snapshot diffing, and an evidence collection orchestrator. Wires commands into the root CLI and adds documentation and README content. All changes are additive.

Changes

Cohort / File(s) Summary
AWS CLI & CloudTrail
cmd/aws/cmd.go, cmd/aws/cloudtrail/cmd.go, cmd/aws/cloudtrail/errors.go, docs/osdctl_aws.md, docs/osdctl_aws_cloudtrail.md, docs/osdctl_aws_cloudtrail_errors.md
Adds aws command group and cloudtrail errors subcommand: STS identity lookup, paginated CloudTrail LookupEvents (regional + global), error detection/filtering, parsing to structured events, optional console links, JSON/human output, and docs.
Cluster snapshot & diff
cmd/cluster/cmd.go, cmd/cluster/snapshot.go, cmd/cluster/diff.go, docs/osdctl_cluster_snapshot.md, docs/osdctl_cluster_diff.md, docs/osdctl_cluster.md
Adds cluster snapshot to capture nodes/operators/namespaces/resources into YAML and cluster diff to compare two snapshots producing structured diffs (nodes/operators/namespaces/resources) with JSON or formatted console output; CLI wiring and docs.
Evidence collection
cmd/evidence/cmd.go, cmd/evidence/collect.go, docs/osdctl_evidence.md, docs/osdctl_evidence_collect.md, README.md, docs/README.md
Adds evidence command group and evidence collect orchestrator that captures cluster state, optional k8s events, conditional CloudTrail pulls for AWS clusters, optional must-gather, writes evidence.yaml and summary.txt, and documents workflows.
Root CLI wiring
cmd/cmd.go
Registers new top-level commands: osdaws.NewCmdAws() and evidence.NewCmdEvidence().
Documentation index & README
README.md, docs/...
Extensive docs and README additions covering new commands, usage examples, and evidence collection guidance.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as CloudTrail CLI
    participant STS as AWS STS
    participant CT as AWS CloudTrail
    participant Parser as Event Parser
    participant Output as Formatter

    User->>CLI: Run `osdctl aws cloudtrail errors` (cluster-id, since, flags)
    CLI->>CLI: Parse flags -> compute startTime
    CLI->>STS: GetCallerIdentity
    STS-->>CLI: ARN/AccountID
    CLI->>Output: Print identity (unless --json)
    CLI->>CT: LookupEvents (cluster region, startTime)
    CT-->>CLI: Regional events
    alt cluster region != us-east-1
        CLI->>CT: LookupEvents (us-east-1, startTime)
        CT-->>CLI: Global events
        CLI->>CLI: Merge events
    end
    CLI->>Parser: Filter/parse events (error-types or defaults)
    Parser-->>CLI: ErrorEvent list (with optional ConsoleLink)
    alt --json
        CLI->>Output: Emit JSON array
    else
        CLI->>Output: Render human-readable table
    end
    Output-->>User: Display results
Loading
sequenceDiagram
    participant User
    participant CLI as Snapshot/Diff CLI
    participant K8s as Kubernetes API
    participant FS as File System
    participant Diff as Diff Engine

    User->>CLI: Run `osdctl cluster snapshot` (cluster-id, output)
    CLI->>K8s: oc get nodes/namespaces/clusteroperators/... 
    K8s-->>CLI: JSON resources
    CLI->>CLI: Build ClusterSnapshot struct
    CLI->>FS: Write YAML snapshot
    FS-->>CLI: Snapshot saved

    User->>CLI: Run `osdctl cluster diff` (snap1, snap2)
    CLI->>FS: Read snap1 & snap2 YAML
    FS-->>CLI: ClusterSnapshot before/after
    CLI->>Diff: compareSnapshots(before, after)
    Diff-->>CLI: DiffResult (nodes/operators/namespaces/resources)
    alt --json
        CLI->>FS: Write structured diff
    else
        CLI->>CLI: Render human diff (+ / - / ~)
    end
    FS-->>User: Diff output
Loading
sequenceDiagram
    participant User
    participant CLI as Evidence Collect CLI
    participant K8s as Kubernetes API
    participant AWS as AWS (STS/CloudTrail)
    participant MustGather as must-gather
    participant FS as File System

    User->>CLI: Run `osdctl evidence collect` (cluster-id, output-dir, since, flags)
    CLI->>CLI: Validate inputs, compute startTime
    CLI->>K8s: Verify cluster connectivity
    K8s-->>CLI: Connected
    CLI->>FS: Create output directory

    alt not skip-cluster-state
        CLI->>K8s: Fetch nodes, operators, machineconfigs
        K8s-->>CLI: ClusterState data
        CLI->>FS: Store ClusterState in evidence.yaml
    end

    alt include-events
        CLI->>K8s: Get events (--all-namespaces, since)
        K8s-->>CLI: Event list
    end

    alt AWS cluster AND not skip-cloudtrail
        CLI->>AWS: whoami (STS)
        AWS-->>CLI: ARN/ID
        CLI->>AWS: CloudTrail LookupEvents (startTime)
        AWS-->>CLI: CloudTrail events
        CLI->>FS: Store CloudTrailData in evidence.yaml
    end

    alt include-must-gather
        CLI->>MustGather: Run `oc adm must-gather`
        MustGather-->>FS: must-gather artifacts
        CLI->>FS: Record diagnostics path
    end

    CLI->>FS: Write evidence.yaml and summary.txt
    FS-->>User: Evidence package ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'SREP-3620: Add feature-testing evidence collection commands' directly reflects the main purpose and scope of the PR: adding new evidence collection commands for feature testing.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test files (*_test.go) with unstable or non-deterministic test names were found in the PR modifications.
Test Structure And Quality ✅ Passed PR introduces new CLI command implementations and documentation without any Ginkgo test files, making quality criteria evaluation impossible.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MitaliBhalla
Once this PR has been reviewed and has the lgtm label, please assign matesaary for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 11, 2026

@MitaliBhalla: This pull request references SREP-3620 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Adds new osdctl commands to help SRE teams collect evidence during feature validation testing (IAM policies, operators, etc.), reducing manual toil and speeding up feature validation.

New Commands

osdctl aws cloudtrail errors - Surface AWS permission errors from CloudTrail
osdctl cluster snapshot - Capture point-in-time cluster state (nodes, operators, namespaces)
osdctl cluster diff - Compare two snapshots to identify changes
osdctl evidence collect - All-in-one evidence collection

Usage Examples

Surface CloudTrail permission errors

osdctl aws cloudtrail errors -C <cluster-id> --since 1h

Capture cluster state before/after testing

osdctl cluster snapshot -C <cluster-id> -o before.yaml
osdctl cluster snapshot -C <cluster-id> -o after.yaml
osdctl cluster diff before.yaml after.yaml

Collect all evidence at once

osdctl evidence collect -C <cluster-id> --output ./evidence/

Features
Supports both ROSA Classic and ROSA HCP clusters
Detects HCP clusters and provides relevant guidance
JSON output option for programmatic use
AWS Console links for CloudTrail events

Testing

Tested on ROSA Classic cluster mbhalla-osdctl (v4.21.4):

  • CloudTrail errors: Found 18 permission errors
  • Snapshot: Captured 7 nodes, 34 operators, 20 MachineConfigs
  • Diff: Correctly detected changes between snapshots
  • Evidence: Generated summary and evidence.yaml

Jira

https://issues.redhat.com/browse/SREP-3620

Summary by CodeRabbit

Release Notes

New Features

  • Explore AWS errors and failures through CloudTrail integration with filtering and console link generation
  • Capture cluster snapshots to record point-in-time cluster state and compare changes over time
  • Collect comprehensive diagnostic evidence including cluster state and AWS data for troubleshooting and validation workflows

Documentation

  • Added procedures for collecting evidence during feature testing and validation activities

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/aws/cloudtrail/errors.go`:
- Around line 76-87: The examples block in the cloudtrail errors command uses
the lowercase shorthand "-c" for the cluster-id flag which is incorrect; update
all occurrences in the examples (including the other block referenced) to use
the registered uppercase shorthand "-C" so pasted commands match the actual flag
definition (update the examples string in cmd/aws/cloudtrail/errors.go where the
help/examples for the errors command are defined).
- Around line 339-345: The parseDurationToUTC function currently accepts zero or
negative durations which yields a start time in the future; update
parseDurationToUTC to validate the parsed duration and return an error if
duration <= 0 (e.g., "lookback window must be a positive duration: %s"), so
callers (e.g., --since) get a clear validation error instead of an empty/future
query; keep the existing time.ParseDuration call and only add a check after
parsing that returns a wrapped fmt.Errorf when the duration is non‑positive.

In `@cmd/cluster/diff.go`:
- Around line 104-105: The --json flag (opts.OutputJSON) is still serializing
output with yaml.Marshal; update the code paths that check opts.OutputJSON (the
diffCmd output block and the similar block around the 410-417 area) to use
json.Marshal or json.MarshalIndent instead of yaml.Marshal so the --json flag
actually emits JSON, retain existing error handling and write the marshaled
bytes to stdout/stderr as before; ensure
diffCmd.Flags().BoolVar(&opts.OutputJSON, "json", false, "Output diff in JSON
format") behavior matches the new json serialization.
- Around line 189-231: compareNodes currently only detects Status changes and
treats added/removed nodes, so role, version, condition, and label changes
recorded in NodeSnapshot are never reported; update compareNodes to compare all
relevant NodeSnapshot fields (Roles, Version, Conditions, Labels in addition to
Status) and produce NodeDiff entries for modifications that include Before/After
strings for any changed field(s) using the NodeSnapshot.Name key; do the
same-change logic for the corresponding namespace comparator (the function
handling namespace diffs referenced around lines 315-357, e.g.,
compareNamespaces) so label changes there are also detected and reported in
NamespaceDiff entries.

In `@cmd/cluster/snapshot.go`:
- Around line 97-105: The help examples for the "osdctl cluster snapshot"
command use the lowercase short flag "-c" but the command actually registers the
uppercase short flag "-C"; update all example lines in the snapshot help block
to use "-C" (e.g., replace "osdctl cluster snapshot -c <cluster-id>" with
"osdctl cluster snapshot -C <cluster-id>") so copy/pasting from the examples
matches the registered short flag for the snapshot command.
- Around line 63-69: ResourceInfo currently only stores Status (phase) which is
too lossy; add a stable per-object digest (and/or richer fields) and populate it
when building snapshots so diffs detect spec/rollout changes. Modify the
ResourceInfo struct (add Digest string `yaml:"digest,omitempty"` and optionally
Generation/ResourceVersion or a SpecSummary field), then in the snapshot
construction code that produces ResourceInfo instances (the code referenced
around ResourceInfo usage and the other block ~382-411) compute a deterministic
hash (e.g., SHA256) over the object's meaningful content — for example
metadata.generation/resourceVersion plus the object spec (or the full
unstructured JSON/YAML of the object) — and assign that string to
ResourceInfo.Digest; ensure the same hashing logic is applied in both places
mentioned so future diffs compare Digest (and any added fields) rather than just
Status.
- Around line 121-205: The run() flow fetches the target cluster via OCM but
then shells out to oc in captureNodes(), captureNamespaces(),
captureClusterOperators(), captureResources() (and other capture paths), which
can operate against whatever kubecontext is active; update run() to resolve or
obtain the correct kubeconfig/context for o.ClusterID (reuse existing
utils/CreateConnection/GetClusterAnyStatus flow or call the utility that returns
the cluster's kubeconfig/OCM login token) and ensure every capture function
(captureNodes, captureNamespaces, captureClusterOperators, captureResources)
uses that kubeconfig/context (pass it as an argument or set KUBECONFIG/--context
for oc invocations) or else fail fast with an error if we cannot establish the
correct context; apply the same check/enforcement to the other capture call
sites noted in the review so metadata and captured resources always come from
the requested cluster.

In `@cmd/evidence/collect.go`:
- Around line 333-355: The collectClusterState function currently ignores errors
from sub-collectors (collectNodes, collectOperators, collectMachineConfigs);
change it to capture and return collector errors instead of swallowing them:
call each sub-collector, append any non-nil error to a slice (or use
errors.Join) and only return the partially-built state if no errors occurred; if
there are errors, return nil along with a combined/wrapped error that includes
context identifying which collector failed (mentioning collectNodes,
collectOperators, collectMachineConfigs in the message) so the caller can detect
and report failed sub-collections.
- Around line 279-290: The events collection ignores the user-specified
--since/startTime; modify the flow so collectKubernetesEvents(startTime)
receives the parsed start time (or make startTime accessible to it) and uses it
in the Kubernetes client list options (e.g., metav1.ListOptions fieldSelector or
metav1.ListOptions with SinceTime via metav1.Time) to limit returned events;
update references where events are collected (function collectKubernetesEvents
and call sites around the IncludeEvents handling and the other occurrences noted
around lines 502-540) to pass the startTime and ensure the client query filters
events by that time before assigning evidence.ClusterState.Events.
- Around line 261-275: The code assigns a placeholder CloudTrail object as
evidence even when nothing was actually queried; update the logic so you only
set evidence.CloudTrailData when o.collectCloudTrailData returns meaningful data
(e.g., non-nil and contains events/records) — otherwise leave
evidence.CloudTrailData nil and print a warning. Concretely, modify
o.collectCloudTrailData (or its callers) to return nil (not &CloudTrailData{})
when no events were found, or add a post-call check in the block that calls
o.collectCloudTrailData (the same block using CreateAWSV2Config and the other
occurrences mentioned) to verify the returned struct has real content (e.g.,
len(events) > 0 or other key fields are non-zero) before assigning to
evidence.CloudTrailData and reporting success.
- Around line 161-172: The help examples in the Examples block for the evidence
collect command use the wrong shorthand `-c`; update all example lines to use
the registered uppercase shorthand `-C` for the `cluster-id` flag (the Examples
string in cmd/evidence/collect.go for the evidence collect command), making sure
you replace every `-c <cluster-id>` occurrence (including the later Examples
block referenced at lines 178-186) with `-C <cluster-id>` so copy-pasted
commands work as expected.
- Around line 248-307: Bind all external cluster commands to the target cluster
and tighten error handling: ensure every oc/oc adm invocation used in
runMustGather(), collectKubernetesEvents(), collectClusterState(), and any other
collectors is executed with the target cluster's kubeconfig/context (use the OCM
cluster object returned earlier—e.g., obtain and pass its kubeconfig or context
into command invocations rather than relying on current kubeconfig), propagate
and return errors from the three sub-collectors inside collectClusterState()
instead of swallowing them (make collectClusterState() return non-nil error when
any sub-collector fails), pass startTime into collectKubernetesEvents(startTime)
and use it to add a --since filter to the underlying oc events call, change
collectCloudTrailData() to return an explicit error or real data (do not return
an empty slice on unimplemented; return an error so callers can distinguish “no
data” from “not collected”), and fix CLI examples/flag usage to match the
defined flag (-C) instead of -c.

In `@README.md`:
- Around line 623-624: Update the heading/text that currently reads "Compare
snapshots with YAML output" to indicate JSON output instead; the example command
`osdctl cluster diff before.yaml after.yaml --json` uses the `--json` flag, so
change the header or surrounding prose to "Compare snapshots with JSON output"
(or otherwise mention JSON) to match the `--json` option shown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fd9c777b-4301-4150-82af-5343f24124a9

📥 Commits

Reviewing files that changed from the base of the PR and between 3676092 and 4d7cf9a.

📒 Files selected for processing (10)
  • README.md
  • cmd/aws/cloudtrail/cmd.go
  • cmd/aws/cloudtrail/errors.go
  • cmd/aws/cmd.go
  • cmd/cluster/cmd.go
  • cmd/cluster/diff.go
  • cmd/cluster/snapshot.go
  • cmd/cmd.go
  • cmd/evidence/cmd.go
  • cmd/evidence/collect.go

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 11, 2026

@MitaliBhalla: This pull request references SREP-3620 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Adds new osdctl commands to help SRE teams collect evidence during feature validation testing (IAM policies, operators, etc.), reducing manual toil and speeding up feature validation.

New Commands

osdctl aws cloudtrail errors - Surface AWS permission errors from CloudTrail
osdctl cluster snapshot - Capture point-in-time cluster state (nodes, operators, namespaces)
osdctl cluster diff - Compare two snapshots to identify changes
osdctl evidence collect - All-in-one evidence collection

Usage Examples

Surface CloudTrail permission errors

osdctl aws cloudtrail errors -C <cluster-id> --since 1h

Capture cluster state before/after testing

osdctl cluster snapshot -C <cluster-id> -o before.yaml
osdctl cluster snapshot -C <cluster-id> -o after.yaml
osdctl cluster diff before.yaml after.yaml

Collect all evidence at once

osdctl evidence collect -C <cluster-id> --output ./evidence/

Features
Supports both ROSA Classic and ROSA HCP clusters
Detects HCP clusters and provides relevant guidance
JSON output option for programmatic use
AWS Console links for CloudTrail events

Testing

Tested on ROSA Classic cluster mbhalla-osdctl (v4.21.4):

  • CloudTrail errors: Found 18 permission errors
  • Snapshot: Captured 7 nodes, 34 operators, 20 MachineConfigs
  • Diff: Correctly detected changes between snapshots
  • Evidence: Generated summary and evidence.yaml

Jira

https://issues.redhat.com/browse/SREP-3620

Summary by CodeRabbit

  • New Features

  • Added osdctl aws cloudtrail errors: query CloudTrail errors with filtering, JSON output, and optional console links

  • Added osdctl cluster snapshot and cluster diff: capture point-in-time cluster snapshots and compare two snapshots

  • Added osdctl evidence collect: gather cluster and AWS evidence into structured output and summary

  • Documentation

  • Added docs and README sections describing the new aws, cloudtrail, cluster snapshot/diff, and evidence collection commands

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (7)
cmd/evidence/collect.go (4)

333-355: ⚠️ Potential issue | 🟠 Major

Don't report partial cluster state as success.

collectClusterState() drops errors from all three sub-collectors and still returns nil. A failed oc get becomes an empty section plus a success message, which makes incomplete evidence look healthy.

Suggested fix
+import "errors"
+
 func (o *collectOptions) collectClusterState() (*ClusterState, error) {
 	state := &ClusterState{}
+	var errs []error
 
 	// Collect nodes
 	nodes, err := o.collectNodes()
-	if err == nil {
+	if err != nil {
+		errs = append(errs, fmt.Errorf("collect nodes: %w", err))
+	} else {
 		state.Nodes = nodes
 	}
 
 	// Collect operators
 	operators, err := o.collectOperators()
-	if err == nil {
+	if err != nil {
+		errs = append(errs, fmt.Errorf("collect operators: %w", err))
+	} else {
 		state.Operators = operators
 	}
 
 	// Collect machine configs
 	machineConfigs, err := o.collectMachineConfigs()
-	if err == nil {
+	if err != nil {
+		errs = append(errs, fmt.Errorf("collect machineconfigs: %w", err))
+	} else {
 		state.MachineConfigs = machineConfigs
 	}
 
-	return state, nil
+	if len(errs) > 0 {
+		return nil, errors.Join(errs...)
+	}
+	return state, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 333 - 355, collectClusterState
currently swallows errors from collectNodes, collectOperators, and
collectMachineConfigs and always returns nil; update collectClusterState to
detect and surface failures from those sub-collectors (collectNodes,
collectOperators, collectMachineConfigs) instead of silently ignoring them: call
each collector, if any returns a non-nil error record it (either return the
first error immediately or aggregate multiple errors) and return a non-nil error
alongside any partially-collected ClusterState so callers know the cluster state
is incomplete.

543-549: ⚠️ Potential issue | 🟠 Major

Don't serialize placeholder CloudTrail data as real evidence.

This function never queries CloudTrail but still returns success with an empty struct. The caller and summary then treat “not collected” as “collected and empty.”

Suggested fix
 func (o *collectOptions) collectCloudTrailData(startTime time.Time) (*CloudTrailData, error) {
 	// Note: CloudTrail data collection is handled by the 'osdctl aws cloudtrail errors' command
-	// This function provides a placeholder for the evidence collection summary
-	// For detailed CloudTrail analysis, use: osdctl aws cloudtrail errors -c <cluster-id> --since <duration>
+	// For detailed CloudTrail analysis, use: osdctl aws cloudtrail errors -C <cluster-id> --since <duration>
 	_ = startTime
 
-	return &CloudTrailData{}, nil
+	return nil, fmt.Errorf("cloudtrail collection is not implemented for evidence bundles yet")
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 543 - 549, The collectCloudTrailData
placeholder in collectOptions (function collectCloudTrailData) currently returns
an empty CloudTrailData struct and nil error, causing callers to treat “not
collected” as valid empty evidence; change it to indicate non-collection by
returning (nil, error) instead of &CloudTrailData{} — e.g. return nil,
fmt.Errorf("cloudtrail data collection not implemented") or a package-level
sentinel error like ErrNotCollected — and update any callers to handle a nil
*CloudTrailData (or that sentinel) so placeholder data is never serialized as
real evidence.

279-282: ⚠️ Potential issue | 🟠 Major

--since is ignored for Kubernetes events.

startTime never reaches collectKubernetesEvents(), so --include-events --since 1h still dumps the full event history. On busy clusters that can massively inflate the bundle and blur the validation window.

Also applies to: 502-540

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 279 - 282, The code ignores the
--since/startTime when collecting Kubernetes events because startTime isn't
passed into collectKubernetesEvents; update the call site to pass startTime (or
set it on the receiver) and change collectKubernetesEvents (the method named
collectKubernetesEvents) to accept and use a startTime parameter (or read the
receiver field) to filter events by event.Timestamp >= startTime so
--include-events --since 1h only returns recent events; ensure any auxiliary
logic that parses --since (the startTime variable) is reused rather than
recomputed.

357-358: ⚠️ Potential issue | 🔴 Critical

Bind every oc invocation to the requested cluster.

All of these calls run against whatever kubeconfig context the operator already has active. -C is only used for OCM lookup and metadata, so this command can collect evidence from the wrong cluster while labeling it with the requested ClusterID.

Also applies to: 415-415, 473-473, 503-503, 558-558

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 357 - 358, The oc calls in collectNodes
currently run against the active kubeconfig; update the exec.CommandContext
invocations in collectNodes (and the other occurrences noted) to pass the
requested cluster binding by appending the appropriate flags (either
"--kubeconfig" with the requested kubeconfig path or "--context" with the
requested cluster context) sourced from the options struct (e.g., use
o.KubeconfigPath or o.ClusterContext) so every oc invocation targets the
intended cluster.
cmd/cluster/snapshot.go (2)

63-69: ⚠️ Potential issue | 🟠 Major

Scope and enrich extra resource snapshots.

Line 383 always queries --all-namespaces, so --namespaces only narrows the namespace list, not the additional resources. On top of that, ResourceInfo still only persists status.phase, which means later diffs miss most rollout/spec updates even when the object changed.

Also applies to: 382-411

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 63 - 69, The snapshot logic currently
always queries extra resources with --all-namespaces (so --namespaces doesn't
scope extras) and the ResourceInfo struct only stores status.phase, causing
diffs to miss spec/rollout changes; update the query that builds extra resource
lists (the code around the call that currently forces "--all-namespaces") to
respect the provided namespaces flag and only include --all-namespaces when no
namespaces are specified, and extend the ResourceInfo type (and any
serialization logic that populates it) to persist richer fields such as
Metadata.ResourceVersion, Labels, Annotations, and the full Spec and Status (as
raw JSON/YAML strings) instead of just status.phase so downstream diffs detect
spec/status changes (refer to ResourceInfo and the function that collects extra
resources).

121-205: ⚠️ Potential issue | 🔴 Critical

Bind every oc call to the requested cluster and fail if core captures are incomplete.

Line 132 resolves the target cluster through OCM, but the capture paths started on Lines 162, 171, 180, and 190 still shell out to whichever kubecontext oc is already using. Because those errors are only logged, the command can still exit 0 after writing a snapshot from the wrong cluster or with missing core sections.

Also applies to: 207-209, 269-274, 323-326, 382-385

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 121 - 205, The snapshot routine
resolves the target cluster via utils.GetClusterAnyStatus in snapshotOptions.run
but then calls captureNodes, captureNamespaces, captureClusterOperators and
captureResources which shell out to oc without being bound to that cluster and
currently only log errors, allowing successful exit with wrong or partial data;
modify the capture* functions (captureNodes, captureNamespaces,
captureClusterOperators, captureResources) to accept an explicit cluster binding
parameter (e.g., cluster.ID or the cluster kubeconfig/API endpoint token
returned by utils or a new utils.GetClusterKubeconfig function) and ensure every
oc invocation is invoked with that binding (or set KUBECONFIG/oc
--server/--context accordingly) so it always targets the resolved cluster, and
change snapshotOptions.run to treat errors from these core captures as fatal by
returning the error (instead of only logging) so writeSnapshot is only called
when all core captures succeed; update any callers of these methods and ensure
writeSnapshot remains unchanged for saving ClusterSnapshot.
cmd/cluster/diff.go (1)

190-232: ⚠️ Potential issue | 🟠 Major

Diff the node and namespace fields you already snapshot.

cmd/cluster/snapshot.go:37-50 captures node roles, version, conditions, and labels, and cmd/cluster/snapshot.go:36-41 captures namespace labels. These comparators still only react to Status, so upgrades, role flips, condition changes, and label mutations never surface unless the status string also changes.

Also applies to: 316-359

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/diff.go` around lines 190 - 232, compareNodes only compares
Status (and only includes Roles on add/remove), so changes to Roles, Version,
Conditions, or Labels from NodeSnapshot never produce "modified" diffs; do the
same deeper field-wise comparison used in snapshot.go by comparing Roles
(order-insensitive), Version, Conditions, and Labels (use reflect.DeepEqual or
explicit map/slice comparison) and when any of those differ append a "modified"
NodeDiff for that node with Before/After strings that include the relevant
fields (e.g., Status, Roles, Version, Conditions, Labels) using the NodeSnapshot
and NodeDiff types; apply the same fix to the analogous comparator
(compareNamespaces or the function around lines 316-359) so namespace label
changes are detected and included in Before/After.
🧹 Nitpick comments (1)
docs/README.md (1)

1107-1125: Add language labels to the new fenced blocks.

All of the new synopsis/flags fences are unlabeled, which is why markdownlint is reporting MD040 throughout these sections. Tagging them as text and the example fences as bash will keep the README lint-clean.

Suggested doc fix
-```
+```text
 osdctl aws [flags]
-```
+```

...

-```
+```text
       --as string                        Username to impersonate for the operation. User could be a regular user or a service account in a namespace.
       --cluster string                   The name of the kubeconfig cluster to use
       --context string                   The name of the kubeconfig context to use
       ...
-```
+```

Also applies to: 1134-1152, 1175-1198, 1606-1625, 2193-2214, 3018-3036, 3064-3088

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/README.md` around lines 1107 - 1125, The unlabeled fenced code blocks
(e.g., the synopsis block starting with "osdctl aws [flags]" and the flags block
containing lines like "--as string                        Username to
impersonate...") need language labels to satisfy markdownlint MD040; update
those backtick fences to use ```text for the synopsis/flags blocks and use
```bash for any example/command blocks, ensuring you add the label on the
opening fence only (e.g., replace ``` with ```text or ```bash as appropriate)
and apply the same change to the other similar sections mentioned in the
comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/aws/cloudtrail/errors.go`:
- Line 98: The call to errorsCmd.MarkFlagRequired("cluster-id") is unchecked;
capture its returned error and handle it instead of ignoring it—wrap the call in
an err := ... assignment and if err != nil then return the error (or call the
command init/fatal logger) so wiring mistakes surface at startup; update the
initialization code that calls errorsCmd.MarkFlagRequired to check the error and
propagate or log it (reference: errorsCmd.MarkFlagRequired).

In `@cmd/cluster/diff.go`:
- Around line 152-157: compareOperators currently appends one change record per
changed field, so using len(result.OperatorChanges) overcounts operators;
instead compute the number of unique operators and assign that to
result.Summary.OperatorsChanged. Modify the code after compareOperators(...)
(referencing result.OperatorChanges and result.Summary.OperatorsChanged) to
build a set keyed by the operator identity used in change records (e.g.,
operator name/namespace or the unique ID present in each change entry), count
the unique keys and assign that count; apply the same dedup logic to the other
occurrence mentioned (lines ~235-309) where operator changes are summarized.

In `@cmd/cluster/snapshot.go`:
- Around line 417-433: The snapshot file is currently written with
world-readable permissions (0644) in snapshotOptions.writeSnapshot; change the
file write to use owner-only permissions (0600) so the evidence file is
restricted to the invoking user—update the os.WriteFile call that writes
o.OutputFile (and any related write path in writeSnapshot) to use 0600 instead
of 0644 so only the file owner can read/write the snapshot.
- Around line 111-116: The two calls that ignore errors are
snapshotCmd.MarkFlagRequired("cluster-id") and
snapshotCmd.MarkFlagRequired("output"); change them to use cobra.CheckErr(...)
so the returned error is handled (e.g.,
cobra.CheckErr(snapshotCmd.MarkFlagRequired("cluster-id")) and
cobra.CheckErr(snapshotCmd.MarkFlagRequired("output"))), leaving the rest of the
flag registration (snapshotCmd.Flags().StringVarP/ StringSliceVar and opts
fields) unchanged.

In `@cmd/evidence/collect.go`:
- Around line 569-575: The file writes sensitive evidence files with
world-readable permissions; update the writer to use restrictive file mode
(e.g., 0600) instead of 0644: in the saveEvidence method
(collectOptions.saveEvidence) replace the os.WriteFile call so the file is
created/written with owner-only read/write permissions, and apply the same
change to the other evidence/summary writer (the other os.WriteFile usage
referenced in the review) so both evidence.yaml and summary.txt are written with
restrictive permissions.
- Around line 152-156: The help text in cmd/evidence/collect.go promises that
cluster state includes "namespaces" but the implementation doesn't collect or
serialize namespaces; either remove "namespaces" from the descriptive
comment/help string or implement namespace collection. To fix, either update the
help/description text in collect.go to drop "namespaces" so it matches the
actual outputs, or add namespace collection by calling the Kubernetes API
(client.CoreV1().Namespaces().List) in the same cluster-state collection path
and include the result in the same serialization/writing code path used by the
existing cluster state writer (the cluster-state collection/serialization
functions in collect.go such as the cluster state writer or
serializeClusterState/WriteClusterState helpers). Ensure the change is applied
to the help string and/or the cluster-state collection code so documentation
matches behavior.
- Around line 654-660: The parseDurationToUTC function currently allows zero and
negative durations which produce future start times; after calling
time.ParseDuration in parseDurationToUTC, validate that duration is strictly
positive (duration > 0) and return a descriptive error (e.g., "lookback must be
greater than zero") for zero or negative values instead of computing a time;
keep the function name parseDurationToUTC and its existing error wrapping for
parse failures, but add this extra validation branch to prevent future start
times.

In `@docs/osdctl_aws_cloudtrail_errors.md`:
- Around line 13-28: Wrap the examples for the command "osdctl aws cloudtrail
errors" in a fenced bash code block so the lines beginning with "#" are treated
as shell comments rather than markdown headings; replace the current loose
example lines with a ```bash block containing the comment lines (e.g., "# Get
permission errors...") followed by the corresponding commands and close the
fence with ```, ensuring the usage snippet remains inside the fenced block.

In `@docs/README.md`:
- Around line 1162-1174: The example command blocks (e.g., lines containing
"osdctl aws cloudtrail errors -C <cluster-id> --since 1h" and the other example
commands) are currently indented and parsed as Markdown headings; wrap each
example set in a fenced code block using triple backticks with "bash" (```bash)
at the start and a closing ``` at the end, ensuring the leading "#" lines remain
as comment lines inside the fence; apply the same fix for the other example
groups mentioned (around the sections that contain the commands ending with
"--json", "--link", and "--error-types AccessDenied,UnauthorizedOperation").
- Around line 2183-2191: Update the example commands for osdctl cluster snapshot
to use the documented flag name -C (or --cluster-id) instead of -c so
copy-pasted commands work; locate the example lines containing "osdctl cluster
snapshot -c <cluster-id> ..." and replace -c with -C, and ensure the note about
the flag table still matches the examples (refer to the osdctl cluster snapshot
examples in the README).

---

Duplicate comments:
In `@cmd/cluster/diff.go`:
- Around line 190-232: compareNodes only compares Status (and only includes
Roles on add/remove), so changes to Roles, Version, Conditions, or Labels from
NodeSnapshot never produce "modified" diffs; do the same deeper field-wise
comparison used in snapshot.go by comparing Roles (order-insensitive), Version,
Conditions, and Labels (use reflect.DeepEqual or explicit map/slice comparison)
and when any of those differ append a "modified" NodeDiff for that node with
Before/After strings that include the relevant fields (e.g., Status, Roles,
Version, Conditions, Labels) using the NodeSnapshot and NodeDiff types; apply
the same fix to the analogous comparator (compareNamespaces or the function
around lines 316-359) so namespace label changes are detected and included in
Before/After.

In `@cmd/cluster/snapshot.go`:
- Around line 63-69: The snapshot logic currently always queries extra resources
with --all-namespaces (so --namespaces doesn't scope extras) and the
ResourceInfo struct only stores status.phase, causing diffs to miss spec/rollout
changes; update the query that builds extra resource lists (the code around the
call that currently forces "--all-namespaces") to respect the provided
namespaces flag and only include --all-namespaces when no namespaces are
specified, and extend the ResourceInfo type (and any serialization logic that
populates it) to persist richer fields such as Metadata.ResourceVersion, Labels,
Annotations, and the full Spec and Status (as raw JSON/YAML strings) instead of
just status.phase so downstream diffs detect spec/status changes (refer to
ResourceInfo and the function that collects extra resources).
- Around line 121-205: The snapshot routine resolves the target cluster via
utils.GetClusterAnyStatus in snapshotOptions.run but then calls captureNodes,
captureNamespaces, captureClusterOperators and captureResources which shell out
to oc without being bound to that cluster and currently only log errors,
allowing successful exit with wrong or partial data; modify the capture*
functions (captureNodes, captureNamespaces, captureClusterOperators,
captureResources) to accept an explicit cluster binding parameter (e.g.,
cluster.ID or the cluster kubeconfig/API endpoint token returned by utils or a
new utils.GetClusterKubeconfig function) and ensure every oc invocation is
invoked with that binding (or set KUBECONFIG/oc --server/--context accordingly)
so it always targets the resolved cluster, and change snapshotOptions.run to
treat errors from these core captures as fatal by returning the error (instead
of only logging) so writeSnapshot is only called when all core captures succeed;
update any callers of these methods and ensure writeSnapshot remains unchanged
for saving ClusterSnapshot.

In `@cmd/evidence/collect.go`:
- Around line 333-355: collectClusterState currently swallows errors from
collectNodes, collectOperators, and collectMachineConfigs and always returns
nil; update collectClusterState to detect and surface failures from those
sub-collectors (collectNodes, collectOperators, collectMachineConfigs) instead
of silently ignoring them: call each collector, if any returns a non-nil error
record it (either return the first error immediately or aggregate multiple
errors) and return a non-nil error alongside any partially-collected
ClusterState so callers know the cluster state is incomplete.
- Around line 543-549: The collectCloudTrailData placeholder in collectOptions
(function collectCloudTrailData) currently returns an empty CloudTrailData
struct and nil error, causing callers to treat “not collected” as valid empty
evidence; change it to indicate non-collection by returning (nil, error) instead
of &CloudTrailData{} — e.g. return nil, fmt.Errorf("cloudtrail data collection
not implemented") or a package-level sentinel error like ErrNotCollected — and
update any callers to handle a nil *CloudTrailData (or that sentinel) so
placeholder data is never serialized as real evidence.
- Around line 279-282: The code ignores the --since/startTime when collecting
Kubernetes events because startTime isn't passed into collectKubernetesEvents;
update the call site to pass startTime (or set it on the receiver) and change
collectKubernetesEvents (the method named collectKubernetesEvents) to accept and
use a startTime parameter (or read the receiver field) to filter events by
event.Timestamp >= startTime so --include-events --since 1h only returns recent
events; ensure any auxiliary logic that parses --since (the startTime variable)
is reused rather than recomputed.
- Around line 357-358: The oc calls in collectNodes currently run against the
active kubeconfig; update the exec.CommandContext invocations in collectNodes
(and the other occurrences noted) to pass the requested cluster binding by
appending the appropriate flags (either "--kubeconfig" with the requested
kubeconfig path or "--context" with the requested cluster context) sourced from
the options struct (e.g., use o.KubeconfigPath or o.ClusterContext) so every oc
invocation targets the intended cluster.

---

Nitpick comments:
In `@docs/README.md`:
- Around line 1107-1125: The unlabeled fenced code blocks (e.g., the synopsis
block starting with "osdctl aws [flags]" and the flags block containing lines
like "--as string                        Username to impersonate...") need
language labels to satisfy markdownlint MD040; update those backtick fences to
use ```text for the synopsis/flags blocks and use ```bash for any
example/command blocks, ensuring you add the label on the opening fence only
(e.g., replace ``` with ```text or ```bash as appropriate) and apply the same
change to the other similar sections mentioned in the comment.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2570e793-1ae3-4e89-8bbf-73076384ad8e

📥 Commits

Reviewing files that changed from the base of the PR and between 4d7cf9a and 6dcd7d8.

📒 Files selected for processing (15)
  • README.md
  • cmd/aws/cloudtrail/cmd.go
  • cmd/aws/cloudtrail/errors.go
  • cmd/aws/cmd.go
  • cmd/cluster/cmd.go
  • cmd/cluster/diff.go
  • cmd/cluster/snapshot.go
  • cmd/cmd.go
  • cmd/evidence/cmd.go
  • cmd/evidence/collect.go
  • docs/README.md
  • docs/osdctl.md
  • docs/osdctl_aws.md
  • docs/osdctl_aws_cloudtrail.md
  • docs/osdctl_aws_cloudtrail_errors.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • cmd/aws/cmd.go
  • cmd/evidence/cmd.go
  • cmd/cluster/cmd.go
  • cmd/aws/cloudtrail/cmd.go
  • cmd/cmd.go

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 11, 2026

@MitaliBhalla: This pull request references SREP-3620 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Adds new osdctl commands to help SRE teams collect evidence during feature validation testing (IAM policies, operators, etc.), reducing manual toil and speeding up feature validation.

New Commands

osdctl aws cloudtrail errors - Surface AWS permission errors from CloudTrail
osdctl cluster snapshot - Capture point-in-time cluster state (nodes, operators, namespaces)
osdctl cluster diff - Compare two snapshots to identify changes
osdctl evidence collect - All-in-one evidence collection

Usage Examples

Surface CloudTrail permission errors

osdctl aws cloudtrail errors -C <cluster-id> --since 1h

Capture cluster state before/after testing

osdctl cluster snapshot -C <cluster-id> -o before.yaml
osdctl cluster snapshot -C <cluster-id> -o after.yaml
osdctl cluster diff before.yaml after.yaml

Collect all evidence at once

osdctl evidence collect -C <cluster-id> --output ./evidence/

Features
Supports both ROSA Classic and ROSA HCP clusters
Detects HCP clusters and provides relevant guidance
JSON output option for programmatic use
AWS Console links for CloudTrail events

Testing

Tested on ROSA Classic cluster mbhalla-osdctl (v4.21.4):

  • CloudTrail errors: Found 18 permission errors
  • Snapshot: Captured 7 nodes, 34 operators, 20 MachineConfigs
  • Diff: Correctly detected changes between snapshots
  • Evidence: Generated summary and evidence.yaml

Jira

https://issues.redhat.com/browse/SREP-3620

Summary by CodeRabbit

  • New Features

  • Added osdctl aws (with cloudtrail and errors) to surface CloudTrail events and permission errors; supports filtering, JSON output, and optional console links

  • Added osdctl cluster snapshot and cluster diff to capture and compare point-in-time cluster state

  • Added osdctl evidence collect to gather cluster and AWS evidence into structured output and a human-readable summary

  • Documentation

  • Expanded README and docs with usage, examples, and options for the new aws, cloudtrail, cluster snapshot/diff, and evidence collect commands

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (10)
docs/README.md (1)

1162-1174: ⚠️ Potential issue | 🟡 Minor

Wrap these example groups in fenced bash blocks again.

The indented # ... lines are rendering as headings instead of shell comments, so these four sections are back to triggering the same Markdown issue.

Suggested doc fix
 Examples:
-  # Get permission errors from the last hour
-  osdctl aws cloudtrail errors -C <cluster-id> --since 1h
-
-  # Get errors from the last 30 minutes with JSON output
-  osdctl aws cloudtrail errors -C <cluster-id> --since 30m --json
+```bash
+# Get permission errors from the last hour
+osdctl aws cloudtrail errors -C <cluster-id> --since 1h
+
+# Get errors from the last 30 minutes with JSON output
+osdctl aws cloudtrail errors -C <cluster-id> --since 30m --json
+```

Apply the same pattern to the cluster diff, cluster snapshot, and evidence collect example groups.

Also applies to: 1599-1604, 2183-2191, 3051-3062

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/README.md` around lines 1162 - 1174, The Markdown example blocks
containing shell comments (e.g., the groups showing "osdctl aws cloudtrail
errors -C <cluster-id> ..." and the similar "cluster diff", "cluster snapshot",
and "evidence collect" example groups) must be wrapped in fenced code blocks
with a bash language tag so leading `#` lines render as comments instead of
headings; update each example group by adding a starting "```bash" before the
first commented line and a closing "```" after the last command for the
cloudtrail, cluster diff, cluster snapshot and evidence collect example
sections.
cmd/cluster/snapshot.go (3)

115-116: ⚠️ Potential issue | 🟡 Minor

Handle the MarkFlagRequired return values.

These calls still discard Cobra's returned error, so errcheck will continue to flag the constructor path. Prefer cobra.CheckErr(...) or propagate the error.

Suggested fix
-	_ = snapshotCmd.MarkFlagRequired("cluster-id")
-	_ = snapshotCmd.MarkFlagRequired("output")
+	cobra.CheckErr(snapshotCmd.MarkFlagRequired("cluster-id"))
+	cobra.CheckErr(snapshotCmd.MarkFlagRequired("output"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 115 - 116, The two calls to
snapshotCmd.MarkFlagRequired currently ignore returned errors; update the
constructor to capture and handle these by checking the error (e.g., err :=
snapshotCmd.MarkFlagRequired("cluster-id"); cobra.CheckErr(err) and similarly
for "output") or propagate the error up, ensuring both
snapshotCmd.MarkFlagRequired invocations are not discarding their returned error
(reference snapshotCmd.MarkFlagRequired and use cobra.CheckErr or return the
err).

63-69: ⚠️ Potential issue | 🟠 Major

ResourceInfo is still too lossy for meaningful diffs.

Only status.phase is persisted for arbitrary resources. For common types like Deployments, Services, ConfigMaps, and many CRs, that field is empty or unrelated to the change users care about, so later diffs mostly collapse to add/remove detection and miss rollout/spec changes.

Also applies to: 404-411

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 63 - 69, The ResourceInfo struct only
persists name/namespace/kind/status and loses meaningful details for diffs;
update ResourceInfo to include either the full unstructured object (e.g., Raw
map[string]interface{} or JSON/YAML blob) or at minimum include
metadata.generation, metadata.resourceVersion, metadata.labels/annotations, and
the spec/template (or a computed hash of spec/template + important status
fields) so rollout/spec changes are detectable; locate the ResourceInfo type and
any code that serializes/deserializes snapshots and change them to capture and
persist these additional fields (or the full object) and adjust comparison logic
to use the new spec/template or hash fields when generating diffs.

121-205: ⚠️ Potential issue | 🔴 Critical

Collect from the requested cluster, not whichever oc context happens to be active.

run() resolves o.ClusterID through OCM, but every capture path later shells out to oc without passing a kubeconfig/context or validating the current one. If the caller is logged into a different cluster, the snapshot metadata comes from one cluster while nodes/namespaces/operators come from another.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 121 - 205, run() currently resolves
o.ClusterID via OCM but then lets captureNodes(), captureNamespaces(),
captureClusterOperators(), and captureResources() shell out to oc using whatever
context is active; fetch the target cluster kubeconfig from OCM (or via an
existing helper like utils.GetClusterKubeconfig/GetClusterAdminKubeconfig)
immediately in run(), save it to a temp file or to o.Kubeconfig, and ensure each
capture function (captureNodes, captureNamespaces, captureClusterOperators,
captureResources) uses that kubeconfig (or the temp file path) when invoking oc
(or pass it down as an argument) rather than relying on the current oc context;
update run() to set/clean the temp kubeconfig and adjust the capture function
signatures to accept the kubeconfig if needed so all captured resources come
from the same cluster resolved by run().
docs/osdctl_aws_cloudtrail_errors.md (1)

13-28: ⚠️ Potential issue | 🟡 Minor

Fence the examples block.

The lines beginning with # are still parsed as markdown headings instead of shell comments. Please wrap the examples in a fenced bash block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/osdctl_aws_cloudtrail_errors.md` around lines 13 - 28, The examples
block is not fenced with a bash code fence so lines starting with # are being
treated as Markdown headings; wrap the entire examples snippet (the four example
commands and the usage line ``osdctl aws cloudtrail errors [flags]``) inside a
fenced code block that specifies bash (i.e., start with ```bash and end with
```), ensuring the existing `#` comment prefixes remain inside the fenced block;
update the block around the examples in the docs/osdctl_aws_cloudtrail_errors.md
file where the Examples section and the usage line appear.
cmd/cluster/diff.go (1)

216-223: ⚠️ Potential issue | 🟠 Major

Node and namespace diffs still miss fields already captured in the snapshot.

compareNodes only reports Status changes, and compareNamespaces does the same. That drops node Roles, Version, Conditions, Labels, and namespace Labels changes even though those fields are present in the snapshot model.

Also applies to: 342-348

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/diff.go` around lines 216 - 223, compareNodes and
compareNamespaces only record Status changes; update compareNodes and
compareNamespaces to also detect and append NodeDiff/NamespaceDiff entries for
Roles, Version, Conditions, and Labels (for nodes) and Labels (for namespaces)
when values differ between beforeNode/beforeNamespace and
afterNode/afterNamespace. Locate the compareNodes and compareNamespaces
functions and, alongside the existing Status check, add equality checks for
Roles (slice/set comparison), Version (string), Conditions (structured
comparison), and Labels (map comparison), creating NodeDiff/NamespaceDiff items
with ChangeType "modified" and Before/After formatted values; apply the same
additional checks in the other similar compare block later in the file where
node/namespace comparisons are duplicated.
cmd/evidence/collect.go (4)

276-289: ⚠️ Potential issue | 🟠 Major

--since flag is not applied to Kubernetes events collection.

The startTime is computed from the --since flag but never passed to collectKubernetesEvents(). The function retrieves all events regardless of the time window, which can produce a large, unfocused dataset on busy clusters.

🔧 Proposed fix

Update the function signature to accept startTime and filter events:

-func (o *collectOptions) collectKubernetesEvents() ([]EventInfo, error) {
+func (o *collectOptions) collectKubernetesEvents(startTime time.Time) ([]EventInfo, error) {

Then filter events by timestamp after parsing, or use --field-selector with the oc command if supported.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 276 - 289, The events collection
ignores the computed startTime from the --since flag because startTime isn’t
passed into collectKubernetesEvents; update the call in the IncludeEvents block
to pass the startTime and change the collectKubernetesEvents signature to accept
a time.Time (or pointer) parameter, then filter returned events inside
collectKubernetesEvents by their timestamps (e.g., Event.LastTimestamp /
Event.FirstTimestamp) to only include events after startTime (alternatively use
a field-selector when executing oc/kubectl inside collectKubernetesEvents);
ensure ClusterState.Events is still set with the filtered slice.

540-547: ⚠️ Potential issue | 🟠 Major

CloudTrail placeholder returns empty data indistinguishable from "collected but empty".

collectCloudTrailData always returns &CloudTrailData{} without actually collecting anything. The caller assigns this to evidence.CloudTrailData and prints success, making "not implemented" indistinguishable from "no errors found."

Either:

  1. Return nil and don't set evidence.CloudTrailData when collection isn't performed
  2. Integrate with the actual CloudTrail collection logic from cmd/aws/cloudtrail/errors.go
  3. Return an error indicating the feature is not yet implemented
🔧 Minimal fix - return nil instead of empty struct
 func (o *collectOptions) collectCloudTrailData(startTime time.Time) (*CloudTrailData, error) {
 	// Note: CloudTrail data collection is handled by the 'osdctl aws cloudtrail errors' command
-	// This function provides a placeholder for the evidence collection summary
-	// For detailed CloudTrail analysis, use: osdctl aws cloudtrail errors -c <cluster-id> --since <duration>
-	_ = startTime
-
-	return &CloudTrailData{}, nil
+	// For detailed CloudTrail analysis, use: osdctl aws cloudtrail errors -C <cluster-id> --since <duration>
+	return nil, nil
 }

Then update the caller to check for nil before assigning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 540 - 547, The collectCloudTrailData
function currently returns an empty &CloudTrailData{} which is indistinguishable
from a successful empty result; change collectCloudTrailData to return nil (and
a nil error or a specific "not implemented" error) when CloudTrail collection is
not performed, and update the caller sites that assign evidence.CloudTrailData
to check for nil before setting it (or propagate the error) so callers can
differentiate "not implemented" from "collected but empty"; alternatively, if
you prefer to implement collection now, wire in the real logic from
cmd/aws/cloudtrail/errors.go into collectCloudTrailData instead of returning an
empty struct.

330-352: ⚠️ Potential issue | 🟠 Major

collectClusterState silently swallows all sub-collector errors.

The function ignores errors from collectNodes, collectOperators, and collectMachineConfigs, always returning nil error. This makes failed collections indistinguishable from successful but empty ones. The caller at line 248-255 will report success even when all sub-collectors failed.

🔧 Proposed fix - propagate errors
 func (o *collectOptions) collectClusterState() (*ClusterState, error) {
 	state := &ClusterState{}
+	var errs []error

 	// Collect nodes
 	nodes, err := o.collectNodes()
-	if err == nil {
+	if err != nil {
+		errs = append(errs, fmt.Errorf("nodes: %w", err))
+	} else {
 		state.Nodes = nodes
 	}

 	// Collect operators
 	operators, err := o.collectOperators()
-	if err == nil {
+	if err != nil {
+		errs = append(errs, fmt.Errorf("operators: %w", err))
+	} else {
 		state.Operators = operators
 	}

 	// Collect machine configs
 	machineConfigs, err := o.collectMachineConfigs()
-	if err == nil {
+	if err != nil {
+		errs = append(errs, fmt.Errorf("machineconfigs: %w", err))
+	} else {
 		state.MachineConfigs = machineConfigs
 	}

-	return state, nil
+	if len(errs) > 0 {
+		return state, fmt.Errorf("partial collection failure: %v", errs)
+	}
+	return state, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 330 - 352, collectClusterState
currently swallows errors from collectNodes, collectOperators, and
collectMachineConfigs and always returns nil error; change it to propagate
failures by checking each err and returning an error if any sub-collector fails
(e.g., return nil, err or aggregate errors into one descriptive error using
fmt.Errorf or a multi-error approach) while still filling state with successful
parts; update the function around collectNodes, collectOperators, and
collectMachineConfigs to return immediately on non-nil err (or collect and
combine errors) so callers of collectClusterState can detect failures.

354-358: ⚠️ Potential issue | 🟠 Major

All oc commands execute against the current kubeconfig context, not the target cluster.

The cluster is resolved via OCM (line 199), but none of the exec.CommandContext calls for oc get nodes, oc get clusteroperators, oc get machineconfigs, oc get events, or oc adm must-gather are bound to the target cluster. This can collect evidence from the wrong cluster while labeling it with the correct cluster ID.

Consider either:

  1. Using --context or --kubeconfig flags with the target cluster's configuration
  2. Documenting that users must ensure their current kubeconfig context matches the target cluster
  3. Using the Kubernetes client-go library with cluster credentials from OCM/backplane instead of shelling out

Also applies to: 411-415, 469-473, 499-503, 549-561

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 354 - 358, The shelled-out oc commands
(e.g., in collectNodes()) run against the current kubeconfig context instead of
the target cluster resolved via OCM; update all exec.CommandContext usages
(collectNodes, collectClusterOperators, collectMachineConfigs, collectEvents,
runMustGather or similar) to bind to the target cluster by passing the cluster
kubeconfig/context obtained during cluster resolution: either add
"--kubeconfig", or "--context <name>" to the oc command args (or set KUBECONFIG
in cmd.Env) using the kubeconfig returned by your OCM/backplane resolver, or
replace these shells with a client-go implementation using the resolved
credentials; ensure the same resolver symbol that collects the cluster
kubeconfig is used so each oc invocation targets the intended cluster.
🧹 Nitpick comments (3)
docs/osdctl_aws.md (1)

13-15: Consider adding language specifiers to fenced code blocks.

The fenced code blocks lack language specifiers, which can affect syntax highlighting and accessibility. For CLI help output, consider using text or shell identifiers.

This is a minor documentation nit, especially if these docs are auto-generated by Cobra.

Also applies to: 19-21, 25-36

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/osdctl_aws.md` around lines 13 - 15, Update the fenced code blocks in
the docs that show CLI examples (e.g., the block containing "osdctl aws [flags]"
and the other blocks noted at lines 19-21 and 25-36) to include a language
specifier such as "shell" or "text" (e.g., change ``` to ```shell) so syntax
highlighting and accessibility are improved; ensure all CLI/example blocks in
the file use the same appropriate specifier.
cmd/aws/cloudtrail/errors.go (1)

98-98: Handle the MarkFlagRequired error.

The error from MarkFlagRequired is silently discarded. While this rarely fails in practice, it can hide wiring mistakes if the flag name is changed.

🔧 Suggested fix
-	_ = errorsCmd.MarkFlagRequired("cluster-id")
+	if err := errorsCmd.MarkFlagRequired("cluster-id"); err != nil {
+		panic(fmt.Sprintf("failed to mark cluster-id as required: %v", err))
+	}

Or using Cobra's helper:

-	_ = errorsCmd.MarkFlagRequired("cluster-id")
+	cobra.CheckErr(errorsCmd.MarkFlagRequired("cluster-id"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/aws/cloudtrail/errors.go` at line 98, The call to
errorsCmd.MarkFlagRequired("cluster-id") currently discards its error; change it
to handle the returned error (e.g., capture the error from MarkFlagRequired and
propagate or log it using your project's error handling, or use Cobra's helper
like cobra.CheckErr) so wiring mistakes are not silently ignored—update the call
site where errorsCmd.MarkFlagRequired("cluster-id") is invoked to check the
returned err and handle it appropriately.
cmd/evidence/collect.go (1)

182-183: Handle MarkFlagRequired errors.

Both MarkFlagRequired calls silently discard errors. This can hide wiring mistakes if flag names change.

🔧 Suggested fix
-	_ = collectCmd.MarkFlagRequired("cluster-id")
-	_ = collectCmd.MarkFlagRequired("output")
+	cobra.CheckErr(collectCmd.MarkFlagRequired("cluster-id"))
+	cobra.CheckErr(collectCmd.MarkFlagRequired("output"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 182 - 183, The MarkFlagRequired calls
on collectCmd currently discard errors; update the two calls to capture their
returned error and handle it instead of assigning to blank identifier — e.g.,
err := collectCmd.MarkFlagRequired("cluster-id"); if err != nil {
cobra.CheckErr(err) } (or log.Fatalf/return the error from the enclosing
init/setup function) and do the same for collectCmd.MarkFlagRequired("output");
this ensures wiring mistakes are surfaced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/aws/cloudtrail/errors.go`:
- Around line 350-356: The whoami function dereferences output.Arn and
output.Account without nil checks; add safeguards to prevent panics by verifying
output is non-nil and that output.Arn and output.Account are non-nil before
dereferencing. If any of those are nil, return a descriptive error (e.g., "empty
GetCallerIdentity response: missing Arn/Account") and preserve the existing
return types; otherwise safely dereference and return the values. Use the
sts.GetCallerIdentity response (output) and its Arn and Account fields as the
unique identifiers to locate where to add these checks.

In `@cmd/cluster/diff.go`:
- Around line 262-298: compareOperators currently only checks Available,
Degraded, Progressing and Version on OperatorSnapshot, so changes to other
entries in the Conditions slice are missed; update compareOperators (and any
helper used to diff operators) to also diff the Conditions field: iterate the
beforeOp.Conditions and afterOp.Conditions (matching by Condition.Type or
creating maps keyed by Type), and for any condition present/removed or whose
Status/Reason/Message (or other relevant fields) changed, append an OperatorDiff
with Field set to e.g. "Conditions.<Type>.<Subfield>" (or "Conditions.<Type>"
for added/removed) and populate Before/After with the previous and current
values; ensure you handle added, removed and modified condition entries so all
condition-only changes surface in the diffs.
- Around line 110-125: In run(), before calling compareSnapshots, verify the two
loaded snapshots belong to the same cluster by comparing
beforeSnapshot.Metadata.ClusterID and afterSnapshot.Metadata.ClusterID; if they
differ, return a clear error (e.g. "snapshots belong to different clusters:
before=<id> after=<id>") from diffOptions.run so we fail fast instead of
producing a misleading diff. Ensure you reference the existing loadSnapshot
return values (beforeSnapshot, afterSnapshot) and perform the check immediately
after loading the after snapshot and before calling compareSnapshots.

In `@cmd/cluster/snapshot.go`:
- Around line 160-195: The current logic swallows errors from core collectors
(o.captureNodes, o.captureNamespaces, o.captureClusterOperators) and writes
empty sections into snapshot, which leads to false diffs; change the behavior so
the snapshot creation fails fast on these core-section errors: when
o.captureNodes, o.captureNamespaces, or o.captureClusterOperators returns a
non-nil err, return that error from the enclosing snapshot creation function
(instead of only logging) so the snapshot is not persisted, or alternatively
populate an explicit snapshot-level error field (e.g., Snapshot.Errors["nodes"]
= err.Error()) and ensure downstream diffing ignores sections with recorded
errors; locate and update the block that assigns
snapshot.Nodes/snapshot.Namespaces/snapshot.Operators and the surrounding
function to implement one of these two fixes.

In `@docs/osdctl_cluster_diff.md`:
- Around line 18-27: Wrap the examples block in a bash fenced code block so the
comment lines are treated as shell instead of headings: add a starting "```bash"
before the "Examples:" line and ensure there's a matching closing "```" after
the usage block (so the "Examples:" line, the two "# Compare ..." lines and the
usage snippet are all inside the fenced block); update the block surrounding the
existing usage snippet (osdctl cluster diff ...) to be enclosed by the new
```bash fence so markdown renders correctly.

In `@docs/osdctl_cluster_snapshot.md`:
- Around line 18-30: The examples block currently isn't fenced and comment lines
like "# Capture cluster snapshot to a file" are rendered as Markdown headings;
wrap the entire examples section (from the first example comment through the
final "osdctl cluster snapshot [flags]" line) in a fenced code block with the
language tag bash (add "```bash" before the first example and "```" after the
last line) so the commands (including the literal "osdctl cluster snapshot
[flags]") render and copy/paste correctly.

In `@docs/README.md`:
- Around line 1107-1125: The generated Markdown code fences in
pkg/docgen/docgen.go are created without language specifiers (the usage and
flags outputs currently emit plain ```), which triggers MD040; update the two
places that emit the usage block and the flags block so they open with a
language tag (e.g., change the usage block fence from ``` to ```bash or ```text,
and change the flags block fence from ``` to ```text), ensuring both the
usage-generation site and the flags-generation site in docgen.go use the new
fenced form.

---

Duplicate comments:
In `@cmd/cluster/diff.go`:
- Around line 216-223: compareNodes and compareNamespaces only record Status
changes; update compareNodes and compareNamespaces to also detect and append
NodeDiff/NamespaceDiff entries for Roles, Version, Conditions, and Labels (for
nodes) and Labels (for namespaces) when values differ between
beforeNode/beforeNamespace and afterNode/afterNamespace. Locate the compareNodes
and compareNamespaces functions and, alongside the existing Status check, add
equality checks for Roles (slice/set comparison), Version (string), Conditions
(structured comparison), and Labels (map comparison), creating
NodeDiff/NamespaceDiff items with ChangeType "modified" and Before/After
formatted values; apply the same additional checks in the other similar compare
block later in the file where node/namespace comparisons are duplicated.

In `@cmd/cluster/snapshot.go`:
- Around line 115-116: The two calls to snapshotCmd.MarkFlagRequired currently
ignore returned errors; update the constructor to capture and handle these by
checking the error (e.g., err := snapshotCmd.MarkFlagRequired("cluster-id");
cobra.CheckErr(err) and similarly for "output") or propagate the error up,
ensuring both snapshotCmd.MarkFlagRequired invocations are not discarding their
returned error (reference snapshotCmd.MarkFlagRequired and use cobra.CheckErr or
return the err).
- Around line 63-69: The ResourceInfo struct only persists
name/namespace/kind/status and loses meaningful details for diffs; update
ResourceInfo to include either the full unstructured object (e.g., Raw
map[string]interface{} or JSON/YAML blob) or at minimum include
metadata.generation, metadata.resourceVersion, metadata.labels/annotations, and
the spec/template (or a computed hash of spec/template + important status
fields) so rollout/spec changes are detectable; locate the ResourceInfo type and
any code that serializes/deserializes snapshots and change them to capture and
persist these additional fields (or the full object) and adjust comparison logic
to use the new spec/template or hash fields when generating diffs.
- Around line 121-205: run() currently resolves o.ClusterID via OCM but then
lets captureNodes(), captureNamespaces(), captureClusterOperators(), and
captureResources() shell out to oc using whatever context is active; fetch the
target cluster kubeconfig from OCM (or via an existing helper like
utils.GetClusterKubeconfig/GetClusterAdminKubeconfig) immediately in run(), save
it to a temp file or to o.Kubeconfig, and ensure each capture function
(captureNodes, captureNamespaces, captureClusterOperators, captureResources)
uses that kubeconfig (or the temp file path) when invoking oc (or pass it down
as an argument) rather than relying on the current oc context; update run() to
set/clean the temp kubeconfig and adjust the capture function signatures to
accept the kubeconfig if needed so all captured resources come from the same
cluster resolved by run().

In `@cmd/evidence/collect.go`:
- Around line 276-289: The events collection ignores the computed startTime from
the --since flag because startTime isn’t passed into collectKubernetesEvents;
update the call in the IncludeEvents block to pass the startTime and change the
collectKubernetesEvents signature to accept a time.Time (or pointer) parameter,
then filter returned events inside collectKubernetesEvents by their timestamps
(e.g., Event.LastTimestamp / Event.FirstTimestamp) to only include events after
startTime (alternatively use a field-selector when executing oc/kubectl inside
collectKubernetesEvents); ensure ClusterState.Events is still set with the
filtered slice.
- Around line 540-547: The collectCloudTrailData function currently returns an
empty &CloudTrailData{} which is indistinguishable from a successful empty
result; change collectCloudTrailData to return nil (and a nil error or a
specific "not implemented" error) when CloudTrail collection is not performed,
and update the caller sites that assign evidence.CloudTrailData to check for nil
before setting it (or propagate the error) so callers can differentiate "not
implemented" from "collected but empty"; alternatively, if you prefer to
implement collection now, wire in the real logic from
cmd/aws/cloudtrail/errors.go into collectCloudTrailData instead of returning an
empty struct.
- Around line 330-352: collectClusterState currently swallows errors from
collectNodes, collectOperators, and collectMachineConfigs and always returns nil
error; change it to propagate failures by checking each err and returning an
error if any sub-collector fails (e.g., return nil, err or aggregate errors into
one descriptive error using fmt.Errorf or a multi-error approach) while still
filling state with successful parts; update the function around collectNodes,
collectOperators, and collectMachineConfigs to return immediately on non-nil err
(or collect and combine errors) so callers of collectClusterState can detect
failures.
- Around line 354-358: The shelled-out oc commands (e.g., in collectNodes()) run
against the current kubeconfig context instead of the target cluster resolved
via OCM; update all exec.CommandContext usages (collectNodes,
collectClusterOperators, collectMachineConfigs, collectEvents, runMustGather or
similar) to bind to the target cluster by passing the cluster kubeconfig/context
obtained during cluster resolution: either add "--kubeconfig", or "--context
<name>" to the oc command args (or set KUBECONFIG in cmd.Env) using the
kubeconfig returned by your OCM/backplane resolver, or replace these shells with
a client-go implementation using the resolved credentials; ensure the same
resolver symbol that collects the cluster kubeconfig is used so each oc
invocation targets the intended cluster.

In `@docs/osdctl_aws_cloudtrail_errors.md`:
- Around line 13-28: The examples block is not fenced with a bash code fence so
lines starting with # are being treated as Markdown headings; wrap the entire
examples snippet (the four example commands and the usage line ``osdctl aws
cloudtrail errors [flags]``) inside a fenced code block that specifies bash
(i.e., start with ```bash and end with ```), ensuring the existing `#` comment
prefixes remain inside the fenced block; update the block around the examples in
the docs/osdctl_aws_cloudtrail_errors.md file where the Examples section and the
usage line appear.

In `@docs/README.md`:
- Around line 1162-1174: The Markdown example blocks containing shell comments
(e.g., the groups showing "osdctl aws cloudtrail errors -C <cluster-id> ..." and
the similar "cluster diff", "cluster snapshot", and "evidence collect" example
groups) must be wrapped in fenced code blocks with a bash language tag so
leading `#` lines render as comments instead of headings; update each example
group by adding a starting "```bash" before the first commented line and a
closing "```" after the last command for the cloudtrail, cluster diff, cluster
snapshot and evidence collect example sections.

---

Nitpick comments:
In `@cmd/aws/cloudtrail/errors.go`:
- Line 98: The call to errorsCmd.MarkFlagRequired("cluster-id") currently
discards its error; change it to handle the returned error (e.g., capture the
error from MarkFlagRequired and propagate or log it using your project's error
handling, or use Cobra's helper like cobra.CheckErr) so wiring mistakes are not
silently ignored—update the call site where
errorsCmd.MarkFlagRequired("cluster-id") is invoked to check the returned err
and handle it appropriately.

In `@cmd/evidence/collect.go`:
- Around line 182-183: The MarkFlagRequired calls on collectCmd currently
discard errors; update the two calls to capture their returned error and handle
it instead of assigning to blank identifier — e.g., err :=
collectCmd.MarkFlagRequired("cluster-id"); if err != nil { cobra.CheckErr(err) }
(or log.Fatalf/return the error from the enclosing init/setup function) and do
the same for collectCmd.MarkFlagRequired("output"); this ensures wiring mistakes
are surfaced.

In `@docs/osdctl_aws.md`:
- Around line 13-15: Update the fenced code blocks in the docs that show CLI
examples (e.g., the block containing "osdctl aws [flags]" and the other blocks
noted at lines 19-21 and 25-36) to include a language specifier such as "shell"
or "text" (e.g., change ``` to ```shell) so syntax highlighting and
accessibility are improved; ensure all CLI/example blocks in the file use the
same appropriate specifier.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 27eb9a3b-169f-4fdd-a724-6b2fbd4e26ab

📥 Commits

Reviewing files that changed from the base of the PR and between 6dcd7d8 and d7fd420.

📒 Files selected for processing (20)
  • README.md
  • cmd/aws/cloudtrail/cmd.go
  • cmd/aws/cloudtrail/errors.go
  • cmd/aws/cmd.go
  • cmd/cluster/cmd.go
  • cmd/cluster/diff.go
  • cmd/cluster/snapshot.go
  • cmd/cmd.go
  • cmd/evidence/cmd.go
  • cmd/evidence/collect.go
  • docs/README.md
  • docs/osdctl.md
  • docs/osdctl_aws.md
  • docs/osdctl_aws_cloudtrail.md
  • docs/osdctl_aws_cloudtrail_errors.md
  • docs/osdctl_cluster.md
  • docs/osdctl_cluster_diff.md
  • docs/osdctl_cluster_snapshot.md
  • docs/osdctl_evidence.md
  • docs/osdctl_evidence_collect.md
✅ Files skipped from review due to trivial changes (1)
  • docs/osdctl_evidence_collect.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/cluster/cmd.go

@MitaliBhalla MitaliBhalla force-pushed the SREP-3620 branch 6 times, most recently from 772997f to 7346956 Compare March 11, 2026 07:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

♻️ Duplicate comments (7)
cmd/cluster/snapshot.go (3)

114-115: ⚠️ Potential issue | 🟡 Minor

Handle MarkFlagRequired errors instead of discarding them.

These ignored returns will keep errcheck unhappy and can hide a broken flag registration if the flag name drifts.

Suggested fix
-	_ = snapshotCmd.MarkFlagRequired("cluster-id")
-	_ = snapshotCmd.MarkFlagRequired("output")
+	cobra.CheckErr(snapshotCmd.MarkFlagRequired("cluster-id"))
+	cobra.CheckErr(snapshotCmd.MarkFlagRequired("output"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 114 - 115, The two calls to
snapshotCmd.MarkFlagRequired("cluster-id") and
snapshotCmd.MarkFlagRequired("output") currently discard returned errors; update
the code to handle these errors instead of using blank identifier—capture the
error from each MarkFlagRequired call and handle it (for example, return the
error up the caller, or log/fatal via the command setup logger) so that failures
to register required flags are surfaced and errcheck is satisfied.

64-69: ⚠️ Potential issue | 🟠 Major

Capture more than status.phase for extra resources.

ResourceInfo still only stores Status, so cluster diff will miss most spec/rollout changes for resources like Deployments and Services and mostly degrade into add/remove detection.

Also applies to: 399-428

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 64 - 69, ResourceInfo currently only
stores Status which misses spec/rollout changes; extend ResourceInfo to capture
metadata and spec-summary used for diffs (e.g., add fields like Labels
map[string]string, Annotations map[string]string, Generation/ObservedGeneration
(int64), ResourceVersion string, and a SpecHash or SpecSummary string/json) and
update the code paths that build/populate ResourceInfo (the functions that
create snapshot ResourceInfo instances) to fill these new fields from
object.Metadata and object.Spec/Status; also apply the same changes to the other
ResourceInfo occurrence referenced in the duplicate region so cluster diff can
detect spec/rollout changes.

120-200: ⚠️ Potential issue | 🔴 Critical

Bind every oc capture to the requested cluster.

run() resolves the target cluster through OCM, but the capture paths still execute plain oc against whatever kubecontext the caller already has selected. That can produce a snapshot whose metadata says cluster B while nodes/operators/resources were collected from cluster A.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 120 - 200, The run() function resolves
the target cluster but then calls captureNodes(), captureNamespaces(),
captureClusterOperators(), and captureResources() which run plain oc against the
caller's current kubecontext; bind every oc call to the resolved cluster by
retrieving the cluster-specific kubeconfig/token (or API server URL and
credentials) after GetClusterAnyStatus and passing it into the capture functions
(or setting it on snapshotOptions) so those functions invoke oc/k8s client with
the cluster-specific context; update the signatures of captureNodes,
captureNamespaces, captureClusterOperators, captureResources (or have them read
a new o.TargetKubeConfig / o.TargetContext set in run()) and ensure each
function uses that kubeconfig/context when executing oc or creating clients.
cmd/evidence/collect.go (3)

255-269: ⚠️ Potential issue | 🟠 Major

This path never actually collects CloudTrail evidence.

collectCloudTrailData() always returns nil, nil, but once AWS auth succeeds the caller prints a success message. That makes the command look complete while evidence.yaml contains no CloudTrail section at all.

Also applies to: 561-568

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 255 - 269, The caller prints
"CloudTrail access verified" even when collectCloudTrailData returns nil (since
collectCloudTrailData currently returns nil, nil), so fix either
collectCloudTrailData to actually return collected data or update the caller in
the block that calls CreateAWSV2Config and collectCloudTrailData: after
cloudTrailData, err := o.collectCloudTrailData(startTime) treat a nil
cloudTrailData as an error (log a warning and do not set evidence.CloudTrailData
or print the success line), and only assign to evidence.CloudTrailData and print
the success message when cloudTrailData is non-nil and err == nil; reference
collectCloudTrailData, evidence.CloudTrailData, and osdCloud.CreateAWSV2Config
to locate the affected code.

327-365: ⚠️ Potential issue | 🟠 Major

Partial collector failures still get serialized as successful evidence.

If one of the core sub-collectors fails, this returns a partially populated ClusterState and the run path treats it as success. The saved bundle then can't distinguish "section was empty" from "section failed to collect".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 327 - 365, The collectClusterState
function currently treats partial sub-collector failures as success; change its
behavior so any failure in collectNodes, collectOperators, or
collectMachineConfigs causes collectClusterState to return an error rather than
a partially populated ClusterState. Specifically, after collecting into
state.Nodes, state.Operators, and state.MachineConfigs and accumulating warnings
in warnings, if len(warnings) > 0 return nil and an error (e.g.
fmt.Errorf("partial collection failures: %v", warnings)); keep the existing
all-empty check logic or combine it into this single check and retain the
warning printing logic if desired. Ensure you reference the warnings slice and
the collectNodes/collectOperators/collectMachineConfigs callers to locate the
change.

196-199: ⚠️ Potential issue | 🔴 Critical

Bind evidence collection to the requested cluster before running oc.

You resolve the target cluster from OCM here, but the subsequent collectors and oc adm must-gather still execute against the caller's current kubecontext. That can label the bundle with the right ClusterID while collecting evidence from a different cluster.

Also applies to: 243-301

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 196 - 199, After resolving the target
cluster with utils.GetClusterAnyStatus (cluster) and o.ClusterID, make sure to
switch the active kube/oc context to that target cluster before invoking any
collectors or running "oc adm must-gather" (and similarly in the later block
around lines 243-301). Replace or augment the current executor/collector setup
to use the target cluster's kubeconfig/REST config (via the existing helper or
client factory in your codebase) so all subsequent collector functions and
must-gather invocations run against the resolved cluster, and restore the
original context if needed after collection.
cmd/cluster/diff.go (1)

204-237: ⚠️ Potential issue | 🟠 Major

Node diffs still miss condition-only and label-only changes.

NodeSnapshot captures Conditions and Labels in cmd/cluster/snapshot.go:44-51, but this comparator only checks status/version/roles. Any change that leaves those three fields intact stays invisible in the diff.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/diff.go` around lines 204 - 237, The compareNodes function
currently only compares Status, Version and Roles and therefore misses changes
to NodeSnapshot.Conditions and NodeSnapshot.Labels; update compareNodes to also
detect and report differences in Conditions and Labels (use deep comparisons
such as reflect.DeepEqual or a deterministic serialization rather than
fmt.Sprintf) by adding checks that append human-readable change strings (e.g.,
"Conditions: ... -> ..." and "Labels: ... -> ...") to the existing changes
slice, and ensure the NodeDiff entries for added/modified nodes include the
Conditions and Labels in their Before/After descriptions so condition-only or
label-only updates are surfaced; reference the compareNodes function,
NodeSnapshot struct, and NodeDiff usage when making the change.
🧹 Nitpick comments (4)
cmd/aws/cloudtrail/errors.go (1)

154-158: Preserve the shared AWS config when switching regions.

Creating the us-east-1 client from raw cloudtrail.Options drops any retryer, endpoint resolver, API options, or other settings already loaded into cfg. Prefer cloudtrail.NewFromConfig(cfg, func(o *cloudtrail.Options) { o.Region = defaultRegion }) instead.

Suggested refactor
-		globalClient := cloudtrail.New(cloudtrail.Options{
-			Region:      defaultRegion,
-			Credentials: cfg.Credentials,
-			HTTPClient:  cfg.HTTPClient,
-		})
+		globalClient := cloudtrail.NewFromConfig(cfg, func(o *cloudtrail.Options) {
+			o.Region = defaultRegion
+		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/aws/cloudtrail/errors.go` around lines 154 - 158, The code creates a new
CloudTrail client with cloudtrail.New(cloudtrail.Options{...}) which discards
shared settings on cfg; instead preserve cfg and only override the region by
using cloudtrail.NewFromConfig(cfg, func(o *cloudtrail.Options) { o.Region =
defaultRegion }), so update the creation of globalClient to call
cloudtrail.NewFromConfig with cfg and the region override (referencing cfg,
defaultRegion, cloudtrail.NewFromConfig, and globalClient) to retain retryer,
endpoint resolver, API options, and other shared config.
docs/osdctl_cluster_snapshot.md (1)

18-20: Add language tags to the fenced blocks.

Using bash for examples and text for usage/options blocks will clear the markdownlint warnings and render these sections more cleanly.

Also applies to: 24-33, 37-43, 47-57

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/osdctl_cluster_snapshot.md` around lines 18 - 20, The fenced code blocks
in docs/osdctl_cluster_snapshot.md (e.g., the usage block showing "osdctl
cluster snapshot [flags]" and the example/output sections) lack language tags;
update each fenced block to include an appropriate language tag—use "text" for
usage/options blocks (the block around "osdctl cluster snapshot [flags]") and
"bash" for example command blocks—applying this change to the ranges noted
(lines ~18, 24-33, 37-43, 47-57) so markdownlint warnings are resolved and
rendering improves.
docs/osdctl_evidence_collect.md (1)

18-20: Add language tags to the fenced blocks.

Using bash for the examples and text for usage/options keeps this page lint-clean and improves the rendered formatting.

Also applies to: 24-36, 40-49, 53-63

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/osdctl_evidence_collect.md` around lines 18 - 20, The fenced code blocks
showing examples and usage should include language tags: change the
command/example blocks like the one containing "osdctl evidence collect [flags]"
to use ```bash and change the usage/options blocks to use ```text; apply the
same change to the other fenced blocks referenced (the examples in the later
examples/usage sections) so all command snippets are ```bash and all
usage/options blocks are ```text to satisfy linting and improve rendered
formatting.
docs/osdctl_cluster_diff.md (1)

18-20: Add language tags to the fenced blocks.

bash for the examples and text for usage/options keeps the generated docs lint-clean and improves formatting in rendered Markdown.

Also applies to: 24-30, 34-37, 41-52

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/osdctl_cluster_diff.md` around lines 18 - 20, The fenced code blocks
showing the command invocation (e.g. the line starting with "osdctl cluster diff
<before.yaml> <after.yaml> [flags]") and the surrounding usage/options blocks
need language tags; change the example blocks to start with ```bash and change
usage/options blocks to start with ```text (apply the same change to the other
fenced blocks noted in the comment ranges 24-30, 34-37, and 41-52) so the
generated docs lint correctly and render with proper syntax highlighting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/aws/cloudtrail/errors.go`:
- Around line 86-95: The command currently only honors the local --json boolean
(opts.OutputJSON) but ignores the inherited -o/--output string flag; update the
errors handling to detect the global output flag and treat "json" the same as
--json. Concretely, in the errors command path (the RunE that calls opts.run
and/or inside opts.run), read the command's inherited output flag (e.g.,
cmd.Flags().GetString("output") or
cmd.Root().PersistentFlags().GetString("output")) and set/override
opts.OutputJSON = true when that value == "json" (or otherwise support
json+pretty variations), then proceed with the existing JSON rendering code so
osdctl aws cloudtrail errors -o json behaves identically to --json.

In `@cmd/aws/cmd.go`:
- Around line 12-17: Update the help text in the AWS command definition: change
occurrences of "AWS related" to the hyphenated form "AWS-related" in both the
Short and Long string literals (the fields defined in the AWS command in
cmd/aws/cmd.go) so the generated docs use the corrected grammar.

In `@cmd/cluster/diff.go`:
- Around line 375-383: The NamespaceDiff creation currently only renders Status
in Before/After even when labelsChanged is true; update the logic around
labelsChanged/ statusChanged in the block that builds the NamespaceDiff (the
code using NamespaceDiff, Name, ChangeType, Before, After, and the
beforeNs/afterNs variables) so that Before and After include labels when labels
differ (e.g., include "Status: <status>; Labels: <labels>" or at minimum append
the labels fmt for beforeNs.Labels and afterNs.Labels when labelsChanged is
true), ensuring label-only modifications produce meaningful Before/After text.
- Around line 571-578: warnCaptureErrors currently writes capture warnings to
stdout which corrupts JSON output; update warnCaptureErrors to write all warning
output to stderr instead by switching fmt.Printf and fmt.Println calls to write
to os.Stderr (e.g., fmt.Fprintln/os.Stderr or fmt.Fprintf(os.Stderr,...)), and
add the necessary import for the os package; ensure the function signature and
logic remain the same and only the output stream is changed.

In `@cmd/evidence/collect.go`:
- Around line 179-180: Replace the ignored return values from
collectCmd.MarkFlagRequired with cmdutil.CheckErr(...) to propagate any errors;
specifically, wrap the two calls to collectCmd.MarkFlagRequired("cluster-id")
and collectCmd.MarkFlagRequired("output") in cmdutil.CheckErr so failures are
handled consistently with the codebase (same pattern used in
cmd/cluster/loggingcheck.go).

In `@docs/osdctl_aws_cloudtrail_errors.md`:
- Around line 13-57: The fenced code blocks in the "Examples", the initial usage
block, and the "Options" / "Options inherited from parent commands" sections are
missing language tags; update the three types of blocks so they are tagged for
proper linting and rendering: mark the usage/flags blocks as "text" (the
single-line usage ```osdctl aws cloudtrail errors [flags]``` and the multi-line
Options blocks), and mark the Examples code fence as "bash" (the examples
showing osdctl commands). Ensure each triple-backtick fence is changed to
```text or ```bash accordingly so MD040 warnings are resolved.

In `@docs/osdctl_aws_cloudtrail.md`:
- Around line 12-35: The markdown code fences in the osdctl_aws_cloudtrail.md
content are missing language annotations and trigger MD040; update each
triple-backtick fence that wraps CLI output and option blocks to use the "text"
language (replace ``` with ```text) so the fenced blocks are annotated (the
three blocks around "osdctl aws cloudtrail [flags]", the help block containing
"-h, --help", and the inherited options block).

In `@docs/osdctl_aws.md`:
- Around line 13-36: Update the three fenced code blocks that show the command
usage ("osdctl aws [flags]") and the Options / Options inherited from parent
commands lists to use language tag "text" instead of plain backticks; replace
the opening fences ``` with ```text for the command block and both flag/option
blocks so markdownlint MD040 stops flagging them and the blocks render as plain
text.

In `@docs/osdctl_evidence.md`:
- Around line 13-36: Update the three fenced code blocks in
docs/osdctl_evidence.md to include a language specifier (use "text") after the
opening backticks so markdownlint MD040 is satisfied; specifically, change the
blocks containing "osdctl evidence [flags]", the short Options block starting
with "-h, --help", and the larger "Options inherited from parent commands" block
to start with ```text instead of ``` (if these blocks are generated, adjust the
template/generator that emits the command and options blocks so it always emits
```text).

---

Duplicate comments:
In `@cmd/cluster/diff.go`:
- Around line 204-237: The compareNodes function currently only compares Status,
Version and Roles and therefore misses changes to NodeSnapshot.Conditions and
NodeSnapshot.Labels; update compareNodes to also detect and report differences
in Conditions and Labels (use deep comparisons such as reflect.DeepEqual or a
deterministic serialization rather than fmt.Sprintf) by adding checks that
append human-readable change strings (e.g., "Conditions: ... -> ..." and
"Labels: ... -> ...") to the existing changes slice, and ensure the NodeDiff
entries for added/modified nodes include the Conditions and Labels in their
Before/After descriptions so condition-only or label-only updates are surfaced;
reference the compareNodes function, NodeSnapshot struct, and NodeDiff usage
when making the change.

In `@cmd/cluster/snapshot.go`:
- Around line 114-115: The two calls to
snapshotCmd.MarkFlagRequired("cluster-id") and
snapshotCmd.MarkFlagRequired("output") currently discard returned errors; update
the code to handle these errors instead of using blank identifier—capture the
error from each MarkFlagRequired call and handle it (for example, return the
error up the caller, or log/fatal via the command setup logger) so that failures
to register required flags are surfaced and errcheck is satisfied.
- Around line 64-69: ResourceInfo currently only stores Status which misses
spec/rollout changes; extend ResourceInfo to capture metadata and spec-summary
used for diffs (e.g., add fields like Labels map[string]string, Annotations
map[string]string, Generation/ObservedGeneration (int64), ResourceVersion
string, and a SpecHash or SpecSummary string/json) and update the code paths
that build/populate ResourceInfo (the functions that create snapshot
ResourceInfo instances) to fill these new fields from object.Metadata and
object.Spec/Status; also apply the same changes to the other ResourceInfo
occurrence referenced in the duplicate region so cluster diff can detect
spec/rollout changes.
- Around line 120-200: The run() function resolves the target cluster but then
calls captureNodes(), captureNamespaces(), captureClusterOperators(), and
captureResources() which run plain oc against the caller's current kubecontext;
bind every oc call to the resolved cluster by retrieving the cluster-specific
kubeconfig/token (or API server URL and credentials) after GetClusterAnyStatus
and passing it into the capture functions (or setting it on snapshotOptions) so
those functions invoke oc/k8s client with the cluster-specific context; update
the signatures of captureNodes, captureNamespaces, captureClusterOperators,
captureResources (or have them read a new o.TargetKubeConfig / o.TargetContext
set in run()) and ensure each function uses that kubeconfig/context when
executing oc or creating clients.

In `@cmd/evidence/collect.go`:
- Around line 255-269: The caller prints "CloudTrail access verified" even when
collectCloudTrailData returns nil (since collectCloudTrailData currently returns
nil, nil), so fix either collectCloudTrailData to actually return collected data
or update the caller in the block that calls CreateAWSV2Config and
collectCloudTrailData: after cloudTrailData, err :=
o.collectCloudTrailData(startTime) treat a nil cloudTrailData as an error (log a
warning and do not set evidence.CloudTrailData or print the success line), and
only assign to evidence.CloudTrailData and print the success message when
cloudTrailData is non-nil and err == nil; reference collectCloudTrailData,
evidence.CloudTrailData, and osdCloud.CreateAWSV2Config to locate the affected
code.
- Around line 327-365: The collectClusterState function currently treats partial
sub-collector failures as success; change its behavior so any failure in
collectNodes, collectOperators, or collectMachineConfigs causes
collectClusterState to return an error rather than a partially populated
ClusterState. Specifically, after collecting into state.Nodes, state.Operators,
and state.MachineConfigs and accumulating warnings in warnings, if len(warnings)
> 0 return nil and an error (e.g. fmt.Errorf("partial collection failures: %v",
warnings)); keep the existing all-empty check logic or combine it into this
single check and retain the warning printing logic if desired. Ensure you
reference the warnings slice and the
collectNodes/collectOperators/collectMachineConfigs callers to locate the
change.
- Around line 196-199: After resolving the target cluster with
utils.GetClusterAnyStatus (cluster) and o.ClusterID, make sure to switch the
active kube/oc context to that target cluster before invoking any collectors or
running "oc adm must-gather" (and similarly in the later block around lines
243-301). Replace or augment the current executor/collector setup to use the
target cluster's kubeconfig/REST config (via the existing helper or client
factory in your codebase) so all subsequent collector functions and must-gather
invocations run against the resolved cluster, and restore the original context
if needed after collection.

---

Nitpick comments:
In `@cmd/aws/cloudtrail/errors.go`:
- Around line 154-158: The code creates a new CloudTrail client with
cloudtrail.New(cloudtrail.Options{...}) which discards shared settings on cfg;
instead preserve cfg and only override the region by using
cloudtrail.NewFromConfig(cfg, func(o *cloudtrail.Options) { o.Region =
defaultRegion }), so update the creation of globalClient to call
cloudtrail.NewFromConfig with cfg and the region override (referencing cfg,
defaultRegion, cloudtrail.NewFromConfig, and globalClient) to retain retryer,
endpoint resolver, API options, and other shared config.

In `@docs/osdctl_cluster_diff.md`:
- Around line 18-20: The fenced code blocks showing the command invocation (e.g.
the line starting with "osdctl cluster diff <before.yaml> <after.yaml> [flags]")
and the surrounding usage/options blocks need language tags; change the example
blocks to start with ```bash and change usage/options blocks to start with
```text (apply the same change to the other fenced blocks noted in the comment
ranges 24-30, 34-37, and 41-52) so the generated docs lint correctly and render
with proper syntax highlighting.

In `@docs/osdctl_cluster_snapshot.md`:
- Around line 18-20: The fenced code blocks in docs/osdctl_cluster_snapshot.md
(e.g., the usage block showing "osdctl cluster snapshot [flags]" and the
example/output sections) lack language tags; update each fenced block to include
an appropriate language tag—use "text" for usage/options blocks (the block
around "osdctl cluster snapshot [flags]") and "bash" for example command
blocks—applying this change to the ranges noted (lines ~18, 24-33, 37-43, 47-57)
so markdownlint warnings are resolved and rendering improves.

In `@docs/osdctl_evidence_collect.md`:
- Around line 18-20: The fenced code blocks showing examples and usage should
include language tags: change the command/example blocks like the one containing
"osdctl evidence collect [flags]" to use ```bash and change the usage/options
blocks to use ```text; apply the same change to the other fenced blocks
referenced (the examples in the later examples/usage sections) so all command
snippets are ```bash and all usage/options blocks are ```text to satisfy linting
and improve rendered formatting.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cea1a2b2-d822-4469-828c-551323ad9151

📥 Commits

Reviewing files that changed from the base of the PR and between d7fd420 and 7346956.

📒 Files selected for processing (20)
  • README.md
  • cmd/aws/cloudtrail/cmd.go
  • cmd/aws/cloudtrail/errors.go
  • cmd/aws/cmd.go
  • cmd/cluster/cmd.go
  • cmd/cluster/diff.go
  • cmd/cluster/snapshot.go
  • cmd/cmd.go
  • cmd/evidence/cmd.go
  • cmd/evidence/collect.go
  • docs/README.md
  • docs/osdctl.md
  • docs/osdctl_aws.md
  • docs/osdctl_aws_cloudtrail.md
  • docs/osdctl_aws_cloudtrail_errors.md
  • docs/osdctl_cluster.md
  • docs/osdctl_cluster_diff.md
  • docs/osdctl_cluster_snapshot.md
  • docs/osdctl_evidence.md
  • docs/osdctl_evidence_collect.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • cmd/cmd.go
  • cmd/evidence/cmd.go
  • README.md

@MitaliBhalla MitaliBhalla force-pushed the SREP-3620 branch 3 times, most recently from 9c6b3bb to cd08339 Compare March 11, 2026 08:36
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 11, 2026

@MitaliBhalla: This pull request references SREP-3620 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Adds new osdctl commands to help SRE teams collect evidence during feature validation testing (IAM policies, operators, etc.), reducing manual toil and speeding up feature validation.

New Commands

osdctl aws cloudtrail errors - Surface AWS permission errors from CloudTrail
osdctl cluster snapshot - Capture point-in-time cluster state (nodes, operators, namespaces)
osdctl cluster diff - Compare two snapshots to identify changes
osdctl evidence collect - All-in-one evidence collection

Usage Examples

Surface CloudTrail permission errors

osdctl aws cloudtrail errors -C <cluster-id> --since 1h

Capture cluster state before/after testing

osdctl cluster snapshot -C <cluster-id> -o before.yaml
osdctl cluster snapshot -C <cluster-id> -o after.yaml
osdctl cluster diff before.yaml after.yaml

Collect all evidence at once

osdctl evidence collect -C <cluster-id> --output ./evidence/

Features
Supports both ROSA Classic and ROSA HCP clusters
Detects HCP clusters and provides relevant guidance
JSON output option for programmatic use
AWS Console links for CloudTrail events

Testing

Tested on ROSA Classic cluster mbhalla-osdctl (v4.21.4):

  • CloudTrail errors: Found 18 permission errors
  • Snapshot: Captured 7 nodes, 34 operators, 20 MachineConfigs
  • Diff: Correctly detected changes between snapshots
  • Evidence: Generated summary and evidence.yaml

Jira

https://issues.redhat.com/browse/SREP-3620

Summary by CodeRabbit

  • New Features

  • Added osdctl aws command group with cloudtrail errors to surface CloudTrail error events (filtering by type, JSON output, optional console links) and aws utilities.

  • Added osdctl cluster snapshot and cluster diff for capturing and comparing point-in-time cluster state; added osdctl evidence collect to gather cluster state, CloudTrail, events, and diagnostics into structured output and a summary.

  • Documentation

  • Expanded README and docs with usage, examples, and options for aws, cloudtrail errors, cluster snapshot/diff, and evidence collect.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (10)
cmd/cluster/diff.go (1)

226-243: ⚠️ Potential issue | 🟠 Major

Node condition and label changes are still invisible.

NodeSnapshot also captures Conditions and Labels, but this comparator only checks Status, Version, and Roles. A node can change readiness/conditions or labels without ever being reported in osdctl cluster diff, which makes the snapshot comparison incomplete.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/diff.go` around lines 226 - 243, The comparator currently only
compares Status, Version and Roles so changes to NodeSnapshot.Conditions and
NodeSnapshot.Labels are missed; update the modified-node detection (the block
that inspects beforeNode and afterNode and appends to diffs) to also compare
Conditions and Labels (use a stable/deep comparison like reflect.DeepEqual for
Conditions slice and compare Labels map keys/values or fmt.Sprintf on a sorted
representation) and include those fields in the changes list and in the
Before/After formatted strings for the NodeDiff so readiness/condition and label
changes are reported for the given name/beforeNode/afterNode.
docs/README.md (1)

1099-1185: ⚠️ Potential issue | 🟡 Minor

Add language specifiers to these new fenced blocks.

These generated fences still omit a language tag, so markdownlint will keep flagging MD040 here. Use text for the usage/flags blocks, and if docs/README.md is generated, fix the docgen template as well so regeneration stays lint-clean.

Also applies to: 1573-1605, 2150-2184, 2980-3045

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/README.md` around lines 1099 - 1185, The fenced code blocks under the
headings "osdctl aws", "osdctl aws cloudtrail", and "osdctl aws cloudtrail
errors" (and the other similar blocks noted) are missing a language specifier
which triggers MD040; update each triple-backtick fence that shows usage/flags
output to use the language tag "text" (e.g., ```text) and, if README.md is
generated, update the doc generation template that emits these blocks so future
regenerations include ```text by default; search for the templates that produce
the usage/flags blocks (look for the headings "osdctl aws", "osdctl aws
cloudtrail", "osdctl aws cloudtrail errors") and add the language specifier
there.
cmd/aws/cloudtrail/errors.go (1)

354-359: ⚠️ Potential issue | 🟡 Minor

Guard the STS response before dereferencing.

In aws-sdk-go-v2/service/sts v1.41.6 these fields are pointers. A short or empty GetCallerIdentity response would panic here instead of surfacing a normal error.

🛡️ Suggested guard
 func whoami(stsClient *sts.Client) (accountArn string, accountId string, err error) {
 	output, err := stsClient.GetCallerIdentity(context.TODO(), &sts.GetCallerIdentityInput{})
 	if err != nil {
 		return "", "", err
 	}
+	if output == nil || output.Arn == nil || output.Account == nil {
+		return "", "", fmt.Errorf("empty GetCallerIdentity response")
+	}
 	return *output.Arn, *output.Account, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/aws/cloudtrail/errors.go` around lines 354 - 359, The whoami function
dereferences output.Arn and output.Account without checking the STS response;
modify whoami (which calls stsClient.GetCallerIdentity) to first verify output
is non-nil and that output.Arn and output.Account are non-nil before returning
them, and return a clear error if any of those fields are missing so we avoid a
panic on short/empty responses.
docs/osdctl_aws_cloudtrail_errors.md (1)

13-57: ⚠️ Potential issue | 🟡 Minor

Add language tags to the generated code fences.

These blocks are still plain triple-backtick fences, so markdownlint MD040 remains red. bash fits the Examples block; the usage/options blocks should be text. If these pages are generated, please fix the generator/template once so the same regression in the other new command docs disappears too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/osdctl_aws_cloudtrail_errors.md` around lines 13 - 57, The code fences
in the docs (the usage block "osdctl aws cloudtrail errors [flags]" and the
Examples/Options blocks) lack language tags, triggering markdownlint MD040;
update the generated markdown so the Examples fences use "bash" and the
usage/options/inherited-options fences use "text" (or another appropriate
language identifier) and update the generator/template that emits these pages so
future command docs include the correct triple-backtick language tags (e.g.,
ensure the template that renders the Examples, Options, and inherited options
blocks sets the language attribute).
cmd/evidence/collect.go (3)

147-156: ⚠️ Potential issue | 🟠 Major

This command still doesn't collect any CloudTrail evidence.

collectCloudTrailData() is a stub that always returns nil, nil, so evidence.yaml never contains CloudTrail events even though the help text and runtime step say this command gathers them. Either wire the CloudTrail collector in here or rename this step/help to an access check so the UX matches the output.

Also applies to: 263-277, 569-576

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 147 - 156, The help text and runtime
step claim CloudTrail events are collected but collectCloudTrailData() is
currently a stub returning nil, so evidence.yaml never contains CloudTrail data;
fix by wiring the actual CloudTrail collector into the collect command: replace
the stub call to collectCloudTrailData() with the real collector invocation (or
call the function that returns CloudTrail events) and ensure its returned events
are merged into the evidence map written to evidence.yaml (same place where
other collectors' outputs are aggregated), or alternatively update the Long help
text and runtime step names to explicitly state this is only an
access/permission check if you do not want to implement collection. Ensure you
reference collectCloudTrailData(), the evidence aggregation logic that writes
evidence.yaml, and any caller functions in cmd/evidence/collect.go so the
CloudTrail output is included.

232-237: ⚠️ Potential issue | 🔴 Critical

Don't collect from an ambient oc context.

The only guard here is a best-effort warning from verifyClusterContext(), but every collector still shells out to plain oc ... afterward. That means -C selects the cluster for metadata only; the actual evidence can come from whichever kubeconfig context is active, and the inherited --context / --kubeconfig flags never reach the subprocesses.

Please centralize subprocess creation so every oc / oc adm call is bound to the resolved target cluster before collection starts.

Also applies to: 377-379, 434-436, 492-494, 522-524, 585-590, 693-703

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 232 - 237, The current
verifyClusterContext(...) only warns but individual collectors still invoke
plain subprocesses with "oc" which can use the ambient kubeconfig; centralize
subprocess creation by adding a single helper (e.g., runOcCmd or newOcExecutor)
in cmd/evidence/collect.go that always injects the resolved cluster
context/kubeconfig (cluster.ID(), resolved kubeconfig path or
--context/--kubeconfig flags) or sets KUBECONFIG in the env for each process;
replace direct subprocess invocations in the collectors referenced (the blocks
around verifyClusterContext, and the other occurrences you noted) to call that
helper so every "oc" / "oc adm" invocation is bound to the target cluster before
collection starts.

250-260: ⚠️ Potential issue | 🟠 Major

Make partial cluster-state failures visible.

If one sub-collector fails, collectClusterState() still returns nil error and run() prints a success line with partial counts. Incomplete evidence needs to stay distinguishable from a clean collection.

Also applies to: 335-373

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/evidence/collect.go` around lines 250 - 260, collectClusterState() is
hiding sub-collector failures so run() prints a success line even when the
result is partial; change collectClusterState (and any helper sub-collector
functions it calls) to return a non-nil error when any sub-collector fails (or
return a sentinel PartialDataError) and propagate that error to the caller, then
update the call site in run() (where o.collectClusterState() is invoked and
evidence.ClusterState is set) to treat a non-nil error as a warning/partial
result (log the error and still attach the partial clusterState) instead of
printing a clean success; reference collectClusterState,
o.collectClusterState(), evidence.ClusterState and the run() caller to find and
update the code paths.
docs/osdctl_cluster_diff.md (1)

18-52: ⚠️ Potential issue | 🟡 Minor

Add language tags to the fenced blocks.

These anonymous fences will still trip MD040. Please mark the examples block as bash and the synopsis/options blocks as text so the generated docs render cleanly and lint passes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/osdctl_cluster_diff.md` around lines 18 - 52, Update the fenced code
blocks for the osdctl cluster diff documentation to include language tags so the
linter passes: change the synopsis and options blocks (the blocks showing
"osdctl cluster diff <before.yaml> <after.yaml> [flags]" and the Options
sections) to use ```text and mark the Examples block's shells (the example
invocations like "osdctl cluster diff before.yaml after.yaml" and the --json
flag example) as ```bash; ensure each existing anonymous fence around those
snippets is replaced with the appropriately tagged fence so the "osdctl cluster
diff" docs render correctly.
cmd/cluster/snapshot.go (2)

65-71: ⚠️ Potential issue | 🟠 Major

ResourceInfo is too lossy for meaningful diffs.

For many resource types called out by this command (MachineConfig, Deployment, Service, etc.), status.phase is empty or unrelated to the actual rollout/spec change being validated. In those cases the snapshot only supports add/remove detection and misses the modifications users are collecting evidence for. Please store a stable digest or richer per-object summary alongside Status so cluster diff can detect real changes.

Also applies to: 400-429

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 65 - 71, ResourceInfo is too lossy for
diffs; update the ResourceInfo struct (the type named ResourceInfo and its
fields Name, Namespace, Kind, Status) to include a stable per-object digest or
richer summary field (e.g., Digest or Summary) that captures a hash of the
meaningful spec+metadata (or a small canonical summary) so cluster diff can
detect modifications beyond add/remove; ensure wherever ResourceInfo instances
are created (the snapshot collection code that builds these structs) you compute
and populate that digest/summary from the object's spec and stable metadata so
downstream diff logic can compare the new Digest/Summary in addition to Status.

121-145: ⚠️ Potential issue | 🔴 Critical

Bind oc collection to the requested cluster.

run() resolves --cluster-id through OCM, but every collector still executes bare oc commands against the caller's current kubecontext. That can produce a snapshot whose metadata describes cluster B while nodes/namespaces/operators come from cluster A. Please establish the target kubeconfig/context up front and thread it through every oc invocation, or fail fast if the active context does not match.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster/snapshot.go` around lines 121 - 145, The run() function resolves
--cluster-id via OCM but never binds the kubectl/oc context, causing collectors
to run against the caller's current kubeconfig; update run() (in
snapshotOptions) to resolve the cluster's API URL/cluster credentials
immediately after GetClusterAnyStatus and then set or validate the active
kubecontext for that cluster (for example by invoking a utility like
utils.SetKubeContextForCluster(cluster) or utils.LoadClusterKubeconfig(cluster)
and exporting KUBECONFIG) and propagate that kubeconfig/context into every
collector call (or fail fast if the active context does not match the target);
also update collector function signatures to accept a kubeconfig/context
parameter or make them read from the newly set environment variable so all oc
invocations target the requested cluster.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/aws/cloudtrail/errors.go`:
- Around line 87-93: The RunE handler currently only treats "json" specially and
silently ignores other values; update the anonymous RunE function (the block
using cmd.Flags().GetString("output"), opts.OutputJSON and returning opts.run())
to explicitly validate the output flag: if output == "json" set opts.OutputJSON
= true, else if output == "" proceed as before, but if output is any other
non-empty value (e.g., "yaml", "env") return a descriptive error (e.g.,
fmt.Errorf("unsupported --output value: %s", output)) so automation gets a clear
failure instead of falling back to human formatting.

In `@cmd/cluster/snapshot.go`:
- Around line 226-229: Replace all uses of exec.CommandContext(...).Output()
with exec.CommandContext(...).CombinedOutput() and include the command output in
the returned error; specifically, for each exec.CommandContext call that assigns
output, err (e.g., the "oc get nodes" invocation and the other three
exec.CommandContext usages), change to CombinedOutput() and wrap failures with a
formatted error that includes strings.TrimSpace(string(output)) (e.g., return
nil, fmt.Errorf("oc ... failed: %s", strings.TrimSpace(string(output)))). Ensure
you update each occurrence that currently uses .Output() so stderr is preserved
in diagnostics.

---

Duplicate comments:
In `@cmd/aws/cloudtrail/errors.go`:
- Around line 354-359: The whoami function dereferences output.Arn and
output.Account without checking the STS response; modify whoami (which calls
stsClient.GetCallerIdentity) to first verify output is non-nil and that
output.Arn and output.Account are non-nil before returning them, and return a
clear error if any of those fields are missing so we avoid a panic on
short/empty responses.

In `@cmd/cluster/diff.go`:
- Around line 226-243: The comparator currently only compares Status, Version
and Roles so changes to NodeSnapshot.Conditions and NodeSnapshot.Labels are
missed; update the modified-node detection (the block that inspects beforeNode
and afterNode and appends to diffs) to also compare Conditions and Labels (use a
stable/deep comparison like reflect.DeepEqual for Conditions slice and compare
Labels map keys/values or fmt.Sprintf on a sorted representation) and include
those fields in the changes list and in the Before/After formatted strings for
the NodeDiff so readiness/condition and label changes are reported for the given
name/beforeNode/afterNode.

In `@cmd/cluster/snapshot.go`:
- Around line 65-71: ResourceInfo is too lossy for diffs; update the
ResourceInfo struct (the type named ResourceInfo and its fields Name, Namespace,
Kind, Status) to include a stable per-object digest or richer summary field
(e.g., Digest or Summary) that captures a hash of the meaningful spec+metadata
(or a small canonical summary) so cluster diff can detect modifications beyond
add/remove; ensure wherever ResourceInfo instances are created (the snapshot
collection code that builds these structs) you compute and populate that
digest/summary from the object's spec and stable metadata so downstream diff
logic can compare the new Digest/Summary in addition to Status.
- Around line 121-145: The run() function resolves --cluster-id via OCM but
never binds the kubectl/oc context, causing collectors to run against the
caller's current kubeconfig; update run() (in snapshotOptions) to resolve the
cluster's API URL/cluster credentials immediately after GetClusterAnyStatus and
then set or validate the active kubecontext for that cluster (for example by
invoking a utility like utils.SetKubeContextForCluster(cluster) or
utils.LoadClusterKubeconfig(cluster) and exporting KUBECONFIG) and propagate
that kubeconfig/context into every collector call (or fail fast if the active
context does not match the target); also update collector function signatures to
accept a kubeconfig/context parameter or make them read from the newly set
environment variable so all oc invocations target the requested cluster.

In `@cmd/evidence/collect.go`:
- Around line 147-156: The help text and runtime step claim CloudTrail events
are collected but collectCloudTrailData() is currently a stub returning nil, so
evidence.yaml never contains CloudTrail data; fix by wiring the actual
CloudTrail collector into the collect command: replace the stub call to
collectCloudTrailData() with the real collector invocation (or call the function
that returns CloudTrail events) and ensure its returned events are merged into
the evidence map written to evidence.yaml (same place where other collectors'
outputs are aggregated), or alternatively update the Long help text and runtime
step names to explicitly state this is only an access/permission check if you do
not want to implement collection. Ensure you reference collectCloudTrailData(),
the evidence aggregation logic that writes evidence.yaml, and any caller
functions in cmd/evidence/collect.go so the CloudTrail output is included.
- Around line 232-237: The current verifyClusterContext(...) only warns but
individual collectors still invoke plain subprocesses with "oc" which can use
the ambient kubeconfig; centralize subprocess creation by adding a single helper
(e.g., runOcCmd or newOcExecutor) in cmd/evidence/collect.go that always injects
the resolved cluster context/kubeconfig (cluster.ID(), resolved kubeconfig path
or --context/--kubeconfig flags) or sets KUBECONFIG in the env for each process;
replace direct subprocess invocations in the collectors referenced (the blocks
around verifyClusterContext, and the other occurrences you noted) to call that
helper so every "oc" / "oc adm" invocation is bound to the target cluster before
collection starts.
- Around line 250-260: collectClusterState() is hiding sub-collector failures so
run() prints a success line even when the result is partial; change
collectClusterState (and any helper sub-collector functions it calls) to return
a non-nil error when any sub-collector fails (or return a sentinel
PartialDataError) and propagate that error to the caller, then update the call
site in run() (where o.collectClusterState() is invoked and
evidence.ClusterState is set) to treat a non-nil error as a warning/partial
result (log the error and still attach the partial clusterState) instead of
printing a clean success; reference collectClusterState,
o.collectClusterState(), evidence.ClusterState and the run() caller to find and
update the code paths.

In `@docs/osdctl_aws_cloudtrail_errors.md`:
- Around line 13-57: The code fences in the docs (the usage block "osdctl aws
cloudtrail errors [flags]" and the Examples/Options blocks) lack language tags,
triggering markdownlint MD040; update the generated markdown so the Examples
fences use "bash" and the usage/options/inherited-options fences use "text" (or
another appropriate language identifier) and update the generator/template that
emits these pages so future command docs include the correct triple-backtick
language tags (e.g., ensure the template that renders the Examples, Options, and
inherited options blocks sets the language attribute).

In `@docs/osdctl_cluster_diff.md`:
- Around line 18-52: Update the fenced code blocks for the osdctl cluster diff
documentation to include language tags so the linter passes: change the synopsis
and options blocks (the blocks showing "osdctl cluster diff <before.yaml>
<after.yaml> [flags]" and the Options sections) to use ```text and mark the
Examples block's shells (the example invocations like "osdctl cluster diff
before.yaml after.yaml" and the --json flag example) as ```bash; ensure each
existing anonymous fence around those snippets is replaced with the
appropriately tagged fence so the "osdctl cluster diff" docs render correctly.

In `@docs/README.md`:
- Around line 1099-1185: The fenced code blocks under the headings "osdctl aws",
"osdctl aws cloudtrail", and "osdctl aws cloudtrail errors" (and the other
similar blocks noted) are missing a language specifier which triggers MD040;
update each triple-backtick fence that shows usage/flags output to use the
language tag "text" (e.g., ```text) and, if README.md is generated, update the
doc generation template that emits these blocks so future regenerations include
```text by default; search for the templates that produce the usage/flags blocks
(look for the headings "osdctl aws", "osdctl aws cloudtrail", "osdctl aws
cloudtrail errors") and add the language specifier there.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c58040ae-8f4b-48e8-b222-f67bbb3851fc

📥 Commits

Reviewing files that changed from the base of the PR and between 7346956 and cd08339.

📒 Files selected for processing (20)
  • README.md
  • cmd/aws/cloudtrail/cmd.go
  • cmd/aws/cloudtrail/errors.go
  • cmd/aws/cmd.go
  • cmd/cluster/cmd.go
  • cmd/cluster/diff.go
  • cmd/cluster/snapshot.go
  • cmd/cmd.go
  • cmd/evidence/cmd.go
  • cmd/evidence/collect.go
  • docs/README.md
  • docs/osdctl.md
  • docs/osdctl_aws.md
  • docs/osdctl_aws_cloudtrail.md
  • docs/osdctl_aws_cloudtrail_errors.md
  • docs/osdctl_cluster.md
  • docs/osdctl_cluster_diff.md
  • docs/osdctl_cluster_snapshot.md
  • docs/osdctl_evidence.md
  • docs/osdctl_evidence_collect.md
✅ Files skipped from review due to trivial changes (1)
  • README.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • docs/osdctl.md
  • docs/osdctl_aws_cloudtrail.md
  • docs/osdctl_cluster.md

@MitaliBhalla
Copy link
Contributor Author

/retest

1 similar comment
@MitaliBhalla
Copy link
Contributor Author

/retest

// NewCmdCloudtrail returns the cloudtrail command group under aws
func NewCmdCloudtrail() *cobra.Command {
cloudtrailCmd := &cobra.Command{
Use: "cloudtrail",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have osdctl cloudtrail commands

Copy link
Contributor Author

@MitaliBhalla MitaliBhalla Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write-events - fetches all CloudTrail write events with filtering
permission-denied-events - only matches Client.UnauthorizedOperation errors
For feature testing/IAM validation, we need to catch a broader set of permission errors:

  • AccessDenied
  • UnauthorizedOperation / Client.UnauthorizedOperation
  • Forbidden
  • InvalidClientTokenId
  • AuthFailure
  • ExpiredToken
  • SignatureDoesNotMatch

Plus additional features like --error-types for custom filtering, --json output, and HCP cluster awareness.
Options:
Enhance permission-denied-events - add broader error patterns + new flags (keeps one command, but changes existing behavior)
Add new **osdctl cloudtrail errors subcommand - broader error detection under existing cloudtrail hierarchy (preserves existing behavior)
Which approach would you prefer?

- osdctl aws cloudtrail errors: surface AWS permission errors
- osdctl cluster snapshot: capture cluster state
- osdctl cluster diff: compare snapshots
- osdctl evidence collect: all-in-one collection
- Support ROSA Classic and HCP clusters
- Add README documentation
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 12, 2026

@MitaliBhalla: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/build cf7ec3c link true /test build

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants