Skip to content

Enable Network Observability on Day 0#1908

Open
stleerh wants to merge 13 commits intoopenshift:masterfrom
stleerh:enable-network-observability
Open

Enable Network Observability on Day 0#1908
stleerh wants to merge 13 commits intoopenshift:masterfrom
stleerh:enable-network-observability

Conversation

@stleerh
Copy link
Copy Markdown
Contributor

@stleerh stleerh commented Dec 16, 2025

No description provided.

@jotak
Copy link
Copy Markdown
Contributor

jotak commented Dec 16, 2025

/cc

@openshift-ci openshift-ci bot requested a review from jotak December 16, 2025 16:22

Being able to manage and observe the network in an OpenShift cluster is critical in maintaining the health and integrity of the network. Without it, there’s no way to verify whether your changes are working as expected or whether your network is experiencing issues.

Currently, Network Observability is an optional operator that many customers are not aware of. A majority of customers using OpenShift Networking do not have Network Observability installed. Customers are missing out on features that they should have and have already paid for.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from telemetry, what's the percentage of clusters with NOO installed today?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have telemetry for the percentage of clusters installed. It's one of the top downloaded operators but in terms of usage, it's not in the top 20.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have telemetry for the percentage of clusters installed

we should be able to that information from telemetry already (OLM reports metrics about installed operators).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have percentage of clusters installed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the total number of clusters reporting to telemetry and the total number of clusters with NOO installed.

### Risks and Mitigations

* Network Observability requires CPU, memory, and storage that the customer might not be aware of.
Mitigation: The default setting stores only metrics at a high sampling interval to minimize the use of resources. If this isn’t sufficient, more fine-tuning and filtering can be done in the provided default configuration (e.g. filtering on specific interfaces only).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single node OpenShift (SNO) clusters have strict requirements in terms of resource usage: is it ok that NOO gets installed by default in this case?

Copy link
Copy Markdown
Contributor Author

@stleerh stleerh Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the stance on this


### Risks and Mitigations

* Network Observability requires CPU, memory, and storage that the customer might not be aware of.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get an estimate of the overhead running NOO with a "stripped-down" flow collector on a typical CI cluster (3 control plane nodes + 3 workers)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Estimates are in the works. The target goals are listed in the Test Plan section of this document.


Summary:

* Sampling at 400
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it depend on the in-cluster Prometheus stack? If yes how does it play with #1880?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. PM has said, "The default OpenShift experience should continue to include Prometheus, alerts, and baseline dashboards."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier I'm currently trying to understand how that plays with optional monitoring, but I don't think that's something tied to this EP (I mean, the question is equally relevant to when netobserv or any other redhat operator is installed from operatorhub?)

Copy link
Copy Markdown
Contributor

@jotak jotak Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that, when CMO is in lightweight mode like that, netobserv will show no data and potentially query failures;
Given that this is an explicit choice of the user, I don't think it's too problematic; it would make sense, in that case, that the user manually uninstalls netobserv.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To rephrase my question: is it worth deploying netobserv if there's no platform monitoring stack? I understand that users create sub-optimal setups but it might be better to prevent it in the first place?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then when users disable the in-cluster monitoring stack (as envisioned in #1880), CNO shouldn't install NOO?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think CNO should not install NOO when user explicitly asks to install it. (Or are you talking more specifically about what the default behaviour should be?)
NOO still has valid usages without the monitoring stack, e.g. if configured to export flows to a 3rd party system.
So if the users asks to install it, we shouldn't go against their will.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some install pre-flight checks where we can generate warnings in such situations?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever option we pick (never install NOO at day-0 when monitoring is disabled or delegate the decision to the user), I'd rather see it documented somewhere including what it entails in terms of features/limitations (and this EP seems to be the right place).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be documented in Network Observability. This EP just enables or not enables Network Observability.


### Topology Considerations

All topologies are supported where CNO is supported, so this excludes MicroShift.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we describe briefly how it works for HyperShift?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal doesn't change how Network Observability works in a Hosted Control Plane (HCP) environment. Network Observability is supported on host clusters and the management cluster.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the management cluster admin be able to override this (e.g. disable netobserv for all hosted clusters by default)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no single parameter to do that. You can prevent Network Observability from being installed/enabled on a per-cluster basis.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking of a global switch for the admin of the management cluster running all the hosted control planes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be something that's added in the future.

4. Wait for NOO to be ready and the OpenShift web console to be available.
5. Create the "netobserv" namespace if it doesn't exist.
6. Check if a FlowCollector instance exists. If yes, exit.
7. Create a FlowCollector instance.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concretely what gets deployed in terms of pods and services? Regarding pods, what level of customization is offered in terms of resource requests/limits and scheduling (infrastructure nodes)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subject to change, but the planned deployment is in the FlowCollector Custom Resource (CR) section of this document. There is no customization for the user. If this is needed, you would just edit the FlowCollector custom resource directly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier how do other in-payload operators typically do that? Through day-0 config?

I think our intent is that the default sampling settings give us a comfortable margin to not run into oomkill from the start. Later on, admins can change the sampling & adapt the limits accordingly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compute resources, core operators define default resource requests but no limits (https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#resources-and-limits). I can mostly speak for the cluster monitoring operator but in our case, admins would customize the CMO configuration after installation to schedule components on infra nodes or adjust resource requests for instance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best practice is to NOT set a CPU limit but to set a memory limit to avoid the OOM killer. Typically,you set the memory limit equal to the memory request.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for OCP core components as documented in the linked pages.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going back to your comment, you can also adjust resources after installation.

The current proposal just takes the default values from Network Observability. If those values are not right, the preference is to change it in Network Observability rather than to override the values here.


Rather than actually installing NOO and creating the FlowCollector instance, it is less risky and simpler to just display a panel or a button to let the user install and enable Network Observability. This resolves the awareness issue. However, by doing this, it will get much less installs compared to making it enabled by default. It goes against the principle that networking and network observability should always go hand in hand and be there from the start.

## Alternatives (Not Implemented)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was installation via the assisted-installer considered? https://github.com/openshift/assisted-service/blob/master/docs/dev/olm-operator-plugins.md

This seems like a viable option to mitigate the drawbacks around topologies and resource constraints.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a parallel work:
https://issues.redhat.com/browse/NETOBSERV-2486
openshift/assisted-service#8729

However, assisted installer doesn't cover all the installation cases. That's why we consider having an alternative.

@openshift-bot
Copy link
Copy Markdown

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2026
@openshift-bot
Copy link
Copy Markdown

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 13, 2026
@stleerh stleerh force-pushed the enable-network-observability branch 2 times, most recently from 9a77395 to a22a08a Compare February 16, 2026 23:17
deploymentModel: Service
loki:
enable: false
namespace: netobserv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with network observability but does netobserv here mean in which namespace the FlowCollector resources will be installed? If yes, why not an openshift- prefixed namespace?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, the CRD default is "netobserv", but it could make more sense to use "openshift-netobserv" now that's baked into the installer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a discussion from many years ago (NETOBSERV-373). The original description was changed so if you look at the History link, it said:

The current project name for Network Observability is "network-observability".  The downstream version should be in a project that begins with "openshift-" that is reserved for OpenShift namespaces.  The suggested new name is "openshift-network-observability".

Nevertheless, it's a good point that should be reconsidered. However, this proposal would just follow whatever Network Observability decides to do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was rather thinking that, on the contrary, what we do here can deviate from upstream

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is going to be an openshift-owned thing, deviating from upstream and using an openshift-* style namespace would be my recommendation

- Hosted Control Plane (HCP) environment
- Add e2e tests in [OpenShift Release Tooling](https://github.com/openshift/release)

Performance testing will be done to optimize the use of resources and to determine the specific FlowCollector settings, with the goal of using less than 5% resources (CPU and memory) and an ideal target of less than 3%.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify that the measured overhead will be based on the resource utilization of all cluster components? E.g. ingesting more metrics and evaluating more rules will also increase the resource usage of Prometheus.

Copy link
Copy Markdown
Contributor Author

@stleerh stleerh Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It should include external components that are affected.

3. Wait for NOO to be ready and the OpenShift web console to be available.
4. Create the "netobserv" namespace if it doesn't exist.
5. Check if a FlowCollector instance exists. If yes, exit.
6. Create a FlowCollector instance.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would CNO signal to the user when one of the steps fails?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be in the log file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about reporting this via status conditions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a status condition

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that can be done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sentence was added about this in the last section.


### Workflow Description

Network Observability is enabled by default on day 0 (planning stage). You don’t have to configure anything when using `openshift-install`, and Network Observability Operator will be installed and a FlowCollector custom resource (CR) will be created (Listing 3).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NOO supported for all (including future/development) OCP releases and architectures?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is today. The goal is to support the new OCP releases and architectures.

@openshift-bot
Copy link
Copy Markdown

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Mar 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 6, 2026

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@OlivierCazade
Copy link
Copy Markdown
Contributor

/reopen

@openshift-ci openshift-ci bot reopened this Mar 10, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 10, 2026

@OlivierCazade: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

OlivierCazade added a commit to OlivierCazade/api that referenced this pull request Mar 15, 2026
Add installNetworkObservability field to enable network observability
installation during cluster deployment (day-0).

- Add InstallNetworkObservability field to NetworkSpec
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field

Related: openshift/enhancements#1908
OlivierCazade added a commit to OlivierCazade/api that referenced this pull request Mar 15, 2026
Add installNetworkObservability field to enable network observability
installation during cluster deployment (day-0).

- Add InstallNetworkObservability field to NetworkSpec
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field

Related: openshift/enhancements#1908
OlivierCazade added a commit to OlivierCazade/api that referenced this pull request Mar 24, 2026
Add networkObservability nested structure to configure network observability
installation during cluster deployment (day-0).

- Add NetworkObservability field to NetworkSpec with NetworkObservabilitySpec type
- Add NetworkObservabilityInstallationPolicy enum ("", "InstallAndEnable", "DoNotInstall")
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field

When installationPolicy is set to "" (empty string) or "InstallAndEnable",
network observability will be installed and enabled. When set to "DoNotInstall",
it will not be installed.

Related: openshift/enhancements#1908
OlivierCazade added a commit to OlivierCazade/api that referenced this pull request Mar 24, 2026
Add networkObservability nested structure to configure network observability
installation during cluster deployment (day-0).

- Add NetworkObservability field to NetworkSpec with NetworkObservabilitySpec type
- Add NetworkObservabilityInstallationPolicy enum ("", "InstallAndEnable", "DoNotInstall")
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field

When installationPolicy is set to "" (empty string) or "InstallAndEnable",
network observability will be installed and enabled. When set to "DoNotInstall",
it will not be installed.

Related: openshift/enhancements#1908
OlivierCazade added a commit to OlivierCazade/api that referenced this pull request Mar 25, 2026
Add networkObservability nested structure to configure network observability
installation during cluster deployment (day-0).

- Add NetworkObservability field to NetworkSpec with NetworkObservabilitySpec type
- Add NetworkObservabilityInstallationPolicy enum ("", "InstallAndEnable", "DoNotInstall")
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field

When installationPolicy is set to "" (empty string) or "InstallAndEnable",
network observability will be installed and enabled. When set to "DoNotInstall",
it will not be installed.

Related: openshift/enhancements#1908
@stleerh stleerh force-pushed the enable-network-observability branch from 423c8df to 073794e Compare March 25, 2026 17:33
api-approvers:
- "@jotak"
- "@dave-tucker"
- {api-team}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@everettraven here please


In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:

1. Check if Network Observability Operator (NOO) is installed. If yes, exit.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you intending to evaluate this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See PR #2925.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate please @stleerh?
I see wasNetworkObservabilityDeployed but that's checking the Status of the network object - but that doesn't seem the same as checking if NOO was installed or not.
What's the expected behaviour when a customer who previously installed NOO via the catalog, upgrades to an OCP version which now attempts to install it via CNO?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different things :

  • the CNO controller checks for an existing flowcollector CRD to detect any already installed NOO and to not overwrite it.
  • once the CNO controller has installed NOO, we want the user to be free to own the deployment and to be able to modify anything. This is where the status is used, once the controller has successfully installed NOO, it set the status and any later reconciliation will check for the status and close the reconciliation.

In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:

1. Check if Network Observability Operator (NOO) is installed. If yes, exit.
2. Install NOO using OLM's OperatorGroup and Subscription.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered making this fully managed by the cluster-network-operator instead of utilizing OLM?

If this is going to be considered a core component of OpenShift moving forward, it seems like it could be reasonable to try and move it away from being a layered product and thus installed by OLM so there is a tighter coupling between OCP version and NOO version.

If you take the OLM approach, will you be pinning the version of NOO to a particular version to prevent automatic upgrades? Are we going to support customers modifying the OLM resources created by the CNO? Would we support running a newer version of NOO on an older version of OCP (since OLM supports over-the-air style updates and upgrades?

There are some nuances when leveraging OLM as the deployment mechanism and I'd like to better understand the interactions here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want Network Observability to be there when you have Cluster Network Operator (CNO), but it can run independent of that in the upstream version.

We have one version of NOO that is backwards-compatible to all versions of OCP. We've been releasing a new X.Y version alongside the OCP version.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to document how this pattern has been used before, i.e in CIO

deploymentModel: Service
loki:
enable: false
namespace: netobserv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is going to be an openshift-owned thing, deviating from upstream and using an openshift-* style namespace would be my recommendation


#### Hypershift / Hosted Control Planes

This proposal doesn't change how Network Observability works in a Hosted Control Plane (HCP) environment. Network Observability is supported on host clusters and the management cluster, therefore it will be enabled by default.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the cluster-network-operator equivalent in HyperShift work? For example, I know that the HyperShift Control Plane Operator has different mechanics than the cluster-authentication-operator for authentication/authorization related things.

Will there need to be any changes to HyperShift's controller behaviors to support this new functionality?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network Observability is already supported on HCP. Enabling NOO by default does nothing to change that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't answer my question. Are there any changes needed on the HyperShift side to support enabling this by default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No


Rather than have CNO enable Network Observability, take the existing Network Observability Operator (NOO) and have it be installed by default in the cluster. There needs to be some logic to accept the values in openshift-install to decide whether NOO should be enabled or not.

The core components of OpenShift are operators like Cluster Network Operator (CNO) and Cluster Storage Operator (CSO). NOO is a much smaller component and should not reside at the top level.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the size of a component really dictates how it should be deployed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, I don't believe it belongs at this level.

Copy link
Copy Markdown
Contributor

@tssurya tssurya Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the size of a component really dictates how it should be deployed.

I agree with this. Note that we have something called CloudNetworkConfigController which is a core operator on cloud platforms and does very small things scope wise

My preference is to ship NOO as core.
Is my understanding correct that

There needs to be some logic to accept the values in openshift-install to decide whether NOO should be enabled or not

Is the only reason why this alternative was not considered? or am I missing something? - installer code change avoidance doesn't seem like a valid reason..

Is NetObserv completely tied to "OVN-Kubernetes" ? and is never going to expand beyond being an observability solution outside of the components owned by CNO?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the two main points.

  1. If you have CNO (basically you are running OpenShift), then you should have Network Observability. That's the basis for this feature and where it's linked to OVN-Kubernetes. In addition, my push is that as features get added to OVN-Kubernetes, they should be immediately supported by Network Observability in the same release.

  2. Network Observability can also run with other container network interface (CNI). So no, it's not exclusively tied to OVN-Kubernetes, but it does support unique features of OVN-Kubernetes.

Copy link
Copy Markdown
Contributor

@tssurya tssurya Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the two main points.

  1. If you have CNO (basically you are running OpenShift), then you should have Network Observability.

again, why not make the same claim around "if you are running OpenShift, then you are running NOO" and hence you are running NetObserv, maybe I am missing that point still.

That's the basis for this feature and where it's linked to OVN-Kubernetes. In addition, my push is that as features get added to OVN-Kubernetes, they should be immediately supported by Network Observability in the same release.

sure, that's still possible to do via NOO right? can't NOO adopt to OCP release cycle if its part of OCP? again apologies if I am not familiar with the NOO lifecycle

  1. Network Observability can also run with other container network interface (CNI). So no, it's not exclusively tied to OVN-Kubernetes, but it does support unique features of OVN-Kubernetes.

so does Network Observability work for all the 3rd party CNIs allowed on OCP? what happens when those CNIs are used instead of OVN-Kubernetes? example Cilium. Will NetObsev still be running by default and doing observ for cilium? what about Layer7 - do we plan to expand it there? or is netobserv also strictly restricted to l2/l3/l4 always?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time wrapping my head around why we want this to essentially be a core feature of OpenShift without coupling it to the OCP version by including it as part of the payload.

Maybe it helps to have a sync call with appropriate stakeholders to hash this out, but my biggest concern with the currently proposed approach is that you mention:

my push is that as features get added to OVN-Kubernetes, they should be immediately supported by Network Observability in the same release.

I could be misunderstanding here, but if the intention is that we should be tying NOO versions to specific OCP versions due to dependence of features, continuing to use OLM as the deployment mechanism starts to make things more confusing for end users. They have to start understanding that they are responsible for ensuring that NOO is up-to-date and that they aren't accidentally installing a version that is incompatible with the version of OpenShift that they are running.

IIRC, there are some safeguards in OLM to help mitigate this but it still feels like we are putting the onus on the user to be more attentive to the platform requirements rather than us being responsible for making sure we are always installing a compatible version of NOO automatically and that it will always stay compatible without any user intervention.

Copy link
Copy Markdown
Contributor Author

@stleerh stleerh Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surya Seetharaman, Dave Tucker, Ben Bennett and I had a discussion on April 1 to hash this out. The conclusion is that we agreed to have CNO enable NOO, so the current proposal remains unchanged.

Just a few comments to the questions above:

  • The upstream Network Observability is not guaranteed to work with all CNIs, meaning we don't formally test against other CNIs.
  • Each new version of Network Observability is backwards-compatible to all supported OCP versions and adapts accordingly.

@everettraven
Copy link
Copy Markdown
Contributor

Proposed API LGTM.

Copy link
Copy Markdown

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jpinsonneau
Once this PR has been reviewed and has the lgtm label, please assign moadz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 27, 2026

@stleerh: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Member

@dave-tucker dave-tucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have some questions that need answering around resource ownership, upgrades, lifecycle etc...


In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:

1. Check if Network Observability Operator (NOO) is installed. If yes, exit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate please @stleerh?
I see wasNetworkObservabilityDeployed but that's checking the Status of the network object - but that doesn't seem the same as checking if NOO was installed or not.
What's the expected behaviour when a customer who previously installed NOO via the catalog, upgrades to an OCP version which now attempts to install it via CNO?

In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:

1. Check if Network Observability Operator (NOO) is installed. If yes, exit.
2. Install NOO using OLM's OperatorGroup and Subscription.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You call out OLM v0 constructs, but the implementation uses OLM v1.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what if someone installed NOO via OLM v0?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with OLM team yesterday we did the following change on the implementation :

  • to check if NOO is installed, it checks if the flowcollector CRD is already present, this way it works with OLM v0, v1 and Helm install (upstream deployment)
  • changed the installation to use the OLM v1 objects

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal was to avoid implementation details, so it will check if NOO is installed, regardless of whether it was done using OLM v0 or v1. Step 2 could have avoided implementation details by just saying "Install NOO".


The actual enabling of Network Observability is done in the Cluster Network Operator (CNO). The rationale is that we want the network observability feature to be part of networking. This is as opposed to being part of the general observability or as a standalone entity. Yet, there is still a separation at the lower level so that the two can be independently developed and released at different times, particularly for bug fixes.

In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a users changes from InstallAndEnable to DoNotInstall?
Does CNO uninstall the operator? Or is the user expected to clean this up manually?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, once NOO has been installed, the CNO has finished its work, switching back to DoNotInstall do nothing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why it's called "DoNotInstall" rather than "Uninstall".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be beneficial to document that for future reference

In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:

1. Check if Network Observability Operator (NOO) is installed. If yes, exit.
2. Install NOO using OLM's OperatorGroup and Subscription.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user mutates the resources created by CNO (i.e ClusterExtension)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user own both the ClusterExtension and the Flowcollector.

If the user delete the ClusterExtension, NOO is removed and the CNO does not try to reinstall it.

1. Check if Network Observability Operator (NOO) is installed. If yes, exit.
2. Install NOO using OLM's OperatorGroup and Subscription.
3. Wait for NOO to be ready.
4. Create the "netobserv" namespace if it doesn't exist.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%s/netobserv/openshift-network-observability/g?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a discussion we've had in the past. I'll resurface this topic, and while this proposal might influence that decision, it has to be in agreement with what Network Observability does.

3. Wait for NOO to be ready.
4. Create the "netobserv" namespace if it doesn't exist.
5. Check if a FlowCollector instance exists. If yes, exit.
6. Create a FlowCollector instance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who owns the FlowCollector. Is it CNO? or the user?
Can a user change this after install?
What happens if, in a later release, we decide to enable a different set of features by default?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FlowCollector is owned by the user.

The user can change any field or even delete it to disable NOO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network Observability Operator (NOO) manages FlowCollector. It is very likely the user will want to change attributes of their FlowCollector instance.

If the default features change, it will only affect future clusters that don't have Network Observability already enabled. We don't anticipate problems with different features on different clusters, since this is primarily a single cluster component. Worse case, we can make a change to update the feature set, but that is something we want to avoid.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the discussion we had a few weeks ago, it was suggested that CNO deploys an empty FlowCollector, and that NOO is then responsible for establishing the default featureset, potentially writing that back to the CRD if it needs to. That way, CNO doesn't need to be aware of NOO defaults in future.

5. Check if a FlowCollector instance exists. If yes, exit.
6. Create a FlowCollector instance.

The Reconciler leverages the existing framework and reuses the concept of client, scheme, and manager. It provides a clear ownership by having a separate controller for it. If the Network CR changes, the Reconciler will repeat the above steps. Note it doesn’t monitor NOO or any of NOO's components for changes, and it doesn’t do any upgrades. That is still the responsibility of NOO.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it doesn’t do any upgrades

Networking component upgrade is managed via CNO, which, AIUI has pinned versions for every release.
I think it would make sense for NOO to pin a version for a given OpenShift release too, and have CNO edit the ClusterExtension to trigger the upgrade of NOO for consistency with how other components are updated.

In addition, my push is that as features get added to OVN-Kubernetes, they should be immediately supported by Network Observability in the same release.

☝️ is a good reason to do so. Let's say we rollout OVN-K8s 4.1.1 and to observe a new feature you need NOO 24.1.1 (and maybe a new feature being toggled on in the FlowCollector). Managing this through CNO seems easier than asking the user to upgrade NOO from the catalog and edit the flow collector themselves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an issue with NOO upgrading as needed for OVN upgrades. We can make that mostly transparent to the user.

In fact, I would say it would make things more complicated if you have to go through COO, because there might be other reasons why you want to upgrade Network Observability. And if you have two ways of upgrading, either through COO or NOO, that just make things more confusing. Having the tie-in plus still be separate and independent gives it the most flexibility.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paraphrasing from our previous call, NOO will always upgrade itself to the latest - I don't recall if that was the default, or user opt-in. CNO doesn't care, we just keep networking up-to-date.

In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:

1. Check if Network Observability Operator (NOO) is installed. If yes, exit.
2. Install NOO using OLM's OperatorGroup and Subscription.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how CNO and NOO lifecycles are supposed to interact?
Maybe take a look over https://github.com/openshift/cluster-network-operator/blob/master/docs/operands.md#cluster-network-operator-operands and how it relates to the rollout of NOO.

Should NOO enter degraded state if NOO has failed to deploy for some reason?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new controller first install NOO, and then create the flowcollector object.

When the flowcollector object is created, NOO will install the network observability stack. If something fails during this phase NOO will enter degraded state.

But if something fails during NOO installation itself, it does not enter degraded state but the ClusterExtension does display some errors.

OlivierCazade added a commit to OlivierCazade/api that referenced this pull request Apr 1, 2026
Add networkObservability nested structure to configure network observability
installation during cluster deployment (day-0).

- Add NetworkObservability field to NetworkSpec with NetworkObservabilitySpec type
- Add NetworkObservabilityInstallationPolicy enum ("", "InstallAndEnable", "DoNotInstall")
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field

When installationPolicy is set to "" (empty string) or "InstallAndEnable",
network observability will be installed and enabled. When set to "DoNotInstall",
it will not be installed.

Related: openshift/enhancements#1908
@stleerh stleerh requested a review from everettraven April 7, 2026 15:33
Copy link
Copy Markdown
Member

@dave-tucker dave-tucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving some comments here based on the discussions we had before easter.
overall lgtm


The actual enabling of Network Observability is done in the Cluster Network Operator (CNO). The rationale is that we want the network observability feature to be part of networking. This is as opposed to being part of the general observability or as a standalone entity. Yet, there is still a separation at the lower level so that the two can be independently developed and released at different times, particularly for bug fixes.

In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be beneficial to document that for future reference

In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following:

1. Check if Network Observability Operator (NOO) is installed. If yes, exit.
2. Install NOO using OLM's OperatorGroup and Subscription.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to document how this pattern has been used before, i.e in CIO

3. Wait for NOO to be ready.
4. Create the "netobserv" namespace if it doesn't exist.
5. Check if a FlowCollector instance exists. If yes, exit.
6. Create a FlowCollector instance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the discussion we had a few weeks ago, it was suggested that CNO deploys an empty FlowCollector, and that NOO is then responsible for establishing the default featureset, potentially writing that back to the CRD if it needs to. That way, CNO doesn't need to be aware of NOO defaults in future.

5. Check if a FlowCollector instance exists. If yes, exit.
6. Create a FlowCollector instance.

The Reconciler leverages the existing framework and reuses the concept of client, scheme, and manager. It provides a clear ownership by having a separate controller for it. If the Network CR changes, the Reconciler will repeat the above steps. Note it doesn’t monitor NOO or any of NOO's components for changes, and it doesn’t do any upgrades. That is still the responsibility of NOO.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paraphrasing from our previous call, NOO will always upgrade itself to the latest - I don't recall if that was the default, or user opt-in. CNO doesn't care, we just keep networking up-to-date.

@dave-tucker
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Indicates that a PR is ready to be merged. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants