Enable Network Observability on Day 0#1908
Enable Network Observability on Day 0#1908stleerh wants to merge 13 commits intoopenshift:masterfrom
Conversation
|
/cc |
|
|
||
| Being able to manage and observe the network in an OpenShift cluster is critical in maintaining the health and integrity of the network. Without it, there’s no way to verify whether your changes are working as expected or whether your network is experiencing issues. | ||
|
|
||
| Currently, Network Observability is an optional operator that many customers are not aware of. A majority of customers using OpenShift Networking do not have Network Observability installed. Customers are missing out on features that they should have and have already paid for. |
There was a problem hiding this comment.
from telemetry, what's the percentage of clusters with NOO installed today?
There was a problem hiding this comment.
We don't have telemetry for the percentage of clusters installed. It's one of the top downloaded operators but in terms of usage, it's not in the top 20.
There was a problem hiding this comment.
We don't have telemetry for the percentage of clusters installed
we should be able to that information from telemetry already (OLM reports metrics about installed operators).
There was a problem hiding this comment.
Does it have percentage of clusters installed?
There was a problem hiding this comment.
we have the total number of clusters reporting to telemetry and the total number of clusters with NOO installed.
| ### Risks and Mitigations | ||
|
|
||
| * Network Observability requires CPU, memory, and storage that the customer might not be aware of. | ||
| Mitigation: The default setting stores only metrics at a high sampling interval to minimize the use of resources. If this isn’t sufficient, more fine-tuning and filtering can be done in the provided default configuration (e.g. filtering on specific interfaces only). |
There was a problem hiding this comment.
Single node OpenShift (SNO) clusters have strict requirements in terms of resource usage: is it ok that NOO gets installed by default in this case?
There was a problem hiding this comment.
Clarified the stance on this
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| * Network Observability requires CPU, memory, and storage that the customer might not be aware of. |
There was a problem hiding this comment.
Can we get an estimate of the overhead running NOO with a "stripped-down" flow collector on a typical CI cluster (3 control plane nodes + 3 workers)?
There was a problem hiding this comment.
Estimates are in the works. The target goals are listed in the Test Plan section of this document.
|
|
||
| Summary: | ||
|
|
||
| * Sampling at 400 |
There was a problem hiding this comment.
Does it depend on the in-cluster Prometheus stack? If yes how does it play with #1880?
There was a problem hiding this comment.
Yes. PM has said, "The default OpenShift experience should continue to include Prometheus, alerts, and baseline dashboards."
There was a problem hiding this comment.
@simonpasquier I'm currently trying to understand how that plays with optional monitoring, but I don't think that's something tied to this EP (I mean, the question is equally relevant to when netobserv or any other redhat operator is installed from operatorhub?)
There was a problem hiding this comment.
My guess is that, when CMO is in lightweight mode like that, netobserv will show no data and potentially query failures;
Given that this is an explicit choice of the user, I don't think it's too problematic; it would make sense, in that case, that the user manually uninstalls netobserv.
There was a problem hiding this comment.
To rephrase my question: is it worth deploying netobserv if there's no platform monitoring stack? I understand that users create sub-optimal setups but it might be better to prevent it in the first place?
There was a problem hiding this comment.
Then when users disable the in-cluster monitoring stack (as envisioned in #1880), CNO shouldn't install NOO?
There was a problem hiding this comment.
I don't think CNO should not install NOO when user explicitly asks to install it. (Or are you talking more specifically about what the default behaviour should be?)
NOO still has valid usages without the monitoring stack, e.g. if configured to export flows to a 3rd party system.
So if the users asks to install it, we shouldn't go against their will.
There was a problem hiding this comment.
is there some install pre-flight checks where we can generate warnings in such situations?
There was a problem hiding this comment.
Whatever option we pick (never install NOO at day-0 when monitoring is disabled or delegate the decision to the user), I'd rather see it documented somewhere including what it entails in terms of features/limitations (and this EP seems to be the right place).
There was a problem hiding this comment.
That should be documented in Network Observability. This EP just enables or not enables Network Observability.
|
|
||
| ### Topology Considerations | ||
|
|
||
| All topologies are supported where CNO is supported, so this excludes MicroShift. |
There was a problem hiding this comment.
Can we describe briefly how it works for HyperShift?
There was a problem hiding this comment.
This proposal doesn't change how Network Observability works in a Hosted Control Plane (HCP) environment. Network Observability is supported on host clusters and the management cluster.
There was a problem hiding this comment.
Should the management cluster admin be able to override this (e.g. disable netobserv for all hosted clusters by default)?
There was a problem hiding this comment.
There's no single parameter to do that. You can prevent Network Observability from being installed/enabled on a per-cluster basis.
There was a problem hiding this comment.
I was more thinking of a global switch for the admin of the management cluster running all the hosted control planes.
There was a problem hiding this comment.
That could be something that's added in the future.
| 4. Wait for NOO to be ready and the OpenShift web console to be available. | ||
| 5. Create the "netobserv" namespace if it doesn't exist. | ||
| 6. Check if a FlowCollector instance exists. If yes, exit. | ||
| 7. Create a FlowCollector instance. |
There was a problem hiding this comment.
Concretely what gets deployed in terms of pods and services? Regarding pods, what level of customization is offered in terms of resource requests/limits and scheduling (infrastructure nodes)?
There was a problem hiding this comment.
This is subject to change, but the planned deployment is in the FlowCollector Custom Resource (CR) section of this document. There is no customization for the user. If this is needed, you would just edit the FlowCollector custom resource directly.
There was a problem hiding this comment.
@simonpasquier how do other in-payload operators typically do that? Through day-0 config?
I think our intent is that the default sampling settings give us a comfortable margin to not run into oomkill from the start. Later on, admins can change the sampling & adapt the limits accordingly.
There was a problem hiding this comment.
For compute resources, core operators define default resource requests but no limits (https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#resources-and-limits). I can mostly speak for the cluster monitoring operator but in our case, admins would customize the CMO configuration after installation to schedule components on infra nodes or adjust resource requests for instance.
There was a problem hiding this comment.
The best practice is to NOT set a CPU limit but to set a memory limit to avoid the OOM killer. Typically,you set the memory limit equal to the memory request.
There was a problem hiding this comment.
not for OCP core components as documented in the linked pages.
There was a problem hiding this comment.
Going back to your comment, you can also adjust resources after installation.
The current proposal just takes the default values from Network Observability. If those values are not right, the preference is to change it in Network Observability rather than to override the values here.
|
|
||
| Rather than actually installing NOO and creating the FlowCollector instance, it is less risky and simpler to just display a panel or a button to let the user install and enable Network Observability. This resolves the awareness issue. However, by doing this, it will get much less installs compared to making it enabled by default. It goes against the principle that networking and network observability should always go hand in hand and be there from the start. | ||
|
|
||
| ## Alternatives (Not Implemented) |
There was a problem hiding this comment.
Was installation via the assisted-installer considered? https://github.com/openshift/assisted-service/blob/master/docs/dev/olm-operator-plugins.md
This seems like a viable option to mitigate the drawbacks around topologies and resource constraints.
There was a problem hiding this comment.
That's a parallel work:
https://issues.redhat.com/browse/NETOBSERV-2486
openshift/assisted-service#8729
However, assisted installer doesn't cover all the installation cases. That's why we consider having an alternative.
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
9a77395 to
a22a08a
Compare
enhancements/observability/enable-network-observability-on-day-0.md
Outdated
Show resolved
Hide resolved
| deploymentModel: Service | ||
| loki: | ||
| enable: false | ||
| namespace: netobserv |
There was a problem hiding this comment.
I'm not very familiar with network observability but does netobserv here mean in which namespace the FlowCollector resources will be installed? If yes, why not an openshift- prefixed namespace?
There was a problem hiding this comment.
that's a good point, the CRD default is "netobserv", but it could make more sense to use "openshift-netobserv" now that's baked into the installer.
There was a problem hiding this comment.
This was a discussion from many years ago (NETOBSERV-373). The original description was changed so if you look at the History link, it said:
The current project name for Network Observability is "network-observability". The downstream version should be in a project that begins with "openshift-" that is reserved for OpenShift namespaces. The suggested new name is "openshift-network-observability".
Nevertheless, it's a good point that should be reconsidered. However, this proposal would just follow whatever Network Observability decides to do.
There was a problem hiding this comment.
I was rather thinking that, on the contrary, what we do here can deviate from upstream
There was a problem hiding this comment.
If it is going to be an openshift-owned thing, deviating from upstream and using an openshift-* style namespace would be my recommendation
| - Hosted Control Plane (HCP) environment | ||
| - Add e2e tests in [OpenShift Release Tooling](https://github.com/openshift/release) | ||
|
|
||
| Performance testing will be done to optimize the use of resources and to determine the specific FlowCollector settings, with the goal of using less than 5% resources (CPU and memory) and an ideal target of less than 3%. |
There was a problem hiding this comment.
Can we clarify that the measured overhead will be based on the resource utilization of all cluster components? E.g. ingesting more metrics and evaluating more rules will also increase the resource usage of Prometheus.
There was a problem hiding this comment.
Good point. It should include external components that are affected.
enhancements/observability/enable-network-observability-on-day-0.md
Outdated
Show resolved
Hide resolved
| 3. Wait for NOO to be ready and the OpenShift web console to be available. | ||
| 4. Create the "netobserv" namespace if it doesn't exist. | ||
| 5. Check if a FlowCollector instance exists. If yes, exit. | ||
| 6. Create a FlowCollector instance. |
There was a problem hiding this comment.
How would CNO signal to the user when one of the steps fails?
There was a problem hiding this comment.
It will be in the log file.
There was a problem hiding this comment.
What about reporting this via status conditions?
There was a problem hiding this comment.
yes, that can be done
There was a problem hiding this comment.
A sentence was added about this in the last section.
|
|
||
| ### Workflow Description | ||
|
|
||
| Network Observability is enabled by default on day 0 (planning stage). You don’t have to configure anything when using `openshift-install`, and Network Observability Operator will be installed and a FlowCollector custom resource (CR) will be created (Listing 3). |
There was a problem hiding this comment.
Is NOO supported for all (including future/development) OCP releases and architectures?
There was a problem hiding this comment.
It is today. The goal is to support the new OCP releases and architectures.
|
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/reopen |
|
@OlivierCazade: Reopened this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
enhancements/observability/enable-network-observability-on-day-0.md
Outdated
Show resolved
Hide resolved
enhancements/observability/enable-network-observability-on-day-0.md
Outdated
Show resolved
Hide resolved
Add installNetworkObservability field to enable network observability installation during cluster deployment (day-0). - Add InstallNetworkObservability field to NetworkSpec - Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview - Add integration test suite for the new field Related: openshift/enhancements#1908
Add installNetworkObservability field to enable network observability installation during cluster deployment (day-0). - Add InstallNetworkObservability field to NetworkSpec - Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview - Add integration test suite for the new field Related: openshift/enhancements#1908
Add networkObservability nested structure to configure network observability
installation during cluster deployment (day-0).
- Add NetworkObservability field to NetworkSpec with NetworkObservabilitySpec type
- Add NetworkObservabilityInstallationPolicy enum ("", "InstallAndEnable", "DoNotInstall")
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field
When installationPolicy is set to "" (empty string) or "InstallAndEnable",
network observability will be installed and enabled. When set to "DoNotInstall",
it will not be installed.
Related: openshift/enhancements#1908
Add networkObservability nested structure to configure network observability
installation during cluster deployment (day-0).
- Add NetworkObservability field to NetworkSpec with NetworkObservabilitySpec type
- Add NetworkObservabilityInstallationPolicy enum ("", "InstallAndEnable", "DoNotInstall")
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field
When installationPolicy is set to "" (empty string) or "InstallAndEnable",
network observability will be installed and enabled. When set to "DoNotInstall",
it will not be installed.
Related: openshift/enhancements#1908
Add networkObservability nested structure to configure network observability
installation during cluster deployment (day-0).
- Add NetworkObservability field to NetworkSpec with NetworkObservabilitySpec type
- Add NetworkObservabilityInstallationPolicy enum ("", "InstallAndEnable", "DoNotInstall")
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field
When installationPolicy is set to "" (empty string) or "InstallAndEnable",
network observability will be installed and enabled. When set to "DoNotInstall",
it will not be installed.
Related: openshift/enhancements#1908
423c8df to
073794e
Compare
| api-approvers: | ||
| - "@jotak" | ||
| - "@dave-tucker" | ||
| - {api-team} |
|
|
||
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: | ||
|
|
||
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. |
There was a problem hiding this comment.
How are you intending to evaluate this?
There was a problem hiding this comment.
Can you elaborate please @stleerh?
I see wasNetworkObservabilityDeployed but that's checking the Status of the network object - but that doesn't seem the same as checking if NOO was installed or not.
What's the expected behaviour when a customer who previously installed NOO via the catalog, upgrades to an OCP version which now attempts to install it via CNO?
There was a problem hiding this comment.
There are two different things :
- the CNO controller checks for an existing flowcollector CRD to detect any already installed NOO and to not overwrite it.
- once the CNO controller has installed NOO, we want the user to be free to own the deployment and to be able to modify anything. This is where the status is used, once the controller has successfully installed NOO, it set the status and any later reconciliation will check for the status and close the reconciliation.
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: | ||
|
|
||
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. | ||
| 2. Install NOO using OLM's OperatorGroup and Subscription. |
There was a problem hiding this comment.
Have you considered making this fully managed by the cluster-network-operator instead of utilizing OLM?
If this is going to be considered a core component of OpenShift moving forward, it seems like it could be reasonable to try and move it away from being a layered product and thus installed by OLM so there is a tighter coupling between OCP version and NOO version.
If you take the OLM approach, will you be pinning the version of NOO to a particular version to prevent automatic upgrades? Are we going to support customers modifying the OLM resources created by the CNO? Would we support running a newer version of NOO on an older version of OCP (since OLM supports over-the-air style updates and upgrades?
There are some nuances when leveraging OLM as the deployment mechanism and I'd like to better understand the interactions here.
There was a problem hiding this comment.
We want Network Observability to be there when you have Cluster Network Operator (CNO), but it can run independent of that in the upstream version.
We have one version of NOO that is backwards-compatible to all versions of OCP. We've been releasing a new X.Y version alongside the OCP version.
There was a problem hiding this comment.
It would be helpful to document how this pattern has been used before, i.e in CIO
| deploymentModel: Service | ||
| loki: | ||
| enable: false | ||
| namespace: netobserv |
There was a problem hiding this comment.
If it is going to be an openshift-owned thing, deviating from upstream and using an openshift-* style namespace would be my recommendation
|
|
||
| #### Hypershift / Hosted Control Planes | ||
|
|
||
| This proposal doesn't change how Network Observability works in a Hosted Control Plane (HCP) environment. Network Observability is supported on host clusters and the management cluster, therefore it will be enabled by default. |
There was a problem hiding this comment.
How does the cluster-network-operator equivalent in HyperShift work? For example, I know that the HyperShift Control Plane Operator has different mechanics than the cluster-authentication-operator for authentication/authorization related things.
Will there need to be any changes to HyperShift's controller behaviors to support this new functionality?
There was a problem hiding this comment.
Network Observability is already supported on HCP. Enabling NOO by default does nothing to change that.
There was a problem hiding this comment.
This doesn't answer my question. Are there any changes needed on the HyperShift side to support enabling this by default?
|
|
||
| Rather than have CNO enable Network Observability, take the existing Network Observability Operator (NOO) and have it be installed by default in the cluster. There needs to be some logic to accept the values in openshift-install to decide whether NOO should be enabled or not. | ||
|
|
||
| The core components of OpenShift are operators like Cluster Network Operator (CNO) and Cluster Storage Operator (CSO). NOO is a much smaller component and should not reside at the top level. |
There was a problem hiding this comment.
I'm not sure the size of a component really dictates how it should be deployed.
There was a problem hiding this comment.
Regardless, I don't believe it belongs at this level.
There was a problem hiding this comment.
I'm not sure the size of a component really dictates how it should be deployed.
I agree with this. Note that we have something called CloudNetworkConfigController which is a core operator on cloud platforms and does very small things scope wise
My preference is to ship NOO as core.
Is my understanding correct that
There needs to be some logic to accept the values in openshift-install to decide whether NOO should be enabled or not
Is the only reason why this alternative was not considered? or am I missing something? - installer code change avoidance doesn't seem like a valid reason..
Is NetObserv completely tied to "OVN-Kubernetes" ? and is never going to expand beyond being an observability solution outside of the components owned by CNO?
There was a problem hiding this comment.
Here are the two main points.
-
If you have CNO (basically you are running OpenShift), then you should have Network Observability. That's the basis for this feature and where it's linked to OVN-Kubernetes. In addition, my push is that as features get added to OVN-Kubernetes, they should be immediately supported by Network Observability in the same release.
-
Network Observability can also run with other container network interface (CNI). So no, it's not exclusively tied to OVN-Kubernetes, but it does support unique features of OVN-Kubernetes.
There was a problem hiding this comment.
Here are the two main points.
- If you have CNO (basically you are running OpenShift), then you should have Network Observability.
again, why not make the same claim around "if you are running OpenShift, then you are running NOO" and hence you are running NetObserv, maybe I am missing that point still.
That's the basis for this feature and where it's linked to OVN-Kubernetes. In addition, my push is that as features get added to OVN-Kubernetes, they should be immediately supported by Network Observability in the same release.
sure, that's still possible to do via NOO right? can't NOO adopt to OCP release cycle if its part of OCP? again apologies if I am not familiar with the NOO lifecycle
- Network Observability can also run with other container network interface (CNI). So no, it's not exclusively tied to OVN-Kubernetes, but it does support unique features of OVN-Kubernetes.
so does Network Observability work for all the 3rd party CNIs allowed on OCP? what happens when those CNIs are used instead of OVN-Kubernetes? example Cilium. Will NetObsev still be running by default and doing observ for cilium? what about Layer7 - do we plan to expand it there? or is netobserv also strictly restricted to l2/l3/l4 always?
There was a problem hiding this comment.
I'm having a hard time wrapping my head around why we want this to essentially be a core feature of OpenShift without coupling it to the OCP version by including it as part of the payload.
Maybe it helps to have a sync call with appropriate stakeholders to hash this out, but my biggest concern with the currently proposed approach is that you mention:
my push is that as features get added to OVN-Kubernetes, they should be immediately supported by Network Observability in the same release.
I could be misunderstanding here, but if the intention is that we should be tying NOO versions to specific OCP versions due to dependence of features, continuing to use OLM as the deployment mechanism starts to make things more confusing for end users. They have to start understanding that they are responsible for ensuring that NOO is up-to-date and that they aren't accidentally installing a version that is incompatible with the version of OpenShift that they are running.
IIRC, there are some safeguards in OLM to help mitigate this but it still feels like we are putting the onus on the user to be more attentive to the platform requirements rather than us being responsible for making sure we are always installing a compatible version of NOO automatically and that it will always stay compatible without any user intervention.
There was a problem hiding this comment.
Surya Seetharaman, Dave Tucker, Ben Bennett and I had a discussion on April 1 to hash this out. The conclusion is that we agreed to have CNO enable NOO, so the current proposal remains unchanged.
Just a few comments to the questions above:
- The upstream Network Observability is not guaranteed to work with all CNIs, meaning we don't formally test against other CNIs.
- Each new version of Network Observability is backwards-compatible to all supported OCP versions and adapts accordingly.
|
Proposed API LGTM. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jpinsonneau The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@stleerh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
dave-tucker
left a comment
There was a problem hiding this comment.
have some questions that need answering around resource ownership, upgrades, lifecycle etc...
|
|
||
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: | ||
|
|
||
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. |
There was a problem hiding this comment.
Can you elaborate please @stleerh?
I see wasNetworkObservabilityDeployed but that's checking the Status of the network object - but that doesn't seem the same as checking if NOO was installed or not.
What's the expected behaviour when a customer who previously installed NOO via the catalog, upgrades to an OCP version which now attempts to install it via CNO?
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: | ||
|
|
||
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. | ||
| 2. Install NOO using OLM's OperatorGroup and Subscription. |
There was a problem hiding this comment.
You call out OLM v0 constructs, but the implementation uses OLM v1.
There was a problem hiding this comment.
Also, what if someone installed NOO via OLM v0?
There was a problem hiding this comment.
After discussing with OLM team yesterday we did the following change on the implementation :
- to check if NOO is installed, it checks if the flowcollector CRD is already present, this way it works with OLM v0, v1 and Helm install (upstream deployment)
- changed the installation to use the OLM v1 objects
There was a problem hiding this comment.
The goal was to avoid implementation details, so it will check if NOO is installed, regardless of whether it was done using OLM v0 or v1. Step 2 could have avoided implementation details by just saying "Install NOO".
|
|
||
| The actual enabling of Network Observability is done in the Cluster Network Operator (CNO). The rationale is that we want the network observability feature to be part of networking. This is as opposed to being part of the general observability or as a standalone entity. Yet, there is still a separation at the lower level so that the two can be independently developed and released at different times, particularly for bug fixes. | ||
|
|
||
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: |
There was a problem hiding this comment.
What if a users changes from InstallAndEnable to DoNotInstall?
Does CNO uninstall the operator? Or is the user expected to clean this up manually?
There was a problem hiding this comment.
No, once NOO has been installed, the CNO has finished its work, switching back to DoNotInstall do nothing.
There was a problem hiding this comment.
This is why it's called "DoNotInstall" rather than "Uninstall".
There was a problem hiding this comment.
It would be beneficial to document that for future reference
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: | ||
|
|
||
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. | ||
| 2. Install NOO using OLM's OperatorGroup and Subscription. |
There was a problem hiding this comment.
What happens if the user mutates the resources created by CNO (i.e ClusterExtension)?
There was a problem hiding this comment.
The user own both the ClusterExtension and the Flowcollector.
If the user delete the ClusterExtension, NOO is removed and the CNO does not try to reinstall it.
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. | ||
| 2. Install NOO using OLM's OperatorGroup and Subscription. | ||
| 3. Wait for NOO to be ready. | ||
| 4. Create the "netobserv" namespace if it doesn't exist. |
There was a problem hiding this comment.
%s/netobserv/openshift-network-observability/g?
There was a problem hiding this comment.
This is a discussion we've had in the past. I'll resurface this topic, and while this proposal might influence that decision, it has to be in agreement with what Network Observability does.
| 3. Wait for NOO to be ready. | ||
| 4. Create the "netobserv" namespace if it doesn't exist. | ||
| 5. Check if a FlowCollector instance exists. If yes, exit. | ||
| 6. Create a FlowCollector instance. |
There was a problem hiding this comment.
Who owns the FlowCollector. Is it CNO? or the user?
Can a user change this after install?
What happens if, in a later release, we decide to enable a different set of features by default?
There was a problem hiding this comment.
The FlowCollector is owned by the user.
The user can change any field or even delete it to disable NOO.
There was a problem hiding this comment.
Network Observability Operator (NOO) manages FlowCollector. It is very likely the user will want to change attributes of their FlowCollector instance.
If the default features change, it will only affect future clusters that don't have Network Observability already enabled. We don't anticipate problems with different features on different clusters, since this is primarily a single cluster component. Worse case, we can make a change to update the feature set, but that is something we want to avoid.
There was a problem hiding this comment.
To summarize the discussion we had a few weeks ago, it was suggested that CNO deploys an empty FlowCollector, and that NOO is then responsible for establishing the default featureset, potentially writing that back to the CRD if it needs to. That way, CNO doesn't need to be aware of NOO defaults in future.
| 5. Check if a FlowCollector instance exists. If yes, exit. | ||
| 6. Create a FlowCollector instance. | ||
|
|
||
| The Reconciler leverages the existing framework and reuses the concept of client, scheme, and manager. It provides a clear ownership by having a separate controller for it. If the Network CR changes, the Reconciler will repeat the above steps. Note it doesn’t monitor NOO or any of NOO's components for changes, and it doesn’t do any upgrades. That is still the responsibility of NOO. |
There was a problem hiding this comment.
and it doesn’t do any upgrades
Networking component upgrade is managed via CNO, which, AIUI has pinned versions for every release.
I think it would make sense for NOO to pin a version for a given OpenShift release too, and have CNO edit the ClusterExtension to trigger the upgrade of NOO for consistency with how other components are updated.
In addition, my push is that as features get added to OVN-Kubernetes, they should be immediately supported by Network Observability in the same release.
☝️ is a good reason to do so. Let's say we rollout OVN-K8s 4.1.1 and to observe a new feature you need NOO 24.1.1 (and maybe a new feature being toggled on in the FlowCollector). Managing this through CNO seems easier than asking the user to upgrade NOO from the catalog and edit the flow collector themselves.
There was a problem hiding this comment.
I don't see an issue with NOO upgrading as needed for OVN upgrades. We can make that mostly transparent to the user.
In fact, I would say it would make things more complicated if you have to go through COO, because there might be other reasons why you want to upgrade Network Observability. And if you have two ways of upgrading, either through COO or NOO, that just make things more confusing. Having the tie-in plus still be separate and independent gives it the most flexibility.
There was a problem hiding this comment.
Paraphrasing from our previous call, NOO will always upgrade itself to the latest - I don't recall if that was the default, or user opt-in. CNO doesn't care, we just keep networking up-to-date.
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: | ||
|
|
||
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. | ||
| 2. Install NOO using OLM's OperatorGroup and Subscription. |
There was a problem hiding this comment.
Can you explain how CNO and NOO lifecycles are supposed to interact?
Maybe take a look over https://github.com/openshift/cluster-network-operator/blob/master/docs/operands.md#cluster-network-operator-operands and how it relates to the rollout of NOO.
Should NOO enter degraded state if NOO has failed to deploy for some reason?
There was a problem hiding this comment.
The new controller first install NOO, and then create the flowcollector object.
When the flowcollector object is created, NOO will install the network observability stack. If something fails during this phase NOO will enter degraded state.
But if something fails during NOO installation itself, it does not enter degraded state but the ClusterExtension does display some errors.
Add networkObservability nested structure to configure network observability
installation during cluster deployment (day-0).
- Add NetworkObservability field to NetworkSpec with NetworkObservabilitySpec type
- Add NetworkObservabilityInstallationPolicy enum ("", "InstallAndEnable", "DoNotInstall")
- Add NetworkObservabilityInstall feature gate enabled in DevPreview and TechPreview
- Add integration test suite for the new field
When installationPolicy is set to "" (empty string) or "InstallAndEnable",
network observability will be installed and enabled. When set to "DoNotInstall",
it will not be installed.
Related: openshift/enhancements#1908
|
|
||
| The actual enabling of Network Observability is done in the Cluster Network Operator (CNO). The rationale is that we want the network observability feature to be part of networking. This is as opposed to being part of the general observability or as a standalone entity. Yet, there is still a separation at the lower level so that the two can be independently developed and released at different times, particularly for bug fixes. | ||
|
|
||
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: |
There was a problem hiding this comment.
It would be beneficial to document that for future reference
| In CNO, it adds a new controller for observability and adds it to the manager. The controller is a single Go file where the Reconciler reads the state of the `installationPolicy` field. If set to `InstallAndEnable`, it does the following: | ||
|
|
||
| 1. Check if Network Observability Operator (NOO) is installed. If yes, exit. | ||
| 2. Install NOO using OLM's OperatorGroup and Subscription. |
There was a problem hiding this comment.
It would be helpful to document how this pattern has been used before, i.e in CIO
| 3. Wait for NOO to be ready. | ||
| 4. Create the "netobserv" namespace if it doesn't exist. | ||
| 5. Check if a FlowCollector instance exists. If yes, exit. | ||
| 6. Create a FlowCollector instance. |
There was a problem hiding this comment.
To summarize the discussion we had a few weeks ago, it was suggested that CNO deploys an empty FlowCollector, and that NOO is then responsible for establishing the default featureset, potentially writing that back to the CRD if it needs to. That way, CNO doesn't need to be aware of NOO defaults in future.
| 5. Check if a FlowCollector instance exists. If yes, exit. | ||
| 6. Create a FlowCollector instance. | ||
|
|
||
| The Reconciler leverages the existing framework and reuses the concept of client, scheme, and manager. It provides a clear ownership by having a separate controller for it. If the Network CR changes, the Reconciler will repeat the above steps. Note it doesn’t monitor NOO or any of NOO's components for changes, and it doesn’t do any upgrades. That is still the responsibility of NOO. |
There was a problem hiding this comment.
Paraphrasing from our previous call, NOO will always upgrade itself to the latest - I don't recall if that was the default, or user opt-in. CNO doesn't care, we just keep networking up-to-date.
|
/lgtm |
No description provided.