Added new controller to monitor infra to validate fd usage#1508
Added new controller to monitor infra to validate fd usage#1508vr4manta wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (7)
🚧 Files skipped from review as they are similar to previous changes (7)
WalkthroughThis PR introduces a new Kubernetes controller that validates VSphere Infrastructure failure domain configurations. It verifies that Machines reference valid failure domains, reports validation status through ClusterOperator conditions, and prevents operator degradation when validation passes. The validator is integrated into the manager startup with a pre-built scheme and cache configuration, and hooked into the operator's reconciliation path. ChangesInfrastructure Failure Domain Validation
Sequence DiagramsequenceDiagram
participant Client as Kubernetes API
participant Manager as Controller Manager
participant Reconciler as ReconcileInfrastructureValidation
participant Validator as VSphereFailureDomainValidator
participant ClusterOp as ClusterOperator
Manager->>Reconciler: Watch Infrastructure + Machine events
Client->>Manager: Infrastructure or Machine change
Manager->>Reconciler: Reconcile(Infrastructure)
Reconciler->>Client: Fetch Infrastructure (uncached)
Client-->>Reconciler: Infrastructure object
alt VSphere Platform
Reconciler->>Validator: Validate(Infrastructure)
Validator->>Client: List all Machines
Client-->>Validator: Machine objects
Validator->>Validator: Check region/zone labels against domains
Validator-->>Reconciler: FailureDomainValidationResult
alt Validation Passed
Reconciler->>ClusterOp: Update InfrastructureFailureDomainsValid=True
Reconciler->>ClusterOp: Clear OperatorDegraded condition
else Validation Failed
Reconciler->>ClusterOp: Set InfrastructureFailureDomainsValid=False
Reconciler->>ClusterOp: Set OperatorDegraded=True with orphaned message
end
else Non-VSphere Platform
Reconciler->>ClusterOp: Clear InfrastructureFailureDomainsValid condition
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (13 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/vsphere/main.go (1)
167-170:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep the cluster-wide Machine watch out of the shared manager cache.
DefaultNamespacessays this binary should reconcile machine-api objects only in--namespace, but Line 177 overridesmachinev1.Machineto cache all namespaces. Because the Machine actuator/controller on Line 223 shares this manager, that widens Machine reconciliation cluster-wide, not just the new infrastructure validation controller. Please keep validation’s cluster-wide Machine watch on a dedicated cache/client instead of changing the manager-wide Machine cache.Also applies to: 177-181
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/vsphere/main.go` around lines 167 - 170, The manager-wide cache is being widened by overriding machinev1.Machine in opts.Cache.DefaultNamespaces; instead revert that change and keep opts.Cache.DefaultNamespaces as-is, and do not register machinev1.Machine in the shared manager cache. Create a dedicated cluster-scoped cache/client for Machine objects (using controller-runtime's cache.New or similar) and pass that dedicated client/cache into the infrastructure validation controller setup (the controller registered where the Machine actuator/controller is created around the code referenced at Line 223), while leaving the main mgr and mgr.GetClient scoped to the configured DefaultNamespaces. Ensure the validation controller uses the dedicated Machine client for cluster-wide watches and the rest of controllers continue to use the shared manager client/cache.
🧹 Nitpick comments (1)
pkg/controller/infrastructurevalidation/vsphere_validator.go (1)
80-123: ⚡ Quick winConsider checking context cancellation in the machine validation loop.
The loop iterates over all machines without checking
ctx.Err(). For clusters with many machines, this could delay cancellation.Suggested improvement
// Check each machine's failure domain reference via region/zone labels for i := range machineList.Items { + if ctx.Err() != nil { + return nil, ctx.Err() + } machine := &machineList.Items[i]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/infrastructurevalidation/vsphere_validator.go` around lines 80 - 123, Inside the machine validation loop over machineList.Items, check the provided context for cancellation (ctx.Err() or ctx.Done()) at the top of each iteration and stop processing if cancelled: return early from the enclosing validation function (propagating the context error) or break the loop so work halts promptly; locate the loop that references machineList.Items, machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and OrphanedReference and add the context check before any heavy work or appending to result.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@install/0000_30_machine-api-operator_09_rbac.yaml`:
- Around line 291-300: Split the single RBAC rule into two rules under
apiGroups: config.openshift.io: one rule that lists resources: clusteroperators
with verbs get, list, watch (no update) to grant read access, and a second rule
that lists resource: clusteroperators/status with verbs update (and optionally
get/list/watch if the controller reads status) so that the update verb only
applies to the status subresource; locate the existing block referencing
"clusteroperators" and "clusteroperators/status" and replace it with these two
narrower rules.
In `@pkg/operator/sync.go`:
- Around line 93-103: The ClusterOperator lookup error from
optr.getClusterOperator() is currently ignored which lets validation be skipped;
modify the block in pkg/operator/sync.go so that if optr.getClusterOperator()
returns an error you append that error to the errors slice (or return early)
instead of silently continuing; keep the existing infraValidCondition check
(v1helpers.FindStatusCondition(...) and its use of
"InfrastructureFailureDomainsValid") but ensure any err from
optr.getClusterOperator() is propagated into errors so the failure-domain
validation gate is not bypassed.
---
Outside diff comments:
In `@cmd/vsphere/main.go`:
- Around line 167-170: The manager-wide cache is being widened by overriding
machinev1.Machine in opts.Cache.DefaultNamespaces; instead revert that change
and keep opts.Cache.DefaultNamespaces as-is, and do not register
machinev1.Machine in the shared manager cache. Create a dedicated cluster-scoped
cache/client for Machine objects (using controller-runtime's cache.New or
similar) and pass that dedicated client/cache into the infrastructure validation
controller setup (the controller registered where the Machine
actuator/controller is created around the code referenced at Line 223), while
leaving the main mgr and mgr.GetClient scoped to the configured
DefaultNamespaces. Ensure the validation controller uses the dedicated Machine
client for cluster-wide watches and the rest of controllers continue to use the
shared manager client/cache.
---
Nitpick comments:
In `@pkg/controller/infrastructurevalidation/vsphere_validator.go`:
- Around line 80-123: Inside the machine validation loop over machineList.Items,
check the provided context for cancellation (ctx.Err() or ctx.Done()) at the top
of each iteration and stop processing if cancelled: return early from the
enclosing validation function (propagating the context error) or break the loop
so work halts promptly; locate the loop that references machineList.Items,
machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and
OrphanedReference and add the context check before any heavy work or appending
to result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d154da29-1729-4bc0-8c59-f08eac10bc7a
📒 Files selected for processing (7)
cmd/vsphere/main.goinstall/0000_30_machine-api-operator_09_rbac.yamlpkg/controller/infrastructurevalidation/infrastructurevalidation_controller.gopkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.gopkg/controller/infrastructurevalidation/vsphere_validator.gopkg/controller/infrastructurevalidation/vsphere_validator_test.gopkg/operator/sync.go
50627e1 to
c867198
Compare
Changes
Summary by CodeRabbit