Skip to content

Added new controller to monitor infra to validate fd usage#1508

Draft
vr4manta wants to merge 1 commit into
openshift:mainfrom
vr4manta:infra-validation
Draft

Added new controller to monitor infra to validate fd usage#1508
vr4manta wants to merge 1 commit into
openshift:mainfrom
vr4manta:infra-validation

Conversation

@vr4manta

@vr4manta vr4manta commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Changes

  • Created new controller to watch for infrastructure changes and degrade operator if any vSphere machines are referencing Failure Domains that no longer exist.

Summary by CodeRabbit

  • New Features
    • Added VSphere infrastructure failure domain validation that checks whether cluster Machines reference existing failure domains.
    • Validation outcomes are published on the ClusterOperator via a dedicated condition, including human-readable messages for orphaned references.
    • The operator now degrades automatically when validation reports invalid failure domain references.
  • Tests
    • Added unit test coverage for controller reconciliation and VSphere validation behavior across multiple scenarios.
  • Chores
    • Updated RBAC to permit reading ClusterOperator objects and updating ClusterOperator status.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 85a8f989-ad3e-4c9f-9552-e680398b7057

📥 Commits

Reviewing files that changed from the base of the PR and between 50627e1 and c867198.

📒 Files selected for processing (7)
  • cmd/vsphere/main.go
  • install/0000_30_machine-api-operator_09_rbac.yaml
  • pkg/controller/infrastructurevalidation/infrastructurevalidation_controller.go
  • pkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.go
  • pkg/controller/infrastructurevalidation/vsphere_validator.go
  • pkg/controller/infrastructurevalidation/vsphere_validator_test.go
  • pkg/operator/sync.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • install/0000_30_machine-api-operator_09_rbac.yaml
  • pkg/operator/sync.go
  • cmd/vsphere/main.go
  • pkg/controller/infrastructurevalidation/vsphere_validator.go
  • pkg/controller/infrastructurevalidation/infrastructurevalidation_controller.go
  • pkg/controller/infrastructurevalidation/vsphere_validator_test.go
  • pkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.go

Walkthrough

This PR introduces a new Kubernetes controller that validates VSphere Infrastructure failure domain configurations. It verifies that Machines reference valid failure domains, reports validation status through ClusterOperator conditions, and prevents operator degradation when validation passes. The validator is integrated into the manager startup with a pre-built scheme and cache configuration, and hooked into the operator's reconciliation path.

Changes

Infrastructure Failure Domain Validation

Layer / File(s) Summary
VSphere Failure Domain Validator
pkg/controller/infrastructurevalidation/vsphere_validator.go, pkg/controller/infrastructurevalidation/vsphere_validator_test.go
Validator constructs a region+zone-keyed map of configured failure domains from Infrastructure spec, lists all Machines via client, reads region/zone labels from each Machine, and collects orphaned references when the labeled combination does not match any configured domain. Returns FailureDomainValidationResult with validity flag, orphaned reference list, and domain names. Tests cover valid references, orphaned references, missing domains, unlabeled machines, partial labels, no machines, and multiple machines per domain.
Infrastructure Validation Reconciler and Condition Management
pkg/controller/infrastructurevalidation/infrastructurevalidation_controller.go, pkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.go
ReconcileInfrastructureValidation watches Infrastructure and Machine resources. On reconcile, fetches Infrastructure via uncached APIReader, skips non-VSphere platforms (clearing prior conditions), runs validation for VSphere, and updates ClusterOperator status with a custom InfrastructureFailureDomainsValid condition. Helper methods update conditions (setting validation result and toggling OperatorDegraded based on validity), clear conditions (removing only controller-specific state), and format orphaned-reference messages grouped by failure domain. Tests validate successful reconciliation, orphaned reference detection, degraded condition clearing, non-VSphere skip behavior, missing resources, and helper logic.
Manager Setup with Scheme and Cache Configuration
cmd/vsphere/main.go
Pre-creates a standalone runtime.Scheme with required API types (core, configv1, machinev1, ipamv1beta1), passes it to manager options, and extends cache watching via opts.Cache.ByObject to ensure Infrastructure and Machine resources are watched cluster-wide regardless of namespace scoping. Registers the infrastructure validation controller during setup.
RBAC Permissions and Sync Integration
install/0000_30_machine-api-operator_09_rbac.yaml, pkg/operator/sync.go
Grants ClusterRole permissions for get, list, watch, and update on config.openshift.io clusteroperators and clusteroperators/status resources. Integrates validation into operator reconciliation: syncAll now checks the InfrastructureFailureDomainsValid ClusterOperator condition for VSphere platforms and appends validation failures to the error list for downstream degraded handling.

Sequence Diagram

sequenceDiagram
  participant Client as Kubernetes API
  participant Manager as Controller Manager
  participant Reconciler as ReconcileInfrastructureValidation
  participant Validator as VSphereFailureDomainValidator
  participant ClusterOp as ClusterOperator
  
  Manager->>Reconciler: Watch Infrastructure + Machine events
  Client->>Manager: Infrastructure or Machine change
  Manager->>Reconciler: Reconcile(Infrastructure)
  Reconciler->>Client: Fetch Infrastructure (uncached)
  Client-->>Reconciler: Infrastructure object
  
  alt VSphere Platform
    Reconciler->>Validator: Validate(Infrastructure)
    Validator->>Client: List all Machines
    Client-->>Validator: Machine objects
    Validator->>Validator: Check region/zone labels against domains
    Validator-->>Reconciler: FailureDomainValidationResult
    
    alt Validation Passed
      Reconciler->>ClusterOp: Update InfrastructureFailureDomainsValid=True
      Reconciler->>ClusterOp: Clear OperatorDegraded condition
    else Validation Failed
      Reconciler->>ClusterOp: Set InfrastructureFailureDomainsValid=False
      Reconciler->>ClusterOp: Set OperatorDegraded=True with orphaned message
    end
  else Non-VSphere Platform
    Reconciler->>ClusterOp: Clear InfrastructureFailureDomainsValid condition
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Code logs customer machine names/namespaces and infrastructure details at WARNING/ERROR levels: vsphere_validator.go line 111 (klog.Warningf "Machine %s/%s"), infrastructurevalidation_controller.go... Redact machine names from logs; use only machine count or unique identifiers instead of customer-provided names in warning/error level logging.
Docstring Coverage ⚠️ Warning Docstring coverage is 31.03% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "Added new controller to monitor infra to validate fd usage" accurately summarizes the main changes: a new infrastructure validation controller that validates vSphere Failure Domain usage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR uses standard Go testing with 17 static, deterministic test names containing no dynamic values, UUIDs, timestamps, or generated identifiers. No Ginkgo tests found.
Test Structure And Quality ✅ Passed Tests added are standard Go unit tests (not Ginkgo), making the Ginkgo-specific check inapplicable. The Go unit tests follow best practices: single responsibility, proper setup with helper function...
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. The changes only include standard Go unit tests in infrastructurevalidation_controller_test.go and vsphere_validator_test.go, which do not use Ginkgo patt...
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added in this PR. Only unit tests using standard Go testing package were added.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces a new infrastructure validation controller and logic, but adds no new deployment manifests, pod specifications, or scheduling constraints. All changes are controller code, RBAC rules,...
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations found. The only stdout write (fmt.Println for --version flag) is conditional, immediate-exit, and never executed during normal OTE operation. All logging pr...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. The PR adds only standard Go unit tests (pkg/controller/infrastructurevalidation/*_test.go) using testing.T, which are not subject to this IPv6/disconnect...
No-Weak-Crypto ✅ Passed No weak cryptography patterns found. PR adds infrastructure validation controller with no MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB usage; no custom crypto implementations; no non-constant-time secr...
Container-Privileges ✅ Passed No privileged container settings detected. PR modifies only RBAC rules and Go code; no new container manifests with privileged, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation, or root acc...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign damdo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/vsphere/main.go (1)

167-170: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the cluster-wide Machine watch out of the shared manager cache.

DefaultNamespaces says this binary should reconcile machine-api objects only in --namespace, but Line 177 overrides machinev1.Machine to cache all namespaces. Because the Machine actuator/controller on Line 223 shares this manager, that widens Machine reconciliation cluster-wide, not just the new infrastructure validation controller. Please keep validation’s cluster-wide Machine watch on a dedicated cache/client instead of changing the manager-wide Machine cache.

Also applies to: 177-181

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/vsphere/main.go` around lines 167 - 170, The manager-wide cache is being
widened by overriding machinev1.Machine in opts.Cache.DefaultNamespaces; instead
revert that change and keep opts.Cache.DefaultNamespaces as-is, and do not
register machinev1.Machine in the shared manager cache. Create a dedicated
cluster-scoped cache/client for Machine objects (using controller-runtime's
cache.New or similar) and pass that dedicated client/cache into the
infrastructure validation controller setup (the controller registered where the
Machine actuator/controller is created around the code referenced at Line 223),
while leaving the main mgr and mgr.GetClient scoped to the configured
DefaultNamespaces. Ensure the validation controller uses the dedicated Machine
client for cluster-wide watches and the rest of controllers continue to use the
shared manager client/cache.
🧹 Nitpick comments (1)
pkg/controller/infrastructurevalidation/vsphere_validator.go (1)

80-123: ⚡ Quick win

Consider checking context cancellation in the machine validation loop.

The loop iterates over all machines without checking ctx.Err(). For clusters with many machines, this could delay cancellation.

Suggested improvement
 // Check each machine's failure domain reference via region/zone labels
 for i := range machineList.Items {
+	if ctx.Err() != nil {
+		return nil, ctx.Err()
+	}
 	machine := &machineList.Items[i]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/infrastructurevalidation/vsphere_validator.go` around lines 80
- 123, Inside the machine validation loop over machineList.Items, check the
provided context for cancellation (ctx.Err() or ctx.Done()) at the top of each
iteration and stop processing if cancelled: return early from the enclosing
validation function (propagating the context error) or break the loop so work
halts promptly; locate the loop that references machineList.Items,
machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and
OrphanedReference and add the context check before any heavy work or appending
to result.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@install/0000_30_machine-api-operator_09_rbac.yaml`:
- Around line 291-300: Split the single RBAC rule into two rules under
apiGroups: config.openshift.io: one rule that lists resources: clusteroperators
with verbs get, list, watch (no update) to grant read access, and a second rule
that lists resource: clusteroperators/status with verbs update (and optionally
get/list/watch if the controller reads status) so that the update verb only
applies to the status subresource; locate the existing block referencing
"clusteroperators" and "clusteroperators/status" and replace it with these two
narrower rules.

In `@pkg/operator/sync.go`:
- Around line 93-103: The ClusterOperator lookup error from
optr.getClusterOperator() is currently ignored which lets validation be skipped;
modify the block in pkg/operator/sync.go so that if optr.getClusterOperator()
returns an error you append that error to the errors slice (or return early)
instead of silently continuing; keep the existing infraValidCondition check
(v1helpers.FindStatusCondition(...) and its use of
"InfrastructureFailureDomainsValid") but ensure any err from
optr.getClusterOperator() is propagated into errors so the failure-domain
validation gate is not bypassed.

---

Outside diff comments:
In `@cmd/vsphere/main.go`:
- Around line 167-170: The manager-wide cache is being widened by overriding
machinev1.Machine in opts.Cache.DefaultNamespaces; instead revert that change
and keep opts.Cache.DefaultNamespaces as-is, and do not register
machinev1.Machine in the shared manager cache. Create a dedicated cluster-scoped
cache/client for Machine objects (using controller-runtime's cache.New or
similar) and pass that dedicated client/cache into the infrastructure validation
controller setup (the controller registered where the Machine
actuator/controller is created around the code referenced at Line 223), while
leaving the main mgr and mgr.GetClient scoped to the configured
DefaultNamespaces. Ensure the validation controller uses the dedicated Machine
client for cluster-wide watches and the rest of controllers continue to use the
shared manager client/cache.

---

Nitpick comments:
In `@pkg/controller/infrastructurevalidation/vsphere_validator.go`:
- Around line 80-123: Inside the machine validation loop over machineList.Items,
check the provided context for cancellation (ctx.Err() or ctx.Done()) at the top
of each iteration and stop processing if cancelled: return early from the
enclosing validation function (propagating the context error) or break the loop
so work halts promptly; locate the loop that references machineList.Items,
machineRegionLabelKey/machineZoneLabelKey, configuredFDs and updates result and
OrphanedReference and add the context check before any heavy work or appending
to result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d154da29-1729-4bc0-8c59-f08eac10bc7a

📥 Commits

Reviewing files that changed from the base of the PR and between d7772c6 and 50627e1.

📒 Files selected for processing (7)
  • cmd/vsphere/main.go
  • install/0000_30_machine-api-operator_09_rbac.yaml
  • pkg/controller/infrastructurevalidation/infrastructurevalidation_controller.go
  • pkg/controller/infrastructurevalidation/infrastructurevalidation_controller_test.go
  • pkg/controller/infrastructurevalidation/vsphere_validator.go
  • pkg/controller/infrastructurevalidation/vsphere_validator_test.go
  • pkg/operator/sync.go

Comment thread install/0000_30_machine-api-operator_09_rbac.yaml
Comment thread pkg/operator/sync.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant