Skip to content

[4.22.1.0-shapeblue1] KVM HA fixes #13373 and #13377#138

Open
harikrishna-patnala wants to merge 5 commits into
4.22.1.0-shapeblue1from
ha-checkonhostanswer-fix-4.22.1
Open

[4.22.1.0-shapeblue1] KVM HA fixes #13373 and #13377#138
harikrishna-patnala wants to merge 5 commits into
4.22.1.0-shapeblue1from
ha-checkonhostanswer-fix-4.22.1

Conversation

@harikrishna-patnala

@harikrishna-patnala harikrishna-patnala commented Jun 17, 2026

Copy link
Copy Markdown
Member

Description

This PR is a duplicate of upstream PRs apache#13373 and apache#13377 to address the same in the ShapeBlue custom patch 4.22.1.0-shapeblue1

cc @sureshanaparti

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@harikrishna-patnala

Copy link
Copy Markdown
Member Author

@blueorangutan package

@harikrishna-patnala harikrishna-patnala changed the title [4.22.1.0-shapeblue1] KVM HA: Fix CheckOnHostAnswer success flag when there is no heartbeat [4.22.1.0-shapeblue1] KVM HA fixes #13373 and #13377 Jun 19, 2026
sureshanaparti and others added 5 commits June 19, 2026 09:54
KVMHAProvider.fence() declared a host fenced only when the out-of-band power-off
command reported success. Against an already-off chassis the BMC rejects the
power-off (e.g. Redfish returns HTTP 409), so fence() failed and the host stayed
stuck in the Fencing HA state, which maps to Disconnected (not Down). VM-HA
therefore never restarted the VMs until the dead host was powered back on.

Fencing now succeeds based on the actual chassis power state:
 - if the host is already powered off (OOBM STATUS == Off), treat it as fenced;
 - otherwise issue a best-effort power-off and confirm via OOBM STATUS;
 - only a confirmed Off state counts as success; if the state cannot be confirmed
   (e.g. unreachable BMC) the fence fails and is retried, to avoid split-brain.

Also map Redfish PowerOperation.OFF to ForceOff (hard power-off) instead of
GracefulShutdown, consistent with the ipmitool driver and appropriate for fencing
an unresponsive host (SOFT remains the graceful ACPI shutdown).

Fixes apache#13376
@harikrishna-patnala harikrishna-patnala force-pushed the ha-checkonhostanswer-fix-4.22.1 branch from 5025f5d to 1613b41 Compare June 19, 2026 04:24
@kiranchavala

kiranchavala commented Jun 19, 2026

Copy link
Copy Markdown
Member

@weizhouapache @NuxRo @rajujith @sureshanaparti @harikrishna-patnala @andrijapanicsb

what should be the expected behaviour of vm ha in case of soft power off the kvm host

Steps to reproduce the issue

  1. Create a HA enabled offering
  2. Deploy a vm with HA enabled offering on a kvm host 1
  3. Login to the kvm host 1
  4. Issue shutdown command
  5. VM HA doesn't get triggered
  6. Global setting value : commands.timeout = CheckHealthCommand=5,CheckOnHostCommand=5
[root@ref-trl-11991-k-Mol8-kiran-chavala-kvm1 ~]# virsh list
 Id   Name       State
--------------------------
 1    i-2-6-VM   running

[root@ref-trl-11991-k-Mol8-kiran-chavala-kvm1 ~]# shutdown now
Connection to 10.0.32.193 closed by remote host.
Connection to 10.0.32.193 closed.
  1. Host goes into disconnected state
  2. No entry created in ( select * from op_ha_work) table
  3. grep "status reported from itself" /var/log/cloudstack/management/management-server.log

VM HA gets triggered only if hard power-off a KVM host

@weizhouapache

Copy link
Copy Markdown
Member

if Host goes into disconnected state, VM HA will NOT be triggered.
It is a known issue, we will address in a customer FR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants