Update Overload standard deviation doc

Bug #2113862 details a number of suggested
corrections and additions to the Workload
Stabilization doc. This patch adds those
suggested changes.

Closes-Bug: #2113862
Assisted-By: Cursor (claude-3.5-sonnet)
Change-Id: I4131a304c064d2ea397b2447025c7edf69a56e2a
Signed-off-by: Ronelle Landy <rlandy@redhat.com>
This commit is contained in:
Ronelle Landy
2025-07-03 16:51:09 -04:00
parent 6d155c4be6
commit 457819072f
3 changed files with 108 additions and 48 deletions

View File

@@ -1,6 +1,6 @@
=============================================
Watcher Overload standard deviation algorithm
=============================================
===============================
Workload Stabilization Strategy
===============================
Synopsis
--------
@@ -19,20 +19,20 @@ Metrics
The *workload_stabilization* strategy requires the following metrics:
============================ ============ ======= =============================
metric service name plugins comment
============================ ============ ======= =============================
``compute.node.cpu.percent`` ceilometer_ none need to set the
``compute_monitors`` option
to ``cpu.virt_driver`` in the
nova.conf.
``hardware.memory.used`` ceilometer_ SNMP_
``cpu`` ceilometer_ none
``instance_ram_usage`` ceilometer_ none
============================ ============ ======= =============================
.. _ceilometer: https://docs.openstack.org/ceilometer/latest/admin/telemetry-measurements.html#openstack-compute
.. _SNMP: https://docs.openstack.org/ceilometer/latest/admin/telemetry-measurements.html#snmp-based-meters
============================ ==================================================
metric description
============================ ==================================================
``instance_ram_usage`` ram memory usage in an instance as float in
megabytes
``instance_cpu_usage`` cpu usage in an instance as float ranging between
0 and 100 representing the total cpu usage as
percentage
``host_ram_usage`` ram memory usage in a compute node as float in
megabytes
``host_cpu_usage`` cpu usage in a compute node as float ranging
between 0 and 100 representing the total cpu
usage as percentage
============================ ==================================================
Cluster data model
******************
@@ -68,23 +68,49 @@ Configuration
Strategy parameters are:
==================== ====== ===================== =============================
parameter type default Value description
==================== ====== ===================== =============================
``metrics`` array |metrics| Metrics used as rates of
====================== ====== =================== =============================
parameter type default Value description
====================== ====== =================== =============================
``metrics`` array |metrics| Metrics used as rates of
cluster loads.
``thresholds`` object |thresholds| Dict where key is a metric
``thresholds`` object |thresholds| Dict where key is a metric
and value is a trigger value.
``weights`` object |weights| These weights used to
The strategy will only will
look for an action plan when
the standard deviation for
the usage of one of the
resources included in the
metrics, taken as a
normalized usage between
0 and 1 among the hosts is
higher than the threshold.
The value of a perfectly
balanced cluster for the
standard deviation would be
0, while in a totally
unbalanced one would be 0.5,
which should be the maximum
value.
``weights`` object |weights| These weights are used to
calculate common standard
deviation. Name of weight
contains meter name and
_weight suffix.
``instance_metrics`` object |instance_metrics| Mapping to get hardware
statistics using instance
metrics.
``host_choice`` string retry Method of host's choice.
deviation when optimizing
the resources usage.
Name of weight contains meter
name and _weight suffix.
Higher values imply the
metric will be prioritized
when calculating an optimal
resulting cluster
distribution.
``instance_metrics`` object |instance_metrics| This parameter represents
the compute node metrics
representing compute resource
usage for the instances
resource indicated in the
metrics parameter.
``host_choice`` string retry Method of hosts choice when
analyzing destination for
instances.
There are cycle, retry and
fullsearch methods. Cycle
will iterate hosts in cycle.
@@ -93,32 +119,49 @@ parameter type default Value description
retry_count option).
Fullsearch will return each
host from list.
``retry_count`` number 1 Count of random returned
``retry_count`` number 1 Count of random returned
hosts.
``periods`` object |periods| These periods are used to get
statistic aggregation for
instance and host metrics.
The period is simply a
repeating interval of time
into which the samples are
grouped for aggregation.
Watcher uses only the last
period of all received ones.
==================== ====== ===================== =============================
``periods`` object |periods| Time, in seconds, to get
statistical values for
resources usage for instance
and host metrics.
Watcher will use the last
period to calculate resource
usage.
``granularity`` number 300 NOT RECOMMENDED TO MODIFY:
The time between two measures
in an aggregated timeseries
of a metric.
``aggregation_method`` object |aggn_method| NOT RECOMMENDED TO MODIFY:
Function used to aggregate
multiple measures into an
aggregated value.
====================== ====== =================== =============================
.. |metrics| replace:: ["instance_cpu_usage", "instance_ram_usage"]
.. |thresholds| replace:: {"instance_cpu_usage": 0.2, "instance_ram_usage": 0.2}
.. |weights| replace:: {"instance_cpu_usage_weight": 1.0, "instance_ram_usage_weight": 1.0}
.. |instance_metrics| replace:: {"instance_cpu_usage": "compute.node.cpu.percent", "instance_ram_usage": "hardware.memory.used"}
.. |instance_metrics| replace:: {"instance_cpu_usage": "host_cpu_usage", "instance_ram_usage": "host_ram_usage"}
.. |periods| replace:: {"instance": 720, "node": 600}
.. |aggn_method| replace:: {"instance": 'mean', "compute_node": 'mean'}
Efficacy Indicator
------------------
Global efficacy indicator:
.. watcher-func::
:format: literal_block
watcher.decision_engine.goal.efficacy.specs.ServerConsolidation.get_global_efficacy_indicator
watcher.decision_engine.goal.efficacy.specs.WorkloadBalancing.get_global_efficacy_indicator
Other efficacy indicators of the goal are:
- ``instance_migrations_count``: The number of VM migrations to be performed
- ``instances_count``: The total number of audited instances in strategy
- ``standard_deviation_after_audit``: The value of resulted standard deviation
- ``standard_deviation_before_audit``: The value of original standard deviation
Algorithm
---------
@@ -141,4 +184,4 @@ How to use it ?
External Links
--------------
- `Watcher Overload standard deviation algorithm spec <https://specs.openstack.org/openstack/watcher-specs/specs/newton/implemented/sd-strategy.html>`_
None

View File

@@ -0,0 +1,7 @@
---
other:
- |
The Watcher Overload Standard Deviation algorithm is now referred to in the
documentation as the Workload Stabilization Strategy. The documentation of
this strategy has been enhanced to clarify and better explain the usage of
parameters.

View File

@@ -48,9 +48,19 @@ def _set_memoize(conf):
class WorkloadStabilization(base.WorkloadStabilizationBaseStrategy):
"""Workload Stabilization control using live migration
This is workload stabilization strategy based on standard deviation
algorithm. The goal is to determine if there is an overload in a cluster
and respond to it by migrating VMs to stabilize the cluster.
This workload stabilization strategy is based on the standard deviation
algorithm, as a measure of cluster resource usage balance. The goal is to
determine if there is an overload in a cluster and respond to it by
migrating VMs to stabilize the cluster.
The standard deviation is determined using normalized CPU and/or memory
usage values, which are scaled to a range between 0 and 1 based on the
usage metrics in the data sources.
A standard deviation of 0 means that your cluster's resources are
perfectly balanced, with all usage values being identical. However, a
standard deviation of 0.5 indicates completely unbalanced resource usage,
where some resources are heavily utilized and others are not at all.
This strategy has been tested in a small (32 nodes) cluster.