Update Overload standard deviation doc

Bug #2113862 details a number of suggested corrections and additions to the Workload Stabilization doc. This patch adds those suggested changes. Closes-Bug: #2113862 Assisted-By: Cursor (claude-3.5-sonnet) Change-Id: I4131a304c064d2ea397b2447025c7edf69a56e2a Signed-off-by: Ronelle Landy <rlandy@redhat.com>
2025-07-03 16:51:09 -04:00
parent 6d155c4be6
commit 457819072f
3 changed files with 108 additions and 48 deletions
--- a/doc/source/strategies/workload-stabilization.rst
+++ b/doc/source/strategies/workload-stabilization.rst
@@ -1,6 +1,6 @@
-=============================================
-Watcher Overload standard deviation algorithm
-=============================================
+===============================
+Workload Stabilization Strategy
+===============================

 Synopsis
 --------
@@ -19,20 +19,20 @@ Metrics

 The *workload_stabilization* strategy requires the following metrics:

-============================ ============ ======= =============================
-metric                       service name plugins comment
-============================ ============ ======= =============================
-``compute.node.cpu.percent`` ceilometer_  none    need to set the
-                                                  ``compute_monitors`` option
-                                                  to ``cpu.virt_driver`` in the
-                                                  nova.conf.
-``hardware.memory.used``     ceilometer_  SNMP_
-``cpu``                      ceilometer_  none
-``instance_ram_usage``       ceilometer_  none
-============================ ============ ======= =============================
-
-.. _ceilometer: https://docs.openstack.org/ceilometer/latest/admin/telemetry-measurements.html#openstack-compute
-.. _SNMP: https://docs.openstack.org/ceilometer/latest/admin/telemetry-measurements.html#snmp-based-meters
+============================ ==================================================
+metric                       description
+============================ ==================================================
+``instance_ram_usage``       ram memory usage in an instance as float in
+                             megabytes
+``instance_cpu_usage``       cpu usage in an instance as float ranging between
+                             0 and 100 representing the total cpu usage as
+                             percentage
+``host_ram_usage``           ram memory usage in a compute node as float in
+                             megabytes
+``host_cpu_usage``           cpu usage in a compute node as float ranging
+                             between 0 and 100 representing the total cpu
+                             usage as percentage
+============================ ==================================================

 Cluster data model
 ******************
@@ -68,23 +68,49 @@ Configuration

 Strategy parameters are:

-==================== ====== ===================== =============================
-parameter            type   default Value         description
-==================== ====== ===================== =============================
-``metrics``          array  |metrics|             Metrics used as rates of
+====================== ====== =================== =============================
+parameter              type   default Value       description
+====================== ====== =================== =============================
+``metrics``            array  |metrics|           Metrics used as rates of
                                                  cluster loads.
-``thresholds``       object |thresholds|          Dict where key is a metric
+``thresholds``         object |thresholds|        Dict where key is a metric
                                                  and value is a trigger value.
-
-``weights``          object |weights|             These weights used to
+                                                  The strategy will only will
+                                                  look for an action plan when
+                                                  the standard deviation for
+                                                  the usage of one of the
+                                                  resources included in the
+                                                  metrics, taken as a
+                                                  normalized usage between
+                                                  0 and 1 among the hosts is
+                                                  higher than the threshold.
+                                                  The value of a perfectly
+                                                  balanced cluster for the
+                                                  standard deviation would be
+                                                  0, while in a totally
+                                                  unbalanced one would be 0.5,
+                                                  which should be the maximum
+                                                  value.
+``weights``            object   |weights|         These weights are used to
                                                  calculate common standard
-                                                  deviation. Name of weight
-                                                  contains meter name and
-                                                  _weight suffix.
-``instance_metrics`` object |instance_metrics|    Mapping to get hardware
-                                                  statistics using instance
-                                                  metrics.
-``host_choice``      string retry                 Method of host's choice.
+                                                  deviation when optimizing
+                                                  the resources usage.
+                                                  Name of weight contains meter
+                                                  name and _weight suffix.
+                                                  Higher values imply the
+                                                  metric will be prioritized
+                                                  when calculating an optimal
+                                                  resulting cluster
+                                                  distribution.
+``instance_metrics``   object |instance_metrics|  This parameter represents
+                                                  the compute node metrics
+                                                  representing compute resource
+                                                  usage for the instances
+                                                  resource indicated in the
+                                                  metrics parameter.
+``host_choice``        string retry               Method of host’s choice when
+                                                  analyzing destination for
+                                                  instances.
                                                  There are cycle, retry and
                                                  fullsearch methods. Cycle
                                                  will iterate hosts in cycle.
@@ -93,32 +119,49 @@ parameter            type   default Value         description
                                                  retry_count option).
                                                  Fullsearch will return each
                                                  host from list.
-``retry_count``      number 1                     Count of random returned
+``retry_count``        number 1                   Count of random returned
                                                  hosts.
-``periods``          object |periods|             These periods are used to get
-                                                  statistic aggregation for
-                                                  instance and host metrics.
-                                                  The period is simply a
-                                                  repeating interval of time
-                                                  into which the samples are
-                                                  grouped for aggregation.
-                                                  Watcher uses only the last
-                                                  period of all received ones.
-==================== ====== ===================== =============================
+``periods``            object |periods|           Time, in seconds, to get
+                                                  statistical values for
+                                                  resources usage for instance
+                                                  and host metrics.
+                                                  Watcher will use the last
+                                                  period to calculate resource
+                                                  usage.
+``granularity``        number 300                 NOT RECOMMENDED TO MODIFY:
+                                                  The time between two measures
+                                                  in an aggregated timeseries
+                                                  of a metric.
+``aggregation_method`` object |aggn_method|       NOT RECOMMENDED TO MODIFY:
+                                                  Function used to aggregate
+                                                  multiple measures into an
+                                                  aggregated value.
+====================== ====== =================== =============================

 .. |metrics| replace:: ["instance_cpu_usage", "instance_ram_usage"]
 .. |thresholds| replace:: {"instance_cpu_usage": 0.2, "instance_ram_usage": 0.2}
 .. |weights| replace:: {"instance_cpu_usage_weight": 1.0, "instance_ram_usage_weight": 1.0}
-.. |instance_metrics| replace:: {"instance_cpu_usage": "compute.node.cpu.percent", "instance_ram_usage": "hardware.memory.used"}
+.. |instance_metrics| replace:: {"instance_cpu_usage": "host_cpu_usage", "instance_ram_usage": "host_ram_usage"}
 .. |periods| replace:: {"instance": 720, "node": 600}
+.. |aggn_method| replace:: {"instance": 'mean', "compute_node": 'mean'}
+

 Efficacy Indicator
 ------------------

+Global efficacy indicator:
+
 .. watcher-func::
  :format: literal_block

-  watcher.decision_engine.goal.efficacy.specs.ServerConsolidation.get_global_efficacy_indicator
+  watcher.decision_engine.goal.efficacy.specs.WorkloadBalancing.get_global_efficacy_indicator
+
+Other efficacy indicators of the goal are:
+
+- ``instance_migrations_count``: The number of VM migrations to be performed
+- ``instances_count``: The total number of audited instances in strategy
+- ``standard_deviation_after_audit``: The value of resulted standard deviation
+- ``standard_deviation_before_audit``: The value of original standard deviation

 Algorithm
 ---------
@@ -141,4 +184,4 @@ How to use it ?
 External Links
 --------------

- `Watcher Overload standard deviation algorithm spec <https://specs.openstack.org/openstack/watcher-specs/specs/newton/implemented/sd-strategy.html>`_
+None
--- a/releasenotes/notes/wokload-stablization-strategy-name-9988e554ac2655a2.yaml
+++ b/releasenotes/notes/wokload-stablization-strategy-name-9988e554ac2655a2.yaml
@@ -0,0 +1,7 @@
+---
+other:
+  - |
+    The Watcher Overload Standard Deviation algorithm is now referred to in the
+    documentation as the Workload Stabilization Strategy. The documentation of
+    this strategy has been enhanced to clarify and better explain the usage of
+    parameters.
--- a/watcher/decision_engine/strategy/strategies/workload_stabilization.py
+++ b/watcher/decision_engine/strategy/strategies/workload_stabilization.py
@@ -48,9 +48,19 @@ def _set_memoize(conf):
 class WorkloadStabilization(base.WorkloadStabilizationBaseStrategy):
    """Workload Stabilization control using live migration

-    This is workload stabilization strategy based on standard deviation
-    algorithm. The goal is to determine if there is an overload in a cluster
-    and respond to it by migrating VMs to stabilize the cluster.
+    This workload stabilization strategy is based on the standard deviation
+    algorithm, as a measure of cluster resource usage balance. The goal is to
+    determine if there is an overload in a cluster and respond to it by
+    migrating VMs to stabilize the cluster.
+
+    The standard deviation is determined using normalized CPU and/or memory
+    usage values, which are scaled to a range between 0 and 1 based on the
+    usage metrics in the data sources.
+
+    A standard deviation of 0 means that your cluster's resources are
+    perfectly balanced, with all usage values being identical. However, a
+    standard deviation of 0.5 indicates completely unbalanced resource usage,
+    where some resources are heavily utilized and others are not at all.

    This strategy has been tested in a small (32 nodes) cluster.