Commit Graph

710 Commits

Author SHA1 Message Date
Zuul
848cde3606 Merge "Rename confusing query timeout options" 2025-08-28 09:26:40 +00:00
Takashi Kajinami
7106a12251 Rename confusing query timeout options
These do not actually define timeout but interval. Rename the options
to reflect what they actually define. The existing deprecated options
in the [gnocchi_client] are also removed, because these have been kept
for 6 years.

In addition, fix inconsistent name (query vs call).

Change-Id: Ib29115746a25b45bdff1c3da8df9d7167c2db662
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-27 23:22:45 +09:00
Douglas Viroel
03c09825f7 Extend compute model attributes
This patch extends compute model attributes by
adding new fields to Instance element. Values are
populated by nova the collector, using the same
nova list call, but requires a more recent compute
API microversion.
A new config option was added to allow users to
enable or disable the extended attributes and it is
disable by default.
Configure prometheus-based jobs to run on newer version
of nova api (2.96) and enables the extended attributes
collection.

Implements: bp/extend-compute-model-attributes

Assisted-By: Cursor (claude-4-sonnet)

Change-Id: Ibf31105d780dce510a59fc74241fa04e28529ade
Signed-off-by: Douglas Viroel <viroel@gmail.com>
2025-08-26 11:35:18 -03:00
Ronelle Landy
457819072f Update Overload standard deviation doc
Bug #2113862 details a number of suggested
corrections and additions to the Workload
Stabilization doc. This patch adds those
suggested changes.

Closes-Bug: #2113862
Assisted-By: Cursor (claude-3.5-sonnet)
Change-Id: I4131a304c064d2ea397b2447025c7edf69a56e2a
Signed-off-by: Ronelle Landy <rlandy@redhat.com>
2025-08-21 11:09:46 -04:00
Zuul
616c8f4cc4 Merge "Add options to disable migration in host maintenance" 2025-08-21 14:11:22 +00:00
Quang Ngo
cc26b3b334 Add options to disable migration in host maintenance
This change enhances the Host Maintenance strategy by introducing
two new input parameters: `disable_live_migration` and
`disable_cold_migration`. These parameters allow cloud
administrators to control whether live or cold migration should be
considered during host maintenance operations.

If `disable_live_migration` is set, active instances will be cold
migrated if `disable_cold_migration` is not set, otherwise
active instances will be stopped. If `disable_cold_migration` is set,
inactive instances will not be cold migrated.
If both are set, only stop actions will be performed on instances.

The strategy logic and action plan generation have been updated to
reflect these behaviors. A new "stop" action is introduced and
registered, and the weight planner is updated to handle new action.

Documentation for the Host Maintenance strategy is updated to
describe the new parameters and their effects.

Test Plan:
- Unit tests for HostMaintenance strategy with new parameters
- Integration tests for action plan generation with stop action

This implements the specification:
Spec: https://review.opendev.org/c/openstack/watcher-specs/+/943873

Change-Id: I201b8e5c52e1bc1a74f3886a0e301e3c0fa5d351
Signed-off-by: Quang Ngo <quang.ngo@canonical.com>
2025-08-20 22:32:33 +10:00
Zuul
90f0c2264c Merge "use cinder migrate for swap volume" 2025-08-18 20:32:42 +00:00
Sean Mooney
3742e0a79c use cinder migrate for swap volume
This change removes watchers in tree functionality
for swapping instance volumes and defines swap as an alias
of cinder volume migrate.

The watcher native implementation was missing error handling
which could lead to irretrievable data loss.

The removed code also forged project user credentials to
perform admin request as if it was done by a member of a project.
this was unsafe an posses a security risk due to how it was
implemented. This code has been removed without replacement.

While some effort has been made to allow existing
audits that were defined to work, any reduction of functionality
as a result of this security hardening is intentional.

Closes-Bug: #2112187
Change-Id: Ic3b6bfd164e272d70fe86d7b182478dd962f8ac0
Signed-off-by: Sean Mooney <work@seanmooney.info>
2025-08-18 16:35:38 +00:00
Jaromir Wysoglad
8309d9848a Add Aetos datasource
Implement the spec for multi-tenancy support for metrics. This adds
a new 'Aetos' datasource very similar to the current Prometheus
datasource. Because of that, the original PrometheusHelper class
was split into two classes and the base class is used for
PrometheusHelper and for AetosHelper. Except for the split, there
is one more change to the original PrometheusHelper class code, which
is the addition and use of the _get_fqdn_label() and
_get_instance_uuid_label() methods.

As part of the change, I refactored the current prometheus datasource
unit tests. Most of them are now used to test the PrometheusBase class
with minimal changes. Changes I've made to the original tests:

- the ones that can be be used to test the base class are moved into the
  TestPrometheusBase class
- the _setup_prometheus_client, _get_instance_uuid_label and
  _get_fqdn_label functions are mocked in the base class tests.
  Their concrete implementations are tested in each datasource tests
  separately.
- a self._create_helper() is used to instantiate the helper class with
  correct mocking.
- all config value modification is the original tests got moved out and
  instead of modifying the config values, the _get_* methods are mocked
  to return the wanted values
- to keep similar test coverage, config retrieval is tested for each
  concrete class by testing the _get_* methods.

New watcher-aetos-integration and watcher-aetos-integration-realdata
zuul jobs are added to test the new datasource. These use the same set
of tempest tests as the current watcher-prometheus-integration jobs.
The only difference is the environment setup and the Watcher config,
so that the job deploys Aetos and Watcher uses it instead of accessing
Prometheus directly.

At first this was generated by asking cursor to implement the linked spec
with some additional prompts for some smaller changes. Afterwards I manually
went through the code doing some cleanups, ensuring it complies with
PEP8 and hacking and so on. Later on I manually adjusted the code to use
the latest observabilityclient changes.
The zuul job was also mostly generated by cursor.

Implements: https://blueprints.launchpad.net/watcher/+spec/prometheus-multitenancy-support

Generated-By: Cursor with claude-4-sonnet model
Change-Id: I72c2171f72819bbde6c9cbbf565ee895e5d2bd53
Signed-off-by: Jaromir Wysoglad <jwysogla@redhat.com>
2025-08-14 02:27:24 -04:00
Zuul
9925fd2cc9 Merge "Replace dateutils usage with datetime and oslo.utils" 2025-08-07 20:46:25 +00:00
Douglas Viroel
f879b10b05 Extend decision engine to support threading mode
With the events of eventlet removal, Watcher will need
to be adapted to support both modes, eventlet and threading, for
a couple of releases before removing all eventlet code.
This patch adds methods and classes that allow decision engine
modules to create futurist thread pools instead of green thread pools,
based on a environment variable that can be enabled by service.
It moves continuous audit handler instance to decison engine service,
so it can be started together with the main decision engine service.
Adds an environment variable that allows the user to disable
eventlet monkey patching and to use oslo.service threading backend.

Change-Id: I8a8be0a7cebdc44005fd77ec960543828c7da318
Signed-off-by: Douglas Viroel <viroel@gmail.com>
2025-08-05 16:45:48 -03:00
Chandan Kumar (raukadah)
95d975f339 Replace dateutils usage with datetime and oslo.utils
This cr fixes:
* Replaced ``dateutil.tz.tzlocal()`` and ``dateutil.tz.tzutc()`` with
  ``datetime.timezone`` built-in classes in audit controllers and
  continuous audit scheduling.

* Replaced ``dateutil.parser.parse()`` with
  ``oslo_utils.timeutils.parse_isotime()`` in the zone migration
  strategy for parsing datetime strings.

Closes-Bug: #2118404

Change-Id: I6d8a345fa4339a688769b147413dcdf3016bf4a0
Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>
2025-08-05 23:09:50 +05:30
Douglas Viroel
081cd5fae9 Merge decision engine services into a single one
The decision engine process was built based on 2
services: a service that handle rpc requests and a
scheduler to trigger watcher periodic tasks.
With the new version of oslo.service, a new threading
backend was added, based on cotyledon service manager,
which starts a new process for each service tha it
manages. These two services can't run in different
process since they need access to a shared in-memory
representation of the cluster (cluster data models)
This patch proposes creating a Decision Engine Service
which includes everything in a single main service.

Change-Id: I335a97ca14b6e023fef055978a56aefebf22d433
Signed-off-by: Douglas Viroel <viroel@gmail.com>
2025-07-08 09:55:32 -03:00
Zuul
16131e5cac Merge "Update Workload Balance strategy documentation" 2025-06-27 13:36:50 +00:00
Ronelle Landy
bfbd136f4b Update Host Maintenance strategy documentation
Add clarifications to the documentation to reflect
the actual strategy usage, including:
 - updating parameter descriptions
 - extending the 'How to Use' section

Closes-Bug: #2111810
Change-Id: Ifd2876056cd8819c50658fb9f213246dc1546d42
2025-06-23 06:36:42 -04:00
Zuul
fe8d8c8839 Merge "Use KiB as unit for host_ram_usage when using prometheus datasource" 2025-06-20 16:19:50 +00:00
Zuul
b8e0e6b01c Merge "Aggregate by label when querying instance cpu usage in prometheus" 2025-06-19 14:46:07 +00:00
Alfredo Moralejo
6ea362da0b Use KiB as unit for host_ram_usage when using prometheus datasource
The prometheus datasource was reporting host_ram_usage in MiB as
described in the docstring for the base datasource interface
definition [1].

However, the gnocchi datasource is reporting it in KiB following
ceilometer metric `hardware.memory.used` [2] and the strategies
using that metric expect it to be in KiB so the best approach is
to change the unit in the prometheus datasource and update the
docstring to avoid missunderstandings in future. So, this patch
is fixing the prometheus datasource to return host_ram_usage
in KiB instead of MiB.

Additionally, it is adding more unit tests for the check_threshold
method so that it covers the memory based strategy execution, validates
the calculated standard deviation and adds the cases where it is below
the threshold.

[1] 15981117ee/watcher/decision_engine/datasources/base.py (L177-L183)
[2] https://docs.openstack.org/ceilometer/train/admin/telemetry-measurements.html#snmp-based-meters

Closes-Bug: #2113776
Change-Id: Idc060d1e709c0265c64ada16062c3a206c6b04fa
2025-06-19 16:25:27 +02:00
Zuul
0f78386462 Merge "Add debug message to report calculated metric for workload_balance" 2025-06-18 12:26:24 +00:00
Alfredo Moralejo
1529e3fadd Add debug message to report calculated metric for workload_balance
The workload_balance strategy calculates host metrics based on the
instance metrics and those are the ones used to compare with the
threshold.

Currently, the strategy does not reports the calculated values what
makes difficult to troubleshoot sometimes. This patch is adding a debug
message to log those values.

This patch is also adding a new unit test for filter_destination_hosts
based on ram instead of cpu and adding assertions for the new debug
messages. To implement properly the new test, I had to sligthly modify
the ram usage fixtures used for the workload_balance tests.

Change-Id: Ief5e167afcf346ff53471f26adc70795c4b69f68
2025-06-17 19:11:48 +02:00
Alfredo Moralejo
3860de0b1e Aggregate by label when querying instance cpu usage in prometheus
Currently, when the prometheus datasource query ceilometer_cpu metric
for instance cpu usage, it aggregates by instance and filter by the
label containing the instance uuid. While this works fine in real
scenarios, where a single metric is provided in a single instance, in
some cases as the CI jobs where metrics are directly injected, leads to
incorrect metric calculation.

We applied a similar fix for the host metrics in [1] but we did not
implement it for instance cpu.

I am also converting the query formatting to the dict format to improve
understability.

[1] https://review.opendev.org/c/openstack/watcher/+/946049

Closes-Bug: #2113936
Change-Id: I3038dec20612162c411fc77446e86a47e0354423
2025-06-11 14:49:56 +02:00
Chandan Kumar (raukadah)
15981117ee Drop unused method get_disabled_compute_nodes_with_reason
get_disabled_compute_nodes_with_reason defined in host_maintenance
strategy is not used anywhere.

This cr drops the unused method.

Change-Id: I07c0d0b63e00d476511aa8b03c0feab8ec4db95b
Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>
2025-06-09 10:51:45 +05:30
Ronelle Landy
f42cb8557b Update Workload Balance strategy documentation
Adds additional parameter and usage explanations
and combined example.

Closes-Bug: #2111848
Change-Id: Id0de4d56fa7083388ad82c61596e7484431d465b
2025-06-06 15:51:23 -04:00
Zuul
26e36e1620 Merge "Handle missing dst_node parameter in zone_migration" 2025-05-20 17:14:29 +00:00
Zuul
3585e0cc3e Merge "Drop code from Host maintenance strategy migrating instance to disabled hosts" 2025-05-16 18:18:26 +00:00
jgilaber
c6302edeca Handle missing dst_node parameter in zone_migration
For compute nodes, nova works fine if a destination node is not
specified, so this change makes sure we're not passing None when the
user does not set one to avoid an error.

Partial-Bug: 2108988

Change-Id: Ida1f18b97697c041819e29f935aa5e232848226a
2025-05-16 13:51:47 +02:00
Chandan Kumar (raukadah)
9dea55bd64 Drop code from Host maintenance strategy migrating instance to disabled hosts
Currently host maintenance strategy also migrate instances from maintenance
node to watcher_disabled compute nodes.

watcher_disabled compute nodes might be disabled for some other purpose
by different strategy. If host maintenace use those compute nodes for
migration, It might affect customer workloads.

Host maintenance strategy should never touch disabled hosts unless the user
specify a disable host as backup node.

This cr drops the logic for using disabled compute node for maintenance.
Host maintaince is already using nova schedular for migrating the
instance, will use the same. If there is no available node, strategy
will fail.

Closes-Bug: #2109945

Change-Id: If9795fd06f684eb67d553405cebd8a30887c3997
Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>
2025-05-14 09:24:25 +05:30
Douglas Viroel
17d1cf535a Deprecated Noisy Neighbor strategy
Noisy neighbor strategy is a proof of concept strategy that was
built based on LLC metric, which is not available in Nova since
Victoria release[1].
This patch marks this strategy as deprecated, to be removed in
future releases.

[1] https://docs.openstack.org/releasenotes/nova/victoria.html#relnotes-22-0-0-unmaintained-victoria-upgrade-notes

Change-Id: I940b88555007312c76a86706bd44a38fbcf7701e
2025-05-12 15:44:39 -03:00
Chandan Kumar (raukadah)
278cb7e98c [host_maintenance] Pass des hostname in add_action solution
Currently we are passing src_node and des_node uuid when we try to run
migrate action.

In the watcher-applier log, migration fails with following exception
```
Nova client exception occurred while live migrating instance <uuid>Exception: Compute host <uuid> could not be found
```
Based on 57f55190ff/watcher/applier/actions/migration.py (L122)
and
57f55190ff/watcher/common/nova_helper.py (L322),
live_migrate_instance expects destination hostname not uuid.

This cr replaces dest_node uuid to hostname.

Closes-Bug: #2109309

Change-Id: I3911ff24ea612f69dddae5eab15fabb4891f938d
Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>
2025-04-25 15:51:20 +05:30
Alfredo Moralejo
c7158b08d1 Aggregate by fqdn label instead instance in host cpu metrics
While in a regular case a specific metric for a specific host will be
provider by a single instance (exporter) so aggregating by label and by
intances should be the same, it is more correct to aggregate by the same
label that the one we use to filter the metrics.

This is follow up of https://review.opendev.org/c/openstack/watcher/+/944795

Related-Bug: #2103451

Change-Id: Ia61f051547ddc51e0d1ccd5a56485ab49ce84c2e
2025-04-02 15:36:17 +02:00
Alfredo Moralejo
a65e7e9b59 Query by fqdn_label instead of instance for host metrics
Currently we are using `instance` label to query about host metrics to
prometheus. This label is assigned to the url of each endpoint being
scrapped.

While this work fine in one-exporter-per-compute cases as the driver is
mapping the fqdn_label value to the `instance` label value, it fails
when there are more that one target with the same value for the fqdn
label. This is a valid case, to be able to query by fqdn and do not
care about what exporter in the host is providing the metric.

This patch is changing the queries we use for hosts to be based on the
fqdn_label instead of the instance one. To implement it, we are also
simplifying the way we check the metric exist for the host by converting
prometheus_fqdn_instance_map into a prometheus_fqdn_labels set
which stores the list of fqdn found in  prometheus.

Closes-Bug: #2103451
Change-Id: I3bcc317441b73da5c876e53edd4622370c6d575e
2025-03-19 15:25:24 +01:00
Zuul
f2ee231f14 Merge "pre-commit: Integrate bandit" 2025-03-11 09:58:29 +00:00
Takashi Kajinami
df3d67a4ed Replace deprecated abc.abstractproperty
It was deprecated in Python 3.3 [1].

[1] https://docs.python.org/3.13/whatsnew/3.3.html#abc

Change-Id: Ibd98cb93f697a6da6a6bc5a5030640a262c7a66b
2025-03-02 15:36:48 +09:00
Zuul
383751904c Merge "Further database refactoring" 2025-02-27 11:52:59 +00:00
Takashi Kajinami
977f014cba Deprecate Monasca data source
The Monasca project was marked inactive during 2023.1. Although we have
seen multiple people showing interest to keep the project, we haven't
seen any real progress.

Because the project is likely retired soon, let's deprecate the feature
dependent on Monasca so that we can remove it in a future release.

Change-Id: Ifd64f5ba59bbac238ff62302ec36a3e36954d6d0
2025-02-16 18:45:31 +09:00
James Page
753c44b0c4 Further database refactoring
More refactoring of the SQLAlchemy database layer to improve
compatility with eventlet on newer Pythons.

Inspired by 0ce2c41404

Related-Bug: 2067815
Change-Id: Ib5e9aa288232cc1b766bbf2a8ce2113d5a8e2f7d
2025-02-14 11:42:47 +00:00
Takashi Kajinami
dd0082c343 pre-commit: Integrate bandit
Run bandit check from per-commit so that the check is executed in pep8
job.

Also remove requirements installed automatically by pre-commit from
test-requirements.

Change-Id: I45af8c47afb262882ebbee74ae52446fed741e26
2025-02-10 22:50:34 +09:00
Zuul
4527f89d8d Merge "Add support for instance metrics to prometheus datasource" 2025-02-03 13:22:28 +00:00
Zuul
e535177bc0 Merge "Remove ceilometer datasource" 2025-01-29 13:22:46 +00:00
Alfredo Moralejo
136e5d927c Add support for instance metrics to prometheus datasource
In order to support vm_workload_consolidation, workload_balance and
workload_stabilization strategis some instance metrics are required.
This patch is adding support for them.

Implementation is based on a prometheus store populated using sg-core
from ceilometer metrics with Pollster source.

- instance_ram_usage: rely on ceilometer_memory_usage metrics created from
  ceilometer memory.usage meter.
- instance_ram_allocated: rely on the memory value provided by the
  inventory created from nova and placement APIs.
- instance_cpu_usage: rely on ceilometer_cpu metric created from
  ceilometer cpu meter. A max value of 100 is set in the query.
- instance_root_disk_size: rely on the `disk` value provided by the
  inventory created from nova and placement APIs.

A new parameterer `instance_uuid_label` has been added to the prometheus
datasource configuration to identify the label used to store the value of the
OpenStack instance uuid for eache instance metric in prometheus. Default
value is `resource`.

Change-Id: I2f2b56aa002014e511a5e48398ef1da43fc4f5e2
2025-01-23 13:23:04 +01:00
m
3f26dc47f2 Add prometheus data source for watcher decision engine
This adds a new data source for the Watcher decision engine that
implements the watcher.decision_engine.datasources.DataSourceBase.

related spec was merged at [1].

Implements: blueprint prometheus-datasource

[1] https://review.opendev.org/c/openstack/watcher-specs/+/933300

Change-Id: I6a70c4acc70a864c418cf347f5f6951cb92ec906
2025-01-10 15:20:37 +02:00
Takashi Kajinami
da23fdc621 Remove ceilometer datasource
This datasource requires Ceilometer API which was already removed some
years ago. The implementation should have been removed when dependency
on ceilometerclient was removed by [1].

Also remove some job definitions which are not actually used.

[1] 01d74d0a87

Change-Id: I29c3865dc1207f1bbbb266e4217cf8888afebfb6
2024-12-16 23:51:27 +09:00
Sean Mooney
5fadd0de57 [pre-commit] Fix execute and shebang lines
This commit removes the execute bit from several files
and remove the shebang lines from the devstack plugin.

While the devstack plugin is written in bash, it is not an executable
script. The devstack plugin is sourced by devstack as needed,
as such it is not executed in a subshell and the #!/bin/bash
lines are not used even when present.

Change-Id: I82ca22b7a47bf267fe6cf11f3e3519510108c146
2024-11-07 20:12:59 +00:00
Sean Mooney
5f79ab87c7 [pre-commit] fix typos and configure codespell
This chanage enabled codespell in precommit and
fixes the existing typos.

A followup commit will enable this in tox and ci.

Change-Id: I0a11bcd5a88247a48d3437525fc8a3cb3cdd4e58
2024-11-07 19:50:21 +00:00
Sean Mooney
9d8b990fd1 [pre-commit] Add initial pre-commit config
This change adds configuration for the pre-commit tool,
follow-up changes will address the remaining issues in a phased
approach to make the reviews simpler.

This is based on the pre-commit config used in nova
with some additional hooks.

Follow-up changes will address the FIXME comments
related to sphinx-lint and codespell, as well as update tox
to enforce these checks in ci.

Change-Id: I87681a19f7fa88366c2b0d310c8b3153aa6a137b
2024-10-22 20:12:53 +01:00
Takashi Natsume
61a7dd85ca Replace deprecated datetime.utcnow()
The datetime.utcnow() is deprecated in Python 3.12.
Replace datetime.utcnow() with oslo_utils.timeutils.utcnow().
This bumps oslo.utils to 7.0.0.

Change-Id: Icccbb0549add686a744a72b354932471cbf91c92
Signed-off-by: Takashi Natsume <takanattie@gmail.com>
2024-10-02 22:24:47 +09:00
Takashi Kajinami
566a830f64 Bump hacking
hacking 3.0.x is quite old. Bump it to the current latest version.

Change-Id: I8d87fed6afe5988678c64090af261266d1ca20e6
2024-09-22 23:54:36 +09:00
Lucian Petrut
c95ce4ec17 Add MAAS support
At the moment, Watcher can use a single bare metal provisioning
service: Openstack Ironic.

We're now adding support for Canonical's MAAS service [1], which
is commonly used along with Juju [2] to deploy Openstack.

In order to do so, we're building a metal client abstraction, with
concrete implementations for Ironic and MAAS. We'll pick the MAAS
client if the MAAS url is provided, otherwise defaulting to Ironic.

For now, we aren't updating the baremetal model collector since it
doesn't seem to be used by any of the existing Watcher strategy
implementations.

[1] https://maas.io/docs
[2] https://juju.is/docs

Implements: blueprint maas-support

Change-Id: I6861995598f6c542fa9c006131f10203f358e0a6
2023-12-11 10:21:33 +00:00
Lucian Petrut
424e9a76af vm workload consolidation: use actual host metrics
The "vm workload consolidation" strategy is summing up instance
usage in order to estimate host usage.

The problem is that some infrastructure services (e.g. OVS or Ceph
clients) may also use a significant amount of resources, which
would be ignored. This can impact Watcher's ability to detect
overloaded nodes and correctly rebalance the workload.

This commit will use the host metrics, if available. The proposed
implementation uses the maximum value between the host metric
and the sum of the instance metrics.

Note that we're holding a dict of host metric deltas in order to
account for planned migrations.

Change-Id: I82f474ee613f6c9a7c0a9d24a05cba41d2f68edb
2023-10-27 21:54:42 +03:00
Zuul
40e93407c7 Merge "Handle deprecated "cpu_util" metric" 2023-10-27 09:47:38 +00:00