watcher

Author	SHA1	Message	Date
Zuul	848cde3606	Merge "Rename confusing query timeout options"	2025-08-28 09:26:40 +00:00
Takashi Kajinami	7106a12251	Rename confusing query timeout options These do not actually define timeout but interval. Rename the options to reflect what they actually define. The existing deprecated options in the [gnocchi_client] are also removed, because these have been kept for 6 years. In addition, fix inconsistent name (query vs call). Change-Id: Ib29115746a25b45bdff1c3da8df9d7167c2db662 Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>	2025-08-27 23:22:45 +09:00
Douglas Viroel	03c09825f7	Extend compute model attributes This patch extends compute model attributes by adding new fields to Instance element. Values are populated by nova the collector, using the same nova list call, but requires a more recent compute API microversion. A new config option was added to allow users to enable or disable the extended attributes and it is disable by default. Configure prometheus-based jobs to run on newer version of nova api (2.96) and enables the extended attributes collection. Implements: bp/extend-compute-model-attributes Assisted-By: Cursor (claude-4-sonnet) Change-Id: Ibf31105d780dce510a59fc74241fa04e28529ade Signed-off-by: Douglas Viroel <viroel@gmail.com>	2025-08-26 11:35:18 -03:00
Ronelle Landy	457819072f	Update Overload standard deviation doc Bug #2113862 details a number of suggested corrections and additions to the Workload Stabilization doc. This patch adds those suggested changes. Closes-Bug: #2113862 Assisted-By: Cursor (claude-3.5-sonnet) Change-Id: I4131a304c064d2ea397b2447025c7edf69a56e2a Signed-off-by: Ronelle Landy <rlandy@redhat.com>	2025-08-21 11:09:46 -04:00
Zuul	616c8f4cc4	Merge "Add options to disable migration in host maintenance"	2025-08-21 14:11:22 +00:00
Quang Ngo	cc26b3b334	Add options to disable migration in host maintenance This change enhances the Host Maintenance strategy by introducing two new input parameters: `disable_live_migration` and `disable_cold_migration`. These parameters allow cloud administrators to control whether live or cold migration should be considered during host maintenance operations. If `disable_live_migration` is set, active instances will be cold migrated if `disable_cold_migration` is not set, otherwise active instances will be stopped. If `disable_cold_migration` is set, inactive instances will not be cold migrated. If both are set, only stop actions will be performed on instances. The strategy logic and action plan generation have been updated to reflect these behaviors. A new "stop" action is introduced and registered, and the weight planner is updated to handle new action. Documentation for the Host Maintenance strategy is updated to describe the new parameters and their effects. Test Plan: - Unit tests for HostMaintenance strategy with new parameters - Integration tests for action plan generation with stop action This implements the specification: Spec: https://review.opendev.org/c/openstack/watcher-specs/+/943873 Change-Id: I201b8e5c52e1bc1a74f3886a0e301e3c0fa5d351 Signed-off-by: Quang Ngo <quang.ngo@canonical.com>	2025-08-20 22:32:33 +10:00
Zuul	90f0c2264c	Merge "use cinder migrate for swap volume"	2025-08-18 20:32:42 +00:00
Sean Mooney	3742e0a79c	use cinder migrate for swap volume This change removes watchers in tree functionality for swapping instance volumes and defines swap as an alias of cinder volume migrate. The watcher native implementation was missing error handling which could lead to irretrievable data loss. The removed code also forged project user credentials to perform admin request as if it was done by a member of a project. this was unsafe an posses a security risk due to how it was implemented. This code has been removed without replacement. While some effort has been made to allow existing audits that were defined to work, any reduction of functionality as a result of this security hardening is intentional. Closes-Bug: #2112187 Change-Id: Ic3b6bfd164e272d70fe86d7b182478dd962f8ac0 Signed-off-by: Sean Mooney <work@seanmooney.info>	2025-08-18 16:35:38 +00:00
Jaromir Wysoglad	8309d9848a	Add Aetos datasource Implement the spec for multi-tenancy support for metrics. This adds a new 'Aetos' datasource very similar to the current Prometheus datasource. Because of that, the original PrometheusHelper class was split into two classes and the base class is used for PrometheusHelper and for AetosHelper. Except for the split, there is one more change to the original PrometheusHelper class code, which is the addition and use of the _get_fqdn_label() and _get_instance_uuid_label() methods. As part of the change, I refactored the current prometheus datasource unit tests. Most of them are now used to test the PrometheusBase class with minimal changes. Changes I've made to the original tests: - the ones that can be be used to test the base class are moved into the TestPrometheusBase class - the _setup_prometheus_client, _get_instance_uuid_label and _get_fqdn_label functions are mocked in the base class tests. Their concrete implementations are tested in each datasource tests separately. - a self._create_helper() is used to instantiate the helper class with correct mocking. - all config value modification is the original tests got moved out and instead of modifying the config values, the _get_* methods are mocked to return the wanted values - to keep similar test coverage, config retrieval is tested for each concrete class by testing the _get_* methods. New watcher-aetos-integration and watcher-aetos-integration-realdata zuul jobs are added to test the new datasource. These use the same set of tempest tests as the current watcher-prometheus-integration jobs. The only difference is the environment setup and the Watcher config, so that the job deploys Aetos and Watcher uses it instead of accessing Prometheus directly. At first this was generated by asking cursor to implement the linked spec with some additional prompts for some smaller changes. Afterwards I manually went through the code doing some cleanups, ensuring it complies with PEP8 and hacking and so on. Later on I manually adjusted the code to use the latest observabilityclient changes. The zuul job was also mostly generated by cursor. Implements: https://blueprints.launchpad.net/watcher/+spec/prometheus-multitenancy-support Generated-By: Cursor with claude-4-sonnet model Change-Id: I72c2171f72819bbde6c9cbbf565ee895e5d2bd53 Signed-off-by: Jaromir Wysoglad <jwysogla@redhat.com>	2025-08-14 02:27:24 -04:00
Zuul	9925fd2cc9	Merge "Replace dateutils usage with datetime and oslo.utils"	2025-08-07 20:46:25 +00:00
Douglas Viroel	f879b10b05	Extend decision engine to support threading mode With the events of eventlet removal, Watcher will need to be adapted to support both modes, eventlet and threading, for a couple of releases before removing all eventlet code. This patch adds methods and classes that allow decision engine modules to create futurist thread pools instead of green thread pools, based on a environment variable that can be enabled by service. It moves continuous audit handler instance to decison engine service, so it can be started together with the main decision engine service. Adds an environment variable that allows the user to disable eventlet monkey patching and to use oslo.service threading backend. Change-Id: I8a8be0a7cebdc44005fd77ec960543828c7da318 Signed-off-by: Douglas Viroel <viroel@gmail.com>	2025-08-05 16:45:48 -03:00
Chandan Kumar (raukadah)	95d975f339	Replace dateutils usage with datetime and oslo.utils This cr fixes: * Replaced ``dateutil.tz.tzlocal()`` and ``dateutil.tz.tzutc()`` with ``datetime.timezone`` built-in classes in audit controllers and continuous audit scheduling. * Replaced ``dateutil.parser.parse()`` with ``oslo_utils.timeutils.parse_isotime()`` in the zone migration strategy for parsing datetime strings. Closes-Bug: #2118404 Change-Id: I6d8a345fa4339a688769b147413dcdf3016bf4a0 Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>	2025-08-05 23:09:50 +05:30
Douglas Viroel	081cd5fae9	Merge decision engine services into a single one The decision engine process was built based on 2 services: a service that handle rpc requests and a scheduler to trigger watcher periodic tasks. With the new version of oslo.service, a new threading backend was added, based on cotyledon service manager, which starts a new process for each service tha it manages. These two services can't run in different process since they need access to a shared in-memory representation of the cluster (cluster data models) This patch proposes creating a Decision Engine Service which includes everything in a single main service. Change-Id: I335a97ca14b6e023fef055978a56aefebf22d433 Signed-off-by: Douglas Viroel <viroel@gmail.com>	2025-07-08 09:55:32 -03:00
Zuul	16131e5cac	Merge "Update Workload Balance strategy documentation"	2025-06-27 13:36:50 +00:00
Ronelle Landy	bfbd136f4b	Update Host Maintenance strategy documentation Add clarifications to the documentation to reflect the actual strategy usage, including: - updating parameter descriptions - extending the 'How to Use' section Closes-Bug: #2111810 Change-Id: Ifd2876056cd8819c50658fb9f213246dc1546d42	2025-06-23 06:36:42 -04:00
Zuul	fe8d8c8839	Merge "Use KiB as unit for host_ram_usage when using prometheus datasource"	2025-06-20 16:19:50 +00:00
Zuul	b8e0e6b01c	Merge "Aggregate by label when querying instance cpu usage in prometheus"	2025-06-19 14:46:07 +00:00
Alfredo Moralejo	6ea362da0b	Use KiB as unit for host_ram_usage when using prometheus datasource The prometheus datasource was reporting host_ram_usage in MiB as described in the docstring for the base datasource interface definition [1]. However, the gnocchi datasource is reporting it in KiB following ceilometer metric `hardware.memory.used` [2] and the strategies using that metric expect it to be in KiB so the best approach is to change the unit in the prometheus datasource and update the docstring to avoid missunderstandings in future. So, this patch is fixing the prometheus datasource to return host_ram_usage in KiB instead of MiB. Additionally, it is adding more unit tests for the check_threshold method so that it covers the memory based strategy execution, validates the calculated standard deviation and adds the cases where it is below the threshold. [1] `15981117ee/watcher/decision_engine/datasources/base.py (L177-L183)` [2] https://docs.openstack.org/ceilometer/train/admin/telemetry-measurements.html#snmp-based-meters Closes-Bug: #2113776 Change-Id: Idc060d1e709c0265c64ada16062c3a206c6b04fa	2025-06-19 16:25:27 +02:00
Zuul	0f78386462	Merge "Add debug message to report calculated metric for workload_balance"	2025-06-18 12:26:24 +00:00
Alfredo Moralejo	1529e3fadd	Add debug message to report calculated metric for workload_balance The workload_balance strategy calculates host metrics based on the instance metrics and those are the ones used to compare with the threshold. Currently, the strategy does not reports the calculated values what makes difficult to troubleshoot sometimes. This patch is adding a debug message to log those values. This patch is also adding a new unit test for filter_destination_hosts based on ram instead of cpu and adding assertions for the new debug messages. To implement properly the new test, I had to sligthly modify the ram usage fixtures used for the workload_balance tests. Change-Id: Ief5e167afcf346ff53471f26adc70795c4b69f68	2025-06-17 19:11:48 +02:00
Alfredo Moralejo	3860de0b1e	Aggregate by label when querying instance cpu usage in prometheus Currently, when the prometheus datasource query ceilometer_cpu metric for instance cpu usage, it aggregates by instance and filter by the label containing the instance uuid. While this works fine in real scenarios, where a single metric is provided in a single instance, in some cases as the CI jobs where metrics are directly injected, leads to incorrect metric calculation. We applied a similar fix for the host metrics in [1] but we did not implement it for instance cpu. I am also converting the query formatting to the dict format to improve understability. [1] https://review.opendev.org/c/openstack/watcher/+/946049 Closes-Bug: #2113936 Change-Id: I3038dec20612162c411fc77446e86a47e0354423	2025-06-11 14:49:56 +02:00
Chandan Kumar (raukadah)	15981117ee	Drop unused method get_disabled_compute_nodes_with_reason get_disabled_compute_nodes_with_reason defined in host_maintenance strategy is not used anywhere. This cr drops the unused method. Change-Id: I07c0d0b63e00d476511aa8b03c0feab8ec4db95b Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>	2025-06-09 10:51:45 +05:30
Ronelle Landy	f42cb8557b	Update Workload Balance strategy documentation Adds additional parameter and usage explanations and combined example. Closes-Bug: #2111848 Change-Id: Id0de4d56fa7083388ad82c61596e7484431d465b	2025-06-06 15:51:23 -04:00
Zuul	26e36e1620	Merge "Handle missing dst_node parameter in zone_migration"	2025-05-20 17:14:29 +00:00
Zuul	3585e0cc3e	Merge "Drop code from Host maintenance strategy migrating instance to disabled hosts"	2025-05-16 18:18:26 +00:00
jgilaber	c6302edeca	Handle missing dst_node parameter in zone_migration For compute nodes, nova works fine if a destination node is not specified, so this change makes sure we're not passing None when the user does not set one to avoid an error. Partial-Bug: 2108988 Change-Id: Ida1f18b97697c041819e29f935aa5e232848226a	2025-05-16 13:51:47 +02:00
Chandan Kumar (raukadah)	9dea55bd64	Drop code from Host maintenance strategy migrating instance to disabled hosts Currently host maintenance strategy also migrate instances from maintenance node to watcher_disabled compute nodes. watcher_disabled compute nodes might be disabled for some other purpose by different strategy. If host maintenace use those compute nodes for migration, It might affect customer workloads. Host maintenance strategy should never touch disabled hosts unless the user specify a disable host as backup node. This cr drops the logic for using disabled compute node for maintenance. Host maintaince is already using nova schedular for migrating the instance, will use the same. If there is no available node, strategy will fail. Closes-Bug: #2109945 Change-Id: If9795fd06f684eb67d553405cebd8a30887c3997 Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>	2025-05-14 09:24:25 +05:30
Douglas Viroel	17d1cf535a	Deprecated Noisy Neighbor strategy Noisy neighbor strategy is a proof of concept strategy that was built based on LLC metric, which is not available in Nova since Victoria release[1]. This patch marks this strategy as deprecated, to be removed in future releases. [1] https://docs.openstack.org/releasenotes/nova/victoria.html#relnotes-22-0-0-unmaintained-victoria-upgrade-notes Change-Id: I940b88555007312c76a86706bd44a38fbcf7701e	2025-05-12 15:44:39 -03:00
Chandan Kumar (raukadah)	278cb7e98c	[host_maintenance] Pass des hostname in add_action solution Currently we are passing src_node and des_node uuid when we try to run migrate action. In the watcher-applier log, migration fails with following exception ``` Nova client exception occurred while live migrating instance <uuid>Exception: Compute host <uuid> could not be found ``` Based on `57f55190ff/watcher/applier/actions/migration.py (L122)` and `57f55190ff/watcher/common/nova_helper.py (L322)`, live_migrate_instance expects destination hostname not uuid. This cr replaces dest_node uuid to hostname. Closes-Bug: #2109309 Change-Id: I3911ff24ea612f69dddae5eab15fabb4891f938d Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>	2025-04-25 15:51:20 +05:30
Alfredo Moralejo	c7158b08d1	Aggregate by fqdn label instead instance in host cpu metrics While in a regular case a specific metric for a specific host will be provider by a single instance (exporter) so aggregating by label and by intances should be the same, it is more correct to aggregate by the same label that the one we use to filter the metrics. This is follow up of https://review.opendev.org/c/openstack/watcher/+/944795 Related-Bug: #2103451 Change-Id: Ia61f051547ddc51e0d1ccd5a56485ab49ce84c2e	2025-04-02 15:36:17 +02:00
Alfredo Moralejo	a65e7e9b59	Query by fqdn_label instead of instance for host metrics Currently we are using `instance` label to query about host metrics to prometheus. This label is assigned to the url of each endpoint being scrapped. While this work fine in one-exporter-per-compute cases as the driver is mapping the fqdn_label value to the `instance` label value, it fails when there are more that one target with the same value for the fqdn label. This is a valid case, to be able to query by fqdn and do not care about what exporter in the host is providing the metric. This patch is changing the queries we use for hosts to be based on the fqdn_label instead of the instance one. To implement it, we are also simplifying the way we check the metric exist for the host by converting prometheus_fqdn_instance_map into a prometheus_fqdn_labels set which stores the list of fqdn found in prometheus. Closes-Bug: #2103451 Change-Id: I3bcc317441b73da5c876e53edd4622370c6d575e	2025-03-19 15:25:24 +01:00
Zuul	f2ee231f14	Merge "pre-commit: Integrate bandit"	2025-03-11 09:58:29 +00:00
Takashi Kajinami	df3d67a4ed	Replace deprecated abc.abstractproperty It was deprecated in Python 3.3 [1]. [1] https://docs.python.org/3.13/whatsnew/3.3.html#abc Change-Id: Ibd98cb93f697a6da6a6bc5a5030640a262c7a66b	2025-03-02 15:36:48 +09:00
Zuul	383751904c	Merge "Further database refactoring"	2025-02-27 11:52:59 +00:00
Takashi Kajinami	977f014cba	Deprecate Monasca data source The Monasca project was marked inactive during 2023.1. Although we have seen multiple people showing interest to keep the project, we haven't seen any real progress. Because the project is likely retired soon, let's deprecate the feature dependent on Monasca so that we can remove it in a future release. Change-Id: Ifd64f5ba59bbac238ff62302ec36a3e36954d6d0	2025-02-16 18:45:31 +09:00
James Page	753c44b0c4	Further database refactoring More refactoring of the SQLAlchemy database layer to improve compatility with eventlet on newer Pythons. Inspired by `0ce2c41404` Related-Bug: 2067815 Change-Id: Ib5e9aa288232cc1b766bbf2a8ce2113d5a8e2f7d	2025-02-14 11:42:47 +00:00
Takashi Kajinami	dd0082c343	pre-commit: Integrate bandit Run bandit check from per-commit so that the check is executed in pep8 job. Also remove requirements installed automatically by pre-commit from test-requirements. Change-Id: I45af8c47afb262882ebbee74ae52446fed741e26	2025-02-10 22:50:34 +09:00
Zuul	4527f89d8d	Merge "Add support for instance metrics to prometheus datasource"	2025-02-03 13:22:28 +00:00
Zuul	e535177bc0	Merge "Remove ceilometer datasource"	2025-01-29 13:22:46 +00:00
Alfredo Moralejo	136e5d927c	Add support for instance metrics to prometheus datasource In order to support vm_workload_consolidation, workload_balance and workload_stabilization strategis some instance metrics are required. This patch is adding support for them. Implementation is based on a prometheus store populated using sg-core from ceilometer metrics with Pollster source. - instance_ram_usage: rely on ceilometer_memory_usage metrics created from ceilometer memory.usage meter. - instance_ram_allocated: rely on the memory value provided by the inventory created from nova and placement APIs. - instance_cpu_usage: rely on ceilometer_cpu metric created from ceilometer cpu meter. A max value of 100 is set in the query. - instance_root_disk_size: rely on the `disk` value provided by the inventory created from nova and placement APIs. A new parameterer `instance_uuid_label` has been added to the prometheus datasource configuration to identify the label used to store the value of the OpenStack instance uuid for eache instance metric in prometheus. Default value is `resource`. Change-Id: I2f2b56aa002014e511a5e48398ef1da43fc4f5e2	2025-01-23 13:23:04 +01:00
m	3f26dc47f2	Add prometheus data source for watcher decision engine This adds a new data source for the Watcher decision engine that implements the watcher.decision_engine.datasources.DataSourceBase. related spec was merged at [1]. Implements: blueprint prometheus-datasource [1] https://review.opendev.org/c/openstack/watcher-specs/+/933300 Change-Id: I6a70c4acc70a864c418cf347f5f6951cb92ec906	2025-01-10 15:20:37 +02:00
Takashi Kajinami	da23fdc621	Remove ceilometer datasource This datasource requires Ceilometer API which was already removed some years ago. The implementation should have been removed when dependency on ceilometerclient was removed by [1]. Also remove some job definitions which are not actually used. [1] `01d74d0a87` Change-Id: I29c3865dc1207f1bbbb266e4217cf8888afebfb6	2024-12-16 23:51:27 +09:00
Sean Mooney	5fadd0de57	[pre-commit] Fix execute and shebang lines This commit removes the execute bit from several files and remove the shebang lines from the devstack plugin. While the devstack plugin is written in bash, it is not an executable script. The devstack plugin is sourced by devstack as needed, as such it is not executed in a subshell and the #!/bin/bash lines are not used even when present. Change-Id: I82ca22b7a47bf267fe6cf11f3e3519510108c146	2024-11-07 20:12:59 +00:00
Sean Mooney	5f79ab87c7	[pre-commit] fix typos and configure codespell This chanage enabled codespell in precommit and fixes the existing typos. A followup commit will enable this in tox and ci. Change-Id: I0a11bcd5a88247a48d3437525fc8a3cb3cdd4e58	2024-11-07 19:50:21 +00:00
Sean Mooney	9d8b990fd1	[pre-commit] Add initial pre-commit config This change adds configuration for the pre-commit tool, follow-up changes will address the remaining issues in a phased approach to make the reviews simpler. This is based on the pre-commit config used in nova with some additional hooks. Follow-up changes will address the FIXME comments related to sphinx-lint and codespell, as well as update tox to enforce these checks in ci. Change-Id: I87681a19f7fa88366c2b0d310c8b3153aa6a137b	2024-10-22 20:12:53 +01:00
Takashi Natsume	61a7dd85ca	Replace deprecated datetime.utcnow() The datetime.utcnow() is deprecated in Python 3.12. Replace datetime.utcnow() with oslo_utils.timeutils.utcnow(). This bumps oslo.utils to 7.0.0. Change-Id: Icccbb0549add686a744a72b354932471cbf91c92 Signed-off-by: Takashi Natsume <takanattie@gmail.com>	2024-10-02 22:24:47 +09:00
Takashi Kajinami	566a830f64	Bump hacking hacking 3.0.x is quite old. Bump it to the current latest version. Change-Id: I8d87fed6afe5988678c64090af261266d1ca20e6	2024-09-22 23:54:36 +09:00
Lucian Petrut	c95ce4ec17	Add MAAS support At the moment, Watcher can use a single bare metal provisioning service: Openstack Ironic. We're now adding support for Canonical's MAAS service [1], which is commonly used along with Juju [2] to deploy Openstack. In order to do so, we're building a metal client abstraction, with concrete implementations for Ironic and MAAS. We'll pick the MAAS client if the MAAS url is provided, otherwise defaulting to Ironic. For now, we aren't updating the baremetal model collector since it doesn't seem to be used by any of the existing Watcher strategy implementations. [1] https://maas.io/docs [2] https://juju.is/docs Implements: blueprint maas-support Change-Id: I6861995598f6c542fa9c006131f10203f358e0a6	2023-12-11 10:21:33 +00:00
Lucian Petrut	424e9a76af	vm workload consolidation: use actual host metrics The "vm workload consolidation" strategy is summing up instance usage in order to estimate host usage. The problem is that some infrastructure services (e.g. OVS or Ceph clients) may also use a significant amount of resources, which would be ignored. This can impact Watcher's ability to detect overloaded nodes and correctly rebalance the workload. This commit will use the host metrics, if available. The proposed implementation uses the maximum value between the host metric and the sum of the instance metrics. Note that we're holding a dict of host metric deltas in order to account for planned migrations. Change-Id: I82f474ee613f6c9a7c0a9d24a05cba41d2f68edb	2023-10-27 21:54:42 +03:00
Zuul	40e93407c7	Merge "Handle deprecated "cpu_util" metric"	2023-10-27 09:47:38 +00:00

1 2 3 4 5 ...

710 Commits