Fixed action status_message update restrictions to allow updates when
action is already in SKIPPED state. Previously, users could only update
the status_message when initially transitioning to SKIPPED state.
Changes include:
- Modified validation logic to allow status_message updates for SKIPPED actions
- Changed exception type from PatchError to Conflict for better semantics
- Added comprehensive test coverage for the new behavior
- Updated API documentation and samples
- Added release note documenting the fix
This enables administrators to fix typos, provide more detailed
explanations, or expand on reasons in action status messages after
the action has been skipped.
Generated-By: claude-code
Closes-Bug: #2121601
Change-Id: I64def708389a8ecd32080fba1638a4499ead349d
Signed-off-by: Sean Mooney <work@seanmooney.info>
These do not actually define timeout but interval. Rename the options
to reflect what they actually define. The existing deprecated options
in the [gnocchi_client] are also removed, because these have been kept
for 6 years.
In addition, fix inconsistent name (query vs call).
Change-Id: Ib29115746a25b45bdff1c3da8df9d7167c2db662
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
This patch extends compute model attributes by
adding new fields to Instance element. Values are
populated by nova the collector, using the same
nova list call, but requires a more recent compute
API microversion.
A new config option was added to allow users to
enable or disable the extended attributes and it is
disable by default.
Configure prometheus-based jobs to run on newer version
of nova api (2.96) and enables the extended attributes
collection.
Implements: bp/extend-compute-model-attributes
Assisted-By: Cursor (claude-4-sonnet)
Change-Id: Ibf31105d780dce510a59fc74241fa04e28529ade
Signed-off-by: Douglas Viroel <viroel@gmail.com>
Bug #2113862 details a number of suggested
corrections and additions to the Workload
Stabilization doc. This patch adds those
suggested changes.
Closes-Bug: #2113862
Assisted-By: Cursor (claude-3.5-sonnet)
Change-Id: I4131a304c064d2ea397b2447025c7edf69a56e2a
Signed-off-by: Ronelle Landy <rlandy@redhat.com>
This change enhances the Host Maintenance strategy by introducing
two new input parameters: `disable_live_migration` and
`disable_cold_migration`. These parameters allow cloud
administrators to control whether live or cold migration should be
considered during host maintenance operations.
If `disable_live_migration` is set, active instances will be cold
migrated if `disable_cold_migration` is not set, otherwise
active instances will be stopped. If `disable_cold_migration` is set,
inactive instances will not be cold migrated.
If both are set, only stop actions will be performed on instances.
The strategy logic and action plan generation have been updated to
reflect these behaviors. A new "stop" action is introduced and
registered, and the weight planner is updated to handle new action.
Documentation for the Host Maintenance strategy is updated to
describe the new parameters and their effects.
Test Plan:
- Unit tests for HostMaintenance strategy with new parameters
- Integration tests for action plan generation with stop action
This implements the specification:
Spec: https://review.opendev.org/c/openstack/watcher-specs/+/943873
Change-Id: I201b8e5c52e1bc1a74f3886a0e301e3c0fa5d351
Signed-off-by: Quang Ngo <quang.ngo@canonical.com>
Fixes the microversion comparison in both enable and
disable nova-compute service methods in NovaHelper.
The previous implementation was incorrect and started to
fail for microversion greather than 2.99.
Closes-Bug: #2120586
Assisted-By: Cursor (claude-4-sonnet)
Change-Id: I69da7f10cd5b42f7d4613d8947bca3e382815c3f
Signed-off-by: Douglas Viroel <viroel@gmail.com>
This patch implements the changes in the API required for the
skipped action blueprint. It includes:
- New field `status_message` is visible in API get calls for Audits,
ActionPlans and Audits.
- New Patch call is added to `/actions/{action_id}` which allows to
manually move actions in PENDING state to SKIPPED for ActionPlans
which have not been started.
- A new API microversion 1.5 is added for these changes.
It also adds requried tests and documentation.
Implements: blueprint add-skip-actions
Assisted-By: Cursor (claude-4-sonnet)
Change-Id: I71fb9af76085e5941a7fd3e9e4c89d6f3a3ada47
Signed-off-by: Alfredo Moralejo <amoralej@redhat.com>
In order to test the different code paths for action execution
it is very useful to be able to make the actions fail in the different
execution stages.
This patch adds three new options `fail_pre_condition`, `fail_execute`
and `fail_post_condition`. Setting any of them to True makes the action
to fail in the specified step.
Change-Id: Ied8c0bb767d9bb6bdfb9209365857a3b4d606b40
Signed-off-by: Alfredo Moralejo <amoralej@redhat.com>
This change removes watchers in tree functionality
for swapping instance volumes and defines swap as an alias
of cinder volume migrate.
The watcher native implementation was missing error handling
which could lead to irretrievable data loss.
The removed code also forged project user credentials to
perform admin request as if it was done by a member of a project.
this was unsafe an posses a security risk due to how it was
implemented. This code has been removed without replacement.
While some effort has been made to allow existing
audits that were defined to work, any reduction of functionality
as a result of this security hardening is intentional.
Closes-Bug: #2112187
Change-Id: Ic3b6bfd164e272d70fe86d7b182478dd962f8ac0
Signed-off-by: Sean Mooney <work@seanmooney.info>
Implement the spec for multi-tenancy support for metrics. This adds
a new 'Aetos' datasource very similar to the current Prometheus
datasource. Because of that, the original PrometheusHelper class
was split into two classes and the base class is used for
PrometheusHelper and for AetosHelper. Except for the split, there
is one more change to the original PrometheusHelper class code, which
is the addition and use of the _get_fqdn_label() and
_get_instance_uuid_label() methods.
As part of the change, I refactored the current prometheus datasource
unit tests. Most of them are now used to test the PrometheusBase class
with minimal changes. Changes I've made to the original tests:
- the ones that can be be used to test the base class are moved into the
TestPrometheusBase class
- the _setup_prometheus_client, _get_instance_uuid_label and
_get_fqdn_label functions are mocked in the base class tests.
Their concrete implementations are tested in each datasource tests
separately.
- a self._create_helper() is used to instantiate the helper class with
correct mocking.
- all config value modification is the original tests got moved out and
instead of modifying the config values, the _get_* methods are mocked
to return the wanted values
- to keep similar test coverage, config retrieval is tested for each
concrete class by testing the _get_* methods.
New watcher-aetos-integration and watcher-aetos-integration-realdata
zuul jobs are added to test the new datasource. These use the same set
of tempest tests as the current watcher-prometheus-integration jobs.
The only difference is the environment setup and the Watcher config,
so that the job deploys Aetos and Watcher uses it instead of accessing
Prometheus directly.
At first this was generated by asking cursor to implement the linked spec
with some additional prompts for some smaller changes. Afterwards I manually
went through the code doing some cleanups, ensuring it complies with
PEP8 and hacking and so on. Later on I manually adjusted the code to use
the latest observabilityclient changes.
The zuul job was also mostly generated by cursor.
Implements: https://blueprints.launchpad.net/watcher/+spec/prometheus-multitenancy-support
Generated-By: Cursor with claude-4-sonnet model
Change-Id: I72c2171f72819bbde6c9cbbf565ee895e5d2bd53
Signed-off-by: Jaromir Wysoglad <jwysogla@redhat.com>
Some response parameters from GET /infra-optim/v1/data_model
endpoint are missing from api-ref documentation. This patch
updates the doc to include them.
For more details see, LP #2117726
Closes-Bug: #2117726
Change-Id: Iaa775f56bb8167d9c6b458cd07f1ec3cefaf70fe
Signed-off-by: Douglas Viroel <viroel@gmail.com>
With the events of eventlet removal, Watcher will need
to be adapted to support both modes, eventlet and threading, for
a couple of releases before removing all eventlet code.
This patch adds methods and classes that allow decision engine
modules to create futurist thread pools instead of green thread pools,
based on a environment variable that can be enabled by service.
It moves continuous audit handler instance to decison engine service,
so it can be started together with the main decision engine service.
Adds an environment variable that allows the user to disable
eventlet monkey patching and to use oslo.service threading backend.
Change-Id: I8a8be0a7cebdc44005fd77ec960543828c7da318
Signed-off-by: Douglas Viroel <viroel@gmail.com>
This cr fixes:
* Replaced ``dateutil.tz.tzlocal()`` and ``dateutil.tz.tzutc()`` with
``datetime.timezone`` built-in classes in audit controllers and
continuous audit scheduling.
* Replaced ``dateutil.parser.parse()`` with
``oslo_utils.timeutils.parse_isotime()`` in the zone migration
strategy for parsing datetime strings.
Closes-Bug: #2118404
Change-Id: I6d8a345fa4339a688769b147413dcdf3016bf4a0
Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>
The last release of openstack to support python 3.9
was 2025.1 (epoxy), with this change watcher now requires
3.10, testing of 3.9 was removed in previous commits.
Change-Id: Ida53740293e93b0c20dec2e175b390fa18bed852
Signed-off-by: Sean Mooney <work@seanmooney.info>
The following exception was added in initial import of watcher
code base[1].
In each of the controller REST APIs, it was called with a flag
stating request was coming from top level resources apis.
But this exception and code was not used anywhere in the
rest api. It seems to be a dead code. So, it needs to be
cleaned up.
Note: In audit_template, under patchapi, this exception
was used for not removal goal from audit template.
Since this cr drops this exception, It replace the same
with NotAuthorized exception keeping status code same.
Links:
[1]. d14e057da1 (diff-6d510a275605e20ba8b435157062da2b749265a88a3cfd6d90abb7e8e5feac2aR235)
Closes-Bug: #2115968
Change-Id: I82a5e4a7a51726b3a89257c84a75157fbfcb82eb
Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>
These apis are not implemented with in the watcher code base and
was marked as a forbidden to use.
It does not make sense to keep these api as they are not implemented.
This cr drops the code around that to make the action apis cleaner.
Closes-Bug: #2110895
Change-Id: I0f465157e6cd481b27665ca6016db68c198cebeb
Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>
The prometheus datasource was reporting host_ram_usage in MiB as
described in the docstring for the base datasource interface
definition [1].
However, the gnocchi datasource is reporting it in KiB following
ceilometer metric `hardware.memory.used` [2] and the strategies
using that metric expect it to be in KiB so the best approach is
to change the unit in the prometheus datasource and update the
docstring to avoid missunderstandings in future. So, this patch
is fixing the prometheus datasource to return host_ram_usage
in KiB instead of MiB.
Additionally, it is adding more unit tests for the check_threshold
method so that it covers the memory based strategy execution, validates
the calculated standard deviation and adds the cases where it is below
the threshold.
[1] 15981117ee/watcher/decision_engine/datasources/base.py (L177-L183)
[2] https://docs.openstack.org/ceilometer/train/admin/telemetry-measurements.html#snmp-based-meters
Closes-Bug: #2113776
Change-Id: Idc060d1e709c0265c64ada16062c3a206c6b04fa
Some services integrations are now classified as experimental
and a warning message will now appear once a client is created
for them. These integrations are not fully tested in CI and
miss a documentation on how they work or should be used.
A release note was added to inform users about the status of
these integrations and related features.
Change-Id: Ib7d0ac0b3e187ae239dfa075fb53a6c0107dff29
Currently, in that case it was failing because watcher tried to create a
name based on a goal automatically and the goal is not defined.
This patch is moving the check for goal specification in the audit
creation call earlier, and if there is not goal defined, it returns an
invalid call error.
This patch is also modifying the existing error for this case to check
the expected behavior.
Closes-Bug: #2110947
Change-Id: I6f3d73b035e8081e86ce82c205498432f0e0fc33
Currently, an actionplan state is set to SUCCEEDED once the execution
has finished, but that does not imply that all the actions finished
successfully.
This patch is checking the actual state of all the actions in the plan
after the execution has finished. If any action has status FAILED, it
will set the state of the action plan as FAILED and will apply the
appropiate notification parameters. This is the expected behavior according
to Watcher documentation.
The patch is also fixing the unit test for this to set the expected
action plan state to FAILED and notification parameters.
Closes-Bug: #2106407
Change-Id: I7bfc6759b51cd97c26ec13b3918bd8d3b7ac9d4e
For compute nodes, nova works fine if a destination node is not
specified, so this change makes sure we're not passing None when the
user does not set one to avoid an error.
Partial-Bug: 2108988
Change-Id: Ida1f18b97697c041819e29f935aa5e232848226a
Currently, when trying to create an audit which misses a mandatory
parameter watcher returns error 500 instead of 400 which is the
documented error in the API [1] and the appropiate error code for
malformed requests.
This patch catch parameters validation errors according to the json
schema for each strategy and returns error 400. It also fixes the
unit test to validate the expected behavior.
[1] https://docs.openstack.org/api-ref/resource-optimization/#audits
Closes-Bug: #2110538
Change-Id: I23232b3b54421839bb01d54386d4e7b244f4e2a0
Currently host maintenance strategy also migrate instances from maintenance
node to watcher_disabled compute nodes.
watcher_disabled compute nodes might be disabled for some other purpose
by different strategy. If host maintenace use those compute nodes for
migration, It might affect customer workloads.
Host maintenance strategy should never touch disabled hosts unless the user
specify a disable host as backup node.
This cr drops the logic for using disabled compute node for maintenance.
Host maintaince is already using nova schedular for migrating the
instance, will use the same. If there is no available node, strategy
will fail.
Closes-Bug: #2109945
Change-Id: If9795fd06f684eb67d553405cebd8a30887c3997
Signed-off-by: Chandan Kumar (raukadah) <chkumar@redhat.com>
Set the default interface for keystone_client to public in the watcher
conf instead of admin.
Closes-Bug: 2109494
Change-Id: I9e0289249981ca965190df6dbdc37e09fd0951d7
pip 23.1 removed the "setup.py install" fallback for projects that do
not have pyproject.toml and now uses a pyproject.toml which is vendored
in pip [1][2]. pip 24.2 has now deprecated a similar fallback to
"setup.py develop" and plans to fully remove this in pip 25.0 [3][4][5].
pbr supports editable installs since 6.0.0
pip 25.1 has now been released and the removal is complete.
by adding our own minimal pyproject.toml to ensure we are using the
correct build system.
This change also requires that we adapt how we generate our wsgi
entry point. when pyproject.toml is used the wsgi console script is
not generated in an editbale install such as is used in devstck
To adress this we need to refactor our usage of our wsgi applciation
to use a module path instead. This change does not remove
the declaration of our wsgi_scrtip entry point but it shoudl
be considered deprecated and it will be removed in the future.
To unblock the gate the devstack plugin is modifed to to deploy
using the wsgi module instead of the console script.
Finally supprot for the mod_wsgi wsgi mode is removed.
that was deprecated in devstack a few cycle ago and
support was removed in I8823e98809ed6b66c27dbcf21a00eea68ef403e8
[1] https://pip.pypa.io/en/stable/news/#v23-1
[2] https://github.com/pypa/pip/issues/8368
[3] https://pip.pypa.io/en/stable/news/#v24-2
[4] https://github.com/pypa/pip/issues/11457
[5] https://ichard26.github.io/blog/2024/08/whats-new-in-pip-24.2/
Closes-Bug: #2109608
Depends-on: https://review.opendev.org/c/openstack/watcher/+/948502
Change-Id: Iad77939ab0403c5720c549f96edfc77d2b7d90ee
Currently we are using `instance` label to query about host metrics to
prometheus. This label is assigned to the url of each endpoint being
scrapped.
While this work fine in one-exporter-per-compute cases as the driver is
mapping the fqdn_label value to the `instance` label value, it fails
when there are more that one target with the same value for the fqdn
label. This is a valid case, to be able to query by fqdn and do not
care about what exporter in the host is providing the metric.
This patch is changing the queries we use for hosts to be based on the
fqdn_label instead of the instance one. To implement it, we are also
simplifying the way we check the metric exist for the host by converting
prometheus_fqdn_instance_map into a prometheus_fqdn_labels set
which stores the list of fqdn found in prometheus.
Closes-Bug: #2103451
Change-Id: I3bcc317441b73da5c876e53edd4622370c6d575e
Add file to the reno documentation build to show release notes for
stable/2025.1.
Use pbr instruction to increment the minor version number
automatically so that master versions are higher than the versions on
stable/2025.1.
Sem-Ver: feature
Change-Id: Ie7a1845d7b02b852e776ed8ec73598caab2fb5c6