Project

General

Profile

Actions

Bug #55322

closed

test-restful.sh: mon metadata unable to be retrieved

Added by Laura Flores about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
restful module
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/nojha-2022-04-13_16:47:41-rados-wip-yuri2-testing-2022-04-13-0703-distro-basic-smithi/6790445/teuthology.log

2022-04-13T17:15:21.666 INFO:tasks.ceph.mgr.x.smithi026.stderr:2022-04-13T17:15:21.658+0000 7f245d9cd700 -1 mgr get_metadata_python Requested missing service mon.a
2022-04-13T17:15:21.693 INFO:tasks.workunit.client.a.smithi026.stdout:{
2022-04-13T17:15:21.693 INFO:tasks.workunit.client.a.smithi026.stdout:    "failed": [],
2022-04-13T17:15:21.693 INFO:tasks.workunit.client.a.smithi026.stdout:    "finished": [
2022-04-13T17:15:21.694 INFO:tasks.workunit.client.a.smithi026.stdout:        {
2022-04-13T17:15:21.694 INFO:tasks.workunit.client.a.smithi026.stdout:            "command": "osd pool create pool=supertestfriends pg_num=128",
2022-04-13T17:15:21.694 INFO:tasks.workunit.client.a.smithi026.stdout:            "outb": "",
2022-04-13T17:15:21.694 INFO:tasks.workunit.client.a.smithi026.stdout:            "outs": "pool 'supertestfriends' created" 
2022-04-13T17:15:21.695 INFO:tasks.workunit.client.a.smithi026.stdout:        }
2022-04-13T17:15:21.695 INFO:tasks.workunit.client.a.smithi026.stdout:    ],
2022-04-13T17:15:21.695 INFO:tasks.workunit.client.a.smithi026.stdout:    "has_failed": false,
2022-04-13T17:15:21.695 INFO:tasks.workunit.client.a.smithi026.stdout:    "id": "139794436317136",
2022-04-13T17:15:21.696 INFO:tasks.workunit.client.a.smithi026.stdout:    "is_finished": true,
2022-04-13T17:15:21.696 INFO:tasks.workunit.client.a.smithi026.stdout:    "is_waiting": false,
2022-04-13T17:15:21.696 INFO:tasks.workunit.client.a.smithi026.stdout:    "running": [],
2022-04-13T17:15:21.696 INFO:tasks.workunit.client.a.smithi026.stdout:    "state": "success",
2022-04-13T17:15:21.696 INFO:tasks.workunit.client.a.smithi026.stdout:    "waiting": []
2022-04-13T17:15:21.697 INFO:tasks.workunit.client.a.smithi026.stdout:}
2022-04-13T17:15:21.700 INFO:tasks.workunit.client.a.smithi026.stderr:Traceback (most recent call last):
2022-04-13T17:15:21.701 INFO:tasks.workunit.client.a.smithi026.stderr:  File "/home/ubuntu/cephtest/clone.client.a/qa/workunits/rest/test_mgr_rest_api.py", line 40, in <module>
2022-04-13T17:15:21.701 INFO:tasks.workunit.client.a.smithi026.stderr:    firstmon = request.json()[0]['name']
2022-04-13T17:15:21.701 INFO:tasks.workunit.client.a.smithi026.stderr:  File "/usr/lib/python3/dist-packages/requests/models.py", line 897, in json
2022-04-13T17:15:21.701 INFO:tasks.workunit.client.a.smithi026.stderr:    return complexjson.loads(self.text, **kwargs)
2022-04-13T17:15:21.702 INFO:tasks.workunit.client.a.smithi026.stderr:  File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
2022-04-13T17:15:21.702 INFO:tasks.workunit.client.a.smithi026.stderr:    return _default_decoder.decode(s)
2022-04-13T17:15:21.703 INFO:tasks.workunit.client.a.smithi026.stderr:  File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
2022-04-13T17:15:21.703 INFO:tasks.workunit.client.a.smithi026.stderr:    obj, end = self.raw_decode(s)
2022-04-13T17:15:21.703 INFO:tasks.workunit.client.a.smithi026.stderr:  File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
2022-04-13T17:15:21.703 INFO:tasks.workunit.client.a.smithi026.stderr:    return self.scan_once(s, idx=_w(s, idx).end())
2022-04-13T17:15:21.703 INFO:tasks.workunit.client.a.smithi026.stderr:simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

/a/nojha-2022-04-13_16:47:41-rados-wip-yuri2-testing-2022-04-13-0703-distro-basic-smithi/6790445/remote/smithi026/log/ceph-mgr.x.log.gz

2022-04-13T17:15:21.622+0000 7f245d9cd700  0 [restful INFO werkzeug] ::ffff:172.21.15.26 - - [13/Apr/2022 17:15:21] "POST /pool?wait=yes HTTP/1.1" 200 -
2022-04-13T17:15:21.638+0000 7f245d9cd700  0 [restful INFO werkzeug] ::ffff:172.21.15.26 - - [13/Apr/2022 17:15:21] "GET /pool HTTP/1.1" 200 -
2022-04-13T17:15:21.658+0000 7f245d9cd700 -1 mgr get_metadata_python Requested missing service mon.a
2022-04-13T17:15:21.658+0000 7f245d9cd700  0 [restful ERROR root] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pecan/core.py", line 683, in __call__
    self.invoke_controller(controller, args, kwargs, state)
  File "/usr/lib/python3/dist-packages/pecan/core.py", line 574, in invoke_controller
    result = controller(*args, **kwargs)
  File "/usr/share/ceph/mgr/restful/decorators.py", line 37, in decorated
    return f(*args, **kwargs)
  File "/usr/share/ceph/mgr/restful/api/mon.py", line 35, in get
    return context.instance.get_mons()
  File "/usr/share/ceph/mgr/restful/module.py", line 510, in get_mons
    mon['server'] = self.get_metadata("mon", mon['name'])['hostname']
TypeError: 'NoneType' object is not subscriptable


Related issues 3 (0 open3 closed)

Has duplicate Dashboard - Bug #56072: mgr/dashboard: TelemetryTest.test_get_report failedDuplicate

Actions
Copied to mgr - Backport #57188: pacific: test-restful.sh: mon metadata unable to be retrievedResolvedPrashant DActions
Copied to mgr - Backport #57189: quincy: test-restful.sh: mon metadata unable to be retrievedResolvedPrashant DActions
Actions #1

Updated by Radoslaw Zarzynski about 2 years ago

  • Project changed from RADOS to mgr
Actions #2

Updated by Radoslaw Zarzynski about 2 years ago

  • Category set to restful module
Actions #3

Updated by Laura Flores almost 2 years ago

  • Backport set to pacific

/a/yuriw-2022-05-31_21:35:41-rados-wip-yuri2-testing-2022-05-31-1300-pacific-distro-default-smithi/6856304

Actions #4

Updated by Laura Flores almost 2 years ago

  • Assignee set to Laura Flores
Actions #5

Updated by Laura Flores almost 2 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Laura Flores almost 2 years ago

Laura Flores wrote:

/a/nojha-2022-04-13_16:47:41-rados-wip-yuri2-testing-2022-04-13-0703-distro-basic-smithi/6790445/remote/smithi026/log/ceph-mgr.x.log.gz

This log message can be found in failed restful tests. It repeats multiple times in the mgr log before the test finally fails:

2022-04-13T17:14:21.774+0000 7f246d1ec700  5 mgr.server handle_report rejecting report from mon.a, since we do not have its metadata now.

...

2022-04-13T17:15:16.802+0000 7f246d1ec700  5 mgr.server handle_report rejecting report from mon.a, since we do not have its metadata now.

Actions #7

Updated by Laura Flores almost 2 years ago

Original Trello card for PR https://github.com/ceph/ceph/pull/45670: https://trello.com/c/kCTebTFa/1504-wip-yuri5-testing-2022-04-05-1720
--> the `rados/rest` test passed here, which is why this was not caught initially.

The PR was merged to main on April 11; the Tracker was filed on April 13. This indicates that they are correlated.

Let's see how many times the test passes on main:
[WITH the suspect commit] http://pulpito.front.sepia.ceph.com/lflores-2022-06-09_22:29:13-rados:rest-main-distro-default-smithi/ (10/25 tests failed)

The PR has not been backported to Quincy yet; we'll see if the tests pass there:
[WITHOUT the suspect commit] http://pulpito.front.sepia.ceph.com/lflores-2022-06-10_19:01:46-rados:rest-quincy-distro-default-smithi/ (25/25 tests passed)

What about Pacific? The commit has been backported there:
[WITH the suspect commit] http://pulpito.front.sepia.ceph.com/lflores-2022-06-10_20:43:57-rados:rest-wip-yuri2-testing-2022-05-31-1300-pacific-distro-default-smithi/ (15/25 tests failed)
[WITHOUT the suspect commit] http://pulpito.front.sepia.ceph.com/lflores-2022-06-10_19:45:16-rados:rest-pacific-distro-default-smithi/ (25/25 tests passed)

Actions #8

Updated by Laura Flores almost 2 years ago

Before the change, the mgr log looks like this for mon.a (this is a Pacific example):

2022-06-10T20:15:27.677+0000 7fc4d45fa700  4 mgr finish mon returned valid metadata JSON for mon.a
2022-06-10T20:15:28.679+0000 7fc49ea5d700  1 -- [v2:172.21.15.63:6824/75350,v1:172.21.15.63:6825/75350] <== mon.0 172.21.15.63:0/75349 1 ==== mgropen(mon.a) v3 ==== 69367+0+0 (secure 0 0 0) 0x557690a48000 con 0x5576909cb000
2022-06-10T20:15:28.679+0000 7fc49ea5d700 10 mgr.server handle_open from mon.a 172.21.15.63:0/75349
2022-06-10T20:15:28.679+0000 7fc49ea5d700 20 mgr.server handle_open updating existing DaemonState for mon.a
2022-06-10T20:15:28.679+0000 7fc49ea5d700  1 -- [v2:172.21.15.63:6824/75350,v1:172.21.15.63:6825/75350] <== mon.0 172.21.15.63:0/75349 2 ==== mgrreport(mon.a +98-0 packed 1158 daemon_metrics=1 task_status=0) v9 ==== 8059+0+0 (secure 0 0 0) 0x557690af6700 con 0x5576909cb000
2022-06-10T20:15:28.679+0000 7fc49ea5d700 10 mgr.server handle_report from 0x5576909cb000 mon.a
2022-06-10T20:15:28.679+0000 7fc49ea5d700 20 mgr.server handle_report updating existing DaemonState for mon.a
2022-06-10T20:15:28.680+0000 7fc49ea5d700 10 mgr.server update_task_status got task status from mon.a
2022-06-10T20:15:33.679+0000 7fc49ea5d700  1 -- [v2:172.21.15.63:6824/75350,v1:172.21.15.63:6825/75350] <== mon.0 172.21.15.63:0/75349 3 ==== mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 ==== 3433+0+0 (secure 0 0 0) 0x557690af7880 con 0x5576909cb000
2022-06-10T20:15:33.679+0000 7fc49ea5d700 10 mgr.server handle_report from 0x5576909cb000 mon.a
2022-06-10T20:15:33.679+0000 7fc49ea5d700 20 mgr.server handle_report updating existing DaemonState for mon.a
2022-06-10T20:15:38.679+0000 7fc49ea5d700  1 -- [v2:172.21.15.63:6824/75350,v1:172.21.15.63:6825/75350] <== mon.0 172.21.15.63:0/75349 4 ==== mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 ==== 3433+0+0 (secure 0 0 0) 0x557690a5ca80 con 0x5576909cb000
2022-06-10T20:15:38.679+0000 7fc49ea5d700 10 mgr.server handle_report from 0x5576909cb000 mon.a
2022-06-10T20:15:38.679+0000 7fc49ea5d700 20 mgr.server handle_report updating existing DaemonState for mon.a
2022-06-10T20:15:43.679+0000 7fc49ea5d700  1 -- [v2:172.21.15.63:6824/75350,v1:172.21.15.63:6825/75350] <== mon.0 172.21.15.63:0/75349 5 ==== mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 ==== 3433+0+0 (secure 0 0 0) 0x557690cec380 con 0x5576909cb000
2022-06-10T20:15:43.679+0000 7fc49ea5d700 10 mgr.server handle_report from 0x5576909cb000 mon.a
2022-06-10T20:15:43.679+0000 7fc49ea5d700 20 mgr.server handle_report updating existing DaemonState for mon.a
2022-06-10T20:15:48.679+0000 7fc49ea5d700  1 -- [v2:172.21.15.63:6824/75350,v1:172.21.15.63:6825/75350] <== mon.0 172.21.15.63:0/75349 6 ==== mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 ==== 3433+0+0 (secure 0 0 0) 0x557690d31c00 con 0x5576909cb000
2022-06-10T20:15:48.679+0000 7fc49ea5d700 10 mgr.server handle_report from 0x5576909cb000 mon.a
2022-06-10T20:15:48.679+0000 7fc49ea5d700 20 mgr.server handle_report updating existing DaemonState for mon.a
2022-06-10T20:15:53.680+0000 7fc49ea5d700  1 -- [v2:172.21.15.63:6824/75350,v1:172.21.15.63:6825/75350] <== mon.0 172.21.15.63:0/75349 7 ==== mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 ==== 3433+0+0 (secure 0 0 0) 0x557690d6b50

After the change, the mgr log looks like this for mon.a (Pacific). Notice that mon.a gets registered in pending_service_map, which then leads to the mgr.server pruning the service map and rejecting the report:

2022-06-10T20:56:25.182+0000 7f367e283700  4 mgr finish mon returned valid metadata JSON for mon.a
2022-06-10T20:56:26.184+0000 7f36479f8700  1 -- [v2:172.21.15.52:6824/40143,v1:172.21.15.52:6825/40143] <== mon.0 172.21.15.52:0/40142 1 ==== mgropen(mon.a daemon) v3 ==== 70182+0+0 (secure 0 0 0) 0x55e2c75c8480 con 0x55e2c735c800
2022-06-10T20:56:26.184+0000 7f36479f8700 10 mgr.server handle_open from mon.a 172.21.15.52:0/40142
2022-06-10T20:56:26.184+0000 7f36479f8700 20 mgr.server handle_open updating existing DaemonState for mon.a
2022-06-10T20:56:26.184+0000 7f36479f8700 10 mgr.server handle_open registering mon.a in pending_service_map
2022-06-10T20:56:26.184+0000 7f36479f8700  1 -- [v2:172.21.15.52:6824/40143,v1:172.21.15.52:6825/40143] <== mon.0 172.21.15.52:0/40142 2 ==== mgrreport(mon.a +98-0 packed 1158 status=0 daemon_metrics=1 task_status=0) v9 ==== 8066+0+0 (secure 0 0 0) 0x55e2c7650a80 con 0x55e2c735c800
2022-06-10T20:56:26.184+0000 7f36479f8700 10 mgr.server handle_report from 0x55e2c735c800 mon.a
2022-06-10T20:56:26.184+0000 7f36479f8700 20 mgr.server handle_report updating existing DaemonState for mon.a
2022-06-10T20:56:26.185+0000 7f36479f8700 10 mgr.server update_task_status got task status from mon.a
2022-06-10T20:56:27.186+0000 7f367fa86700  4 mgr cull_services Removing data for mon.a
2022-06-10T20:56:29.182+0000 7f36589fa700 10 mgr.server _prune_pending_service_map daemon mon.a in service map but not in daemon state index -- force pruning
2022-06-10T20:56:31.184+0000 7f36479f8700  1 -- [v2:172.21.15.52:6824/40143,v1:172.21.15.52:6825/40143] <== mon.0 172.21.15.52:0/40142 3 ==== mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 ==== 3436+0+0 (secure 0 0 0) 0x55e2c7651180 con 0x55e2c735c800
2022-06-10T20:56:31.184+0000 7f36479f8700 10 mgr.server handle_report from 0x55e2c735c800 mon.a
2022-06-10T20:56:31.184+0000 7f36479f8700  5 mgr.server handle_report rejecting report from mon.a, since we do not have its metadata now.
2022-06-10T20:56:31.185+0000 7f36479f8700  0 ms_deliver_dispatch: unhandled message 0x55e2c7651180 mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 from mon.0 172.21.15.52:0/40142
2022-06-10T20:56:31.185+0000 7f367e283700  4 mgr finish mon returned valid metadata JSON for mon.a
2022-06-10T20:56:31.185+0000 7f36479f8700  1 -- [v2:172.21.15.52:6824/40143,v1:172.21.15.52:6825/40143] <== mon.0 172.21.15.52:0/40142 1 ==== mgropen(mon.a daemon) v3 ==== 70182+0+0 (secure 0 0 0) 0x55e2c7746000 con 0x55e2c735c800
2022-06-10T20:56:31.185+0000 7f36479f8700 10 mgr.server handle_open from mon.a 172.21.15.52:0/40142
2022-06-10T20:56:31.185+0000 7f36479f8700 20 mgr.server handle_open updating existing DaemonState for mon.a
2022-06-10T20:56:31.185+0000 7f36479f8700 10 mgr.server handle_open registering mon.a in pending_service_map
2022-06-10T20:56:31.185+0000 7f36479f8700  1 -- [v2:172.21.15.52:6824/40143,v1:172.21.15.52:6825/40143] <== mon.0 172.21.15.52:0/40142 2 ==== mgrreport(mon.a +98-0 packed 1158 status=0 task_status=0) v9 ==== 8051+0+0 (secure 0 0 0) 0x55e2c7650e00 con 0x55e2c735c800
2022-06-10T20:56:31.185+0000 7f36479f8700 10 mgr.server handle_report from 0x55e2c735c800 mon.a
2022-06-10T20:56:31.185+0000 7f36479f8700 20 mgr.server handle_report updating existing DaemonState for mon.a
2022-06-10T20:56:31.186+0000 7f36479f8700 10 mgr.server update_task_status got task status from mon.a
2022-06-10T20:56:33.467+0000 7f367fa86700  4 mgr cull_services Removing data for mon.a
2022-06-10T20:56:35.183+0000 7f36589fa700 10 mgr.server _prune_pending_service_map daemon mon.a in service map but not in daemon state index -- force pruning
2022-06-10T20:56:36.186+0000 7f36479f8700  1 -- [v2:172.21.15.52:6824/40143,v1:172.21.15.52:6825/40143] <== mon.0 172.21.15.52:0/40142 3 ==== mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 ==== 3436+0+0 (secure 0 0 0) 0x55e2c77e2700 con 0x55e2c735c800
2022-06-10T20:56:36.186+0000 7f36479f8700 10 mgr.server handle_report from 0x55e2c735c800 mon.a
2022-06-10T20:56:36.186+0000 7f36479f8700  5 mgr.server handle_report rejecting report from mon.a, since we do not have its metadata now.
2022-06-10T20:56:36.186+0000 7f36479f8700  0 ms_deliver_dispatch: unhandled message 0x55e2c77e2700 mgrreport(mon.a +0-0 packed 1158 daemon_metrics=1) v9 from mon.0 172.21.15.52:0/40142
2022-06-10T20:56:36.186+0000 7f367ea84700  4 mgr finish mon returned valid metadata JSON for mon.a
2022-06-10T20:56:36.187+0000 7f36479f8700  1 -- [v2:172.21.15.52:6824/40143,v1:172.21.15.52:6825/40143] <== mon.0 172.21.15.52:0/40142 1 ==== mgropen(mon.a daemon) v3 ==== 70182+0+0 (secure 0 0 0) 0x55e2c7746240 con 0x55e2c7708400
2022-06-10T20:56:36.187+0000 7f36479f8700 10 mgr.server handle_open from mon.a 172.21.15.52:0/40142
2022-06-10T20:56:36.187+0000 7f36479f8700 20 mgr.server handle_open updating existing DaemonState for mon.a
2022-06-10T20:56:36.187+0000 7f36479f8700 10 mgr.server handle_open registering mon.a in pending_service_map

Actions #9

Updated by Sridhar Seshasayee almost 2 years ago

/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881166

Actions #10

Updated by Laura Flores almost 2 years ago

  • Has duplicate Bug #56072: mgr/dashboard: TelemetryTest.test_get_report failed added
Actions #11

Updated by Laura Flores almost 2 years ago

  • Priority changed from Normal to Urgent

Bumping the priority of this since it has a negative effect on both the restful module and telemetry.

Actions #12

Updated by Laura Flores almost 2 years ago

I'm noticing that `update_pending_metadata` gets called in the mon logs, which is part of the new PR:

2022-06-14T18:53:19.663+0000 7faa0f422700 20 mon.a@0(leader) e1 update_pending_metadata
2022-06-14T18:53:19.663+0000 7faa0f422700  1 mgrc update_daemon_metadata mon.a metadata {addrs=[v2:172.21.15.121:3300/0,v1:172.21.15.121:6789/0],arch=x86_64,ceph_release=quincy,ceph_version=ceph version 17.0.0-13018-gb83a2da4 (b83a2da43dc8dd92b29e89ded3652b92b9486fa7) quincy (dev),ceph_version_short=17.0.0-13018-gb83a2da4,compression_algorithms=none, snappy, zlib, zstd, lz4,cpu=Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz,device_ids=nvme0n1=INTEL_SSDPEDMD400G4_PHFT620400ZF400BGN,device_paths=nvme0n1=/dev/disk/by-path/pci-0000:02:00.0-nvme-1,devices=nvme0n1,distro=rhel,distro_description=Red Hat Enterprise Linux 8.5 (Ootpa),distro_version=8.5,hostname=smithi121,kernel_description=#1 SMP Fri Apr 15 22:12:19 EDT 2022,kernel_version=4.18.0-372.9.1.el8.x86_64,mem_swap_kb=0,mem_total_kb=32605224,os=Linux}

But there are no subsequent calls to "handle_update", which means that the `MMgrUpdate` message type is not getting picked up in `DaemonServer::ms_dispatch2`. I grepped for the differences between existing message types and `MMgrUpdate`, and the missing piece might be that we need to add this new message type to the ceph-dencoder (tools/ceph-dencoder/common_types.h):

[lflores@fedora src]$ git grep "MMgrReport\.h" 
mgr/DaemonState.h:#include "messages/MMgrReport.h" 
mgr/MDSPerfMetricCollector.cc:#include "messages/MMgrReport.h" 
mgr/MgrClient.cc:#include "messages/MMgrReport.h" 
mgr/MgrClient.h:#include "messages/MMgrReport.h" 
mgr/OSDPerfMetricCollector.cc:#include "messages/MMgrReport.h" 
msg/Message.cc:#include "messages/MMgrReport.h" 
tools/ceph-dencoder/common_types.h:#include "messages/MMgrReport.h" 
[lflores@fedora src]$ git grep "MMgrOpen\.h" 
crimson/mgr/client.cc:#include "messages/MMgrOpen.h" 
mgr/DaemonServer.cc:#include "messages/MMgrOpen.h" 
mgr/MgrClient.cc:#include "messages/MMgrOpen.h" 
msg/Message.cc:#include "messages/MMgrOpen.h" 
tools/ceph-dencoder/common_types.h:#include "messages/MMgrOpen.h" 
[lflores@fedora src]$ git grep "MMgrClose\.h" 
mgr/DaemonServer.cc:#include "messages/MMgrClose.h" 
mgr/MgrClient.cc:#include "messages/MMgrClose.h" 
msg/Message.cc:#include "messages/MMgrClose.h" 
tools/ceph-dencoder/common_types.h:#include "messages/MMgrClose.h" 

MMgrUpdate is missing in tools/ceph-dencoder/common_types.h:

[lflores@fedora src]$ git grep "MMgrUpdate\.h" 
mgr/DaemonServer.cc:#include "messages/MMgrUpdate.h" 
mgr/MgrClient.cc:#include "messages/MMgrUpdate.h" 
msg/Message.cc:#include "messages/MMgrUpdate.h" 

The fix would look something like:
tools/ceph-dencoder/common_types.h

#include "messages/MTimeCheck2.h" 
MESSAGE(MTimeCheck2)

#include "messages/MWatchNotify.h" 
MESSAGE(MWatchNotify)

+ #include "messages/MMgrUpdate.h" 
+ MESSAGE(MMgrUpdate)

This theory would make sense since `DaemonServer::ms_dispatch2` seems unable to decode the `MMgrUpdate` message. But I will test out this theory first.

Actions #14

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-06-21_16:28:27-rados-wip-yuri4-testing-2022-06-21-0704-pacific-distro-default-smithi/6889489

Actions #16

Updated by Laura Flores almost 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee changed from Laura Flores to Prashant D
  • Pull request ID set to 46838
Actions #17

Updated by Laura Flores almost 2 years ago

Here is how the test is currently failing on main:
http://pulpito.front.sepia.ceph.com/lflores-2022-06-09_22:29:13-rados:rest-main-distro-default-smithi/
10/25 tests failed

To trigger the tests, I ran:

./teuthology/virtualenv/bin/teuthology-suite -v -m smithi -c main -s rados:rest --filter-all "ubuntu_latest" -N 25 -p 75

Here, I am testing the fix:
http://pulpito.front.sepia.ceph.com/lflores-2022-06-24_18:36:01-rados:rest-wip-version-mgrupdate-distro-default-smithi/
25/25 tests passed

To trigger the tests, I ran:

./teuthology/virtualenv/bin/teuthology-suite -v -m smithi -c wip-version-mgrupdate -s rados:rest --filter-all "ubuntu_latest" -N 25 -p 75

Actions #19

Updated by Laura Flores almost 2 years ago

Laura Flores wrote:

Here, I am testing the fix:
http://pulpito.front.sepia.ceph.com/lflores-2022-06-24_18:36:01-rados:rest-wip-version-mgrupdate-distro-default-smithi/
25/25 tests passed

Looking at one of the passed jobs here, we can see that `update_pending_metadata` is getting called (indicated by `update_daemon_metadata`):
/a/lflores-2022-06-24_18:36:01-rados:rest-wip-version-mgrupdate-distro-default-smithi/6898012/remote/smithi204/log/ceph-mon.a.log.gz

2022-06-24T18:52:00.268+0000 7f90a8d21700  1 mgrc update_daemon_metadata mon.a metadata {addrs=[v2:172.21.15.204:3300/0,v1:172.21.15.204:6789/0],arch=x86_64,ceph_release=quincy,ceph_version=ceph version 17.0.0-13211-g6540937a (6540937a6d1865642dd799b4996d85d2ee9a99ad) quincy (dev),ceph_version_short=17.0.0-13211-g6540937a,compression_algorithms=none, snappy, zlib, zstd, lz4,cpu=Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz,device_ids=nvme0n1=INTEL_SSDPEDMD400G4_CVFT623300GD400BGN,device_paths=nvme0n1=/dev/disk/by-path/pci-0000:02:00.0-nvme-1,devices=nvme0n1,distro=ubuntu,distro_description=Ubuntu 20.04.4 LTS,distro_version=20.04,hostname=smithi204,kernel_description=#137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022,kernel_version=5.4.0-121-generic,mem_swap_kb=2097148,mem_total_kb=32766928,os=Linux}

In the mgr log, we can see that the monitor is no longer getting registered in the pending_service_map. Behavior has returned to normal:
zcat /a/lflores-2022-06-24_18:36:01-rados:rest-wip-version-mgrupdate-distro-default-smithi/6898012/remote/smithi204/log/ceph-mgr.x.log.gz | grep "mon\.a"

2022-06-24T18:52:04.356+0000 7fd108cf3700  4 mgr finish mon returned valid metadata JSON for mon.a
2022-06-24T18:52:05.360+0000 7fd0f7e3a700  1 -- [v2:172.21.15.204:6824/14143,v1:172.21.15.204:6825/14143] <== mon.0 172.21.15.204:0/14142 1 ==== mgropen(mon.a) v3 ==== 73646+0+0 (secure 0 0 0) 0x5624cd622240 con 0x5624cd8dd800
2022-06-24T18:52:05.360+0000 7fd0f7e3a700 10 mgr.server handle_open from mon.a 172.21.15.204:0/14142
2022-06-24T18:52:05.360+0000 7fd0f7e3a700 20 mgr.server handle_open updating existing DaemonState for mon.a
2022-06-24T18:52:05.360+0000 7fd0f7e3a700  1 -- [v2:172.21.15.204:6824/14143,v1:172.21.15.204:6825/14143] <== mon.0 172.21.15.204:0/14142 2 ==== mgrreport(mon.a +97-0 packed 1150 status=0 daemon_metrics=1 task_status=0) v9 ==== 8020+0+0 (secure 0 0 0) 0x5624cd731880 con 0x5624cd8dd800
2022-06-24T18:52:05.360+0000 7fd0f7e3a700 10 mgr.server handle_report from 0x5624cd8dd800 mon.a
2022-06-24T18:52:05.360+0000 7fd0f7e3a700 20 mgr.server handle_report updating existing DaemonState for mon.a
2022-06-24T18:52:05.360+0000 7fd0f7e3a700 -1 mgr.server handle_report got status from non-daemon mon.a
2022-06-24T18:52:05.360+0000 7fd0f7e3a700 10 mgr.server update_task_status got task status from mon.a

The only thing I'm still not seeing in the logs are calls to `handle_update()`, which would indicate that the MMgrUpdate message is getting handled. This might be expected behavior though... we will need to confirm.

Actions #21

Updated by Sridhar Seshasayee almost 2 years ago

/a/yuriw-2022-06-29_18:22:37-rados-wip-yuri2-testing-2022-06-29-0820-distro-default-smithi/6906105

Actions #22

Updated by Prashant D almost 2 years ago

Triggered another pacific upgrade teuthology run on wip-version-mgrupdate changes : http://pulpito.front.sepia.ceph.com/pdhange-2022-07-06_09:26:20-upgrade:pacific-x:stress-split-wip-pdhange-testing-distro-default-smithi/

Analysis of pass/fail jobs from http://pulpito.front.sepia.ceph.com/pdhange-2022-07-06_09:26:20-upgrade:pacific-x:stress-split-wip-pdhange-testing-distro-default-smithi/ run :

[1] 12 Passed/61 Failed/2 Dead/15 Running out of 90 jobs

[2] job 6917133 passed
[2a] the handle_update message is getting logged in mgr log :
6917133/remote/smithi083/log/4cbbd33e-fd13-11ec-842d-001a4aab830c/ceph-mgr.y.log.gz
2022-07-06T10:16:46.713+0000 7fa544373700 10 mgr.server handle_update from 0x559de01d3c00 mon.a
2022-07-06T10:16:46.713+0000 7fa544373700 20 mgr.server handle_update updating existing DaemonState for mon.a
2022-07-06T10:18:12.940+0000 7fe8cdbdb700 10 mgr.server handle_update from 0x561d8c3cb000 mon.c
2022-07-06T10:18:12.940+0000 7fe8cdbdb700 20 mgr.server handle_update updating existing DaemonState for mon.c
2022-07-06T10:19:33.375+0000 7fe8cdbdb700 10 mgr.server handle_update from 0x561d8bc91c00 mon.b
2022-07-06T10:19:33.375+0000 7fe8cdbdb700 20 mgr.server handle_update updating existing DaemonState for mon.b

[2b] test_prometheus_metrics.sh succeeded
2022-07-06T10:14:49.150 INFO:tasks.workunit.client.0.smithi083.stdout:metrics_metadata_string : ceph_daemon="mon.a",hostname="smithi083",public_addr="172.21.15.83",rank="0",ceph_version="ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)"
2022-07-06T10:14:49.150 INFO:tasks.workunit.client.0.smithi083.stdout:mon_metadata[a] : ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
2022-07-06T10:14:49.151 INFO:tasks.workunit.client.0.smithi083.stdout:metrics_metadata_string : ceph_daemon="mon.c",hostname="smithi083",public_addr="172.21.15.83",rank="1",ceph_version="ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)"
2022-07-06T10:14:49.151 INFO:tasks.workunit.client.0.smithi083.stdout:mon_metadata[c] : ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
2022-07-06T10:14:49.151 INFO:tasks.workunit.client.0.smithi083.stdout:metrics_metadata_string : ceph_daemon="mon.b",hostname="smithi116",public_addr="172.21.15.116",rank="2",ceph_version="ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)"
2022-07-06T10:14:49.151 INFO:tasks.workunit.client.0.smithi083.stdout:mon_metadata[b] : ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
2022-07-06T10:14:49.156 INFO:tasks.workunit.client.0.smithi083.stderr:+ rm -f /tmp/mon_metadata.json
2022-07-06T10:14:49.157 INFO:tasks.workunit.client.0.smithi083.stderr:+ echo /home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh OK
2022-07-06T10:14:49.158 INFO:tasks.workunit.client.0.smithi083.stdout:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh OK

[3] job 6917134 failed
[3a] the handle_update message is getting logged in mgr log :
6917134/remote/smithi031/log/113cf0fe-fd13-11ec-842d-001a4aab830c/ceph-mgr.y.log.gz
2022-07-06T10:20:56.346+0000 7fe44d8cc700 10 mgr.server handle_update from 0x55a712ca2400 mon.a
2022-07-06T10:20:56.346+0000 7fe44d8cc700 20 mgr.server handle_update updating existing DaemonState for mon.a
2022-07-06T10:22:37.247+0000 7f288d0fc700 10 mgr.server handle_update from 0x55adc91c5c00 mon.c
2022-07-06T10:22:37.247+0000 7f288d0fc700 20 mgr.server handle_update updating existing DaemonState for mon.c
2022-07-06T10:24:12.039+0000 7f288d0fc700 10 mgr.server handle_update from 0x55add16fb400 mon.b
2022-07-06T10:24:12.039+0000 7f288d0fc700 20 mgr.server handle_update updating existing DaemonState for mon.b

[3b] no metrics_metadata_string messages logged meaning test_prometheus_metrics.sh was not initiated or failed before verifying ceph version
$ grep -r metrics_metadata_string 6917134

[3c] Failure reason
2022-07-06T11:46:51.104 INFO:tasks.workunit:Running workunits matching mgr/test_prometheus_metrics.sh on client.0...
2022-07-06T11:46:51.107 INFO:tasks.workunit:Running workunit mgr/test_prometheus_metrics.sh...
2022-07-06T11:46:51.108 DEBUG:teuthology.orchestra.run.smithi031:workunit test mgr/test_prometheus_metrics.sh> mkdir p - /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=a18d347b32e5b66717aa231d72994c914dbd4cc2 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh
2022-07-06T11:46:51.121 INFO:tasks.workunit.client.0.smithi031.stderr:+ dirname /home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh
2022-07-06T11:46:51.130 DEBUG:teuthology.orchestra.run:got remote process result: 2
2022-07-06T11:46:51.131 INFO:tasks.workunit.client.0.smithi031.stderr:+ mydir=/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr
2022-07-06T11:46:51.132 INFO:tasks.workunit.client.0.smithi031.stderr:+ TMP_FILE=/tmp/mon_metadata.json
2022-07-06T11:46:51.132 INFO:tasks.workunit.client.0.smithi031.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
2022-07-06T11:46:51.133 INFO:tasks.workunit:Stopping ['mgr/test_prometheus_metrics.sh'] on client.0...
2022-07-06T11:46:51.134 DEBUG:teuthology.orchestra.run.smithi031:> sudo rm rf - /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0

Actions #23

Updated by Prashant D almost 2 years ago

16 jobs out of 61 failed jobs are because of syntax error in test_prometheus_metrics.sh :

6917134/teuthology.log:295521:2022-07-06T11:46:51.132 INFO:tasks.workunit.client.0.smithi031.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917136/teuthology.log:8732:2022-07-06T10:21:33.295 INFO:tasks.workunit.client.0.smithi008.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917140/teuthology.log:305810:2022-07-06T12:03:29.788 INFO:tasks.workunit.client.0.smithi109.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917147/teuthology.log:312100:2022-07-06T12:10:20.061 INFO:tasks.workunit.client.0.smithi143.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917151/teuthology.log:9307:2022-07-06T10:39:01.701 INFO:tasks.workunit.client.0.smithi098.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917153/teuthology.log:351283:2022-07-06T12:22:55.441 INFO:tasks.workunit.client.0.smithi120.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917159/teuthology.log:338117:2022-07-06T12:42:46.660 INFO:tasks.workunit.client.0.smithi086.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917166/teuthology.log:9318:2022-07-06T10:58:10.928 INFO:tasks.workunit.client.0.smithi142.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917173/teuthology.log:288108:2022-07-06T12:42:18.830 INFO:tasks.workunit.client.0.smithi018.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917179/teuthology.log:295230:2022-07-06T12:38:41.242 INFO:tasks.workunit.client.0.smithi077.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917181/teuthology.log:9036:2022-07-06T11:07:01.341 INFO:tasks.workunit.client.0.smithi043.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917185/teuthology.log:295869:2022-07-06T12:46:57.035 INFO:tasks.workunit.client.0.smithi153.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917192/teuthology.log:307289:2022-07-06T12:52:20.105 INFO:tasks.workunit.client.0.smithi157.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917196/teuthology.log:8889:2022-07-06T11:36:44.892 INFO:tasks.workunit.client.0.smithi133.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917198/teuthology.log:324731:2022-07-06T13:17:18.304 INFO:tasks.workunit.client.0.smithi017.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected
6917211/teuthology.log:9006:2022-07-06T13:11:00.390 INFO:tasks.workunit.client.0.smithi007.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/mgr/test_prometheus_metrics.sh: 7: Syntax error: "(" unexpected

Actions #24

Updated by Prashant D almost 2 years ago

The pacific upgrade tests were successful and failures are not related to PR#46838 (or prometheus upgrade test)
http://pulpito.front.sepia.ceph.com/pdhange-2022-07-15_00:43:02-upgrade:pacific-x:stress-split-wip-pdhange-testing-distro-default-smithi/

Actions #25

Updated by Laura Flores almost 2 years ago

Looking in the logs, it seems that we are successfully handling the new MMgrUpdate message when it is sent:

/a/pdhange-2022-07-15_00:43:02-upgrade:pacific-x:stress-split-wip-pdhange-testing-distro-default-smithi/6931122/remote/smithi087/log/c495cf32-03d9-11ed-842e-001a4aab830c/ceph-mgr.y.log.gz

2022-07-15T01:10:53.454+0000 7ff901df3700  1 -- [v2:172.21.15.87:6800/4120238218,v1:172.21.15.87:6801/4120238218] <== mon.0 172.21.15.87:0/2319737114 3 ==== mgrupdate(mon.a) v2 ==== 981+0+0 (secure 0 0 0) 0x555b23bb4f00 con 0x555b24383400
2022-07-15T01:10:53.454+0000 7ff901df3700 10 mgr.server handle_update from 0x555b24383400 mon.a
2022-07-15T01:10:53.454+0000 7ff901df3700 20 mgr.server handle_update updating existing DaemonState for mon.a

Actions #26

Updated by Aishwarya Mathuria almost 2 years ago

/a/yuriw-2022-07-13_19:41:18-rados-wip-yuri7-testing-2022-07-11-1631-distro-default-smithi/6929390

Actions #27

Updated by Laura Flores over 1 year ago

@Prashant your PR was just merged into main! I'll leave it to you to update this Tracker since I know that this and your original patch still needs to be backported to Quincy and Pacific.

Actions #28

Updated by Prashant D over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #29

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57188: pacific: test-restful.sh: mon metadata unable to be retrieved added
Actions #30

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #31

Updated by Prashant D over 1 year ago

  • Backport changed from pacific to pacific,quincy
Actions #32

Updated by Prashant D over 1 year ago

  • Tags deleted (backport_processed)
Actions #33

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57189: quincy: test-restful.sh: mon metadata unable to be retrieved added
Actions #34

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #35

Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (backport_processed)
Actions

Also available in: Atom PDF