Project

General

Profile

Actions

Bug #40376

closed

MON is not processing requests

Added by Rafal Wadolowski almost 5 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Monitor doesn't consume operations in queue. On peon they are stucked.

Function responsible for that is located at: https://github.com/ceph/ceph/blob/94246359f2f802da12637bdf41559a1ee747b687/src/mon/OSDMonitor.cc#L3706

Another thing is that peons are resending messages to leader, so the events list are growing.
This is blocking monitors for normal operations, so finally it stopped cluster.
There aren't any errors/warning on mon side

2019-06-14 21:31:29.941941 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 <== mon.2 10.11.3.203:6789/0 539390011 ==== paxos(lease_ack lc 53882586 fc 53881925 pn 0 opn 0) v4 ==== 166+0+0 (1427837865 0 0) 0x559cd6a46000 co
n 0x559cd31db800
2019-06-14 21:31:29.941951 7f2d7e9f2700 20 mon.rgw1@0(leader) e4 _ms_dispatch existing session 0x559cd26a9f80 for mon.2 10.11.3.203:6789/0
2019-06-14 21:31:29.941953 7f2d7e9f2700 20 mon.rgw1@0(leader) e4  caps allow *
2019-06-14 21:31:29.941955 7f2d7e9f2700 20 is_capable service=mon command= read on cap allow *
2019-06-14 21:31:29.941956 7f2d7e9f2700 20  allow so far , doing grant allow *
2019-06-14 21:31:29.941957 7f2d7e9f2700 20  allow all
2019-06-14 21:31:29.941957 7f2d7e9f2700 20 is_capable service=mon command= exec on cap allow *
2019-06-14 21:31:29.941958 7f2d7e9f2700 20  allow so far , doing grant allow *
2019-06-14 21:31:29.941958 7f2d7e9f2700 20  allow all
2019-06-14 21:31:31.639841 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 <== mon.2 10.11.3.203:6789/0 539390012 ==== forward(osd_beacon(pgs [5.d0c5,5.cd,5.d13d,5.c610,5.be5b,5.c594,5.f2ea,5.96f9,5.9fa9,5.64f8,5.426b,5.8
f09,5.a393,5.8e9c,5.8e4c,5.8975,5.a9aa,5.4655,5.ff9c,5.5fce,5.d90f,5.3a4d,5.35bb,5.76ea,5.27e1,5.22ab,5.51a,7.74d,5.2c61,5.4388,5.f299,5.ff79,5.4782,5.c3dd,5.21aa,5.e055,5.f1b6,5.5853,7.7ca,5.e25a,5.eab8,5.20a7,
5.6122,5.b0d7,5.5828,5.9f42,5.3b01,5.9364,5.52de,5.fd31,5.3b26,5.d82a,5.ba8b,5.e914,5.a18c,5.8f4d,5.bb99,5.58e5,5.35a2,5.f817,5.e4c6,5.d1d2,5.39a7,5.9216,5.4ce7,5.dd8b,5.36d6,5.e68a,5.f24c,5.c09d,5.49a8,5.d22d,5
.d3c4,5.5764,5.896d,5.4ecb,5.7510,5.24bc,5.4246,5.9e8b,5.f0f0,5.b940,5.e8db,5.2e71,5.4001,5.e028,5.39c9,5.4a19,5.b26b,5.80a8,5.680,5.1814,5.bc94,5.1705,5.b011,5.cfe5,5.21c4,5.7dd9,5.373c,5.6fad,5.915,5.968b,5.49
93,5.c2d3,5.381c,5.bca7,5.c9d9,5.becf,5.40d2,5.6945,5.6dea,5.f0fa,5.6dd1,5.4200,5.6fe7,5.707c,5.11ea,5.798b,5.e7f0,5.8170,5.837d,5.5e23] lec 857167 v874020) v1 caps allow profile osd tid 851 con_features 2305244
844532236283) v3 ==== 2277+0+0 (2559699423 0 0) 0x559cd6161e00 con 0x559cd31db800
2019-06-14 21:31:31.639891 7f2d7e9f2700 20 mon.rgw1@0(leader) e4 _ms_dispatch existing session 0x559cd26a9f80 for mon.2 10.11.3.203:6789/0
2019-06-14 21:31:31.639897 7f2d7e9f2700 20 mon.rgw1@0(leader) e4  caps allow *
2019-06-14 21:31:31.639900 7f2d7e9f2700 20 is_capable service=mon command= read on cap allow *
2019-06-14 21:31:31.639902 7f2d7e9f2700 20  allow so far , doing grant allow *
2019-06-14 21:31:31.639903 7f2d7e9f2700 20  allow all
2019-06-14 21:31:31.639904 7f2d7e9f2700 10 mon.rgw1@0(leader) e4 received forwarded message from osd.3152 10.11.2.64:6800/4185493 via mon.2 10.11.3.203:6789/0
2019-06-14 21:31:31.639908 7f2d7e9f2700 20 is_capable service=mon command= exec on cap allow *
2019-06-14 21:31:31.639910 7f2d7e9f2700 20  allow so far , doing grant allow *
2019-06-14 21:31:31.639912 7f2d7e9f2700 20  allow all
2019-06-14 21:31:31.639926 7f2d7e9f2700 10 mon.rgw1@0(leader) e4  caps are allow profile osd
2019-06-14 21:31:31.639927 7f2d7e9f2700 10 mon.rgw1@0(leader) e4  entity name 'osd.3152' type 4
2019-06-14 21:31:31.639929 7f2d7e9f2700 10 mon.rgw1@0(leader) e4  mesg 0x559cd7d2ef40 from 10.11.3.203:6789/0
2019-06-14 21:31:31.639934 7f2d7e9f2700 20 mon.rgw1@0(leader) e4 _ms_dispatch existing session 0x559cd21a8300 for osd.3152 -
2019-06-14 21:31:31.639935 7f2d7e9f2700 20 mon.rgw1@0(leader) e4  caps allow profile osd
2019-06-14 21:31:31.639948 7f2d7e9f2700 10 mon.rgw1@0(leader).osd e874020 preprocess_query osd_beacon(pgs [5.d0c5,5.cd,5.d13d,5.c610,5.be5b,5.c594,5.f2ea,5.96f9,5.9fa9,5.64f8,5.426b,5.8f09,5.a393,5.8e9c,5.8e4c,5
.8975,5.a9aa,5.4655,5.ff9c,5.5fce,5.d90f,5.3a4d,5.35bb,5.76ea,5.27e1,5.22ab,5.51a,7.74d,5.2c61,5.4388,5.f299,5.ff79,5.4782,5.c3dd,5.21aa,5.e055,5.f1b6,5.5853,7.7ca,5.e25a,5.eab8,5.20a7,5.6122,5.b0d7,5.5828,5.9f4
2,5.3b01,5.9364,5.52de,5.fd31,5.3b26,5.d82a,5.ba8b,5.e914,5.a18c,5.8f4d,5.bb99,5.58e5,5.35a2,5.f817,5.e4c6,5.d1d2,5.39a7,5.9216,5.4ce7,5.dd8b,5.36d6,5.e68a,5.f24c,5.c09d,5.49a8,5.d22d,5.d3c4,5.5764,5.896d,5.4ecb
,5.7510,5.24bc,5.4246,5.9e8b,5.f0f0,5.b940,5.e8db,5.2e71,5.4001,5.e028,5.39c9,5.4a19,5.b26b,5.80a8,5.680,5.1814,5.bc94,5.1705,5.b011,5.cfe5,5.21c4,5.7dd9,5.373c,5.6fad,5.915,5.968b,5.4993,5.c2d3,5.381c,5.bca7,5.
c9d9,5.becf,5.40d2,5.6945,5.6dea,5.f0fa,5.6dd1,5.4200,5.6fe7,5.707c,5.11ea,5.798b,5.e7f0,5.8170,5.837d,5.5e23] lec 857167 v874020) v1 from osd.3152 10.11.2.64:6800/4185493
2019-06-14 21:31:31.639962 7f2d7e9f2700 20 is_capable service=osd command= exec on cap allow profile osd
2019-06-14 21:31:31.639964 7f2d7e9f2700 20  allow so far , doing grant allow profile osd
2019-06-14 21:31:31.639977 7f2d7e9f2700 20  match
2019-06-14 21:31:31.639978 7f2d7e9f2700  7 mon.rgw1@0(leader).osd e874020 prepare_update osd_beacon(pgs [5.d0c5,5.cd,5.d13d,5.c610,5.be5b,5.c594,5.f2ea,5.96f9,5.9fa9,5.64f8,5.426b,5.8f09,5.a393,5.8e9c,5.8e4c,5.8
975,5.a9aa,5.4655,5.ff9c,5.5fce,5.d90f,5.3a4d,5.35bb,5.76ea,5.27e1,5.22ab,5.51a,7.74d,5.2c61,5.4388,5.f299,5.ff79,5.4782,5.c3dd,5.21aa,5.e055,5.f1b6,5.5853,7.7ca,5.e25a,5.eab8,5.20a7,5.6122,5.b0d7,5.5828,5.9f42,
5.3b01,5.9364,5.52de,5.fd31,5.3b26,5.d82a,5.ba8b,5.e914,5.a18c,5.8f4d,5.bb99,5.58e5,5.35a2,5.f817,5.e4c6,5.d1d2,5.39a7,5.9216,5.4ce7,5.dd8b,5.36d6,5.e68a,5.f24c,5.c09d,5.49a8,5.d22d,5.d3c4,5.5764,5.896d,5.4ecb,5
.7510,5.24bc,5.4246,5.9e8b,5.f0f0,5.b940,5.e8db,5.2e71,5.4001,5.e028,5.39c9,5.4a19,5.b26b,5.80a8,5.680,5.1814,5.bc94,5.1705,5.b011,5.cfe5,5.21c4,5.7dd9,5.373c,5.6fad,5.915,5.968b,5.4993,5.c2d3,5.381c,5.bca7,5.c9
d9,5.becf,5.40d2,5.6945,5.6dea,5.f0fa,5.6dd1,5.4200,5.6fe7,5.707c,5.11ea,5.798b,5.e7f0,5.8170,5.837d,5.5e23] lec 857167 v874020) v1 from osd.3152 10.11.2.64:6800/4185493
2019-06-14 21:31:31.639990 7f2d7e9f2700 10 mon.rgw1@0(leader).osd e874020 prepare_beacon osd_beacon(pgs [5.d0c5,5.cd,5.d13d,5.c610,5.be5b,5.c594,5.f2ea,5.96f9,5.9fa9,5.64f8,5.426b,5.8f09,5.a393,5.8e9c,5.8e4c,5.8
975,5.a9aa,5.4655,5.ff9c,5.5fce,5.d90f,5.3a4d,5.35bb,5.76ea,5.27e1,5.22ab,5.51a,7.74d,5.2c61,5.4388,5.f299,5.ff79,5.4782,5.c3dd,5.21aa,5.e055,5.f1b6,5.5853,7.7ca,5.e25a,5.eab8,5.20a7,5.6122,5.b0d7,5.5828,5.9f42,
5.3b01,5.9364,5.52de,5.fd31,5.3b26,5.d82a,5.ba8b,5.e914,5.a18c,5.8f4d,5.bb99,5.58e5,5.35a2,5.f817,5.e4c6,5.d1d2,5.39a7,5.9216,5.4ce7,5.dd8b,5.36d6,5.e68a,5.f24c,5.c09d,5.49a8,5.d22d,5.d3c4,5.5764,5.896d,5.4ecb,5
.7510,5.24bc,5.4246,5.9e8b,5.f0f0,5.b940,5.e8db,5.2e71,5.4001,5.e028,5.39c9,5.4a19,5.b26b,5.80a8,5.680,5.1814,5.bc94,5.1705,5.b011,5.cfe5,5.21c4,5.7dd9,5.373c,5.6fad,5.915,5.968b,5.4993,5.c2d3,5.381c,5.bca7,5.c9
d9,5.becf,5.40d2,5.6945,5.6dea,5.f0fa,5.6dd1,5.4200,5.6fe7,5.707c,5.11ea,5.798b,5.e7f0,5.8170,5.837d,5.5e23] lec 857167 v874020) v1 from osd.3152
2019-06-14 21:31:32.316483 7f2d8ba0c700  1 -- 10.11.3.201:6789/0 >> - conn(0x559cd7d15000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=22 -
2019-06-14 21:31:32.316718 7f2d8ba0c700 10 mon.rgw1@0(leader) e4 ms_verify_authorizer 10.11.3.203:0/3880419125 client protocol 0
2019-06-14 21:31:32.316947 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 <== client.? 10.11.3.203:0/3880419125 1 ==== auth(proto 0 37 bytes epoch 0) v1 ==== 67+0+0 (1222061535 0 0) 0x559cd6399400 con 0x559cd7d15000
2019-06-14 21:31:32.316971 7f2d7e9f2700 10 mon.rgw1@0(leader) e4 _ms_dispatch new session 0x559cd21a8300 MonSession(client.? 10.11.3.203:0/3880419125 is open ) features 0x1ffddff8eea4fffb
2019-06-14 21:31:32.316978 7f2d7e9f2700 20 mon.rgw1@0(leader) e4  caps
2019-06-14 21:31:32.316990 7f2d7e9f2700 10 mon.rgw1@0(leader).auth v22555 preprocess_query auth(proto 0 37 bytes epoch 0) v1 from client.? 10.11.3.203:0/3880419125
2019-06-14 21:31:32.316994 7f2d7e9f2700 10 mon.rgw1@0(leader).auth v22555 prep_auth() blob_size=37
2019-06-14 21:31:32.317005 7f2d7e9f2700 10 mon.rgw1@0(leader).auth v22555 AuthMonitor::assign_global_id m=auth(proto 0 37 bytes epoch 0) v1 mon=0/3 last_allocated=212284299 max_global_id=212294096
2019-06-14 21:31:32.317008 7f2d7e9f2700 10 mon.rgw1@0(leader).auth v22555 next_global_id should be 212284302
2019-06-14 21:31:32.317019 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 --> 10.11.3.203:0/3880419125 -- mon_map magic: 0 v1 -- 0x559cd7d2ef40 con 0
2019-06-14 21:31:32.317054 7f2d7e9f2700  2 mon.rgw1@0(leader) e4 send_reply 0x559cd7337340 0x559cd2b68f00 auth_reply(proto 2 0 (0) Success) v1
2019-06-14 21:31:32.317065 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 --> 10.11.3.203:0/3880419125 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x559cd2b68f00 con 0
2019-06-14 21:31:32.317703 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 <== client.? 10.11.3.203:0/3880419125 2 ==== auth(proto 2 32 bytes epoch 0) v1 ==== 62+0+0 (709281538 0 0) 0x559cd2b68f00 con 0x559cd7d15000
2019-06-14 21:31:32.317714 7f2d7e9f2700 20 mon.rgw1@0(leader) e4 _ms_dispatch existing session 0x559cd21a8300 for client.? 10.11.3.203:0/3880419125
2019-06-14 21:31:32.317718 7f2d7e9f2700 20 mon.rgw1@0(leader) e4  caps
2019-06-14 21:31:32.317724 7f2d7e9f2700 10 mon.rgw1@0(leader).auth v22555 preprocess_query auth(proto 2 32 bytes epoch 0) v1 from client.? 10.11.3.203:0/3880419125
2019-06-14 21:31:32.317727 7f2d7e9f2700 10 mon.rgw1@0(leader).auth v22555 prep_auth() blob_size=32
2019-06-14 21:31:32.317863 7f2d7e9f2700  2 mon.rgw1@0(leader) e4 send_reply 0x559cd7337340 0x559cd6399400 auth_reply(proto 2 0 (0) Success) v1
2019-06-14 21:31:32.317874 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 --> 10.11.3.203:0/3880419125 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x559cd6399400 con 0
2019-06-14 21:31:32.318254 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 <== client.? 10.11.3.203:0/3880419125 3 ==== auth(proto 2 181 bytes epoch 0) v1 ==== 211+0+0 (436331968 0 0) 0x559cd6399400 con 0x559cd7d15000
2019-06-14 21:31:32.318265 7f2d7e9f2700 20 mon.rgw1@0(leader) e4 _ms_dispatch existing session 0x559cd21a8300 for client.? 10.11.3.203:0/3880419125
2019-06-14 21:31:32.318268 7f2d7e9f2700 20 mon.rgw1@0(leader) e4  caps allow rw
2019-06-14 21:31:32.318274 7f2d7e9f2700 10 mon.rgw1@0(leader).auth v22555 preprocess_query auth(proto 2 181 bytes epoch 0) v1 from client.? 10.11.3.203:0/3880419125
2019-06-14 21:31:32.318277 7f2d7e9f2700 10 mon.rgw1@0(leader).auth v22555 prep_auth() blob_size=181
2019-06-14 21:31:32.318425 7f2d7e9f2700  2 mon.rgw1@0(leader) e4 send_reply 0x559cd7337340 0x559cd2b68f00 auth_reply(proto 2 0 (0) Success) v1
2019-06-14 21:31:32.318434 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 --> 10.11.3.203:0/3880419125 -- auth_reply(proto 2 0 (0) Success) v1 -- 0x559cd2b68f00 con 0
2019-06-14 21:31:32.318943 7f2d7e9f2700  1 -- 10.11.3.201:6789/0 <== client.? 10.11.3.203:0/3880419125 4 ==== mon_subscribe({monmap=0+}) v2 ==== 23+0+0 (1620593354 0 0) 0x559cd7d2ef40 con 0x559cd7d15000
2019-06-14 21:31:32.318953 7f2d7e9f2700 20 mon.rgw1@0(leader) e4 _ms_dispatch existing session 0x559cd21a8300 for client.? 10.11.3.203:0/3880419125
2019-06-14 21:31:32.318956 7f2d7e9f2700 20 mon.rgw1@0(leader) e4  caps allow rw
2019-06-14 21:31:32.318958 7f2d7e9f2700 20 is_capable service=mon command= read on cap allow rw
2019-06-14 21:31:32.318960 7f2d7e9f2700 20  allow so far , doing grant allow rw
2019-06-14 21:31:32.318961 7f2d7e9f2700 20  match

Files

mon.1_(peon)_ops (6.04 KB) mon.1_(peon)_ops Rafal Wadolowski, 06/14/2019 07:39 PM
mon.2_(peon)_ops (23.2 KB) mon.2_(peon)_ops Rafal Wadolowski, 06/14/2019 07:39 PM
Actions #1

Updated by Rafal Wadolowski almost 5 years ago

It looks like the reply is needed like in this bug https://github.com/ceph/ceph/pull/20467 . Am I right?

Actions #2

Updated by Greg Farnum almost 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)
  • Status changed from New to Closed
  • Component(RADOS) Monitor added

Rafal Wadolowski wrote:

It looks like the reply is needed like in this bug https://github.com/ceph/ceph/pull/20467 . Am I right?

Yes. There were a number of leaked ops which were fixed throughout the luminous point releases, including the OSD beacons. If you upgrade from 12.2.4 to the latest luminous it should be fixed.

Actions

Also available in: Atom PDF