Project

General

Profile

Bug #36540

msg: messages are queued but not sent

Added by Patrick Donnelly 11 months ago. Updated 6 months ago.

Status:
New
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

In CephFS testing, we've observed transient failures caused by what
appears to messages being dropped [1,2]. These appear to have been
caused by the recent refactor PR [3,4] but I have no evidence other
than the problems appearing during testing with [4] after [4] was
merged.

I'm running tests [5] to see if I can get more debugging (debug ms =
20) but I wanted to canvas for ideas/advice before I get much deeper.
Has anyone else seen transient failures with messages getting dropped?

[1] http://tracker.ceph.com/issues/36389
[2] http://tracker.ceph.com/issues/36349
[3] https://github.com/ceph/ceph/pull/23415
[4] https://github.com/ceph/ceph/pull/24305
[5] http://pulpito.ceph.com/?branch=wip-pdonnell-testing-20181011.152759

See also thread on ceph-devel "msgr bug in master caused by recent protocol refactor (?)".

Example with "debug ms = 20":

2018-10-16 19:57:47.356 1b1c9700 5 mds.beacon.e Sending beacon
up:active seq 2214
2018-10-16 19:57:47.356 1b1c9700 1 -- 172.21.15.179:6816/2028340967
--> 172.21.15.179:6789/0 -- mdsbeacon(4307/e up:active seq 2214 v283)
v7 -- 0x209f2f40 con 0
2018-10-16 19:57:47.356 1b1c9700 20 -- 172.21.15.179:6816/2028340967

172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143

cs=1 l=1).prepare_send_message m mdsbeacon(4307/e up:active seq 2214
v283) v7
2018-10-16 19:57:47.357 1b1c9700 20 -- 172.21.15.179:6816/2028340967

172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143

cs=1 l=1).prepare_send_message encoding features 4611087854031142911
0x209f2f40 mdsbeacon(4307/e up:active seq 2214 v283) v7
2018-10-16 19:57:47.357 1b1c9700 15 -- 172.21.15.179:6816/2028340967

172.21.15.179:6789/0 conn(0x1f544e20 legacy :-1 s=OPENED pgs=143

cs=1 l=1).send_message inline write is denied, reschedule m=0x209f2f40

From: /ceph/teuthology-archive/pdonnell-2018-10-16_16:46:31-multimds-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148285/remote/smithi179/log/ceph-mds.e.log.gz

It looks like the messenger just never sent the message. FWIW, the mds
and mon in this particular case are on the same host. I looked around
at other beacon sends (grep "beacon.e") and the actual send by the
msgr happens promptly afterwards. For some reason, that's not the case
case for seq=2214 ... but I'm not that familiar with msgr debugging.

and another where the connection recovered:

Another job with the same test configuration (with valgrind) but this
one is different in that the connection eventually sent the beacons
without requiring a reconnect:

2018-10-16 18:32:55.287 1b1c9700 5 mds.beacon.c Sending beacon
up:active seq 634
2018-10-16 18:32:55.287 1b1c9700 1 -- 172.21.15.156:6819/417220422
--> 172.21.15.156:6789/0 -- mdsbeacon(4315/c up:active seq 634 v331)
v7 -- 0x21aabd20 con 0
2018-10-16 18:32:55.287 1b1c9700 20 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
l=1).prepare_send_message m mdsbeacon(4315/c up:active seq 634 v331)
v7
2018-10-16 18:32:55.287 1b1c9700 20 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
l=1).prepare_send_message encoding features 4611087854031142911
0x21aabd20 mdsbeacon(4315/c up:active seq 634 v331) v7
2018-10-16 18:32:55.287 1b1c9700 15 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
l=1).send_message inline write is denied, reschedule m=0x21aabd20
...
2018-10-16 18:33:07.993 159be700 4 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
s=STATE_CONNECTION_ESTABLISHED l=1).handle_write
2018-10-16 18:33:07.993 159be700 10 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
l=1).write_event
2018-10-16 18:33:07.993 159be700 10 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
l=1).append_keepalive_or_ack
2018-10-16 18:33:07.993 159be700 20 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
l=1).write_message no session security
2018-10-16 18:33:07.993 159be700 20 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
l=1).write_message sending message type=100 src mds.2 front=416 data=0
off 0
2018-10-16 18:33:07.993 159be700 20 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1 s=OPENED pgs=78 cs=1
l=1).write_message sending 721 0x21aabd20
2018-10-16 18:33:07.994 159be700 10 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 500 remaining
bytes 0

From: /ceph/teuthology-archive/pdonnell-2018-10-16_16:46:31-multimds-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148290/remote/smithi156/log/ceph-mds.c.log.gz

So 12 seconds inexplicably spanned the beacon queue and the actual
send on the socket.

Notice it also was also slow to "receive" three mdsmaps in the same time period:

2018-10-16 18:33:07.477 159be700 5 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
s=READ_FOOTER_AND_DISPATCH pgs=78 cs=1 l=1). rx mon.1 seq 998
0x12348cd0 mdsmap(e 331) v1
...
2018-10-16 18:33:07.481 159be700 5 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
s=READ_FOOTER_AND_DISPATCH pgs=78 cs=1 l=1). rx mon.1 seq 999
0x12350210 mdsmap(e 332) v1
...
2018-10-16 18:33:07.486 159be700 5 -- 172.21.15.156:6819/417220422 >>
172.21.15.156:6789/0 conn(0x1487fcd0 legacy :-1
s=READ_FOOTER_AND_DISPATCH pgs=78 cs=1 l=1). rx mon.1 seq 1000
0x123652c0 mdsmap(e 333) v1

All three rapidly received in sequence.

The mons sent the above mdsmaps in a timely fashion, e.g. e331:

2018-10-16 18:32:43.343 1d101700 20 -- 172.21.15.156:6789/0 done
calling dispatch on 0x22151430
2018-10-16 18:32:43.344 1f105700 4 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789
s=STATE_CONNECTION_ESTABLISHED l=1).handle_write
2018-10-16 18:32:43.344 1f105700 10 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
pgs=1 cs=1 l=1).write_event
2018-10-16 18:32:43.344 1f105700 20 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
pgs=1 cs=1 l=1).prepare_send_message m mdsmap(e 330) v1
2018-10-16 18:32:43.344 1f105700 20 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
pgs=1 cs=1 l=1).prepare_send_message encoding features
4611087854031142911 0x224309d0 mdsmap(e 330) v1
2018-10-16 18:32:43.345 1f105700 20 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
pgs=1 cs=1 l=1).write_message signed m=0x224309d0): sig = 0
2018-10-16 18:32:43.345 1f105700 20 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
pgs=1 cs=1 l=1).write_message sending message type=21 src mon.1
front=1905 data=0 off 0
2018-10-16 18:32:43.345 1f105700 20 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
pgs=1 cs=1 l=1).write_message sending 995 0x224309d0
2018-10-16 18:32:43.345 1f105700 10 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789
s=STATE_CONNECTION_ESTABLISHED l=1)._try_send sent bytes 1980
remaining bytes 0
2018-10-16 18:32:43.345 1f105700 10 -- 172.21.15.156:6789/0 >>
172.21.15.156:6819/417220422 conn(0x217e96e0 legacy :6789 s=OPENED
pgs=1 cs=1 l=1).write_message sending 0x224309d0 done.

From: /ceph/teuthology-archive/pdonnell-2018-10-16_16:46:31-multimds-wip-pdonnell-testing-20181011.152759-distro-basic-smithi/3148290/remote/smithi156/log/ceph-mon.a.log.gz


Related issues

Related to fs - Bug #36389: untar encounters unexpected EPERM on kclient/multimds cluster with thrashing New 10/10/2018
Related to fs - Bug #36349: mds: src/mds/MDCache.cc: 1637: FAILED ceph_assert(follows >= realm->get_newest_seq()) Can't reproduce 10/08/2018
Related to Messengers - Bug #36666: msg: rejoin message queued but not sent Duplicate

History

#1 Updated by Patrick Donnelly 11 months ago

  • Related to Bug #36389: untar encounters unexpected EPERM on kclient/multimds cluster with thrashing added

#2 Updated by Patrick Donnelly 11 months ago

  • Related to Bug #36349: mds: src/mds/MDCache.cc: 1637: FAILED ceph_assert(follows >= realm->get_newest_seq()) added

#3 Updated by Patrick Donnelly 11 months ago

  • Subject changed from msg: messages are not queued but not sent to msg: messages are queued but not sent

#4 Updated by Patrick Donnelly 11 months ago

  • Related to Bug #36666: msg: rejoin message queued but not sent added

#5 Updated by Sage Weil 8 months ago

  • Project changed from Ceph to RADOS

#6 Updated by Greg Farnum 6 months ago

  • Project changed from RADOS to Messengers

#7 Updated by Patrick Donnelly 6 months ago

I haven't seen this recently. Usually I grep for "no longer laggy" in MDS logs within the multimds suite runs. Right now that's cluttered with false positives due to valgrind. I have opened #38723 to get rid of those.

#8 Updated by Patrick Donnelly 6 months ago

No, the problem still exists:

2019-03-13 21:50:45.156 1b269700  5 mds.beacon.i Sending beacon up:active seq 795
2019-03-13 21:50:45.189 1b269700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] --> [v2:172.21.15.8:3301/0,v1:172.21.15.8:6790/0] -- mdsbeacon(4456/i up:active seq 795 v409) v7 -- 0x1f2eac40 con 0x2105ce10
2019-03-13 21:50:45.189 1b269700 20 mds.beacon.i sender thread waiting interval 4s
...
2019-03-13 21:50:57.721 1b269700  5 mds.beacon.i Sending beacon up:active seq 798
2019-03-13 21:50:57.721 1b269700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] --> [v2:172.21.15.8:3301/0,v1:172.21.15.8:6790/0] -- mdsbeacon(4456/i up:active seq 798 v409) v7 -- 0x2824fcc0 con 0x2105ce10
2019-03-13 21:50:57.722 1b269700 20 mds.beacon.i sender thread waiting interval 4s
...
2019-03-13 21:51:06.111 18263700  5 mds.6.18  laggy, deferring lock(a=nudge inest 0x50000002025.head) v1
2019-03-13 21:51:06.140 18263700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] <== mon.2 v2:172.21.15.8:3301/0 952 ==== mdsmap(e 410) v1 ==== 2276+0+0 (crc 0 0 0) 0x121258d0 con 0x2105ce10
2019-03-13 21:51:06.229 18263700  5 mds.i handle_mds_map old map epoch 410 <= 414, discarding
2019-03-13 21:51:06.229 18263700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] <== mon.2 v2:172.21.15.8:3301/0 953 ==== mdsmap(e 411) v1 ==== 2272+0+0 (crc 0 0 0) 0x149d8780 con 0x2105ce10
2019-03-13 21:51:06.229 18263700  5 mds.i handle_mds_map old map epoch 411 <= 414, discarding
2019-03-13 21:51:06.229 18263700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] <== mon.2 v2:172.21.15.8:3301/0 954 ==== mdsmap(e 412) v1 ==== 2268+0+0 (crc 0 0 0) 0x1b5b45f0 con 0x2105ce10
2019-03-13 21:51:06.230 18263700  5 mds.i handle_mds_map old map epoch 412 <= 414, discarding
2019-03-13 21:51:06.230 18263700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] <== mon.2 v2:172.21.15.8:3301/0 955 ==== mdsmap(e 413) v1 ==== 2264+0+0 (crc 0 0 0) 0x1b627980 con 0x2105ce10
2019-03-13 21:51:06.231 18263700  5 mds.i handle_mds_map old map epoch 413 <= 414, discarding
2019-03-13 21:51:06.231 18263700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] <== mon.2 v2:172.21.15.8:3301/0 956 ==== mdsmap(e 414) v1 ==== 2264+0+0 (crc 0 0 0) 0x1b653d30 con 0x2105ce10
2019-03-13 21:51:06.231 18263700  5 mds.i handle_mds_map old map epoch 414 <= 414, discarding
...
2019-03-13 21:51:06.311 1625f700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] <== mon.2 v2:172.21.15.8:3301/0 957 ==== mdsbeacon(4456/i up:active seq 795 v414) v7 ==== 126+0+0 (crc 0 0 0) 0x14681c10 con 0x2105ce10
2019-03-13 21:51:06.312 1625f700  5 mds.beacon.i received beacon reply up:active seq 795 rtt 21.1565
2019-03-13 21:51:06.320 1625f700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] <== mon.2 v2:172.21.15.8:3301/0 958 ==== mdsbeacon(4456/i up:active seq 796 v414) v7 ==== 126+0+0 (crc 0 0 0) 0x1f4576e0 con 0x2105ce10
2019-03-13 21:51:06.320 1625f700  5 mds.beacon.i received beacon reply up:active seq 796 rtt 17.0176
2019-03-13 21:51:06.321 1625f700  1 -- [v2:172.21.15.169:6836/237079033,v1:172.21.15.169:6837/237079033] <== mon.2 v2:172.21.15.8:3301/0 959 ==== mdsbeacon(4456/i up:active seq 797 v414) v7 ==== 126+0+0 (crc 0 0 0) 0x20898ff0 con 0x2105ce10
2019-03-13 21:51:06.321 1625f700  5 mds.beacon.i received beacon reply up:active seq 797 rtt 12.7697
2019-03-13 21:51:06.322 1625f700  0 mds.beacon.i  MDS is no longer laggy

From: /ceph/teuthology-archive/pdonnell-2019-03-13_19:07:54-multimds-master-distro-basic-smithi/3718401/remote/smithi169/log/ceph-mds.i.log.gz

mon gets all of these beacons in rapid succession:

2019-03-13 21:51:06.227 1d9a4700  1 -- [v2:172.21.15.8:3301/0,v1:172.21.15.8:6790/0] <== mds.6 v2:172.21.15.169:6836/237079033 672 ==== mdsbeacon(4456/i up:active seq 795 v409) v7 ==== 416+0+0 (crc 0 0 0) 0x1a0de5a0 con 0x2292ebc0
...
2019-03-13 21:51:06.243 1d9a4700  1 -- [v2:172.21.15.8:3301/0,v1:172.21.15.8:6790/0] <== mds.6 v2:172.21.15.169:6836/237079033 677 ==== mdsbeacon(4456/i up:active seq 800 v414) v7 ==== 416+0+0 (crc 0 0 0) 0x21f87780 con 0x2292ebc0

From: /ceph/teuthology-archive/pdonnell-2019-03-13_19:07:54-multimds-master-distro-basic-smithi/3718401/remote/smithi008/log/ceph-mon.c.log.gz

Also available in: Atom PDF