Project

General

Profile

Actions

Bug #19605

closed

OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)

Added by John Spray about 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Files

osd.1.log.bz2 (723 KB) osd.1.log.bz2 Kefu Chai, 08/03/2017 09:41 AM
osd.0.log-0.bz2 (672 KB) osd.0.log-0.bz2 first part of osd.0.log Kefu Chai, 08/17/2017 11:43 AM
osd.0.log-1.bz2 (586 KB) osd.0.log-1.bz2 second part of osd.0.log Kefu Chai, 08/17/2017 11:43 AM

Related issues 2 (0 open2 closed)

Related to RADOS - Bug #25174: osd: assert failure with FAILED assert(repop_queue.front() == repop) In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' Can't reproduceNeha Ojha

Actions
Copied to RADOS - Backport #21242: luminous: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop)ResolvedNathan CutlerActions
Actions #1

Updated by Zengran Zhang almost 7 years ago

2017-05-19 22:58:05.142834 7f14de2a2700 0 osd.0 pg_epoch: 78440 pg[9.10cs0( v 78440'6350 (78438'4241,78440'6350] local-les=78440 n=8 ec=77949 les/c/f 78440/78440/0 78439/78439/78372) [0,2147483647,6,7,10,1] r=0 lpr=78439 luod=78440'6348 crt=78440'6319 lcod 78440'6319 mlcod 78440'6319 active+undersized+degraded] removing repgather(0x55e708089000 78440'6348 rep_tid=26665 committed?=1 applied?=1 r=0)
2017-05-19 22:58:05.142849 7f14de2a2700 0 osd.0 pg_epoch: 78440 pg[9.10cs0( v 78440'6350 (78438'4241,78440'6350] local-les=78440 n=8 ec=77949 les/c/f 78440/78440/0 78439/78439/78372) [0,2147483647,6,7,10,1] r=0 lpr=78439 luod=78440'6348 crt=78440'6319 lcod 78440'6319 mlcod 78440'6319 active+undersized+degraded] q front is repgather(0x55e6f92c9980 78440'6343 rep_tid=26656 committed?=0 applied?=0 r=0)
2017-05-19 22:58:05.149253 7f14de2a2700 -1 /build/ceph-12.0.2/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' thread 7f14de2a2700 time 2017-05-19 22:58:05.142856
/build/ceph-12.0.2/src/osd/PrimaryLogPG.cc: 8615: FAILED assert(repop_queue.front() == repop)

ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55e6edbb2072]
2: (PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)+0x566) [0x55e6ed74a0a6]
3: (PrimaryLogPG::repop_all_applied(PrimaryLogPG::RepGather*)+0x74) [0x55e6ed74a9f4]
4: (Context::complete(int)+0x9) [0x55e6ed61e769]
5: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply const&)+0x1a6) [0x55e6ed8d69f6]
6: (ECBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x2be) [0x55e6ed8df91e]
7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x47b) [0x55e6ed74b76b]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1bf) [0x55e6ed5e0ccf]
9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55e6ed5e1147]
10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1215) [0x55e6ed60cc95]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55e6edbb6e29]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e6edbba1e0]
13: (()+0x76ba) [0x7f14f4f926ba]
14: (clone()+0x6d) [0x7f14f400982d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #2

Updated by Zengran Zhang almost 7 years ago

Zengran Zhang wrote:

2017-05-19 22:48:23.854608 7f14f1c1e700 0 -- 10.10.133.1:6823/2019 >> 10.10.133.1:6819/19544 conn(0x55e703369800 :6823 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2017-05-19 22:48:25.698578 7f14f1c1e700 0 -- 10.10.133.1:6823/2019 >> 10.10.133.1:6801/2970 conn(0x55e704830000 :-1 s=STATE_OPEN pgs=12 cs=1 l=0).fault initiating reconnect
2017-05-19 22:48:52.637936 7f14f1c1e700 0 -- 10.10.133.1:6823/2019 >> 10.10.133.1:6801/22328 conn(0x55e703ae4800 :6823 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2017-05-19 22:58:05.142005 7f14f1c1e700 0 -- 10.10.133.1:6823/2019 >> 10.10.133.1:6813/19500 conn(0x55e70327c000 :6823 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2017-05-19 22:58:05.142720 7f14f1c1e700 0 -- 10.10.133.1:6823/2019 >> 10.10.133.1:6813/19500 conn(0x55e70327c000 :6823 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=20 cs=1 l=0).process missed message? skipped from seq 0 to 1934764039
2017-05-19 22:58:05.142834 7f14de2a2700 0 osd.0 pg_epoch: 78440 pg[9.10cs0( v 78440'6350 (78438'4241,78440'6350] local-les=78440 n=8 ec=77949 les/c/f 78440/78440/0 78439/78439/78372) [0,2147483647,6,7,10,1] r=0 lpr=78439 luod=78440'6348 crt=78440'6319 lcod 78440'6319 mlcod 78440'6319 active+undersized+degraded] removing repgather(0x55e708089000 78440'6348 rep_tid=26665 committed?=1 applied?=1 r=0)
2017-05-19 22:58:05.142849 7f14de2a2700 0 osd.0 pg_epoch: 78440 pg[9.10cs0( v 78440'6350 (78438'4241,78440'6350] local-les=78440 n=8 ec=77949 les/c/f 78440/78440/0 78439/78439/78372) [0,2147483647,6,7,10,1] r=0 lpr=78439 luod=78440'6348 crt=78440'6319 lcod 78440'6319 mlcod 78440'6319 active+undersized+degraded] q front is repgather(0x55e6f92c9980 78440'6343 rep_tid=26656 committed?=0 applied?=0 r=0)
2017-05-19 22:58:05.149253 7f14de2a2700 -1 /build/ceph-12.0.2/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' thread 7f14de2a2700 time 2017-05-19 22:58:05.142856
/build/ceph-12.0.2/src/osd/PrimaryLogPG.cc: 8615: FAILED assert(repop_queue.front() == repop)

ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55e6edbb2072]
2: (PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)+0x566) [0x55e6ed74a0a6]
3: (PrimaryLogPG::repop_all_applied(PrimaryLogPG::RepGather*)+0x74) [0x55e6ed74a9f4]
4: (Context::complete(int)+0x9) [0x55e6ed61e769]
5: (ECBackend::handle_sub_write_reply(pg_shard_t, ECSubWriteReply const&)+0x1a6) [0x55e6ed8d69f6]
6: (ECBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x2be) [0x55e6ed8df91e]
7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x47b) [0x55e6ed74b76b]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1bf) [0x55e6ed5e0ccf]
9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55e6ed5e1147]
10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1215) [0x55e6ed60cc95]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55e6edbb6e29]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e6edbba1e0]
13: (()+0x76ba) [0x7f14f4f926ba]
14: (clone()+0x6d) [0x7f14f400982d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #3

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to Correctness/Safety
  • Component(RADOS) OSD added
Actions #4

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Can't reproduce

If you can reproduce this on master or luminous rc, please reopen!

Actions #5

Updated by Kefu Chai almost 7 years ago

  • Status changed from Can't reproduce to New

i can reproduce this

    0> 2017-07-19 20:24:38.806917 7fc464310700 -1 /var/ceph/ceph/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' thread 7fc464310700 time 2017-07-19 20:24:38.782527
/var/ceph/ceph/src/osd/PrimaryLogPG.cc: 9074: FAILED assert(repop_queue.front() == repop)

 ceph version Development (no_version) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x137) [0x562111033f2c]
 2: (PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)+0xf02) [0x5621109c15b8]
 3: (PrimaryLogPG::repop_all_committed(PrimaryLogPG::RepGather*)+0x282) [0x5621109c0170]
 4: (C_OSD_RepopCommit::finish(int)+0x3e) [0x562110a1b256]
 5: (Context::complete(int)+0x27) [0x56211075bc65]
 6: (ReplicatedBackend::do_repop_reply(boost::intrusive_ptr<OpRequest>)+0x88f) [0x562110c53d5f]
 7: (ReplicatedBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x35a) [0x562110c50a40]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xceb) [0x562110977953]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x453) [0x562110747945]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x71) [0x562110bc7d63]
   -10> 2017-07-19 20:24:38.782426 7fc464310700  1 -- 127.0.0.1:6804/10982 --> 127.0.0.1:0/4260068453 -- osd_op_reply(8207 10000001676.00000000 [write 0~6232] v17'968 uv968 ondisk = 0) v8 -- 0x56211e85ce00 con 0
    -9> 2017-07-19 20:24:38.782469 7fc464310700  0 osd.1 pg_epoch: 17 pg[1.6( v 17'1041 (0'0,17'1041] local-lis/les=13/14 n=716 ec=11/11 lis/c 13/13 les/c/f 14/14/0 13/13/11) [1,0,2] r=0 lpr=13 luod=17'968 lua=17'1013 crt=17'1041 lcod 17'967 mlcod 17'967 active+clean+snaptrim snaptrimq=[2~2]]  removing repgather(0x56211e336d00 17'968 rep_tid=4698 committed?=1 applied?=1 r=0)
    -8> 2017-07-19 20:24:38.782504 7fc464310700  0 osd.1 pg_epoch: 17 pg[1.6( v 17'1041 (0'0,17'1041] local-lis/les=13/14 n=716 ec=11/11 lis/c 13/13 les/c/f 14/14/0 13/13/11) [1,0,2] r=0 lpr=13 luod=17'968 lua=17'1013 crt=17'1041 lcod 17'967 mlcod 17'967 active+clean+snaptrim snaptrimq=[2~2]]    q front is repgather(0x56211d80c780 17'967 rep_tid=4693 committed?=0 applied?=0 r=0)
    -7> 2017-07-19 20:24:38.783441 7fc47ba89700  1 -- 127.0.0.1:6805/10982 <== osd.0 127.0.0.1:6801/10776 23710471 ==== osd_repop_reply(client.4120.0:8203 1.0 e17/13) v2 ==== 111+0+0 (3538393328 0 0) 0x56211fa7db80 con 0x56211d337000
    -6> 2017-07-19 20:24:38.783585 7fc464b11700  1 -- 127.0.0.1:6804/10982 --> 127.0.0.1:0/4260068453 -- osd_op_reply(8203 10000001672.00000000 [write 0~237] v17'946 uv946 ondisk = 0) v8 -- 0x56211db7a680 con 0
    -5> 2017-07-19 20:24:38.783640 7fc464b11700  0 osd.1 pg_epoch: 17 pg[1.0( v 17'1038 (0'0,17'1038] local-lis/les=13/14 n=732 ec=11/11 lis/c 13/13 les/c/f 14/14/0 13/13/11) [1,0,2] r=0 lpr=13 luod=17'946 lua=17'992 crt=17'1038 lcod 17'945 mlcod 17'945 active+clean]  removing repgather(0x562121e14100 17'946 rep_tid=4695 committed?=1 applied?=1 r=0)
    -4> 2017-07-19 20:24:38.783667 7fc464b11700  0 osd.1 pg_epoch: 17 pg[1.0( v 17'1038 (0'0,17'1038] local-lis/les=13/14 n=732 ec=11/11 lis/c 13/13 les/c/f 14/14/0 13/13/11) [1,0,2] r=0 lpr=13 luod=17'946 lua=17'992 crt=17'1038 lcod 17'945 mlcod 17'945 active+clean]    q front is repgather(0x56211da82000 17'945 rep_tid=4691 committed?=0 applied?=0 r=0)
    -3> 2017-07-19 20:24:38.783962 7fc47ba89700  1 -- 127.0.0.1:6805/10982 <== osd.0 127.0.0.1:6801/10776 23710472 ==== osd_repop_reply(client.4120.0:8249 1.7 e17/13) v2 ==== 111+0+0 (4222689668 0 0) 0x56211fa7c500 con 0x56211d337000
    -2> 2017-07-19 20:24:38.784965 7fc47ba89700  1 -- 127.0.0.1:6805/10982 <== osd.0 127.0.0.1:6801/10776 23710473 ==== osd_repop_reply(client.4120.0:8126 1.3 e17/13) v2 ==== 111+0+0 (4031194495 0 0) 0x56211fa7ed00 con 0x56211d337000
    -1> 2017-07-19 20:24:38.785314 7fc46330e700  1 -- 127.0.0.1:6804/10982 --> 127.0.0.1:0/4260068453 -- osd_op_reply(8126 1000000161c.00000000 [write 0~6982] v17'961 uv961 ondisk = 0) v8 -- 0x56211f599880 con 0
Actions #6

Updated by Sage Weil over 6 years ago

  • Status changed from New to 12
  • Assignee set to Kefu Chai

How do you reproduce it?

Actions #7

Updated by Kefu Chai over 6 years ago

i reproduced it by running

fs/snaps/{begin.yaml clusters/fixed-2-ucephfs.yaml mount/fuse.yaml objectstore/filestore-xfs.yaml tasks/snaptests.yaml}

locally.

in short,

  1. mount cephfs using ceph-fuse
  2. untar_snap_rm.sh

but so far, i only made it twice.

Actions #8

Updated by Kefu Chai over 6 years ago

/me has a core dump now, /me looking.

Actions #9

Updated by Kefu Chai over 6 years ago

i think the root cause is in the messenger layer. in my case, osd.1 is the primary osd. and it expects that its peer osds reply the repop in order, but actually the rep_gather rep_gather.4205 in the middle of repop_queue is finished before the rep_gather.4198 at the front. that's why osd panic'ed.

by inspecting the log, osd.2 actually at least tried to reply osd.1 in order. but before the osd_repop_reply of rep_gather.4198 was put on the wire, osd.2 ran into some connection issue:

2017-07-27 13:11:04.996421 7facad711700  1 -- 127.0.0.1:6810/54391 >> 127.0.0.1:6806/54182 conn(0x9e5d89000 :6810 s=STATE_OPEN pgs=3 cs=1 l=0).read_until read failed
2017-07-27 13:11:04.996427 7facad711700  1 -- 127.0.0.1:6810/54391 >> 127.0.0.1:6806/54182 conn(0x9e5d89000 :6810 s=STATE_OPEN pgs=3 cs=1 l=0).process read tag failed
.
2017-07-27 13:11:05.787109 7faca1796700  1 -- 127.0.0.1:6810/54391 --> 127.0.0.1:6806/54182 -- osd_repop_reply(client.4140.0:7497 1.7 e21/17 ondisk, result = 0) v2 -- 0x9e75fa000 con 0
/// the reply of rep_gather.4198. but i am not quite sure if this message is sent at all, because at that moment, the async conn is not writeable.
2017-07-27 13:11:05.787475 7facad711700  1 -- 127.0.0.1:6810/54391 learned_addr learned my addr 127.0.0.1:6810/54391
2017-07-27 13:11:05.788296 7facad711700  0 -- 127.0.0.1:6810/54391 >> 127.0.0.1:6806/54182 conn(0x9e5d89000 :6810 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=3 cs=1 l=0).handle_connect_reply connect got RESETSESSION
.
2017-07-27 13:11:05.792184 7faca1796700  1 -- 127.0.0.1:6810/54391 --> 127.0.0.1:6806/54182 -- osd_repop_reply(client.4140.0:7501 1.7 e21/17 ondisk, result = 0) v2 -- 0x9e75faa00 con 0
/// the reply of rep_gather.4205

on mon.a

2017-07-27 13:10:22.430516 7f3d0884d700 10 mon.a@0(leader).osd e21 preprocess_failure new: osd.2 127.0.0.1:6809/54391, from osd.0 127.0.0.1:6801/53974
..
2017-07-27 13:10:25.920452 7f3d0884d700 10 mon.a@0(leader).osd e21 preprocess_failure new: osd.2 127.0.0.1:6809/54391, from osd.1 127.0.0.1:6805/54182
..
2017-07-27 13:10:31.394073 7f3d04044700 10 mon.a@0(leader).osd e22  adding osd.2 to down_pending_out map
..
2017-07-27 13:10:25.971551 7f3d0b052700  2 mon.a@0(leader).osd e21  osd.2 DOWN
..
2017-07-27 13:10:32.630712 7f3d0b052700 10 mon.a@0(leader).osd e22 e22: 3 total, 2 up, 3 in

osd.0 and osd.2 reported that osd.1 is too laggy. and then it's marked down.

on the osd.2 side

  -371> 2017-07-27 13:11:04.996232 7fd3ebcd4700 10 osd.1 21  advance to epoch 22 (<= last 22 <= newest_map 23)
  -368> 2017-07-27 13:11:04.996277 7fd3ebcd4700  1 -- 127.0.0.1:6806/54182 mark_down 127.0.0.1:6810/54391 -- 0xc596d0c800
... osd.2 consumed the latest osdmap, and killed the conn to osd.2 which was marked down by mon
   -67> 2017-07-27 13:11:05.787397 7fd3f7c4f700  1 -- 127.0.0.1:6806/54182 >> - conn(0xc598263800 :6806 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=52 -
   -66> 2017-07-27 13:11:05.787954 7fd3f7c4f700 10 osd.1 23  new session 0xc596083180 con=0xc598263800 addr=127.0.0.1:6810/54391
   -64> 2017-07-27 13:11:05.788213 7fd3f7c4f700  0 -- 127.0.0.1:6806/54182 >> 127.0.0.1:6810/54391 conn(0xc598263800 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
   -63> 2017-07-27 13:11:05.788965 7fd3f7c4f700 10 osd.1 23  session 0xc596083180 osd.2 has caps osdcap[grant(*)] 'allow *'
   -62> 2017-07-27 13:11:05.789346 7fd3f7c4f700  0 -- 127.0.0.1:6806/54182 >> 127.0.0.1:6810/54391 conn(0xc598263800 :6806 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=0).process missed message?  skipped from seq 0 to 985371443
. and osd.2 did feel something goes wrong.
   -31> 2017-07-27 13:11:05.792363 7fd3f7c4f700  1 -- 127.0.0.1:6806/54182 <== osd.2 127.0.0.1:6810/54391 985371445 ==== osd_repop_reply(client.4140.0:7501 1.7 e21/17) v2 ==== 111+0+0 (247522301 0 0) 0xc596083e00 con 0xc598263800
   -30> 2017-07-27 13:11:05.792471 7fd3dfcbc700 10 osd.1 23 dequeue_op 0xc5983a2f40 prio 196 cost 0 latency 0.000161 osd_repop_reply(client.4140.0:7501 1.7 e21/17) v2 pg pg[1.7( v 21'829 (0'0,21'829] local-lis/les=17/18 n=700 ec=12/12 lis/c 17/17 les/c/f 18/19/0 17/17/12) [1,2,0] r=0 lpr=17 luod=21'755 lua=21'811 crt=21'829 lcod 21'754 mlcod 21'754 active+clean+snaptrim snaptrimq=[2~2]]
   -29> 2017-07-27 13:11:05.792521 7fd3dfcbc700 10 osd.1 pg_epoch: 21 pg[1.7( v 21'829 (0'0,21'829] local-lis/les=17/18 n=700 ec=12/12 lis/c 17/17 les/c/f 18/19/0 17/17/12) [1,2,0] r=0 lpr=17 luod=21'755 lua=21'811 crt=21'829 lcod 21'754 mlcod 21'754 active+clean+snaptrim snaptrimq=[2~2]] _handle_message: 0xc5983a2f40
   -28> 2017-07-27 13:11:05.792547 7fd3dfcbc700  7 osd.1 pg_epoch: 21 pg[1.7( v 21'829 (0'0,21'829] local-lis/les=17/18 n=700 ec=12/12 lis/c 17/17 les/c/f 18/19/0 17/17/12) [1,2,0] r=0 lpr=17 luod=21'755 lua=21'811 crt=21'829 lcod 21'754 mlcod 21'754 active+clean+snaptrim snaptrimq=[2~2]] do_repop_reply: tid 4205 op  ack_type 4 from 2
. this is the first osd_repop_reply from osd.2 involving pg1.7 after osd.1 got reconnected to osd.2, but this is 4205 not 4198

Actions #10

Updated by Greg Farnum over 6 years ago

So is this a timing issue where the lossy connection is dead and a message gets thrown out, but then the second reply gets submitted after the connection is re-established?

Actions #11

Updated by Kefu Chai over 6 years ago

Greg, yeah, that's what it seems to be. but the osd-osd connection is not lossy. so the root cause of this issue is still unknown. i will continue the investigation anyway.

Actions #12

Updated by Greg Farnum over 6 years ago

The line "fault on lossy channel, failing" suggests that the connection you're looking at is lossy. So either it's tagged incorrectly or there's a bug in the asyncmessenger where it's invoking a lossy branch in non-lossy connections.

Actions #13

Updated by Kefu Chai over 6 years ago

Greg, i think the "fault on lossy channel, failing" lines are from heartbeat connections, and they are misleading. i am removing them from the comment above.

Actions #14

Updated by Kefu Chai over 6 years ago

osd.1: the osd who sent the out of order reply.4205 without sending the reply.4198 first.
osd.2: the primary osd who crashed at seeing it.

  1. osd.2 and osd.0 reported osd.1 laggy, and osd.1 was then marked down by mon
  2. osd.2 shut down the connections to osd.1
  3. osd.1 the conn to osd.2 fault with nothing to send, and went standby
  4. osd.1 rep op committed, wants to send the reply to osd.2. so tried to reopen the conn to osd.2
  5. osd.2 reset the session "handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION", because it's the one who reset the connection.
  6. osd.1 got RESETSESSION, and discarded the out_queue which had the yet sent rep_gather.4198.
Actions #15

Updated by Kefu Chai over 6 years ago

if osd.1 is down, osd.2 should have started a peering. and repop_queue should be flushed by on_change() in start_peering_interval(). but it didn't.

checked the actingset and actingbackfill of the PG of the crashed osd using gdb, they are not changed after consuming the osdmap which marked the osd.1 down. they were still [2,1,0]

Actions #16

Updated by Greg Farnum over 6 years ago

Kefu Chai wrote:

checked the actingset and actingbackfill of the PG of the crashed osd using gdb, they are not changed after consuming the osdmap which marked the osd.1 down. they were still [2,1,0]

Had the PG actually processed the map change yet, or only the OSD?

I think there's something missing here, though — besides the repops being flushed, all the messages should have gotten tossed out as well since they were from a prior interval.

I'm trying to reproduce this as well and will put logs somewhere shared if I manage to do so. Kefu, if you could upload your logs so we have a common reference point that would be good to.

Actions #17

Updated by Kefu Chai over 6 years ago

front.
  pg = 1.4
  oid = 10000001b80.00000000
  rep_tid = 5386

repop
  pg = 1.4
  oid = 10000001b86.00000000
  rep_tid = 5389

pg.actingbackfill
  [0,1,2]

pg.actingset
  [0,1,2]

osd.1 is the one who crashed. debug-osd = 10, debug-ms = 1/10

2017-08-02 11:39:34.797500 7fe40dd07700 10 osd.1 pg_epoch: 19 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=15/16 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 15/15/12) [1,0,2] r=0 lpr=15 luod=19'983 crt=19'1014 lcod 19'982 mlcod 19'982 active+clean] handle_advance_map [1,2]/[1,2] -- 1/1
..
2017-08-02 11:39:34.797544 7fe40dd07700 10 osd.1 pg_epoch: 20 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=15/16 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 15/15/12) [1,0,2] r=0 lpr=15 luod=19'983 crt=19'1014 lcod 19'982 mlcod 19'982 active+clean] state<Started>: should_restart_peering, transitioning to Reset
..
2017-08-02 11:39:34.797961 7fe40dd07700 10 osd.1 pg_epoch: 20 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=15/16 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 15/15/12) [1,2] r=0 lpr=20 pi=[15,20)/1 luod=0'0 crt=19'1014 lcod 19'982 mlcod 0'0 active] start_peering_interval: check_new_interval output: check_new_interval interval(15-19 up [1,0,2](1) acting [1,0,2](1)): not rw, up_thru 15 up_from 9 last_epoch_clean 16
..
2017-08-02 11:39:34.797993 7fe40dd07700  1 osd.1 pg_epoch: 20 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=15/16 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=0 lpr=20 pi=[15,20)/1 luod=0'0 crt=19'1014 lcod 19'982 mlcod 0'0 active] start_peering_interval up [1,0,2] -> [1,2], acting [1,0,2] -> [1,2], acting_primary 1 -> 1, up_primary 1 -> 1, role 0 -> 0, features acting 2305244844532236283 upacting 2305244844532236283
..
2017-08-02 11:39:34.798030 7fe40dd07700 10 osd.1 pg_epoch: 20 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=15/16 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=0 lpr=20 pi=[15,20)/1 crt=19'1014 lcod 19'982 mlcod 0'0 unknown] on_change
..
2017-08-02 11:39:34.798067 7fe40dd07700 10 osd.1 pg_epoch: 20 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=15/16 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=0 lpr=20 pi=[15,20)/1 crt=19'1014 lcod 19'982 mlcod 0'0 unknown]  canceling repop tid 5143
...
2017-08-02 11:39:34.799023 7fe40dd07700 10 osd.1 pg_epoch: 20 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=15/16 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=0 lpr=20 pi=[15,20)/1 crt=19'1014 lcod 19'982 mlcod 0'0 unknown]  canceling repop tid 5231
2017-08-02 11:40:25.020782 7fe40e508700 10 osd.1 pg_epoch: 23 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=20/22 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=0 lpr=23 pi=[15,20)/1 crt=19'1014 lcod 19'982 mlcod 0'0 unknown] state<Reset>: should restart peering, calling start_peering_interval again
2017-08-02 11:40:25.020820 7fe40e508700 10 osd.1 pg_epoch: 23 pg[1.4( v 19'1014 (0'0,19'1014] local-lis/les=20/22 n=723 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,0,2] r=0 lpr=23 pi=[15,23)/1 crt=19'1014 lcod 19'982 mlcod 0'0 unknown] start_peering_interval: check_new_interval output: check_new_interval interval(20-22 up [1,2](1) acting [1,2](1)): not rw, up_thru 20 up_from 9 last_epoch_clean 16
...
2017-08-02 11:42:01.325272 7fe40b502700  1 -- 127.0.0.1:6806/34784 --> 127.0.0.1:6810/34999 -- osd_repop(client.4120.0:9400 1.4 e26/23 1:3bff216e:::10000001b80.00000000:head v 26'1048) v2 -- 0xebe94f5000 con 0

osd.0

2017-08-02 11:43:34.351508 7f49e65e5700 10 osd.0 26 handle_osd_ping osd.2 127.0.0.1:6812/34999 says i am down in 27
..
2017-08-02 11:43:34.352649 7f49dcbf9700  3 osd.0 26 handle_osd_map epochs [27,27], i have 26, src has [1,27]
..
2017-08-02 11:43:34.352658 7f49dcbf9700 10 osd.0 26 handle_osd_map  got inc map for epoch 27
..
2017-08-02 11:45:34.138111 7f49cf616700 10 osd.0 26 dequeue_op 0x5558cc1c0 prio 127 cost 27342 latency 59.190270 osd_repop(client.4120.0:9780 1.3 e26/23) v2 pg pg[1.3( v 26'1143 (0'0,26'1143] local-lis/les=23/25 n=839 ec=12/12 lis/c 23/23 les/c/f 25/25/0 23/23/12) [1,2,0] r=2 lpr=23 luod=0'0 lua=26'1080 crt=26'1143 lcod 26'1086 active]
2017-08-02 11:45:34.138166 7f49d9e2b700 10 osd.0 pg_epoch: 26 pg[1.4( v 26'1104 (0'0,26'1104] local-lis/les=23/25 n=809 ec=12/12 lis/c 23/23 les/c/f 25/26/0 23/23/12) [1,0,2] r=1 lpr=23 luod=0'0 lua=26'1040 crt=26'1104 lcod 26'1046 active] repop_commit on op osd_repop(client.4120.0:9400 1.4 e26/23 1:3bff216e:::10000001b80.00000000:head v 26'1048) v2, sending commit to osd.1
Actions #18

Updated by Kefu Chai over 6 years ago

mon.a.log

2017-08-02 11:29:26.520230 7f4325f31700  7 mon.a@0(leader).osd e18 update_from_paxos  applying incremental 19
2017-08-02 11:29:26.520475 7f4325f31700  1 mon.a@0(leader).osd e19 e19: 3 total, 3 up, 3 in
...
2017-08-02 11:39:29.376025 7f432a73a700  1 mon.a@0(leader).osd e19  we have enough reporters to mark osd.0 down
...
2017-08-02 11:39:29.426773 7f432cf3f700  2 mon.a@0(leader).osd e19  osd.0 DOWN
...
2017-08-02 11:39:32.389899 7f4325f31700  7 mon.a@0(leader).osd e19 update_from_paxos  applying incremental 20
2017-08-02 11:39:32.390162 7f4325f31700  1 mon.a@0(leader).osd e20 e20: 3 total, 2 up, 3 in
...
2017-08-02 11:43:03.824228 7f432cf3f700 10 mon.a@0(leader).osd e26  osd.0 has 2 reporters, 24.176361 grace (20.000000 + 4.17636 + 0), max_failed_since 2017-08-02 11:42:34.509763
..
2017-08-02 11:43:03.824334 7f432cf3f700 10 mon.a@0(leader).osd e26 encode_pending e 27
..
2017-08-02 11:43:03.825037 7f432cf3f700  2 mon.a@0(leader).osd e26  osd.0 DOWN

osdmap e20 is the first osdmap marking osd.0 down. e27 is the second.
Actions #19

Updated by Greg Farnum over 6 years ago

osd.1 in the posted log has pg 1.4 in epoch 26 from the time it first dequeues those operations right up until it crashes, so it would have no reason to throw out any replies or reset any state.
osd.1 has epoch 27 and processes the mark_down from it, but I think osd.0 (log not provided) is the one at fault here. My guess is it managed to throw out the older messages except this one.

Actions #20

Updated by Kefu Chai over 6 years ago

greg, osd.0 failed to send the reply of tid 5386 over the wire because it was disconnected. but it managed to send the reply of tid 5389 because it was reconnected by then.

i will upload osd.0.log once i have access to my testbed (probably 1 week later)

Actions #21

Updated by Greg Farnum over 6 years ago

Yes, but osd.0 doing that is very incorrect. We've had some problems in this area before with marking stuff down not persisting correctly so it's not a huge surprise.
I think (but don't remember for sure) that it should be throwing out all in-progress ops on the PG and dismissing any replies before they get to the messenger, and than cleaning it up during peering. Not certain though.

Actions #23

Updated by Kefu Chai over 6 years ago

excerpt of osd.0.log

2017-08-02 11:40:00.786146 7f49d161a700  1 osd.0 pg_epoch: 20 pg[1.4( v 19'1009 (0'0,19'1009] local-lis/les=15/16 n=718 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=-1 lpr=20 pi=[15,20)/1 luod=0'0 lua=19'986 crt=19'1009 lcod 19'987 active] start_peering_interval up [1,0,2] -> [1,2], acting [1,0,2] -> [1,2], acting_primary 1 -> 1, up_primary 1 -> 1, role 1 -> -1, features acting 2305244844532236283 upacting 2305244844532236283

2017-08-02 11:40:00.786235 7f49d161a700 10 osd.0 pg_epoch: 20 pg[1.4( v 19'1009 (0'0,19'1009] local-lis/les=15/16 n=718 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=-1 lpr=20 pi=[15,20)/1 crt=19'1009 lcod 19'987 unknown] on_change_cleanup
2017-08-02 11:40:00.786245 7f49d161a700 10 osd.0 pg_epoch: 20 pg[1.4( v 19'1009 (0'0,19'1009] local-lis/les=15/16 n=718 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=-1 lpr=20 pi=[15,20)/1 crt=19'1009 lcod 19'987 unknown] on_change

2017-08-02 11:40:00.786443 7f49d161a700  5 osd.0 pg_epoch: 20 pg[1.4( v 19'1009 (0'0,19'1009] local-lis/les=15/16 n=718 ec=12/12 lis/c 15/15 les/c/f 16/16/0 20/20/12) [1,2] r=-1 lpr=20 pi=[15,20)/1 crt=19'1009 lcod 19'987 unknown NOTIFY] enter Started/Stray

2017-08-02 11:40:38.411565 7f49d1e1b700  1 -- 127.0.0.1:6814/1034575 --> 127.0.0.1:6806/34784 -- MRecoveryReserve(1.4 GRANT e26) v2 -- 0x555969680 con 0
2017-08-02 11:40:38.411626 7f49d1e1b700  5 osd.0 pg_epoch: 26 pg[1.4( v 19'1014 lc 19'1009 (0'0,19'1014] local-lis/les=23/25 n=723 ec=12/12 lis/c 23/15 les/c/f 25/16/0 23/23/12) [1,0,2] r=1 lpr=23 pi=[15,23)/1 luod=0'0 crt=19'1014 lcod 19'987 active m=5] enter Started/ReplicaActive/RepRecovering

2017-08-02 11:42:01.325351 7f49e65e5700  1 -- 127.0.0.1:6814/1034575 <== osd.1 127.0.0.1:6806/34784 215 ==== osd_repop(client.4120.0:9400 1.4 e26/23) v2 ==== 998+0+7857 (4043856715 0 197144736) 0x554b95800 con 0x555f9c000
<<<<<< receive osd_repop from osd.1

2017-08-02 11:42:08.864427 7f49cee15700 10 osd.0 pg_epoch: 26 pg[1.4( v 26'1047 (0'0,26'1047] local-lis/les=23/25 n=754 ec=12/12 lis/c 23/15 les/c/f 25/16/0 23/23/12) [1,0,2] r=1 lpr=23 pi=[15,23)/1 luod=0'0 lua=26'1028 crt=26'1047 lcod 26'1030 active] do_repop 1:3bff216e:::10000001b80.00000000:head v 26'1048 (transaction) 160

2017-08-02 11:43:34.351508 7f49e65e5700 10 osd.0 26 handle_osd_ping osd.2 127.0.0.1:6812/34999 says i am down in 27
2017-08-02 11:43:34.352649 7f49dcbf9700  3 osd.0 26 handle_osd_map epochs [27,27], i have 26, src has [1,27]
<<<<<<<< the osdmap marking me down

2017-08-02 11:44:25.450144 7f49cee15700 10 osd.0 26 dequeue_op 0x553d1cfc0 prio 127 cost 3649 latency 71.072040 osd_repop(client.4120.0:9535 1.4 e26/23) v2 pg pg[1.4( v 26'1066 (0'0,26'1066] local-lis/les=23/25 n=771 ec=12/12 lis/c 23/15 les/c/f 25/16/0 23/23/12) [1,0,2] r=1 lpr=23 pi=[15,23)/1 luod=0'0 lua=26'1032 crt=26'1066 lcod 26'1033 active]

2017-08-02 11:44:26.795883 7f49dbbf7700  1 -- 127.0.0.1:6814/1034575 <== osd.1 127.0.0.1:6806/34784 443 ==== MRecoveryReserve(1.4 RELEASE e26) v2 ==== 26+0+0 (3354862884 0 0) 0x554690400 con 0x555f9c000
2017-08-02 11:44:29.019165 7f49d161a700 10 osd.0 pg_epoch: 26 pg[1.4( v 26'1069 (0'0,26'1069] local-lis/les=23/25 n=774 ec=12/12 lis/c 23/15 les/c/f 25/16/0 23/23/12) [1,0,2] r=1 lpr=23 pi=[15,23)/1 luod=0'0 lua=26'1033 crt=26'1069 lcod 26'1034 active] handle_peering_event: epoch_sent: 26 epoch_requested: 26 RecoveryDone
2017-08-02 11:44:29.019281 7f49d161a700  5 osd.0 pg_epoch: 26 pg[1.4( v 26'1069 (0'0,26'1069] local-lis/les=23/25 n=774 ec=12/12 lis/c 23/15 les/c/f 25/16/0 23/23/12) [1,0,2] r=1 lpr=23 pi=[15,23)/1 luod=0'0 lua=26'1033 crt=26'1069 lcod 26'1034 active] enter Started/ReplicaActive/RepNotRecovering
<<<<<< osd.1 released the recovery reservation, osd.0 returned to active (RepNotRecovering)

2017-08-02 11:44:29.019376 7f49cee15700 10 osd.0 26 dequeue_op 0x5576c48c0 prio 127 cost 5226 latency 74.630144 osd_repop(client.4120.0:9544 1.4 e26/23) v2 pg pg[1.4( v 26'1069 (0'0,26'1069] local-lis/les=23/25 n=774 ec=12/12 lis/c 23/15 les/c/f 25/16/0 23/23/12) [1,0,2] r=1 lpr=23 pi=[15,23)/1 luod=0'0 lua=26'1033 crt=26'1069 lcod 26'1034 active]

2017-08-02 11:45:31.474277 7f49e65e5700 10 osd.0 26 handle_osd_ping osd.2 127.0.0.1:6812/34999 says i am down in 28

2017-08-02 11:45:34.138166 7f49d9e2b700 10 osd.0 pg_epoch: 26 pg[1.4( v 26'1104 (0'0,26'1104] local-lis/les=23/25 n=809 ec=12/12 lis/c 23/23 les/c/f 25/26/0 23/23/12) [1,0,2] r=1 lpr=23 luod=0'0 lua=26'1040 crt=26'1104 lcod 26'1046 active] repop_commit on op osd_repop(client.4120.0:9400 1.4 e26/23 1:3bff216e:::10000001b80.00000000:head v 26'1048) v2, sending commit to osd.1
<<<<<<<< osd_repop committed.

2017-08-02 11:45:42.908750 7f49d9e2b700 10 osd.0 26 _committed_osd_maps 27..27
2017-08-02 11:45:42.908809 7f49d9e2b700 10 osd.0 26  advance to epoch 27 (<= last 27 <= newest_map 28)
<<<<<<<< the osd-map was received at 11:43:34.352649, but it was committed after 2 minutes
<<<<<<<< none of the PGs is posted with a NullEvt!!!

2017-08-02 11:46:03.201137 7f49e163a700  0 log_channel(cluster) log [WRN] : slow request 241.875358 seconds old, received at 2017-08-02 11:42:01.325291: osd_repop(client.4120.0:9400 1.4 e26/23 1:3bff216e:::10000001b80.00000000:head v 26'1048) currently commit_sent

2017-08-02 11:47:01.531887 7f49d161a700 10 osd.0 pg_epoch: 26 pg[1.4( v 26'1104 (0'0,26'1104] local-lis/les=23/25 n=809 ec=12/12 lis/c 23/23 les/c/f 25/26/0 23/23/12) [1,0,2] r=1 lpr=23 luod=0'0 lua=26'1046 crt=26'1104 lcod 26'1051 active] handle_advance_map [1,2]/[1,2] -- 1/1
2017-08-02 11:47:01.531936 7f49d161a700 10 osd.0 pg_epoch: 27 pg[1.4( v 26'1104 (0'0,26'1104] local-lis/les=23/25 n=809 ec=12/12 lis/c 23/23 les/c/f 25/26/0 23/23/12) [1,0,2] r=1 lpr=23 luod=0'0 lua=26'1046 crt=26'1104 lcod 26'1051 active] state<Started>: Started advmap

2017-08-02 11:47:01.532235 7f49d161a700  1 osd.0 pg_epoch: 27 pg[1.4( v 26'1104 (0'0,26'1104] local-lis/les=23/25 n=809 ec=12/12 lis/c 23/23 les/c/f 25/26/0 27/27/12) [1,2] r=-1 lpr=27 pi=[23,27)/1 luod=0'0 lua=26'1046 crt=26'1104 lcod 26'1051 active] start_peering_interval up [1,0,2] -> [1,2], acting [1,0,2] -> [1,2], acting_primary 1 -> 1, up_primary 1 -> 1, role 1 -> -1, features acting 2305244844532236283 upacting 2305244844532236283

so osd.0 had not committed osdmap.27 before sending the reply of tid 5386 to osd.1.

Actions #24

Updated by Kefu Chai over 6 years ago

  1. osd.1 sent failure report of osd.0
  2. osd.1 sent repop 5386 to osd.0
  3. mon.a marked osd.0 down in osdmap.27
  4. osd.1 consumed osdmap.27
  5. osd.1 shut down the connection to osd.0, all connections to it are marked down
  6. osd.1 posted NullEvt to pg 1.4
  7. osd.0 sent the reply of 5386 to osd.1, no existing conn.
  8. osd.0 reconnected to osd.1
  9. osd.1 sent a RESETSESSION to osd.0
  10. osd.0 got RESETSESSION, and discarded the out_queue which had the reply of 5386
  11. osd.0 sent the reply of 5389 to osd.1
  12. osd.1 panic'ed at seeing reply of 5389 out of order.

the primary osd should cancel all repops once its peer gets marked down and pg starts peering. but pg 1.4 did not start peering after consuming osdmap e27 before osd.0 crashed.

Actions #25

Updated by Kefu Chai over 6 years ago

(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f65f36b5b95 in __GI___pthread_mutex_lock (mutex=0x55578bca65f8) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557805c2e12 in Mutex::Lock (this=0x55578bca65d0, no_lockdep=false) at /var/ceph/ceph/src/common/Mutex.cc:107
#3  0x000055577fdf2f93 in PG::lock (this=0x55578bca6000, no_lockdep=false) at /var/ceph/ceph/src/osd/PG.cc:363
#4  0x000055577fdf2f3b in PG::lock_suspend_timeout (this=0x55578bca6000, handle=...) at /var/ceph/ceph/src/osd/PG.cc:357
#5  0x000055577fcde401 in OSD::process_peering_events (this=0x55578badc000, pgs=std::__cxx11::list = {...}, handle=...) at /var/ceph/ceph/src/osd/OSD.cc:9587
#6  0x000055577fd0c0c4 in OSD::PeeringWQ::_process (this=0x55578badd5a8, pgs=std::__cxx11::list = {...}, handle=...) at /var/ceph/ceph/src/osd/OSD.h:1854
#7  0x000055577fdccf2b in ThreadPool::BatchWorkQueue<PG>::_void_process (this=0x55578badd5a8, p=0x55578daf8f40, handle=...) at /var/ceph/ceph/src/common/WorkQueue.h:120
#8  0x00005557805fd11a in ThreadPool::worker (this=0x55578badc970, wt=0x55578bafc4e0) at /var/ceph/ceph/src/common/WorkQueue.cc:120
#9  0x0000555780601915 in ThreadPool::WorkThread::entry (this=0x55578bafc4e0) at /var/ceph/ceph/src/common/WorkQueue.h:448
#10 0x00005557807be393 in Thread::entry_wrapper (this=0x55578bafc4e0) at /var/ceph/ceph/src/common/Thread.cc:79
#11 0x00005557807be2c8 in Thread::_entry_func (arg=0x55578bafc4e0) at /var/ceph/ceph/src/common/Thread.cc:59
#12 0x00007f65f36b3494 in start_thread (arg=0x7f65dc3a0700) at pthread_create.c:333
#13 0x00007f65f273fabf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

(gdb) f 3
#3  0x000055577fdf2f93 in PG::lock (this=0x55578bca6000, no_lockdep=false) at /var/ceph/ceph/src/osd/PG.cc:363
363       _lock.Lock(no_lockdep);
(gdb) p info.pgid
$3 = {pgid = {m_pool = 1, m_seed = 2, m_preferred = -1, static calc_name_buf_size = 36 '$'}, shard = {id = -1 '\377', static NO_SHARD = {id = -1 '\377',
      static NO_SHARD = <same as static member of an already seen type>}}, static calc_name_buf_size = 40 '('}

(gdb) thread 1
(gdb) bt
#0  raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x0000555780594e8e in reraise_fatal (signum=6) at /var/ceph/ceph/src/global/signal_handler.cc:74
#2  0x0000555780595206 in handle_fatal_signal (signum=6) at /var/ceph/ceph/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5  0x00007f65f268b42a in __GI_abort () at abort.c:89
#6  0x00005557805fb51f in ceph::__ceph_assert_fail (assertion=0x555780e2c66f "repop_queue.front() == repop", file=0x555780e28c10 "/var/ceph/ceph/src/osd/PrimaryLogPG.cc",
    line=9266, func=0x555780e36c80 <PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)::__PRETTY_FUNCTION__> "void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)")
    at /var/ceph/ceph/src/common/assert.cc:66
#7  0x000055577ff67a54 in PrimaryLogPG::eval_repop (this=0x55578bca6000, repop=0x55578c67d980) at /var/ceph/ceph/src/osd/PrimaryLogPG.cc:9266
#8  0x000055577ff6660c in PrimaryLogPG::repop_all_committed (this=0x55578bca6000, repop=0x55578c67d980) at /var/ceph/ceph/src/osd/PrimaryLogPG.cc:9161
#9  0x000055577ffc4b0e in C_OSD_RepopCommit::finish (this=0x55578e1ffbe0) at /var/ceph/ceph/src/osd/PrimaryLogPG.cc:9142
....

so the peering failed to acquire the pg lock while the pg was busy with handling the requests/replies.

mon.a, marked osd.2 down in osdmap.35

2017-08-24 10:42:39.308999 7fd0b44cb700  1 mon.a@0(leader).osd e34  we have enough reporters to mark osd.2 down
2017-08-24 10:42:39.309772 7fd0b44cb700  2 mon.a@0(leader).osd e34  osd.2 DOWN
2017-08-24 10:42:40.768353 7fd0ad4bd700  1 mon.a@0(leader).osd e35 e35: 3 total, 2 up, 3 in

osd.0, primary of pg 1.2:

  -905> 2017-08-24 10:42:45.546230 7f65e43b0700 10 osd.0 34 _committed_osd_maps 35..35
  -894> 2017-08-24 10:42:45.546668 7f65e43b0700  7 osd.0 35 consume_map version 35
  -869> 2017-08-24 10:42:45.547561 7f65da39c700  1 -- 127.0.0.1:6814/1019336 --> 127.0.0.1:6806/2023140 -- osd_map(35..35 src has 1..35) v3 -- 0x55578c5ef180 con 0

  -159> 2017-08-24 10:42:54.065821 7f65e43b0700 10 osd.0 pg_epoch: 34 pg[1.2( v 34'1374 (0'0,34'1374] local-lis/les=31/33 n=1068 ec=9/9 lis/c 31/19 les/c/f 33/21/0 31/31/19) [0,1,2] r=0 lpr=31 pi=[19,31)/1 luod=34'1301 lua=34'1356 crt=34'1374 lcod 34'1300 mlcod 24'1008 active+recovery_wait+degraded] null
//////// waiting for the pg lock
  -123> 2017-08-24 10:42:54.066293 7f65da39c700  7 osd.0 pg_epoch: 34 pg[1.2( v 34'1374 (0'0,34'1374] local-lis/les=31/33 n=1068 ec=9/9 lis/c 31/19 les/c/f 33/21/0 31/31/19) [0,1,2] r=0 lpr=31 pi=[19,31)/1 luod=34'1301 lua=34'1356 crt=34'1374 lcod 34'1300 mlcod 24'1008 active+recovery_wait+degraded] do_repop_reply: tid 1826 op  ack_type 4 from 2
  -120> 2017-08-24 10:42:54.066334 7f65da39c700 10 osd.0 pg_epoch: 34 pg[1.2( v 34'1374 (0'0,34'1374] local-lis/les=31/33 n=1068 ec=9/9 lis/c 31/19 les/c/f 33/21/0 31/31/19) [0,1,2] r=0 lpr=31 pi=[19,31)/1 luod=34'1301 lua=34'1356 crt=34'1374 lcod 34'1300 mlcod 24'1008 active+recovery_wait+degraded] repop_all_applied: repop tid 1826 all applied
  -118> 2017-08-24 10:42:54.066352 7f65da39c700 10 osd.0 pg_epoch: 34 pg[1.2( v 34'1374 (0'0,34'1374] local-lis/les=31/33 n=1068 ec=9/9 lis/c 31/19 les/c/f 33/21/0 31/31/19) [0,1,2] r=0 lpr=31 pi=[19,31)/1 luod=34'1301 lua=34'1356 crt=34'1374 lcod 34'1300 mlcod 24'1008 active+recovery_wait+degraded] eval_repop repgather(0x55578c67d980 34'1303 rep_tid=1826 committed?=0 applied?=1 r=0)

  -106> 2017-08-24 10:42:54.066483 7f65da39c700  1 -- 127.0.0.1:6801/19336 --> 127.0.0.1:0/2210655423 -- osd_op_reply(11131 100000022ba.00000000 [write 0~9486] v34'1303 uv1303 ondisk = 0) v8 -- 0x55578ce70300 con 0

  -102> 2017-08-24 10:42:54.066517 7f65da39c700 10 osd.0 pg_epoch: 34 pg[1.2( v 34'1374 (0'0,34'1374] local-lis/les=31/33 n=1068 ec=9/9 lis/c 31/19 les/c/f 33/21/0 31/31/19) [0,1,2] r=0 lpr=31 pi=[19,31)/1 luod=34'1303 lua=34'1356 crt=34'1374 lcod 34'1302 mlcod 24'1008 active+recovery_wait+degraded]  applied: repgather(0x55578c67d980 34'1303 rep_tid=1826 committed?=1 applied?=1 r=0)
   -92> 2017-08-24 10:42:54.066639 7f65da39c700  0 osd.0 pg_epoch: 34 pg[1.2( v 34'1374 (0'0,34'1374] local-lis/les=31/33 n=1068 ec=9/9 lis/c 31/19 les/c/f 33/21/0 31/31/19) [0,1,2] r=0 lpr=31 pi=[19,31)/1 luod=34'1303 lua=34'1356 crt=34'1374 lcod 34'1302 mlcod 24'1008 active+recovery_wait+degraded]    q front is repgather(0x55578c67c180 34'1302 rep_tid=1825 committed?=0 applied?=0 r=0)

Actions #26

Updated by Kefu Chai over 6 years ago

although all ops in repop_queue are canceled upon pg reset (change), and pg discards messages from down OSDs according to its own (might slightly outdated) osdmap, we reset the connection to the down OSDs once the osdmap is committed, see OSD::_committed_osd_maps(), and pg is not updated until the NullEvt is handled in osd's peering_wq. NullEvt is posted by consume_map(), and consume_map() is called by OSD::_committed_osd_maps()

so, it's not guarantee that PG::on_change() will be called in time.

in the log above,

  1. osd.0 reset the connection to osd.2 because it was marked down
  2. osd.2 sent reply of tid 1825 to osd.0, but no connection existed at that moment
  3. and before the pg 1.4 was reset(change),
  4. osd.2 reconnected to osd.0, but osd.0 instructed it to discard out_queue by sending it a RESETSESSION, so reply of tid 1825 was removed from the out_queue.
  5. osd.2 sent reply of tid 1826 to osd.0,
  6. osd.0 found tid 1826 is out-of-order, so it panic'ed.
Actions #27

Updated by Kefu Chai over 6 years ago

  • Status changed from 12 to Fix Under Review
Actions #28

Updated by Kefu Chai over 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to luminous
Actions #29

Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #21242: luminous: OSD crash: PrimaryLogPG.cc: 8396: FAILED assert(repop_queue.front() == repop) added
Actions #30

Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved
Actions #31

Updated by Shylesh Kumar over 5 years ago

  • Related to Bug #25174: osd: assert failure with FAILED assert(repop_queue.front() == repop) In function 'void PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)' added
Actions

Also available in: Atom PDF