Bug #19427
closedcommon/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size())
0%
Description
-35> 2017-03-29 22:36:36.972567 7f2538e96700 20 osd.2 756 should_share_map osd.1 172.21.15.117:6814/13151 756 -34> 2017-03-29 22:36:36.972564 7f2546eb2700 1 -- 172.21.15.117:6808/1991 <== mon.2 172.21.15.117:6790/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 580+0+0 (1929996202 0 0) 0x565217318800 con 0x56521713b3c0 -33> 2017-03-29 22:36:36.972571 7f2538e96700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] handle_message: 0x565216b7db00 -32> 2017-03-29 22:36:36.972575 7f2538e96700 15 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] do_sub_op osd_sub_op(unknown.0.0:0 0.28 MIN [scrub-map] v 0'0 snapset=0=[]:[]) v12 -31> 2017-03-29 22:36:36.972579 7f2538e96700 7 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] sub_op_scrub_map -30> 2017-03-29 22:36:36.972582 7f2538e96700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] got 1 scrub map -29> 2017-03-29 22:36:36.972586 7f2538e96700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] map version is 0'0 -28> 2017-03-29 22:36:36.972589 7f2538e96700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] requeue_scrub: queueing -27> 2017-03-29 22:36:36.972591 7f2538e96700 20 osd.2 op_wq(0) _enqueue 0.28 PGQueueable(PGScrub prio 5 cost 52428800 e756) -26> 2017-03-29 22:36:36.972596 7f2538e96700 10 osd.2 756 dequeue_op 0x565216b7db00 finish -25> 2017-03-29 22:36:36.972600 7f253b69b700 20 osd.2 op_wq(0) _process 0.28 item PGQueueable(PGScrub prio 5 cost 52428800 e756) queued -24> 2017-03-29 22:36:36.972602 7f2538e96700 20 osd.2 op_wq(0) _process empty q, waiting -23> 2017-03-29 22:36:36.972608 7f253b69b700 20 osd.2 op_wq(0) _process 0.28 item PGQueueable(PGScrub prio 5 cost 52428800 e756) pg 0x565215be2000 -22> 2017-03-29 22:36:36.972608 7f2546eb2700 1 -- 172.21.15.117:6808/1991 mark_down 0x565217138000 -- 0x56521562b400 -21> 2017-03-29 22:36:36.972610 7f253b69b700 20 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub state WAIT_REPLICAS -20> 2017-03-29 22:36:36.972614 7f253b69b700 20 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub state COMPARE_MAPS -19> 2017-03-29 22:36:36.972617 7f253b69b700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub_compare_maps has maps, analyzing -18> 2017-03-29 22:36:36.972619 7f253b69b700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub_compare_maps comparing replica scrub maps -17> 2017-03-29 22:36:36.972622 7f253b69b700 2 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub_compare_maps osd.2 has 0 items -16> 2017-03-29 22:36:36.972625 7f253b69b700 2 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub_compare_maps replica 1 has 0 items -15> 2017-03-29 22:36:36.972629 7f253b69b700 2 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] -14> 2017-03-29 22:36:36.972630 7f2546eb2700 1 monclient: found mon.c -13> 2017-03-29 22:36:36.972632 7f2546eb2700 10 monclient: _send_mon_message to mon.c at 172.21.15.117:6790/0 -12> 2017-03-29 22:36:36.972632 7f253b69b700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub_snapshot_metadata -11> 2017-03-29 22:36:36.972633 7f2546eb2700 1 -- 172.21.15.117:6808/1991 --> 172.21.15.117:6790/0 -- mon_subscribe({mgrmap=0+,monmap=3+,osd_pg_creates=0+,osdmap=757}) v2 -- ?+0 0x565216867400 con 0x56521713b3c0 -10> 2017-03-29 22:36:36.972636 7f253b69b700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub_snapshot_metadata (scrub) finish -9> 2017-03-29 22:36:36.972639 7f2546eb2700 10 monclient: _send_mon_message to mon.c at 172.21.15.117:6790/0 -8> 2017-03-29 22:36:36.972642 7f2546eb2700 1 -- 172.21.15.117:6808/1991 --> 172.21.15.117:6790/0 -- pg_stats(39 pgs tid 177 v 0) v1 -- ?+0 0x565215f23e40 con 0x56521713b3c0 -7> 2017-03-29 22:36:36.972640 7f253b69b700 20 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub state FINISH -6> 2017-03-29 22:36:36.972645 7f253b69b700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub got 0/0 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 bytes, 0/0 hit_set_archive bytes. -5> 2017-03-29 22:36:36.972654 7f2546eb2700 10 monclient: _send_mon_message to mon.c at 172.21.15.117:6790/0 -4> 2017-03-29 22:36:36.972652 7f253b69b700 10 osd.2 pg_epoch: 756 pg[0.28( empty local-les=727 n=0 ec=47 les/c/f 727/727/0 725/725/163) [2,1] r=0 lpr=725 crt=0'0 mlcod 0'0 active+clean+scrubbing] scrub_process_inconsistent: checking authoritative -3> 2017-03-29 22:36:36.972656 7f2546eb2700 1 -- 172.21.15.117:6808/1991 --> 172.21.15.117:6790/0 -- pg_stats(39 pgs tid 178 v 0) v1 -- ?+0 0x5652197f19c0 con 0x56521713b3c0 -2> 2017-03-29 22:36:36.972659 7f253b69b700 0 log_channel(cluster) log [INF] : 0.28 scrub ok -1> 2017-03-29 22:36:36.972660 7f2546eb2700 10 log_client log_queue is 3 last_log 600 sent 596 num 3 unsent 4 sending 4 0> 2017-03-29 22:36:36.976280 7f2546eb2700 -1 /build/ceph-12.0.0-2001-gcc51f56/src/common/LogClient.cc: In function 'Message* LogClient::_get_mon_log_message()' thread 7f2546eb2700 time 2017-03-29 22:36:36.972662 /build/ceph-12.0.0-2001-gcc51f56/src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size()) ceph version 12.0.0-2001-gcc51f56 (cc51f56bb2c153bf35078eb95311e1d35c118c9f) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x56520b4d28b2] 2: (LogClient::_get_mon_log_message()+0x13e5) [0x56520b482d25] 3: (LogClient::get_mon_log_message()+0x1f) [0x56520b482ecf] 4: (MonClient::send_log()+0x15) [0x56520b4e64a5] 5: (MonClient::handle_auth(MAuthReply*)+0x348) [0x56520b4ef488] 6: (MonClient::ms_dispatch(Message*)+0x39b) [0x56520b4efa9b] 7: (DispatchQueue::entry()+0xf4a) [0x56520b69e94a] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x56520b55a33d] 9: (()+0x770a) [0x7f25590bd70a] 10: (clone()+0x6d) [0x7f255813482d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
this was on wip-mgr-commands, but no changes were made the osd or logclient, so I'm guessing this is related to the monclient multiconnection refactor?
/a/sage-2017-03-29_21:10:58-rados-wip-mgr-commands---basic-smithi/962237
Updated by Sage Weil about 7 years ago
/a/sage-2017-04-07_04:46:46-rados:thrash-wip-bluestore-osr-drain-hang---basic-smithi/995356
Updated by Kefu Chai about 7 years ago
it's a race of LogChannel::do_log()
and the "log_client->reset_session(); send_log()" MonClient::handle_auth()
. this is not introduced by the MonClient multi-con change. and it is also in jewel.
-2> 2017-03-29 22:36:36.972659 7f253b69b700 0 log_channel(cluster) log [INF] : 0.28 scrub ok -1> 2017-03-29 22:36:36.972660 7f2546eb2700 10 log_client log_queue is 3 last_log 600 sent 596 num 3 unsent 4 sending 4 0> 2017-03-29 22:36:36.976280 7f2546eb2700 -1 /build/ceph-12.0.0-2001-gcc51f56/src/common/LogClient.cc: In function 'Message* LogClient::_get_mon_log_message()' thread 7f2546eb2700 time 2017-03-29 22:36:36.972662 /build/ceph-12.0.0-2001-gcc51f56/src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size())
- reset_session() updates the "last_log_sent"
- osd->clog->info() calls do_log, which calls
- LogClient::get_next_seq() // last_log is atomic, and without log_lock
- LogClient::queue() // with log_lock
- MonClient::send_log() calls
LogClient::get_mon_log_message()
- LogClient::get_mon_log_message() calls
_get_mon_log_message()
with log_lock
- LogClient::get_mon_log_message() calls
but osd->clog->info()
is not atomic in the sense that get_mon_log_message
could be called after LogClient::last_log
is updated by LogClient::get_next_seq()
and before LogClient::queue()
, that's why num_unsent was 4 and log_queue.size() is 3, which should have been 4 right after the LogClient::queue()
call.
Updated by Kefu Chai about 7 years ago
- Status changed from New to Fix Under Review
Updated by Sage Weil about 7 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler about 7 years ago
- Copied to Backport #19617: jewel: common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size()) added
Updated by Nathan Cutler about 7 years ago
- Copied to Backport #19618: kraken: common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size()) added
Updated by Nathan Cutler almost 7 years ago
- Status changed from Pending Backport to Resolved
Updated by Sage Weil almost 7 years ago
- Status changed from Resolved to In Progress
- Priority changed from Urgent to Immediate
this popped up again!
2017-04-28T19:11:13.357 INFO:tasks.ceph.osd.4.smithi117.stderr:/build/ceph-12.0.1-1689-gcd86908/src/common/LogClient.cc: In function 'Message* LogClient::_get_mon_log_message()' thread 7fab5b2dd700 time 2017-04-28 19:11:13.360971 2017-04-28T19:11:13.357 INFO:tasks.ceph.osd.4.smithi117.stderr:/build/ceph-12.0.1-1689-gcd86908/src/common/LogClient.cc: 308: FAILED assert(num_unsent <= log_queue.size()) 2017-04-28T19:11:13.365 INFO:tasks.ceph.osd.4.smithi117.stderr: ceph version 12.0.1-1689-gcd86908 (cd86908220578fbe091355ce3d8d18166e4cb352) 2017-04-28T19:11:13.365 INFO:tasks.ceph.osd.4.smithi117.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55b2cda00552] 2017-04-28T19:11:13.365 INFO:tasks.ceph.osd.4.smithi117.stderr: 2: (LogClient::_get_mon_log_message()+0x13e5) [0x55b2cd9d1745] 2017-04-28T19:11:13.365 INFO:tasks.ceph.osd.4.smithi117.stderr: 3: (LogClient::get_mon_log_message(bool)+0x25) [0x55b2cd9d18f5] 2017-04-28T19:11:13.365 INFO:tasks.ceph.osd.4.smithi117.stderr: 4: (MonClient::send_log(bool)+0x19) [0x55b2cda14149] 2017-04-28T19:11:13.365 INFO:tasks.ceph.osd.4.smithi117.stderr: 5: (MonClient::tick()+0x3c1) [0x55b2cda1db01] 2017-04-28T19:11:13.365 INFO:tasks.ceph.osd.4.smithi117.stderr: 6: (Context::complete(int)+0x9) [0x55b2cd491429] 2017-04-28T19:11:13.365 INFO:tasks.ceph.osd.4.smithi117.stderr: 7: (SafeTimer::timer_thread()+0x452) [0x55b2cd9fc7d2] 2017-04-28T19:11:13.366 INFO:tasks.ceph.osd.4.smithi117.stderr: 8: (SafeTimerThread::entry()+0xd) [0x55b2cd9fdbdd] 2017-04-28T19:11:13.366 INFO:tasks.ceph.osd.4.smithi117.stderr: 9: (()+0x770a) [0x7fab70e0a70a] 2017-04-28T19:11:13.366 INFO:tasks.ceph.osd.4.smithi117.stderr: 10: (clone()+0x6d) [0x7fab6fe8182d] 2017-04-28T19:11:13.366 INFO:tasks.ceph.osd.4.smithi117.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
/a/sage-2017-04-28_16:57:50-rados-wip-sage-testing2---basic-smithi/1077110
Updated by Kefu Chai almost 7 years ago
-4704> 2017-04-28 19:11:13.360957 7fab532cd700 0 log_channel(cluster) log [INF] : osd_enable_op_tracker = 'true' -4703> 2017-04-28 19:11:13.360962 7fab5b2dd700 10 log_client log_queue is 7 last_log 94 sent 86 num 7 unsent 8 sending 8
racing again: LogChannel::do_log() and LogClient::get_mon_log_message(false)
Updated by Kefu Chai almost 7 years ago
- Status changed from In Progress to Fix Under Review
Updated by Kefu Chai almost 7 years ago
- Status changed from Fix Under Review to Resolved
Updated by Kefu Chai almost 7 years ago
- Status changed from Resolved to Pending Backport
Updated by Sage Weil almost 7 years ago
- Status changed from Pending Backport to 12
This is still happening:
2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr: 0> 2017-05-16 23:55:22.412915 7f5ea2d8e700 -1 /build/ceph-12.0.1-2423-ge3b049c/src/common/LogClient.cc: In function 'Message* LogClient::_get_mon_log_message()' thread 7f5ea2d8e700 time 2017-05-16 23:55:22.410335 2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr:/build/ceph-12.0.1-2423-ge3b049c/src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size()) 2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr: 2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr: ceph version 12.0.1-2423-ge3b049c (e3b049cc24496285b07728c70033b1bd7a1e592a) 2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5653aafdac72] 2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr: 2: (LogClient::_get_mon_log_message()+0x13e5) [0x5653aafabb45] 2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr: 3: (LogClient::get_mon_log_message(bool)+0x40) [0x5653aafabd10] 2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr: 4: (MonClient::send_log(bool)+0x19) [0x5653aafee869] 2017-05-16T23:55:22.468 INFO:tasks.ceph.osd.5.smithi198.stderr: 5: (MonClient::tick()+0x3c1) [0x5653aaff81a1] 2017-05-16T23:55:22.469 INFO:tasks.ceph.osd.5.smithi198.stderr: 6: (Context::complete(int)+0x9) [0x5653aaa5e669] 2017-05-16T23:55:22.469 INFO:tasks.ceph.osd.5.smithi198.stderr: 7: (SafeTimer::timer_thread()+0x452) [0x5653aafd6ea2] 2017-05-16T23:55:22.469 INFO:tasks.ceph.osd.5.smithi198.stderr: 8: (SafeTimerThread::entry()+0xd) [0x5653aafd82ad] 2017-05-16T23:55:22.469 INFO:tasks.ceph.osd.5.smithi198.stderr: 9: (()+0x76ba) [0x7f5eb80bd6ba] 2017-05-16T23:55:22.469 INFO:tasks.ceph.osd.5.smithi198.stderr: 10: (clone()+0x6d) [0x7f5eb713482d]
/a/sage-2017-05-16_22:22:40-rados-wip-sage-testing2---basic-smithi/1186056
verified the commit contains the previous fix
gnit:build (wip-sage-testing2) 10:28 AM $ git branch --contains 4da64aa g master wip-bluestore-aio-warning wip-gui wip-list-objects-more wip-luminous-legacy-check wip-mempool-perf wip-mon-kvtype wip-mon-summary wip-osdmap-require wip-sage-testing * wip-sage-testing2 gnit:build (wip-sage-testing2) 10:28 AM $ git rev-parse wip-sage-testing2 e3b049cc24496285b07728c70033b1bd7a1e592a
Updated by Sage Weil almost 7 years ago
-1> 2017-06-05 17:19:59.922592 7fa504dfd700 10 log_client log_queue is 4 last_log 4215 sent 4210 num 4 unsent 5 sending 5 0> 2017-06-05 17:19:59.928726 7fa504dfd700 -1 /build/ceph-12.0.2-2248-g8c17cc5/src/common/LogClient.cc: In function 'Message* LogClient::_get_mon_log_message()' thread 7fa504dfd700 time 2017-06-05 17:19:59.922597 /build/ceph-12.0.2-2248-g8c17cc5/src/common/LogClient.cc: 293: FAILED assert(num_unsent <= log_queue.size()) ceph version 12.0.2-2248-g8c17cc5 (8c17cc5e289cdd593b537d3b627451e9483d742b) luminous (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x563a70c78f72] 2: (LogClient::_get_mon_log_message()+0x1502) [0x563a70c48e72] 3: (LogClient::get_mon_log_message(bool)+0x43) [0x563a70c49023] 4: (MonClient::send_log(bool)+0x19) [0x563a70c8c649] 5: (MonClient::tick()+0x3c1) [0x563a70c96a11] 6: (Context::complete(int)+0x9) [0x563a706f12a9] 7: (SafeTimer::timer_thread()+0x452) [0x563a70c75092] 8: (SafeTimerThread::entry()+0xd) [0x563a70c7649d] 9: (()+0x76ba) [0x7fa51d8d96ba] 10: (clone()+0x6d) [0x7fa51c95082d]
osd.5
/a/sage-2017-06-05_14:47:27-rados-wip-sage-testing-distro-basic-smithi/1260414
Updated by Kefu Chai almost 7 years ago
- Status changed from 12 to Need More Info
the osd.5 log is huge, and i cannot use zless to find the log lines i am interested. would be great if we can have another smaller log with this failure.
Updated by Sage Weil almost 7 years ago
- Priority changed from Immediate to Urgent
Updated by Patrick Donnelly almost 7 years ago
- Has duplicate Bug #20680: osd: FAILED assert(num_unsent <= log_queue.size()) added
Updated by Patrick Donnelly almost 7 years ago
Another case: http://tracker.ceph.com/issues/20680
Updated by Kefu Chai almost 7 years ago
- Status changed from Need More Info to 12
will take a look again tmr.
Updated by Sage Weil over 6 years ago
- Status changed from 12 to Duplicate
Updated by Sage Weil over 6 years ago
- Is duplicate of Bug #18209: src/common/LogClient.cc: 310: FAILED assert(num_unsent <= log_queue.size()) added