Actions
Bug #47767
closedoctopus: setting noscrub crashed osd process
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
octopus, nautilus
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We just had a crash of one osd (out of ~1200) moments after we set noscrub and nodeep-scrub on the cluster.
Here's the setting of noscrub and the reporting of osd.762 failing:
2020-10-06 15:54:45.474 7f60f532e700 0 mon.cephbeesly-mon-2a00f134e5@0(leader) e40 handle_command mon_command({"prefix": "osd set", "key": "noscrub"} v 0) v1 2020-10-06 15:54:45.474 7f60f532e700 0 log_channel(audit) log [INF] : from='client.504140151 ' entity='client.admin' cmd=[{"prefix": "osd set", "key": "noscrub"}]: dispatch 2020-10-06 15:54:45.956 7f60f7b33700 1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446634 do_prune osdmap full prune enabled 2020-10-06 15:54:46.249 7f60f3b2b700 1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446635 e3446635: 1222 total, 1219 up, 1216 in 2020-10-06 15:54:46.254 7f60f3b2b700 0 log_channel(audit) log [INF] : from='client.504140151 ' entity='client.admin' cmd='[{"prefix": "osd set", "key": "noscrub"}]': finished 2020-10-06 15:54:46.254 7f60f3b2b700 0 log_channel(cluster) log [DBG] : osdmap e3446635: 1222 total, 1219 up, 1216 in 2020-10-06 15:54:47.855 7f60f7b33700 0 log_channel(cluster) log [WRN] : Health check update: Degraded data redundancy: 558438/1087980300 objects degraded (0.051%), 29 pgs degraded, 29 pgs undersized (PG_DEGRADED) 2020-10-06 15:54:47.856 7f60f7b33700 1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446635 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 125829120 full_alloc: 520093696 kv_alloc: 369098752 2020-10-06 15:54:50.096 7f60f532e700 1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446635 prepare_failure osd.762 [v2:128.142.162.24:6800/201371,v1:128.142.162.24:6801/201371] from osd.929 is reporting failure:1 2020-10-06 15:54:50.096 7f60f532e700 0 log_channel(cluster) log [DBG] : osd.762 reported immediately failed by osd.929 2020-10-06 15:54:50.096 7f60f532e700 1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446635 we're forcing failure of osd.762 2020-10-06 15:54:50.096 7f60f532e700 0 log_channel(cluster) log [INF] : osd.762 failed (root=default,room=0513-R-0050,rack=RA17,host=p06253939h49320) (connection refused reported by osd.929) <pre> The osdmap e3446634 contains the noscrub, nodeep-scrub flags. Here's osd.762 crashing: <pre> -25> 2020-10-06 15:54:44.719 7f57ba0c4700 2 osd.762 pg_epoch: 3446633 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 344 6633'286971847 mlcod 3446633'286971847 active+clean+scrubbing+deep trimq=[1984d~1] ps=119] -24> 2020-10-06 15:54:45.093 7f57ba0c4700 10 monclient: _renew_subs -23> 2020-10-06 15:54:45.093 7f57ba0c4700 10 monclient: _send_mon_message to mon.p05517715y01595 at v2:188.184.36.166:3300/0 -22> 2020-10-06 15:54:45.093 7f57c58db700 3 osd.762 3446633 handle_osd_map epochs [3446634,3446634], i have 3446633, src has [3442954,3446634] -21> 2020-10-06 15:54:45.280 7f57c30d6700 3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634] -20> 2020-10-06 15:54:45.280 7f57c30d6700 3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634] -19> 2020-10-06 15:54:45.322 7f57c30d6700 3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634] -18> 2020-10-06 15:54:45.322 7f57c30d6700 3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634] -17> 2020-10-06 15:54:45.324 7f57ba0c4700 0 log_channel(cluster) log [DBG] : 4.4b1 scrub starts -16> 2020-10-06 15:54:45.328 7f57c30d6700 3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634] -15> 2020-10-06 15:54:45.372 7f57c30d6700 3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634] -14> 2020-10-06 15:54:45.389 7f57ba0c4700 2 osd.762 pg_epoch: 3446634 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 3446633'286971847 mlcod 3446633'286971847 active+clean+scrubbing trimq=[1984d~1] ps=119] scrub_compare_maps replica 247 has 3 items -13> 2020-10-06 15:54:45.389 7f57ba0c4700 2 osd.762 pg_epoch: 3446634 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 3446633'286971847 mlcod 3446633'286971847 active+clean+scrubbing trimq=[1984d~1] ps=119] scrub_compare_maps replica 793 has 3 items -12> 2020-10-06 15:54:45.389 7f57ba0c4700 2 osd.762 pg_epoch: 3446634 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 3446633'286971847 mlcod 3446633'286971847 active+clean+scrubbing trimq=[1984d~1] ps=119] scrub_compare_maps osd.762 has 3 items -11> 2020-10-06 15:54:45.389 7f57ba0c4700 2 osd.762 pg_epoch: 3446634 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 3446633'286971847 mlcod 3446633'286971847 active+clean+scrubbing trimq=[1984d~1] ps=119] 4.4b1 shard 247 4:8d20000d:::rbd_data.afa6497c52e435.000000000001adf5:head : missing 4.4b1 shard 247 4:8d20001b:::rbd_data.e28da6777550e9.00000000004025c5:head : missing 4.4b1 shard 247 4:8d20007c:::rbd_data.9e6d808a383234.000000000005eb0f:head : missing 4.4b1 shard 762 4:8d22820b:::rbd_data.0e08f963b0c4e9.0000000000041e4a:head : missing 4.4b1 shard 793 4:8d22820b:::rbd_data.0e08f963b0c4e9.0000000000041e4a:head : missing 4.4b1 shard 762 4:8d228210:::rbd_data.6db2f7c4d97bf74.00000000000155ee:head : missing 4.4b1 shard 793 4:8d228210:::rbd_data.6db2f7c4d97bf74.00000000000155ee:head : missing 4.4b1 shard 762 4:8d228217:::rbd_data.67f02bc2a30fea8.0000000000000e59:head : missing 4.4b1 shard 793 4:8d228217:::rbd_data.67f02bc2a30fea8.0000000000000e59:head : missing -10> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 247 4:8d20000d:::rbd_data.afa6497c52e435.000000000001adf5:head : missing -9> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 247 4:8d20001b:::rbd_data.e28da6777550e9.00000000004025c5:head : missing -8> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 247 4:8d20007c:::rbd_data.9e6d808a383234.000000000005eb0f:head : missing -7> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 762 4:8d22820b:::rbd_data.0e08f963b0c4e9.0000000000041e4a:head : missing -6> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 793 4:8d22820b:::rbd_data.0e08f963b0c4e9.0000000000041e4a:head : missing -5> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 762 4:8d228210:::rbd_data.6db2f7c4d97bf74.00000000000155ee:head : missing -4> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 793 4:8d228210:::rbd_data.6db2f7c4d97bf74.00000000000155ee:head : missing -3> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 762 4:8d228217:::rbd_data.67f02bc2a30fea8.0000000000000e59:head : missing -2> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 793 4:8d228217:::rbd_data.67f02bc2a30fea8.0000000000000e59:head : missing -1> 2020-10-06 15:54:45.406 7f57ba0c4700 -1 /builddir/build/BUILD/ceph-14.2.11/src/osd/PG.cc: In function 'void PG::do_replica_scrub_map(OpRequestRef)' thread 7f57ba0c4700 time 2020-10-06 15:54:45.394594 /builddir/build/BUILD/ceph-14.2.11/src/osd/PG.cc: 4632: FAILED ceph_assert(scrubber.waiting_on_whom.count(m->from)) ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55b3a44150e1] 2: (()+0x4d72a9) [0x55b3a44152a9] 3: (PG::do_replica_scrub_map(boost::intrusive_ptr<OpRequest>)+0x634) [0x55b3a45ad404] 4: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x692) [0x55b3a46b4d42] 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55b3a44f4da2] 6: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b3a47837c2] 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55b3a450fd3f] 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55b3a4ac3c26] 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b3a4ac6740] 10: (()+0x7ea5) [0x7f57dc0faea5] 11: (clone()+0x6d) [0x7f57dafbe8dd] 0> 2020-10-06 15:54:45.412 7f57ba0c4700 -1 *** Caught signal (Aborted) ** in thread 7f57ba0c4700 thread_name:tp_osd_tp </pre> I have the coredump; maybe this is useful: <pre> (gdb) up #8 0x000055b3a45ad404 in PG::do_replica_scrub_map(boost::intrusive_ptr<OpRequest>) () at /usr/src/debug/ceph-14.2.11/src/osd/PG.cc:4635 4635 dout(10) << __func__ << " replica was preempted, setting flag" << dendl; (gdb) list 4630 dout(10) << __func__ << " waiting_on_whom was " << scrubber.waiting_on_whom 4631 << dendl; 4632 ceph_assert(scrubber.waiting_on_whom.count(m->from)); 4633 scrubber.waiting_on_whom.erase(m->from); 4634 if (m->preempted) { 4635 dout(10) << __func__ << " replica was preempted, setting flag" << dendl; 4636 scrub_preempted = true; 4637 } 4638 if (scrubber.waiting_on_whom.empty()) { 4639 requeue_scrub(ops_blocked_by_scrub()); (gdb) p m->from $1 = {static NO_OSD = 2147483647, osd = 247, shard = {id = -1 '\377', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}} </pre> I suppose this is related to 46275 which just landed in this release, so I marked this as a regression.
Actions