Bug #47767: octopus: setting noscrub crashed osd process - RADOS - Ceph

Actions

Copy link

Bug #47767

closed

octopus: setting noscrub crashed osd process

Added by Dan van der Ster over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

David Zafman

Category:

Target version:

Ceph - v15.2.7

% Done:

Source:

Community (dev)

Tags:

Backport:

octopus, nautilus

Regression:

Yes

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v14.2.11

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

38359

Crash signature (v1):

Crash signature (v2):

Description

We just had a crash of one osd (out of ~1200) moments after we set noscrub and nodeep-scrub on the cluster.

Here's the setting of noscrub and the reporting of osd.762 failing:

2020-10-06 15:54:45.474 7f60f532e700  0 mon.cephbeesly-mon-2a00f134e5@0(leader) e40 handle_command mon_command({"prefix": "osd set", "key": "noscrub"} v 0) v1
2020-10-06 15:54:45.474 7f60f532e700  0 log_channel(audit) log [INF] : from='client.504140151 ' entity='client.admin' cmd=[{"prefix": "osd set", "key": "noscrub"}]: dispatch
2020-10-06 15:54:45.956 7f60f7b33700  1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446634 do_prune osdmap full prune enabled
2020-10-06 15:54:46.249 7f60f3b2b700  1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446635 e3446635: 1222 total, 1219 up, 1216 in
2020-10-06 15:54:46.254 7f60f3b2b700  0 log_channel(audit) log [INF] : from='client.504140151 ' entity='client.admin' cmd='[{"prefix": "osd set", "key": "noscrub"}]': finished
2020-10-06 15:54:46.254 7f60f3b2b700  0 log_channel(cluster) log [DBG] : osdmap e3446635: 1222 total, 1219 up, 1216 in
2020-10-06 15:54:47.855 7f60f7b33700  0 log_channel(cluster) log [WRN] : Health check update: Degraded data redundancy: 558438/1087980300 objects degraded (0.051%), 29 pgs degraded, 29 pgs undersized (PG_DEGRADED)
2020-10-06 15:54:47.856 7f60f7b33700  1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446635 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 125829120 full_alloc: 520093696 kv_alloc: 369098752
2020-10-06 15:54:50.096 7f60f532e700  1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446635 prepare_failure osd.762 [v2:128.142.162.24:6800/201371,v1:128.142.162.24:6801/201371] from osd.929 is reporting failure:1
2020-10-06 15:54:50.096 7f60f532e700  0 log_channel(cluster) log [DBG] : osd.762 reported immediately failed by osd.929
2020-10-06 15:54:50.096 7f60f532e700  1 mon.cephbeesly-mon-2a00f134e5@0(leader).osd e3446635  we're forcing failure of osd.762
2020-10-06 15:54:50.096 7f60f532e700  0 log_channel(cluster) log [INF] : osd.762 failed (root=default,room=0513-R-0050,rack=RA17,host=p06253939h49320) (connection refused reported by osd.929)
<pre>

The osdmap e3446634 contains the noscrub, nodeep-scrub flags.

Here's osd.762 crashing:

<pre>
   -25> 2020-10-06 15:54:44.719 7f57ba0c4700  2 osd.762 pg_epoch: 3446633 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 344
6633'286971847 mlcod 3446633'286971847 active+clean+scrubbing+deep trimq=[1984d~1] ps=119] 
   -24> 2020-10-06 15:54:45.093 7f57ba0c4700 10 monclient: _renew_subs
   -23> 2020-10-06 15:54:45.093 7f57ba0c4700 10 monclient: _send_mon_message to mon.p05517715y01595 at v2:188.184.36.166:3300/0
   -22> 2020-10-06 15:54:45.093 7f57c58db700  3 osd.762 3446633 handle_osd_map epochs [3446634,3446634], i have 3446633, src has [3442954,3446634]
   -21> 2020-10-06 15:54:45.280 7f57c30d6700  3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634]
   -20> 2020-10-06 15:54:45.280 7f57c30d6700  3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634]
   -19> 2020-10-06 15:54:45.322 7f57c30d6700  3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634]
   -18> 2020-10-06 15:54:45.322 7f57c30d6700  3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634]
   -17> 2020-10-06 15:54:45.324 7f57ba0c4700  0 log_channel(cluster) log [DBG] : 4.4b1 scrub starts
   -16> 2020-10-06 15:54:45.328 7f57c30d6700  3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634]
   -15> 2020-10-06 15:54:45.372 7f57c30d6700  3 osd.762 3446634 handle_osd_map epochs [3446634,3446634], i have 3446634, src has [3442954,3446634]
   -14> 2020-10-06 15:54:45.389 7f57ba0c4700  2 osd.762 pg_epoch: 3446634 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 3446633'286971847 mlcod 3446633'286971847 active+clean+scrubbing trimq=[1984d~1] ps=119] scrub_compare_maps replica 247 has 3 items
   -13> 2020-10-06 15:54:45.389 7f57ba0c4700  2 osd.762 pg_epoch: 3446634 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 3446633'286971847 mlcod 3446633'286971847 active+clean+scrubbing trimq=[1984d~1] ps=119] scrub_compare_maps replica 793 has 3 items
   -12> 2020-10-06 15:54:45.389 7f57ba0c4700  2 osd.762 pg_epoch: 3446634 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 3446633'286971847 mlcod 3446633'286971847 active+clean+scrubbing trimq=[1984d~1] ps=119] scrub_compare_maps   osd.762 has 3 items
   -11> 2020-10-06 15:54:45.389 7f57ba0c4700  2 osd.762 pg_epoch: 3446634 pg[4.4b1( v 3446633'286971848 (3446612'286968763,3446633'286971848] local-lis/les=3399272/3399273 n=36055 ec=355/355 lis/c 3399272/3399272 les/c/f 3399273/3399299/0 3399272/3399272/3399220) [762,247,793] r=0 lpr=3399272 crt=3446633'286971848 lcod 3446633'286971847 mlcod 3446633'286971847 active+clean+scrubbing trimq=[1984d~1] ps=119] 4.4b1 shard 247 4:8d20000d:::rbd_data.afa6497c52e435.000000000001adf5:head : missing
4.4b1 shard 247 4:8d20001b:::rbd_data.e28da6777550e9.00000000004025c5:head : missing
4.4b1 shard 247 4:8d20007c:::rbd_data.9e6d808a383234.000000000005eb0f:head : missing
4.4b1 shard 762 4:8d22820b:::rbd_data.0e08f963b0c4e9.0000000000041e4a:head : missing
4.4b1 shard 793 4:8d22820b:::rbd_data.0e08f963b0c4e9.0000000000041e4a:head : missing
4.4b1 shard 762 4:8d228210:::rbd_data.6db2f7c4d97bf74.00000000000155ee:head : missing
4.4b1 shard 793 4:8d228210:::rbd_data.6db2f7c4d97bf74.00000000000155ee:head : missing
4.4b1 shard 762 4:8d228217:::rbd_data.67f02bc2a30fea8.0000000000000e59:head : missing
4.4b1 shard 793 4:8d228217:::rbd_data.67f02bc2a30fea8.0000000000000e59:head : missing

   -10> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 247 4:8d20000d:::rbd_data.afa6497c52e435.000000000001adf5:head : missing
    -9> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 247 4:8d20001b:::rbd_data.e28da6777550e9.00000000004025c5:head : missing
    -8> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 247 4:8d20007c:::rbd_data.9e6d808a383234.000000000005eb0f:head : missing
    -7> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 762 4:8d22820b:::rbd_data.0e08f963b0c4e9.0000000000041e4a:head : missing
    -6> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 793 4:8d22820b:::rbd_data.0e08f963b0c4e9.0000000000041e4a:head : missing
    -5> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 762 4:8d228210:::rbd_data.6db2f7c4d97bf74.00000000000155ee:head : missing
    -4> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 793 4:8d228210:::rbd_data.6db2f7c4d97bf74.00000000000155ee:head : missing
    -3> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 762 4:8d228217:::rbd_data.67f02bc2a30fea8.0000000000000e59:head : missing
    -2> 2020-10-06 15:54:45.389 7f57ba0c4700 -1 log_channel(cluster) log [ERR] : 4.4b1 shard 793 4:8d228217:::rbd_data.67f02bc2a30fea8.0000000000000e59:head : missing
    -1> 2020-10-06 15:54:45.406 7f57ba0c4700 -1 /builddir/build/BUILD/ceph-14.2.11/src/osd/PG.cc: In function 'void PG::do_replica_scrub_map(OpRequestRef)' thread 7f57ba0c4700 time 2020-10-06 15:54:45.394594
/builddir/build/BUILD/ceph-14.2.11/src/osd/PG.cc: 4632: FAILED ceph_assert(scrubber.waiting_on_whom.count(m->from))

 ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55b3a44150e1]
 2: (()+0x4d72a9) [0x55b3a44152a9]
 3: (PG::do_replica_scrub_map(boost::intrusive_ptr<OpRequest>)+0x634) [0x55b3a45ad404]
 4: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x692) [0x55b3a46b4d42]
 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55b3a44f4da2]
 6: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b3a47837c2]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55b3a450fd3f]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55b3a4ac3c26]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b3a4ac6740]
 10: (()+0x7ea5) [0x7f57dc0faea5]
 11: (clone()+0x6d) [0x7f57dafbe8dd]

     0> 2020-10-06 15:54:45.412 7f57ba0c4700 -1 *** Caught signal (Aborted) **
 in thread 7f57ba0c4700 thread_name:tp_osd_tp
</pre>

I have the coredump; maybe this is useful:
<pre>
(gdb) up
#8  0x000055b3a45ad404 in PG::do_replica_scrub_map(boost::intrusive_ptr<OpRequest>) ()
    at /usr/src/debug/ceph-14.2.11/src/osd/PG.cc:4635
4635        dout(10) << __func__ << " replica was preempted, setting flag" << dendl;
(gdb) list
4630      dout(10) << __func__ << " waiting_on_whom was " << scrubber.waiting_on_whom
4631           << dendl;
4632      ceph_assert(scrubber.waiting_on_whom.count(m->from));
4633      scrubber.waiting_on_whom.erase(m->from);
4634      if (m->preempted) {
4635        dout(10) << __func__ << " replica was preempted, setting flag" << dendl;
4636        scrub_preempted = true;
4637      }
4638      if (scrubber.waiting_on_whom.empty()) {
4639        requeue_scrub(ops_blocked_by_scrub());
(gdb) p m->from
$1 = {static NO_OSD = 2147483647, osd = 247, shard = {id = -1 '\377', static NO_SHARD = {id = -1 '\377', 
      static NO_SHARD = <same as static member of an already seen type>}}}
</pre>

I suppose this is related to 46275 which just landed in this release, so I marked this as a regression.

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Dan van der Ster over 3 years ago

Sorry about the formatting -- I think it's still legible.

I found other osds with missing object errors also triggered by setting noscrub; those osds didn't crash like osd.762.

The cluster log is at ceph-post-file: c576a83c-c21e-4ebc-a0c1-9d141f150cb6 and osd.762's log is at ceph-post-file: 447d520b-f2c7-453c-a6a8-9166400c01a6

2020-10-06 15:54:46.033257 osd.643 (osd.643) 857 : cluster [DBG] 75.1c59 scrub starts
2020-10-06 15:54:46.073770 osd.643 (osd.643) 858 : cluster [ERR] 75.1c59 shard 643 75:9a393bbb:::rbd_data.8e7b9f61990f23.0000000000026a20:head : missing
2020-10-06 15:54:46.073773 osd.643 (osd.643) 859 : cluster [ERR] 75.1c59 shard 946 75:9a393bbb:::rbd_data.8e7b9f61990f23.0000000000026a20:head : missing
2020-10-06 15:54:46.073775 osd.643 (osd.643) 860 : cluster [ERR] 75.1c59 shard 643 75:9a393bc6:::rbd_data.f77f2a36472214.00000000000fd6ac:head : missing

2020-10-06 15:54:45.483859 osd.1461 (osd.1461) 234 : cluster [DBG] 4.1c3 scrub starts
2020-10-06 15:54:46.399689 osd.1461 (osd.1461) 235 : cluster [ERR] 4.1c3 shard 95 4:c3865046:::rbd_data.f36ef31042dcf7.0000000000004647:head : missing
2020-10-06 15:54:46.399692 osd.1461 (osd.1461) 236 : cluster [ERR] 4.1c3 shard 1461 4:c3865046:::rbd_data.f36ef31042dcf7.0000000000004647:head : missing
2020-10-06 15:54:46.399694 osd.1461 (osd.1461) 237 : cluster [ERR] 4.1c3 shard 95 4:c386504e:::rbd_data.8ec8947aaecd46.000000000001c1a4:head : missing

And I forgot one key info: the script [1] that triggered the crash is setting noscrub and increasing osd_max_scrubs within a narrow window; maybe this triggers a race.

[1] https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Assignee set to David Zafman
Priority changed from Normal to Urgent
Backport set to nautilus,octopus

Mostly likely a regression caused by https://github.com/ceph/ceph/pull/36292.

Actions

Copy link

Updated by Dan van der Ster over 3 years ago

It happened again moments after setting nodeep-scrub:


2020-10-22 17:14:18.690485 mon.p05517715d82373 (mon.3) 7628 : audit [INF] from='client.521838223 128.142.161.204:0/190789214' entity='client.admin' cmd=[{"prefix": "osd set", "key": "nodeep-scrub"}]: dispatch

2020-10-22 17:14:19.934556 osd.715 (osd.715) 1288 : cluster [ERR] 75.1ad9 shard 715 75:9b5fe47a:::rbd_data.f77f2a36472214.00000000000604e2:head : missing
2020-10-22 17:14:19.934559 osd.715 (osd.715) 1289 : cluster [ERR] 75.1ad9 shard 715 75:9b5fe499:::rbd_data.4eaef654899d752.000000000021f815:head : missing
2020-10-22 17:14:20.240093 osd.2 (osd.2) 13 : cluster [ERR] 4.1da2 shard 2 4:45b877d8:::rbd_data.6d117395685a0c6.00000000000313ce:head : missing
2020-10-22 17:14:20.240096 osd.2 (osd.2) 14 : cluster [ERR] 4.1da2 shard 727 4:45b877d8:::rbd_data.6d117395685a0c6.00000000000313ce:head : missing
2020-10-22 17:14:20.421938 osd.447 (osd.447) 17 : cluster [ERR] 4.e81 shard 349 4:8170df52:::rbd_data.36c746215383cb.0000000000151bf3:head : missing
2020-10-22 17:14:20.421941 osd.447 (osd.447) 18 : cluster [ERR] 4.e81 shard 447 4:8170df52:::rbd_data.36c746215383cb.0000000000151bf3:head : missing
2020-10-22 17:14:20.421942 osd.447 (osd.447) 19 : cluster [ERR] 4.e81 shard 349 4:8170df52:::rbd_data.afa6497c52e435.000000000006743d:head : missing
2020-10-22 17:14:20.421943 osd.447 (osd.447) 20 : cluster [ERR] 4.e81 shard 447 4:8170df52:::rbd_data.afa6497c52e435.000000000006743d:head : missing
2020-10-22 17:14:20.421944 osd.447 (osd.447) 21 : cluster [ERR] 4.e81 shard 349 4:8170df69:::rbd_data.99e74530298e95.000000000013dfb2:head : missing
2020-10-22 17:14:20.421947 osd.447 (osd.447) 22 : cluster [ERR] 4.e81 shard 447 4:8170df69:::rbd_data.99e74530298e95.000000000013dfb2:head : missing

Actions

Copy link

Updated by David Zafman over 3 years ago

Related to Backport #46706: nautilus: Cancellation of on-going scrubs added

Actions

Copy link

Updated by David Zafman over 3 years ago

The function PG::abort_scrub() probably races with in messages in flight. It might help if we called scrub_unreserve_replicas() before scrub_clear_state(), but not sure that it is right.

Actions

Copy link

Updated by David Zafman over 3 years ago

Related to Bug #46275: Cancellation of on-going scrubs added

Actions

Copy link

Updated by David Zafman over 3 years ago

Status changed from New to In Progress
Pull request ID set to 38271

Actions

Copy link

Updated by David Zafman over 3 years ago

Subject changed from setting noscrub crashed osd process to octopus: setting noscrub crashed osd process
Target version set to v15.2.7
Backport changed from nautilus,octopus to nautilus