Bug #50466: _delete_some additional unexpected onode list - RADOS - Ceph

Actions

Copy link

Bug #50466

closed

_delete_some additional unexpected onode list

Added by Dan van der Ster about 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

nautilus, octopus, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.19, Ceph - v14.2.20

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

40993

Crash signature (v1):

Crash signature (v2):

Description

After updating to 14.2.19 and then moving some PGs around we have a few warnings related to the new efficient PG removal code, e.g.

/var/log/ceph/ceph-osd.792.log:2021-04-14 20:34:34.353 7fb2439d4700  0
osd.792 pg_epoch: 40906 pg[10.14b2s0( v 40734'290069
(33782'287000,40734'290069] lb MIN (bitwise) local-lis/les=33990/33991
n=36272 ec=4951/4937 lis/c 33990/33716 les/c/f 33991/33747/0
40813/40813/37166) [933,626,260,804,503,491]p933(0) r=-1 lpr=40813
DELETING pi=[33716,40813)/4 crt=40734'290069 unknown NOTIFY mbc={}]
_delete_some additional unexpected onode list (new onodes has appeared
since PG removal started[0#10:4d280000::::head#]

/var/log/ceph/ceph-osd.851.log:2021-04-14 18:40:13.312 7fd87bded700  0
osd.851 pg_epoch: 40671 pg[10.133fs5( v 40662'288967
(33782'285900,40662'288967] lb MIN (bitwise) local-lis/les=33786/33787
n=13 ec=4947/4937 lis/c 40498/33714 les/c/f 40499/33747/0
40670/40670/33432) [859,199,913,329,439,79]p859(0) r=-1 lpr=40670
DELETING pi=[33714,40670)/4 crt=40662'288967 unknown NOTIFY mbc={}]
_delete_some additional unexpected onode list (new onodes has appeared
since PG removal started[5#10:fcc80000::::head#]

/var/log/ceph/ceph-osd.851.log:2021-04-14 20:58:14.393 7fd87adeb700  0
osd.851 pg_epoch: 40906 pg[10.2e8s3( v 40610'288991
(33782'285900,40610'288991] lb MIN (bitwise) local-lis/les=33786/33787
n=161220 ec=4937/4937 lis/c 39826/33716 les/c/f 39827/33747/0
40617/40617/39225) [717,933,727,792,607,129]p717(0) r=-1 lpr=40617
DELETING pi=[33716,40617)/3 crt=40610'288991 unknown NOTIFY mbc={}]
_delete_some additional unexpected onode list (new onodes has appeared
since PG removal started[3#10:17400000::::head#]

/var/log/ceph/ceph-osd.883.log:2021-04-14 18:55:16.822 7f78c485d700  0
osd.883 pg_epoch: 40857 pg[7.d4( v 40804'9911289
(35835'9908201,40804'9911289] lb MIN (bitwise)
local-lis/les=40782/40783 n=195 ec=2063/1989 lis/c 40782/40782 les/c/f
40783/40844/0 40781/40845/40845) [877,870,894] r=-1 lpr=40845 DELETING
pi=[40782,40845)/1 crt=40804'9911289 lcod 40804'9911288 unknown NOTIFY
mbc={}] _delete_some additional unexpected onode list (new onodes has
appeared since PG removal started[#7:2b000000::::head#]

This can also lead to flapping if the OSD is particularly slow, e.g.

2021-04-21 15:23:37.003 7f51c273c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f51a3e81700' had timed out after 15
2021-04-21 15:23:41.595 7f51a3e81700  0
bluestore(/var/lib/ceph/osd/ceph-941) log_latency_fn slow operation
observed for _collection_list, latency = 67.7234s, lat = 67s cid
=10.14aes4_head start GHMAX end GHMAX max 30
2021-04-21 15:23:41.595 7f51a3e81700  0 osd.941 pg_epoch: 43004
pg[10.14aes4( v 42754'296580 (40999'293500,42754'296580] lb MIN
(bitwise) local-lis/les=41331/41332 n=159058 ec=4951/4937 lis/c
41331/41331 les/c/f 41332/42758/0 41330/42759/33461)
[171,903,106,27,395,773]p171(0) r=-1 lpr=42759 DELETING
pi=[41331,42759)/1 crt=42754'296580 unknown NOTIFY mbc={}]
_delete_some additional unexpected onode list (new onodes has appeared
since PG removal started[4#10:75280000::::head#]
2021-04-21 15:23:50.061 7f51a3e81700  0
bluestore(/var/lib/ceph/osd/ceph-941) log_latency slow operation
observed for submit_transact, latency = 8.46584s
2021-04-21 15:23:50.062 7f51a3e81700  1 heartbeat_map reset_timeout
'OSD::osd_op_tp thread 0x7f51a3e81700' had timed out after 15
2021-04-21 15:23:50.115 7f51b6ca1700  0
bluestore(/var/lib/ceph/osd/ceph-941) log_latency_fn slow operation
observed for _txc_committed_kv, latency = 8.51916s, txc =
0x5573928a7340
2021-04-21 15:23:50.473 7f51b2498700  0 log_channel(cluster) log [WRN]
: Monitor daemon marked osd.941 down, but it is still running

Workaround to avoid flapping is to increase osd_heartbeat_grace.

bluefs_buffered_io is enabled on the above cluster. I suspect it is not helping much in this case because the collection list is not fully cached by the end of a PG removal.

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #50466

_delete_some additional unexpected onode list

Updated by Dan van der Ster about 3 years ago

Updated by Konstantin Shalygin about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Neha Ojha about 3 years ago

Updated by Konstantin Shalygin about 3 years ago

Updated by Konstantin Shalygin about 3 years ago

Updated by Konstantin Shalygin about 3 years ago

Updated by Dan van der Ster about 3 years ago

Updated by Konstantin Shalygin almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Konstantin Shalygin almost 3 years ago

Updated by Loïc Dachary almost 3 years ago