Bug #57699: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds) - RADOS - Ceph

Actions

Copy link

Bug #57699

closed

slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds)

Added by Nitzan Mordechai over 1 year ago. Updated 3 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Nitzan Mordechai

Category:

Peering

Target version:

% Done:

100%

Source:

Community (dev)

Tags:

backport_processed

Backport:

quincy,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

rados

Component(RADOS):

Pull request ID:

48295

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2022-09-23_20:38:59-rados-wip-yuri6-testing-2022-09-23-1008-quincy-distro-default-smithi/7042504

teuthology.exceptions.MaxWhileTries: reached maximum tries (90) after waiting for 540 seconds

osd.0 didn't boot, looks like it stuck on await_reserved_maps
we never completed to consume

2022-09-23T21:40:52.586+0000 208a1700  7 osd.0 8 consume_map version 8
2022-09-23T21:40:52.774+0000 1588d700 10 osd.0 8 tick_without_osd_lock
2022-09-23T21:40:54.631+0000 348c9700 17 osd.0 scrub-queue::update_load_average heartbeat: daily_loadavg 0.951342
2022-09-23T21:40:54.631+0000 348c9700  5 osd.0 8 heartbeat osd_stat(store_statfs(0x18ff89c000/0x0/0x1900000000, data 0x3382/0x12000, compress 0x0/0x0/0x0, omap 0x0, meta 0x750000), peers [] op hist [])
2022-09-23T21:43:40.464+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x9e83450 con 0x1bf4a5e0
2022-09-23T21:43:41.465+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x171ba970 con 0x1bf4a5e0
2022-09-23T21:43:42.466+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x1ada10e0 con 0x1bf4a5e0
2022-09-23T21:43:43.467+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x1b167b60 con 0x1bf4a5e0
2022-09-23T21:43:44.468+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x173b2ce0 con 0x1bf4a5e0

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Nitzan Mordechai over 1 year ago

Category set to Peering
Target version set to v17.2.4
ceph-qa-suite rados added

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Status changed from New to In Progress
Backport set to quincy

Marking WIP per our morning talk.

Actions

Copy link

Updated by Nitzan Mordechai over 1 year ago

Pull request ID set to 48295

I was not able to reproduce it with the more debug messages, I created PR with the debug message and will wait for reoccurs (Please re-open if you hit it)

Actions

Copy link

Updated by Radoslaw Zarzynski over 1 year ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

Updated by Nitzan Mordechai over 1 year ago

The issue is that we having deadlock on specific condition. When we are trying to update the mClockScheduler config changes while we already in OSD::consume_map
OSD::consume_map locking osd_lock and when i calling identify_splits_and_merges that try to get shard_lock which is locked by update_scheduler_config. but update_scheduler_config will also need osd_lock to update osd configuration update (when it calls for handle_conf_change) so we end up with deadlock and not able to boot.

After talking with @Sridhar Seshasayee we will remove the lock of shard_lock from update_scheduler_config since we don't need it to update the mClock configuration

Actions

Copy link

Updated by Sridhar Seshasayee over 1 year ago

@Nitzan Mordechai this is probably similar to,
https://tracker.ceph.com/issues/52948 and https://tracker.ceph.com/issues/56574 that exhibit the same symptoms.

I removed the cond_wait method of updating the mon store with the OSDs max capacity as part of https://tracker.ceph.com/issues/57040 and used Context based completion.

The common thing in these failures is either the valgrind leaks or thrash related tests.

Actions

Copy link