Project

General

Profile

Bug #57699

slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds)

Added by Nitzan Mordechai 4 months ago. Updated 26 days ago.

Status:
Fix Under Review
Priority:
Normal
Category:
Peering
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-09-23_20:38:59-rados-wip-yuri6-testing-2022-09-23-1008-quincy-distro-default-smithi/7042504

teuthology.exceptions.MaxWhileTries: reached maximum tries (90) after waiting for 540 seconds

osd.0 didn't boot, looks like it stuck on await_reserved_maps
we never completed to consume

2022-09-23T21:40:52.586+0000 208a1700  7 osd.0 8 consume_map version 8
2022-09-23T21:40:52.774+0000 1588d700 10 osd.0 8 tick_without_osd_lock
2022-09-23T21:40:54.631+0000 348c9700 17 osd.0 scrub-queue::update_load_average heartbeat: daily_loadavg 0.951342
2022-09-23T21:40:54.631+0000 348c9700  5 osd.0 8 heartbeat osd_stat(store_statfs(0x18ff89c000/0x0/0x1900000000, data 0x3382/0x12000, compress 0x0/0x0/0x0, omap 0x0, meta 0x750000), peers [] op hist [])
2022-09-23T21:43:40.464+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x9e83450 con 0x1bf4a5e0
2022-09-23T21:43:41.465+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x171ba970 con 0x1bf4a5e0
2022-09-23T21:43:42.466+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x1ada10e0 con 0x1bf4a5e0
2022-09-23T21:43:43.467+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x1b167b60 con 0x1bf4a5e0
2022-09-23T21:43:44.468+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x173b2ce0 con 0x1bf4a5e0

History

#1 Updated by Nitzan Mordechai 4 months ago

  • Category set to Peering
  • Target version set to 17.2.4
  • ceph-qa-suite rados added

#2 Updated by Radoslaw Zarzynski 4 months ago

  • Status changed from New to In Progress
  • Backport set to quincy

Marking WIP per our morning talk.

#3 Updated by Nitzan Mordechai 4 months ago

  • Pull request ID set to 48295

I was not able to reproduce it with the more debug messages, I created PR with the debug message and will wait for reoccurs (Please re-open if you hit it)

#4 Updated by Radoslaw Zarzynski 4 months ago

  • Status changed from In Progress to Fix Under Review

#5 Updated by Nitzan Mordechai 3 months ago

The issue is that we having deadlock on specific condition. When we are trying to update the mClockScheduler config changes while we already in OSD::consume_map
OSD::consume_map locking osd_lock and when i calling identify_splits_and_merges that try to get shard_lock which is locked by update_scheduler_config. but update_scheduler_config will also need osd_lock to update osd configuration update (when it calls for handle_conf_change) so we end up with deadlock and not able to boot.

After talking with @Sridhar Seshasayee we will remove the lock of shard_lock from update_scheduler_config since we don't need it to update the mClock configuration

#6 Updated by Sridhar Seshasayee 3 months ago

@Nitzan Mordechai this is probably similar to,
https://tracker.ceph.com/issues/52948 and https://tracker.ceph.com/issues/56574 that exhibit the same symptoms.

I removed the cond_wait method of updating the mon store with the OSDs max capacity as part of https://tracker.ceph.com/issues/57040 and used Context based completion.

The common thing in these failures is either the valgrind leaks or thrash related tests.

#7 Updated by Nitzan Mordechai 3 months ago

Sridher, yes, those trackers look the same, valgrind make the osd start slower, maybe that's the reason we are seeing it more with it.

#8 Updated by Sergii Kuzko 27 days ago

Hi
Can you update the bug status
Or transfer to the group of the current version 17.2.5

#9 Updated by Sergii Kuzko 27 days ago

Sergii Kuzko wrote:

Hi
Can you update the bug status
Or transfer to the group of the current version 17.2.6

#10 Updated by Nitzan Mordechai 27 days ago

  • Target version changed from 17.2.4 to v17.2.6

#11 Updated by Radoslaw Zarzynski 26 days ago

bumping up (fix awaits qa)

Also available in: Atom PDF