Bug #57699
closedslow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds)
100%
Description
/a/yuriw-2022-09-23_20:38:59-rados-wip-yuri6-testing-2022-09-23-1008-quincy-distro-default-smithi/7042504
teuthology.exceptions.MaxWhileTries: reached maximum tries (90) after waiting for 540 seconds
osd.0 didn't boot, looks like it stuck on await_reserved_maps
we never completed to consume
2022-09-23T21:40:52.586+0000 208a1700 7 osd.0 8 consume_map version 8 2022-09-23T21:40:52.774+0000 1588d700 10 osd.0 8 tick_without_osd_lock 2022-09-23T21:40:54.631+0000 348c9700 17 osd.0 scrub-queue::update_load_average heartbeat: daily_loadavg 0.951342 2022-09-23T21:40:54.631+0000 348c9700 5 osd.0 8 heartbeat osd_stat(store_statfs(0x18ff89c000/0x0/0x1900000000, data 0x3382/0x12000, compress 0x0/0x0/0x0, omap 0x0, meta 0x750000), peers [] op hist []) 2022-09-23T21:43:40.464+0000 228a5700 1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x9e83450 con 0x1bf4a5e0 2022-09-23T21:43:41.465+0000 228a5700 1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x171ba970 con 0x1bf4a5e0 2022-09-23T21:43:42.466+0000 228a5700 1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x1ada10e0 con 0x1bf4a5e0 2022-09-23T21:43:43.467+0000 228a5700 1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x1b167b60 con 0x1bf4a5e0 2022-09-23T21:43:44.468+0000 228a5700 1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x173b2ce0 con 0x1bf4a5e0
Updated by Nitzan Mordechai over 1 year ago
- Category set to Peering
- Target version set to v17.2.4
- ceph-qa-suite rados added
Updated by Radoslaw Zarzynski over 1 year ago
- Status changed from New to In Progress
- Backport set to quincy
Marking WIP per our morning talk.
Updated by Nitzan Mordechai over 1 year ago
- Pull request ID set to 48295
I was not able to reproduce it with the more debug messages, I created PR with the debug message and will wait for reoccurs (Please re-open if you hit it)
Updated by Radoslaw Zarzynski over 1 year ago
- Status changed from In Progress to Fix Under Review
Updated by Nitzan Mordechai over 1 year ago
The issue is that we having deadlock on specific condition. When we are trying to update the mClockScheduler config changes while we already in OSD::consume_map
OSD::consume_map locking osd_lock and when i calling identify_splits_and_merges that try to get shard_lock which is locked by update_scheduler_config. but update_scheduler_config will also need osd_lock to update osd configuration update (when it calls for handle_conf_change) so we end up with deadlock and not able to boot.
After talking with @Sridhar Seshasayee we will remove the lock of shard_lock from update_scheduler_config since we don't need it to update the mClock configuration
Updated by Sridhar Seshasayee over 1 year ago
@Nitzan Mordechai this is probably similar to,
https://tracker.ceph.com/issues/52948 and https://tracker.ceph.com/issues/56574 that exhibit the same symptoms.
I removed the cond_wait method of updating the mon store with the OSDs max capacity as part of https://tracker.ceph.com/issues/57040 and used Context based completion.
The common thing in these failures is either the valgrind leaks or thrash related tests.
Updated by Nitzan Mordechai over 1 year ago
Sridher, yes, those trackers look the same, valgrind make the osd start slower, maybe that's the reason we are seeing it more with it.
Updated by Sergii Kuzko over 1 year ago
Hi
Can you update the bug status
Or transfer to the group of the current version 17.2.5
Updated by Sergii Kuzko over 1 year ago
Sergii Kuzko wrote:
Hi
Can you update the bug status
Or transfer to the group of the current version 17.2.6
Updated by Nitzan Mordechai over 1 year ago
- Target version changed from v17.2.4 to v17.2.6
Updated by Radoslaw Zarzynski 11 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from quincy to quincy,reef
Updated by Backport Bot 11 months ago
- Copied to Backport #61445: reef: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds) added
Updated by Backport Bot 11 months ago
- Copied to Backport #61446: quincy: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds) added
Updated by Konstantin Shalygin 3 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100
- Source set to Community (dev)