Project

General

Profile

Actions

Bug #57699

closed

slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds)

Added by Nitzan Mordechai over 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Category:
Peering
Target version:
-
% Done:

100%

Source:
Community (dev)
Tags:
backport_processed
Backport:
quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-09-23_20:38:59-rados-wip-yuri6-testing-2022-09-23-1008-quincy-distro-default-smithi/7042504

teuthology.exceptions.MaxWhileTries: reached maximum tries (90) after waiting for 540 seconds

osd.0 didn't boot, looks like it stuck on await_reserved_maps
we never completed to consume

2022-09-23T21:40:52.586+0000 208a1700  7 osd.0 8 consume_map version 8
2022-09-23T21:40:52.774+0000 1588d700 10 osd.0 8 tick_without_osd_lock
2022-09-23T21:40:54.631+0000 348c9700 17 osd.0 scrub-queue::update_load_average heartbeat: daily_loadavg 0.951342
2022-09-23T21:40:54.631+0000 348c9700  5 osd.0 8 heartbeat osd_stat(store_statfs(0x18ff89c000/0x0/0x1900000000, data 0x3382/0x12000, compress 0x0/0x0/0x0, omap 0x0, meta 0x750000), peers [] op hist [])
2022-09-23T21:43:40.464+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x9e83450 con 0x1bf4a5e0
2022-09-23T21:43:41.465+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x171ba970 con 0x1bf4a5e0
2022-09-23T21:43:42.466+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x1ada10e0 con 0x1bf4a5e0
2022-09-23T21:43:43.467+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x1b167b60 con 0x1bf4a5e0
2022-09-23T21:43:44.468+0000 228a5700  1 -- [v2:172.21.15.203:6802/104834,v1:172.21.15.203:6803/104834] --> [v2:172.21.15.203:3302/0,v1:172.21.15.203:6791/0] -- auth(proto 2 165 bytes epoch 0) v1 -- 0x173b2ce0 con 0x1bf4a5e0


Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #61445: reef: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds)ResolvedNitzan MordechaiActions
Copied to RADOS - Backport #61446: quincy: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds)ResolvedNitzan MordechaiActions
Actions #1

Updated by Nitzan Mordechai over 1 year ago

  • Category set to Peering
  • Target version set to v17.2.4
  • ceph-qa-suite rados added
Actions #2

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to In Progress
  • Backport set to quincy

Marking WIP per our morning talk.

Actions #3

Updated by Nitzan Mordechai over 1 year ago

  • Pull request ID set to 48295

I was not able to reproduce it with the more debug messages, I created PR with the debug message and will wait for reoccurs (Please re-open if you hit it)

Actions #4

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from In Progress to Fix Under Review
Actions #5

Updated by Nitzan Mordechai over 1 year ago

The issue is that we having deadlock on specific condition. When we are trying to update the mClockScheduler config changes while we already in OSD::consume_map
OSD::consume_map locking osd_lock and when i calling identify_splits_and_merges that try to get shard_lock which is locked by update_scheduler_config. but update_scheduler_config will also need osd_lock to update osd configuration update (when it calls for handle_conf_change) so we end up with deadlock and not able to boot.

After talking with @Sridhar Seshasayee we will remove the lock of shard_lock from update_scheduler_config since we don't need it to update the mClock configuration

Actions #6

Updated by Sridhar Seshasayee over 1 year ago

@Nitzan Mordechai this is probably similar to,
https://tracker.ceph.com/issues/52948 and https://tracker.ceph.com/issues/56574 that exhibit the same symptoms.

I removed the cond_wait method of updating the mon store with the OSDs max capacity as part of https://tracker.ceph.com/issues/57040 and used Context based completion.

The common thing in these failures is either the valgrind leaks or thrash related tests.

Actions #7

Updated by Nitzan Mordechai over 1 year ago

Sridher, yes, those trackers look the same, valgrind make the osd start slower, maybe that's the reason we are seeing it more with it.

Actions #8

Updated by Sergii Kuzko over 1 year ago

Hi
Can you update the bug status
Or transfer to the group of the current version 17.2.5

Actions #9

Updated by Sergii Kuzko over 1 year ago

Sergii Kuzko wrote:

Hi
Can you update the bug status
Or transfer to the group of the current version 17.2.6

Actions #10

Updated by Nitzan Mordechai over 1 year ago

  • Target version changed from v17.2.4 to v17.2.6
Actions #11

Updated by Radoslaw Zarzynski over 1 year ago

bumping up (fix awaits qa)

Actions #12

Updated by Ilya Dryomov 11 months ago

  • Target version deleted (v17.2.6)
Actions #13

Updated by Radoslaw Zarzynski 11 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from quincy to quincy,reef
Actions #14

Updated by Backport Bot 11 months ago

  • Copied to Backport #61445: reef: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds) added
Actions #15

Updated by Backport Bot 11 months ago

  • Copied to Backport #61446: quincy: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds) added
Actions #16

Updated by Backport Bot 11 months ago

  • Tags set to backport_processed
Actions #17

Updated by Konstantin Shalygin 3 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
  • Source set to Community (dev)
Actions

Also available in: Atom PDF