Bug #57506
crimson: vstart cluster pgs stuck in +wait
Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Main (c49b81c7d619cea23e9707d1f5bcc7de3049c4fd) + sjust/wip-io-hang (https://github.com/ceph/ceph/pull/48057)
MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --crimson --redirect-output --debug -n --no-restart --crimson-smp 3 ./bin/ceph osd pool create rbd 32 32 replicated replicated_rule 2 2 2 ./bin/rados bench 1000 write --admin-socket asok/bench.asok -p rbd -b 4096 --debug-ms=1 2>out/bench.stderr kill -9 688982 # kill osd.2 ./bin/ceph osd down 2 ./bin/ceph osd out 2 ./bin/ceph -s *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2022-09-12T23:02:44.425+0000 7fa6c5a4c640 -1 WARNING: all dangerous and experimental features are enabled. 2022-09-12T23:02:44.428+0000 7fa6c5a4c640 -1 WARNING: all dangerous and experimental features are enabled. cluster: id: 07d892de-d0fb-4b86-b74d-cbc11d240a7a health: HEALTH_WARN Degraded data redundancy: 14449/44218 objects degraded (32.677%), 17 pgs degraded 1 pool(s) do not have an application enabled services: mon: 1 daemons, quorum a (age 7m) mgr: x(active, since 7m) osd: 3 osds: 2 up (since 62s), 2 in (since 21s); 17 remapped pgs data: pools: 2 pools, 33 pgs objects: 22.11k objects, 86 MiB usage: 2.2 GiB used, 200 GiB / 202 GiB avail pgs: 14449/44218 objects degraded (32.677%) 15 active+clean 14 active+recovery_wait+undersized+degraded+remapped+wait 2 active+recovering+undersized+degraded+remapped+wait 1 active+undersized+wait 1 active+recovery_wait+undersized+degraded+remapped io: client: 122 KiB/s wr, 0 op/s rd, 30 op/s wr recovery: 85 KiB/s, 21 objects/s
Interestingly, IO continues anyway.
History
#1 Updated by Samuel Just over 1 year ago
- Description updated (diff)
#2 Updated by Samuel Just over 1 year ago
- Description updated (diff)
#3 Updated by Samuel Just over 1 year ago
- Description updated (diff)
#4 Updated by Samuel Just over 1 year ago
Steady state several minutes later after the rados bench instance completed (successfully!)
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2022-09-12T23:15:28.722+0000 7f7a0e6e8640 -1 WARNING: all dangerous and experimental features are enabled. 2022-09-12T23:15:28.726+0000 7f7a0e6e8640 -1 WARNING: all dangerous and experimental features are enabled. cluster: id: 07d892de-d0fb-4b86-b74d-cbc11d240a7a health: HEALTH_WARN Degraded data redundancy: 1 pg undersized 1 pool(s) do not have an application enabled services: mon: 1 daemons, quorum a (age 20m) mgr: x(active, since 20m) osd: 3 osds: 2 up (since 13m), 2 in (since 13m) data: pools: 2 pools, 33 pgs objects: 42.37k objects, 166 MiB usage: 2.4 GiB used, 200 GiB / 202 GiB avail pgs: 20 active+clean 12 active+clean+wait 1 active+undersized+wait
#5 Updated by Samuel Just over 1 year ago
https://github.com/ceph/ceph/pull/48057 covers the stuck part, but the fact that IO continues regardless means that we haven't actually implemented blocking IO while the prior read lease expires. https://tracker.ceph.com/issues/57508 for that part.
#6 Updated by Samuel Just over 1 year ago
- Status changed from New to Resolved