Project

General

Profile

Bug #57506

crimson: vstart cluster pgs stuck in +wait

Added by Samuel Just over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Main (c49b81c7d619cea23e9707d1f5bcc7de3049c4fd) + sjust/wip-io-hang (https://github.com/ceph/ceph/pull/48057)

  MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh --without-dashboard -X --crimson --redirect-output --debug -n --no-restart --crimson-smp 3
  ./bin/ceph osd pool create rbd 32 32 replicated replicated_rule 2 2 2
  ./bin/rados bench 1000 write --admin-socket asok/bench.asok -p rbd -b 4096 --debug-ms=1 2>out/bench.stderr
  kill -9 688982 # kill osd.2
  ./bin/ceph osd down 2
  ./bin/ceph osd out 2

./bin/ceph -s

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2022-09-12T23:02:44.425+0000 7fa6c5a4c640 -1 WARNING: all dangerous and experimental features are enabled.
2022-09-12T23:02:44.428+0000 7fa6c5a4c640 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     07d892de-d0fb-4b86-b74d-cbc11d240a7a
    health: HEALTH_WARN
            Degraded data redundancy: 14449/44218 objects degraded (32.677%), 17 pgs degraded
            1 pool(s) do not have an application enabled

  services:
    mon: 1 daemons, quorum a (age 7m)
    mgr: x(active, since 7m)
    osd: 3 osds: 2 up (since 62s), 2 in (since 21s); 17 remapped pgs

  data:
    pools:   2 pools, 33 pgs
    objects: 22.11k objects, 86 MiB
    usage:   2.2 GiB used, 200 GiB / 202 GiB avail
    pgs:     14449/44218 objects degraded (32.677%)
             15 active+clean
             14 active+recovery_wait+undersized+degraded+remapped+wait
             2  active+recovering+undersized+degraded+remapped+wait
             1  active+undersized+wait
             1  active+recovery_wait+undersized+degraded+remapped

  io:
    client:   122 KiB/s wr, 0 op/s rd, 30 op/s wr
    recovery: 85 KiB/s, 21 objects/s

Interestingly, IO continues anyway.

History

#1 Updated by Samuel Just over 1 year ago

  • Description updated (diff)

#2 Updated by Samuel Just over 1 year ago

  • Description updated (diff)

#3 Updated by Samuel Just over 1 year ago

  • Description updated (diff)

#4 Updated by Samuel Just over 1 year ago

Steady state several minutes later after the rados bench instance completed (successfully!)

*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2022-09-12T23:15:28.722+0000 7f7a0e6e8640 -1 WARNING: all dangerous and experimental features are enabled.
2022-09-12T23:15:28.726+0000 7f7a0e6e8640 -1 WARNING: all dangerous and experimental features are enabled.
  cluster:
    id:     07d892de-d0fb-4b86-b74d-cbc11d240a7a
    health: HEALTH_WARN
            Degraded data redundancy: 1 pg undersized
            1 pool(s) do not have an application enabled

  services:
    mon: 1 daemons, quorum a (age 20m)
    mgr: x(active, since 20m)
    osd: 3 osds: 2 up (since 13m), 2 in (since 13m)

  data:
    pools:   2 pools, 33 pgs
    objects: 42.37k objects, 166 MiB
    usage:   2.4 GiB used, 200 GiB / 202 GiB avail
    pgs:     20 active+clean
             12 active+clean+wait
             1  active+undersized+wait

#5 Updated by Samuel Just over 1 year ago

https://github.com/ceph/ceph/pull/48057 covers the stuck part, but the fact that IO continues regardless means that we haven't actually implemented blocking IO while the prior read lease expires. https://tracker.ceph.com/issues/57508 for that part.

#6 Updated by Samuel Just over 1 year ago

  • Status changed from New to Resolved

Also available in: Atom PDF