Project

General

Profile

Bug #51076

"wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend.

Added by Sridhar Seshasayee almost 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/sseshasa-2021-06-01_08:27:04-rados-wip-sseshasa-testing-objs-test-2-distro-basic-smithi/6145021

Unfortunately there are no osd logs since the job was marked as dead. Some relevant details are
pasted below:

Traceback

2021-06-01T10:03:25.359 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6ab8dbe666350a7377b35b715b28fe83cc8492c2/qa/tasks/ceph_manager.py", line 187, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6ab8dbe666350a7377b35b715b28fe83cc8492c2/qa/tasks/ceph_manager.py", line 1403, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6ab8dbe666350a7377b35b715b28fe83cc8492c2/qa/tasks/ceph_manager.py", line 2803, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

2021-06-01T10:03:25.359 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6ab8dbe666350a7377b35b715b28fe83cc8492c2/qa/tasks/ceph_manager.py", line 1278, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6ab8dbe666350a7377b35b715b28fe83cc8492c2/qa/tasks/ceph_manager.py", line 187, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6ab8dbe666350a7377b35b715b28fe83cc8492c2/qa/tasks/ceph_manager.py", line 1403, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6ab8dbe666350a7377b35b715b28fe83cc8492c2/qa/tasks/ceph_manager.py", line 2803, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

Slow Ops:

Slow ops from the last issued ceph status output before the above traceback:

"SLOW_OPS":{

    "severity":"HEALTH_WARN",
    "summary":{
        "message":"2 slow ops, oldest one blocked for 1743 sec, osd.10 has slow ops",
        "count":1
    },
    "muted":false

}

PG Map

The state of the pg map just prior to the above traceback being reported:

"pgmap":{

    "pgs_by_state":[
        {
            "state_name":"active+clean",
            "count":62
        },
        {
            "state_name":"recovery_wait+undersized+degraded+remapped+peered",
            "count":13
        },
        {
            "state_name":"active+recovery_wait+undersized+degraded+remapped",
            "count":9
        },
        {
            "state_name":"active+recovering+undersized+degraded+remapped",
            "count":1
        }
    ],
    "num_pgs":85,
    "num_pools":3,
    "num_objects":318,
    "data_bytes":751566848,
    "bytes_used":4771844096,
    "bytes_avail":961595797504,
    "bytes_total":966367641600,
    "inactive_pgs_ratio":0.15294118225574493,
    "degraded_objects":2235,
    "degraded_total":1908,
    "degraded_ratio":1.171383647798742

}

Related issues

Related to RADOS - Bug #50842: pacific: recovery does not complete because of rw_manager lock not being released Resolved
Related to RADOS - Bug #56149: thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "active+recovering+undersized" pg New
Related to RADOS - Bug #58893: test_map_discontinuity: AssertionError: wait_for_clean: failed before timeout expired New
Duplicated by RADOS - Bug #53875: AssertionError: wait_for_recovery: failed before timeout expired due to down PGs Duplicate
Duplicated by RADOS - Bug #50842: pacific: recovery does not complete because of rw_manager lock not being released Resolved
Copied to RADOS - Backport #55743: octopus: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. Resolved
Copied to RADOS - Backport #55744: quincy: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. Resolved
Copied to RADOS - Backport #55745: pacific: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. Resolved

History

#1 Updated by Sridhar Seshasayee almost 3 years ago

Ran the same test repeatedly (5 times) on master by setting osd_op_queue to 'wpq' and 'mclock_scheduler' on different
branches.

Result with osd_op_queue=mclock_scheduler:
https://pulpito.ceph.com/sseshasa-2021-06-08_20:11:54-rados:thrash-erasure-code-big-wip-thrashosdsmclk-testing-distro-basic-smithi/

All the 5 attempts at the test passed.

Result with osd_op_queue=wpq:
https://pulpito.ceph.com/sseshasa-2021-06-08_20:09:28-rados:thrash-erasure-code-big-wip-thrashosdswpq-testing-distro-basic-smithi/

2 attempts didn't complete:
/a/sseshasa-2021-06-08_20:09:28-rados:thrash-erasure-code-big-wip-thrashosdswpq-testing-distro-basic-smithi/6161181
/a/sseshasa-2021-06-08_20:09:28-rados:thrash-erasure-code-big-wip-thrashosdswpq-testing-distro-basic-smithi/6161182

The reason for the failure is similar to what was observed originally (with osd_op_queue=debug_random) when logging this tracker which was a timeout either waiting for recovery or waiting for pgs to become clean. Therefore, this issue cannot be attributed to the 'mclock_scheduler' alone and the changes in PR: https://github.com/ceph/ceph/pull/41308 which are related to mclock scheduler.

Traceback from 6161181

2021-06-08T21:15:49.387 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 187, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 1403, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 2803, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

2021-06-08T21:15:49.388 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 1278, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 187, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 1403, in _do_thrash
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 2803, in wait_for_recovery
    'wait_for_recovery: failed before timeout expired'
AssertionError: wait_for_recovery: failed before timeout expired

Traceback from 6161182

2021-06-08T21:16:05.821 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 187, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 1400, in _do_thrash
    self.test_map_discontinuity()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 1186, in test_map_discontinuity
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 2718, in wait_for_clean
    'wait_for_clean: failed before timeout expired'
AssertionError: wait_for_clean: failed before timeout expired

2021-06-08T21:16:05.821 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 1278, in do_thrash
    self._do_thrash()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 187, in wrapper
    return func(self)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 1400, in _do_thrash
    self.test_map_discontinuity()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 1186, in test_map_discontinuity
    timeout=self.config.get('timeout')
  File "/home/teuthworker/src/git.ceph.com_ceph-c_dd0d67e8cdaf292af720548ea89e8b4da879c5e5/qa/tasks/ceph_manager.py", line 2718, in wait_for_clean
    'wait_for_clean: failed before timeout expired'
AssertionError: wait_for_clean: failed before timeout expired

The logs from both the failures were collected. I will try to upload them to a common location. Analysis is pending.

#2 Updated by Sridhar Seshasayee almost 3 years ago

From the logs of 6161181 snapshot recovery is not able to proceed since a rwlock on the head version
(3:cb63772d:::smithi11022783-30:head) could not be obtained. This issue appears to be similar to
https://tracker.ceph.com/issues/50842 observed on pacific.

2021-06-08T20:45:12.055+0000 7f2244ebd700 20 osd.8 op_wq(3) _enqueue OpSchedulerItem(3.13s0 PGRecovery(pgid=3.13s0 epoch_queued=882 reserved_pushes=1) prio 5 cost 20971520 e882 reserved_pushes 1)
2021-06-08T20:45:12.055+0000 7f2244ebd700 10 osd.8 pg_epoch: 882 pg[3.13s0( v 871'1226 (0'0,871'1226] local-lis/les=873/874 n=11 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=871'1226 lcod 871'1224 mlcod 0'0 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[2cb~1,2de~1,2f0~1,2f5~1] ps=156] release_backoffs [3:cb63772d:::smithi11022783-30:2f9,3:cb63772d:::smithi11022783-30:2f9)
2021-06-08T20:45:12.055+0000 7f2244ebd700 10 osd.8 pg_epoch: 882 pg[3.13s0( v 871'1226 (0'0,871'1226] local-lis/les=873/874 n=11 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=871'1226 lcod 871'1224 mlcod 0'0 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[2cb~1,2de~1,2f0~1,2f5~1] ps=156] finish_degraded_object 3:cb63772d:::smithi11022783-30:2f9
2021-06-08T20:45:12.055+0000 7f2244ebd700 10 osd.8 pg_epoch: 882 pg[3.13s0( v 871'1226 (0'0,871'1226] local-lis/les=873/874 n=11 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=871'1226 lcod 871'1224 mlcod 0'0 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[2cb~1,2de~1,2f0~1,2f5~1] ps=156] continue_recovery_op: WRITING return RecoveryOp(hoid=3:cb63772d:::smithi11022783-30:2f9 v=871'1225 missing_on=10(1) missing_on_shards=1 recovery_info=ObjectRecoveryInfo(3:cb63772d:::smithi11022783-30:2f9@871'1225, size: 2211840, copy_subset: [], clone_subset: {}, snapset: 2f9=[]:{291=[291],2d9=[2d9,2d2,2d1,2cc,2cb,2c9,2c6],2f9=[2f9,2f5,2f3,2f1,2f0,2ec,2eb,2ea,2e0,2de,2db]}, object_exist: 1) recovery_progress=ObjectRecoveryProgress(!first, data_recovered_to:2211840, data_complete:true, omap_recovered_to:, omap_complete:true, error:false) obc refcount=2 state=COMPLETE waiting_on_pushes= extent_requested=2097152,1048576)

Shortly after, the rwlock could not be obatined as seen below:

2021-06-08T20:45:12.055+0000 7f2240eb5700 10 osd.8 882 do_recovery starting 1 pg[3.13s0( v 871'1226 (0'0,871'1226] local-lis/les=873/874 n=11 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=871'1226 lcod 871'1224 mlcod 0'0 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[2cb~1,2de~1,2f0~1,2f5~1] ps=156]
2021-06-08T20:45:12.055+0000 7f2240eb5700 10 osd.8 pg_epoch: 882 pg[3.13s0( v 871'1226 (0'0,871'1226] local-lis/les=873/874 n=11 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=871'1226 lcod 871'1224 mlcod 0'0 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[2cb~1,2de~1,2f0~1,2f5~1] ps=156] recover_replicas(1)
2021-06-08T20:45:12.055+0000 7f2240eb5700 10 osd.8 pg_epoch: 882 pg[3.13s0( v 871'1226 (0'0,871'1226] local-lis/les=873/874 n=11 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=871'1226 lcod 871'1224 mlcod 0'0 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[2cb~1,2de~1,2f0~1,2f5~1] ps=156]  peer osd.10(1) missing 1 objects.
2021-06-08T20:45:12.055+0000 7f2240eb5700 20 osd.8 pg_epoch: 882 pg[3.13s0( v 871'1226 (0'0,871'1226] local-lis/les=873/874 n=11 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=871'1226 lcod 871'1224 mlcod 0'0 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[2cb~1,2de~1,2f0~1,2f5~1] ps=156]  peer osd.10(1) missing {3:cb63772d:::smithi11022783-30:head=871'1226 flags = none clean_offsets: [], clean_omap: 0, new_object: 1}

...

2021-06-08T20:45:12.055+0000 7f2240eb5700 20 osd.8 pg_epoch: 882 pg[3.13s0( v 871'1226 (0'0,871'1226] local-lis/les=873/874 n=11 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=871'1226 lcod 871'1224 mlcod 0'0 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[2cb~1,2de~1,2f0~1,2f5~1] ps=156] recovery delayed on 3:cb63772d:::smithi11022783-30:head; could not get rw_manager lock

The state of the above pg remained the same. After some more time slow ops were reported for the op pending on this pg.

2021-06-08T20:45:33.370+0000 7f2240eb5700 20 osd.8 pg_epoch: 900 pg[3.13s0( v 890'1232 (0'0,890'1232] local-lis/les=873/874 n=12 ec=48/26 lis/c=873/787 les/c/f=874/788/0 sis=873) [8,10,9,4,11,7]/[8,NONE,9,4,11,7]p8(0) async=[10(1)] r=0 lpr=873 pi=[787,873)/1 crt=890'1232 lcod 890'1231 mlcod 871'1225 active+recovering+undersized+remapped mbc={0={(1+0)=1},1={(0+1)=1},2={(1+0)=1},3={(1+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[27b~1,2c9~1,2cb~1,2de~1,2ec~1,2f0~1,2f5~1,2f9~1] ps=84] recovery delayed on 3:cb63772d:::smithi11022783-30:head; could not get rw_manager lock

...

2021-06-08T20:45:35.282+0000 7f2259ef6700 20 slow request osd_op(client.4485.0:9126 3.13s0 3:cb63772d:::smithi11022783-30:head [copy-from ver 0 in=55b] snapc 304=[304,303,302,2ff,2fe,2fc,2f9,2f5,2f3,2f1,2f0,2ec,2eb,2ea,2e0,2db,2d9,2d2,2d1,2cc,2cb,2c9,2c6,291,27b] ondisk+write+known_if_redirected e875) initiated 2021-06-08T20:45:04.388826+0000 currently started
2021-06-08T20:45:35.282+0000 7f2259ef6700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.4485.0:9126 3.13s0 3:cb63772d:::smithi11022783-30:head [copy-from ver 0 in=55b] snapc 304=[304,303,302,2ff,2fe,2fc,2f9,2f5,2f3,2f1,2f0,2ec,2eb,2ea,2e0,2db,2d9,2d2,2d1,2cc,2cb,2c9,2c6,291,27b] ondisk+write+known_if_redirected e875) initiated 2021-06-08T20:45:04.388826+0000 currently started
2021-06-08T20:45:35.282+0000 7f2259ef6700 20 slow request osd_op(client.4485.0:9127 3.13s0 3:cb63772d:::smithi11022783-30:head [stat] snapc 0=[] ondisk+read+rwordered+known_if_redirected e875) initiated 2021-06-08T20:45:04.388851+0000 currently delayed
2021-06-08T20:45:35.282+0000 7f2259ef6700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.4485.0:9127 3.13s0 3:cb63772d:::smithi11022783-30:head [stat] snapc 0=[] ondisk+read+rwordered+known_if_redirected e875) initiated 2021-06-08T20:45:04.388851+0000 currently delayed
2021-06-08T20:45:35.282+0000 7f2259ef6700 -1 osd.8 900 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4485.0:9126 3.13s0 3:cb63772d:::smithi11022783-30:head [copy-from ver 0 in=55b] snapc 304=[304,303,302,2ff,2fe,2fc,2f9,2f5,2f3,2f1,2f0,2ec,2eb,2ea,2e0,2db,2d9,2d2,2d1,2cc,2cb,2c9,2c6,291,27b] ondisk+write+known_if_redirected e875

#3 Updated by Sridhar Seshasayee almost 3 years ago

For the dead jobs, relevant logs have been uploaded to senta02 under /home/sseshasa/recovery_timeout.
Please let me know if anyone is unable to access the logs or if additional logs are needed.

Command Used to Run the Test:

$teuthology-suite -s rados:thrash-erasure-code-big -c wip-thrashosdswpq-testing --filter '{ceph cluster/{12-osds openstack} mon_election/connectivity msgr-failures/osd-delay objectstore/bluestore-comp-lz4 rados recovery-overrides/{more-async-recovery} supported-random-distro$/{ubuntu_latest} thrashers/mapgap thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}' -m smithi -p 70  --seed 8394 -N 5

#4 Updated by Neha Ojha almost 3 years ago

  • Related to Bug #50842: pacific: recovery does not complete because of rw_manager lock not being released added

#5 Updated by Neha Ojha almost 3 years ago

  • Priority changed from Normal to Urgent
  • Backport set to pacific

Looks very similar

{"pgs_by_state":[{"state_name":"active+clean","count":181},
                 {"state_name":"active+recovery_wait+undersized+degraded+remapped","count":2},
                 {"state_name":"recovery_wait+undersized+degraded+remapped+peered","count":1},
                 {"state_name":"active+recovering+undersized+degraded+remapped","count":1}],

{ceph cluster/{12-osds openstack} mon_election/connectivity msgr-failures/few objectstore/bluestore-comp-lz4 rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{centos_8} thrashers/pggrow thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}

/a/yuriw-2021-06-07_19:22:06-rados-wip-yuri4-testing-2021-06-07-0955-pacific-distro-basic-smithi/6157866 - no logs

#6 Updated by Neha Ojha almost 3 years ago

rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} mon_election/connectivity msgr-failures/osd-dispatch-delay objectstore/bluestore-stupid rados recovery-overrides/{more-async-recovery} supported-random-distro$/{rhel_8} thrashers/pggrow thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}

/a/yuriw-2021-06-07_19:22:06-rados-wip-yuri4-testing-2021-06-07-0955-pacific-distro-basic-smithi/6158078 - no logs

#7 Updated by Patrick Donnelly almost 3 years ago

I think this one is related?

/ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175804

It does have logs.

#8 Updated by Neha Ojha over 2 years ago

  • Assignee set to Neha Ojha

#10 Updated by Deepika Upadhyay over 2 years ago


2021-09-29T21:35:48.143 INFO:tasks.ceph.osd.3.smithi123.stderr:2021-09-29T21:35:47.889+0000 7f96eafa2700 -1 osd.3 619 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.4566.0:5507 3.24s0 3:24ed5cab:::smithi12332721-31:head [rollback 1c1] snapc 1c1=[1c1,1bc,1bb,1ac] ondisk+write+known_if_redirected e587

2021-10-07T20:05:55.179 INFO:teuthology.orchestra.run.smithi124.stdout:            "description": "osd_op(client.4501.0:7721 3.5s0 3:a11e178e:::smithi10932432-30:head [rollback 258] snapc 25a=[25a,259,258,250,24d,22e,20d,1ec] ondisk+write+known_if_redirected e684)",
2021-10-07T20:05:55.179 INFO:teuthology.orchestra.run.smithi124.stdout:            "initiated_at": "2021-10-07T20:05:33.457443+0000",
2021-10-07T20:05:55.180 INFO:teuthology.orchestra.run.smithi124.stdout:            "age": 21.717791700999999,
2021-10-07T20:05:55.180 INFO:teuthology.orchestra.run.smithi124.stdout:            "duration": 21.717808847000001,
2021-10-07T20:05:55.180 INFO:teuthology.orchestra.run.smithi124.stdout:            "type_data": {
2021-10-07T20:05:55.180 INFO:teuthology.orchestra.run.smithi124.stdout:                "flag_point": "delayed",
2021-10-07T20:05:55.181 INFO:teuthology.orchestra.run.smithi124.stdout:                "client_info": {

2021-10-07T20:57:31.451 INFO:tasks.ceph.ceph_manager.ceph:PG 3.12 is not active+clean
2021-10-07T20:57:31.451 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '3.12', 'version': "1487'1049", 'reported_seq': 4264, 'reported_epoch': 2029, 'state': 'active+recovery_wait+undersized+degraded+remapped', 'last_fresh': '2021-10-07T20:57:27.335113+0000', 'last_change': '2021-10-07T20:05:10.287992+0000
(similar with other non active clean PG's)

/ceph/teuthology-archive/yuriw-2021-09-29_18:13:53-rados-wip-yuri3-testing-2021-09-29-0859-octopus-distro-basic-smithi/6413909/teuthology.log

looks similar to this issue, but could not collect osd logs so cannot verify is it related to rw_manager or not

#11 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-01-11_19:17:55-rados-wip-yuri5-testing-2022-01-11-0843-distro-default-smithi/6608450

#12 Updated by Laura Flores about 2 years ago

  • Duplicated by Bug #53875: AssertionError: wait_for_recovery: failed before timeout expired due to down PGs added

#13 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-01-15_05:47:18-rados-wip-yuri8-testing-2022-01-14-1551-distro-default-smithi/6619577
/a/yuriw-2022-01-17_17:14:22-rados-wip-yuri8-testing-2022-01-14-1551-distro-default-smithi/6622567

#14 Updated by Neha Ojha about 2 years ago

  • Assignee changed from Neha Ojha to Laura Flores

#15 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-01-27_14:57:16-rados-wip-yuri-testing-2022-01-26-1810-pacific-distro-default-smithi/6643449

#16 Updated by Laura Flores about 2 years ago

http://pulpito.front.sepia.ceph.com/lflores-2022-01-31_19:11:11-rados:thrash-erasure-code-big-master-distro-default-smithi/

/a/lflores-2022-01-31_19:11:11-rados:thrash-erasure-code-big-master-distro-default-smithi/6651484

2022-02-01T03:54:57.657+0000 7fb0ac5b7700 20 osd.1 pg_epoch: 314 pg[3.19s0( v 314'188 (0'0,314'188] local-lis/les=303/304 n=4 ec=31/25 lis/c=303/298 les/c/f=304/299/0 sis=303) [1,2,4,10,8,7]/[1,NONE,4,10,8,7]p1(0) async=[2(1)] r=0 lpr=303 pi=[298,303)/1 luod=300'186 crt=300'186 lcod 299'184 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=6},1={(0+0)=2,(0+1)=4},2={(1+0)=5,(1+1)=1},3={(1+0)=6},4={(1+0)=6},5={(1+0)=6}} trimq=[e8~1,f7~2] ps=30] recovery delayed on 3:98183898:::smithi04040197-26:105; could not get rw_manager lock
2022-02-01T03:54:57.658+0000 7fb0ac5b7700 20 osd.1 pg_epoch: 314 pg[3.19s0( v 314'188 (0'0,314'188] local-lis/les=303/304 n=4 ec=31/25 lis/c=303/298 les/c/f=304/299/0 sis=303) [1,2,4,10,8,7]/[1,NONE,4,10,8,7]p1(0) async=[2(1)] r=0 lpr=303 pi=[298,303)/1 luod=300'186 crt=300'186 lcod 299'184 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=6},1={(0+0)=2,(0+1)=4},2={(1+0)=5,(1+1)=1},3={(1+0)=6},4={(1+0)=6},5={(1+0)=6}} trimq=[e8~1,f7~2] ps=30] recovery delayed on 3:98183898:::smithi04040197-26:105; could not get rw_manager lock

2022-02-01T03:54:14.404+0000 7f5bc11fe700 20 osd.0 pg_epoch: 271 pg[3.2bs0( v 271'316 (0'0,271'316] local-lis/les=265/266 n=11 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 luod=267'314 crt=267'314 lcod 267'313 mlcod 30'16 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=11},1={(1+0)=11},2={(1+0)=11},3={(1+0)=11},4={(0+0)=5,(0+1)=6},5={(1+0)=6,(1+1)=5}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1] ps=[7c~1,88~1,8c~3,92~1,95~1,a4~1,a8~1,ac~3,b0~2,ba~2,c2~1,c7~1]] recovery delayed on 3:d7b31dab:::smithi04040197-27:e1; could not get rw_manager lock
2022-02-01T03:54:14.405+0000 7f5bc5206700 20 osd.0 pg_epoch: 271 pg[3.2bs0( v 271'316 (0'0,271'316] local-lis/les=265/266 n=11 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 luod=267'314 crt=267'314 lcod 267'313 mlcod 30'16 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=11},1={(1+0)=11},2={(1+0)=11},3={(1+0)=11},4={(0+0)=5,(0+1)=6},5={(1+0)=6,(1+1)=5}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1] ps=[7c~1,88~1,8c~3,92~1,95~1,a4~1,a8~1,ac~3,b0~2,ba~2,c2~1,c7~1]] recovery delayed on 3:d7b31dab:::smithi04040197-27:e1; could not get rw_manager lock
2022-02-01T03:54:21.835+0000 7f5bc5206700 20 osd.0 pg_epoch: 278 pg[3.2bs0( v 271'316 (0'0,271'316] local-lis/les=265/266 n=11 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=271'316 lcod 267'314 mlcod 30'16 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=3},1={(1+0)=3},2={(1+0)=3},3={(1+0)=3},4={(0+0)=3},5={(1+0)=3}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2] ps=[7c~1,88~1,8c~3,92~1,95~1,a4~1,a8~1,ac~3,b0~2,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:21.921+0000 7f5bc5206700 20 osd.0 pg_epoch: 278 pg[3.2bs0( v 278'318 (0'0,278'318] local-lis/les=265/266 n=11 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=278'318 lcod 278'317 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=2},1={(1+0)=2},2={(1+0)=2},3={(1+0)=2},4={(0+0)=2},5={(1+0)=2}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2] ps=[7c~1,88~1,8c~3,92~1,95~1,a4~1,a8~1,ac~3,b0~2,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:22.007+0000 7f5bc5206700 20 osd.0 pg_epoch: 278 pg[3.2bs0( v 278'318 (0'0,278'318] local-lis/les=265/266 n=11 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=278'318 lcod 278'317 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2] ps=[7c~1,88~1,8c~3,92~1,95~1,a4~1,a8~1,ac~3,b0~2,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:22.435+0000 7f5bc11fe700 20 osd.0 pg_epoch: 279 pg[3.2bs0( v 278'318 (0'0,278'318] local-lis/les=265/266 n=11 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=278'318 lcod 278'317 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~1] ps=[7c~1,88~1,8c~3,92~1,95~1,a4~1,a8~1,ac~3,b0~2,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:23.419+0000 7f5bc11fe700 20 osd.0 pg_epoch: 280 pg[3.2bs0( v 278'318 (0'0,278'318] local-lis/les=265/266 n=11 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=278'318 lcod 278'317 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~1] ps=[7c~1,88~1,8c~3,92~1,95~1,a4~1,a8~1,ac~3,b0~2,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:24.360+0000 7f5bc11fe700 20 osd.0 pg_epoch: 281 pg[3.2bs0( v 280'322 (0'0,280'322] local-lis/les=265/266 n=12 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=280'322 lcod 280'321 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~1] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:25.378+0000 7f5bc11fe700 20 osd.0 pg_epoch: 282 pg[3.2bs0( v 280'322 (0'0,280'322] local-lis/les=265/266 n=12 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=280'322 lcod 280'321 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~1] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:26.391+0000 7f5bc5206700 20 osd.0 pg_epoch: 283 pg[3.2bs0( v 280'322 (0'0,280'322] local-lis/les=265/266 n=12 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=280'322 lcod 280'321 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~2] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:27.385+0000 7f5bc11fe700 20 osd.0 pg_epoch: 284 pg[3.2bs0( v 280'322 (0'0,280'322] local-lis/les=265/266 n=12 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=280'322 lcod 280'321 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~2] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:28.379+0000 7f5bc5206700 20 osd.0 pg_epoch: 285 pg[3.2bs0( v 284'326 (0'0,284'326] local-lis/les=265/266 n=13 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=284'326 lcod 284'325 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~2,eb~1] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:29.378+0000 7f5bc11fe700 20 osd.0 pg_epoch: 286 pg[3.2bs0( v 284'326 (0'0,284'326] local-lis/les=265/266 n=13 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=284'326 lcod 284'325 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~2,eb~1] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:30.380+0000 7f5bc5206700 20 osd.0 pg_epoch: 287 pg[3.2bs0( v 284'326 (0'0,284'326] local-lis/les=265/266 n=13 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=284'326 lcod 284'325 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~2,eb~1] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:31.462+0000 7f5bc11fe700 20 osd.0 pg_epoch: 288 pg[3.2bs0( v 284'326 (0'0,284'326] local-lis/les=265/266 n=13 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=284'326 lcod 284'325 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~2,eb~1] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:32.451+0000 7f5bc11fe700 20 osd.0 pg_epoch: 289 pg[3.2bs0( v 284'326 (0'0,284'326] local-lis/les=265/266 n=13 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=284'326 lcod 284'325 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~2,eb~1] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:33.465+0000 7f5bc11fe700 20 osd.0 pg_epoch: 290 pg[3.2bs0( v 289'327 (0'0,289'327] local-lis/les=265/266 n=13 ec=152/25 lis/c=265/162 les/c/f=266/163/0 sis=265) [0,7,1,10,5,8]/[0,7,1,10,NONE,8]p0(0) async=[5(4)] r=0 lpr=265 pi=[162,265)/1 crt=289'327 lcod 284'326 mlcod 267'313 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(1+0)=1},4={(0+0)=1},5={(1+0)=1}} trimq=[ab~1,b2~1,b6~1,bc~1,bf~1,c6~1,ca~1,cf~2,d3~1,d5~1,d7~1,da~1,e0~2,e6~2,eb~1] ps=[88~1,8d~1,ac~1,b0~1,ba~2,c2~1,c7~1]] recovery delayed on 3:d694dacf:::smithi04040197-25:head; could not get rw_manager lock
2022-02-01T03:54:40.520+0000 7f5bbf1fa700 20 osd.0 pg_epoch: 297 pg[3.17s0( v 297'312 (0'0,297'312] local-lis/les=292/293 n=6 ec=31/25 lis/c=292/242 les/c/f=293/243/0 sis=292) [0,5,9,4,8,10]/[0,5,9,NONE,8,10]p0(0) async=[4(3)] r=0 lpr=292 pi=[242,292)/1 luod=262'310 crt=262'310 lcod 258'308 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=5},1={(1+0)=5},2={(1+0)=5},3={(0+0)=2,(0+1)=3},4={(1+0)=5},5={(1+0)=5}} trimq=[d3~1,d5~1,d7~1,da~1,e0~2,e6~2,eb~1,f3~2] ps=[88~1,8d~1,ab~2,b0~1,b2~1,b6~1,ba~3,bf~1,c2~1,c6~2,ca~1,cf~2]] recovery delayed on 3:eb5b89e8:::smithi04040197-32:f8; could not get rw_manager lock
2022-02-01T03:54:40.520+0000 7f5bc3202700 20 osd.0 pg_epoch: 297 pg[3.17s0( v 297'312 (0'0,297'312] local-lis/les=292/293 n=6 ec=31/25 lis/c=292/242 les/c/f=293/243/0 sis=292) [0,5,9,4,8,10]/[0,5,9,NONE,8,10]p0(0) async=[4(3)] r=0 lpr=292 pi=[242,292)/1 luod=262'310 crt=262'310 lcod 258'308 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=5},1={(1+0)=5},2={(1+0)=5},3={(0+0)=2,(0+1)=3},4={(1+0)=5},5={(1+0)=5}} trimq=[d3~1,d5~1,d7~1,da~1,e0~2,e6~2,eb~1,f3~2] ps=[88~1,8d~1,ab~2,b0~1,b2~1,b6~1,ba~3,bf~1,c2~1,c6~2,ca~1,cf~2]] recovery delayed on 3:eb5b89e8:::smithi04040197-32:f8; could not get rw_manager lock
2022-02-01T03:54:43.166+0000 7f5bbf1fa700 20 osd.0 pg_epoch: 299 pg[3.17s0( v 299'314 (0'0,299'314] local-lis/les=292/293 n=6 ec=31/25 lis/c=292/242 les/c/f=293/243/0 sis=292) [0,5,9,4,8,10]/[0,5,9,NONE,8,10]p0(0) async=[4(3)] r=0 lpr=292 pi=[242,292)/1 luod=297'312 crt=297'312 lcod 262'310 mlcod 35'26 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(0+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[d3~1,d5~1,d7~1,da~1,dc~1,e0~2,e6~2,eb~1,f3~2] ps=[88~1,8d~1,ab~2,b0~1,b2~1,b6~1,ba~3,bf~1,c2~1,c6~2,ca~1,cf~2]] recovery delayed on 3:eb5b89e8:::smithi04040197-32:head; could not get rw_manager lock

#17 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-02-11_22:59:19-rados-wip-yuri4-testing-2022-02-11-0858-distro-default-smithi/6677733

Last pg map before the Traceback looks a little different here. Looks like it timed out waiting for snaptrim.

{
  "pgs_by_state": [
    {
      "state_name": "active+clean",
      "count": 268
    },
    {
      "state_name": "active+clean+snaptrim_wait",
      "count": 222
    },
    {
      "state_name": "activating",
      "count": 7
    },
    {
      "state_name": "active",
      "count": 5
    },
    {
      "state_name": "active+clean+snaptrim",
      "count": 2
    },
    {
      "state_name": "recovering+undersized+degraded+remapped+peered",
      "count": 1
    }
  ],
  "num_pgs": 505,
  "num_pools": 3,
  "num_objects": 210,
  "data_bytes": 482439712,
  "bytes_used": 7917903872,
  "bytes_avail": 759667417088,
  "bytes_total": 767585320960,
  "inactive_pgs_ratio": 0.015841584652662277,
  "degraded_objects": 40,
  "degraded_total": 1252,
  "degraded_ratio": 0.03194888178913738,
  "misplaced_objects": 4,
  "misplaced_total": 1252,
  "misplaced_ratio": 0.003194888178913738
}

#18 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-02-16_00:25:26-rados-wip-yuri-testing-2022-02-15-1431-distro-default-smithi/6687342

Same issue with rw_manager lock here:

2022-02-16T09:46:59.725+0000 7f0bebb7e700 20 osd.11 pg_epoch: 560 pg[3.2fs0( v 560'504 (0'0,560'504] local-lis/les=557/558 n=5 ec=269/26 lis/c=557/552 les/c/f=558/553/0 sis=557) [11,4,9,10,0,5]/[11,NONE,9,10,0,5]p11(0) async=[4(1)] r=0 lpr=557 pi=[552,557)/1 luod=555'502 crt=555'502 lcod 553'500 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=5},1={(0+0)=2,(0+1)=3},2={(1+0)=5},3={(1+0)=5},4={(1+0)=5},5={(1+0)=5}} trimq=[17e~1] ps=117] recovery delayed on 3:f47c2e5a:::smithi05921110-39:1fa; could not get rw_manager lock
2022-02-16T09:46:59.729+0000 7f0befb86700 20 osd.11 pg_epoch: 560 pg[3.2fs0( v 560'504 (0'0,560'504] local-lis/les=557/558 n=5 ec=269/26 lis/c=557/552 les/c/f=558/553/0 sis=557) [11,4,9,10,0,5]/[11,NONE,9,10,0,5]p11(0) async=[4(1)] r=0 lpr=557 pi=[552,557)/1 luod=555'502 crt=555'502 lcod 553'500 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=5},1={(0+0)=2,(0+1)=3},2={(1+0)=5},3={(1+0)=5},4={(1+0)=5},5={(1+0)=5}} trimq=[17e~1] ps=117] recovery delayed on 3:f47c2e5a:::smithi05921110-39:1fa; could not get rw_manager lock

#19 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-02-17_22:49:55-rados-wip-yuri3-testing-2022-02-17-1256-distro-default-smithi/6692376

2022-02-18T09:47:34.143+0000 7fc63612b700 20 osd.7 pg_epoch: 745 pg[3.78s0( v 745'681 (0'0,745'681] local-lis/les=729/730 n=9 ec=545/26 lis/c=729/660 les/c/f=730/662/0 sis=729) [7,4,0,11,9,3]/[7,4,0,11,NONE,3]p7(0) async=[9(4)] r=0 lpr=729 pi=[660,729)/2 luod=679'679 crt=679'679 lcod 679'679 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped+forced_backfill mbc={0={(1+0)=9},1={(1+0)=9},2={(1+0)=9},3={(1+0)=9},4={(0+0)=2,(0+1)=7},5={(1+0)=9}} trimq=[16d~1,174~1,194~1,1a2~1,1b7~1,1b9~1,1e9~1,1f8~1,20a~1,20e~1,210~1,215~1] ps=52] recovery delayed on 3:1f0b484e:::smithi01822269-34:220; could not get rw_manager lock
2022-02-18T09:47:34.243+0000 7fc63612b700 20 osd.7 pg_epoch: 745 pg[3.78s0( v 745'681 (0'0,745'681] local-lis/les=729/730 n=9 ec=545/26 lis/c=729/660 les/c/f=730/662/0 sis=729) [7,4,0,11,9,3]/[7,4,0,11,NONE,3]p7(0) async=[9(4)] r=0 lpr=729 pi=[660,729)/2 luod=679'679 crt=679'679 lcod 679'679 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped+forced_backfill mbc={0={(1+0)=9},1={(1+0)=9},2={(1+0)=9},3={(1+0)=9},4={(0+0)=2,(0+1)=7},5={(1+0)=9}} trimq=[16d~1,174~1,194~1,1a2~1,1b7~1,1b9~1,1e9~1,1f8~1,20a~1,20e~1,210~1,215~1] ps=52] recovery delayed on 3:1f0b484e:::smithi01822269-34:220; could not get rw_manager lock
2022-02-18T09:43:35.363+0000 7fd96955b700 20 osd.0 pg_epoch: 529 pg[3.6s0( v 529'462 (0'0,529'462] local-lis/les=526/527 n=20 ec=26/26 lis/c=526/461 les/c/f=527/462/0 sis=526) [0,1,5,8,11,4]/[0,1,NONE,8,11,4]p0(0) async=[5(2)] r=0 lpr=526 pi=[461,526)/1 luod=520'459 crt=520'459 lcod 519'457 mlcod 0'0 active+recovering+undersized+remapped rops=6 mbc={0={(1+0)=8},1={(1+0)=8},2={(0+0)=2,(0+1)=6},3={(1+0)=8},4={(1+0)=8},5={(1+0)=8}} trimq=[13d~1,192~1,1a9~1] ps=[12c~1,132~1,15d~1,16a~1,17d~1]] recovery delayed on 3:60e7a2b9:::smithi01822269-15:1ad; could not get rw_manager lock
2022-02-18T09:43:35.363+0000 7fd96955b700 20 osd.0 pg_epoch: 529 pg[3.6s0( v 529'462 (0'0,529'462] local-lis/les=526/527 n=20 ec=26/26 lis/c=526/461 les/c/f=527/462/0 sis=526) [0,1,5,8,11,4]/[0,1,NONE,8,11,4]p0(0) async=[5(2)] r=0 lpr=526 pi=[461,526)/1 luod=520'459 crt=520'459 lcod 519'457 mlcod 0'0 active+recovering+undersized+remapped rops=6 mbc={0={(1+0)=8},1={(1+0)=8},2={(0+0)=2,(0+1)=6},3={(1+0)=8},4={(1+0)=8},5={(1+0)=8}} trimq=[13d~1,192~1,1a9~1] ps=[12c~1,132~1,15d~1,16a~1,17d~1]] recovery delayed on 3:60e7a2b9:::smithi01822269-15:head; could not get rw_manager lock
2022-02-18T09:43:35.367+0000 7fd96955b700 20 osd.0 pg_epoch: 529 pg[3.6s0( v 529'462 (0'0,529'462] local-lis/les=526/527 n=20 ec=26/26 lis/c=526/461 les/c/f=527/462/0 sis=526) [0,1,5,8,11,4]/[0,1,NONE,8,11,4]p0(0) async=[5(2)] r=0 lpr=526 pi=[461,526)/1 luod=520'459 crt=520'459 lcod 519'457 mlcod 0'0 active+recovering+undersized+degraded+remapped rops=5 mbc={0={(1+0)=7},1={(1+0)=7},2={(0+0)=2,(0+1)=5},3={(1+0)=7},4={(1+0)=7},5={(1+0)=7}} trimq=[13d~1,192~1,1a9~1] ps=[12c~1,132~1,15d~1,16a~1,17d~1]] recovery delayed on 3:60e7a2b9:::smithi01822269-15:1ad; could not get rw_manager lock
2022-02-18T09:43:35.367+0000 7fd96955b700 20 osd.0 pg_epoch: 529 pg[3.6s0( v 529'462 (0'0,529'462] local-lis/les=526/527 n=20 ec=26/26 lis/c=526/461 les/c/f=527/462/0 sis=526) [0,1,5,8,11,4]/[0,1,NONE,8,11,4]p0(0) async=[5(2)] r=0 lpr=526 pi=[461,526)/1 luod=520'459 crt=520'459 lcod 519'457 mlcod 0'0 active+recovering+undersized+degraded+remapped rops=5 mbc={0={(1+0)=7},1={(1+0)=7},2={(0+0)=2,(0+1)=5},3={(1+0)=7},4={(1+0)=7},5={(1+0)=7}} trimq=[13d~1,192~1,1a9~1] ps=[12c~1,132~1,15d~1,16a~1,17d~1]] recovery delayed on 3:60e7a2b9:::smithi01822269-15:head; could not get rw_manager lock
2022-02-18T09:43:35.471+0000 7fd96d563700 20 osd.0 pg_epoch: 529 pg[3.6s0( v 529'462 (0'0,529'462] local-lis/les=526/527 n=20 ec=26/26 lis/c=526/461 les/c/f=527/462/0 sis=526) [0,1,5,8,11,4]/[0,1,NONE,8,11,4]p0(0) async=[5(2)] r=0 lpr=526 pi=[461,526)/1 luod=529'461 crt=520'459 lcod 520'459 mlcod 493'418 active+recovering+undersized+degraded+remapped rops=6 mbc={0={(1+0)=7},1={(1+0)=7},2={(0+0)=2,(0+1)=5},3={(1+0)=7},4={(1+0)=7},5={(1+0)=7}} trimq=[13d~1,192~1,1a9~1] ps=[12c~1,132~1,15d~1,16a~1,17d~1]] recovery delayed on 3:60e7a2b9:::smithi01822269-15:head; could not get rw_manager lock
2022-02-18T09:51:07.411+0000 7fd96c561700 20 osd.0 pg_epoch: 942 pg[3.68s0( v 942'372 (0'0,942'372] local-lis/les=908/909 n=11 ec=426/26 lis/c=908/726 les/c/f=909/727/0 sis=908) [0,10,7,2,8,4]/[0,NONE,7,2,8,4]p0(0) async=[10(1)] r=0 lpr=908 pi=[726,908)/1 luod=905'370 crt=905'370 lcod 905'370 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=12},1={(0+0)=2,(0+1)=10},2={(1+0)=12},3={(1+0)=12},4={(1+0)=12},5={(1+0)=12}} trimq=21 ps=30] recovery delayed on 3:1618fca3:::smithi01822269-21:2ce; could not get rw_manager lock
2022-02-18T09:51:07.507+0000 7fd96c561700 20 osd.0 pg_epoch: 942 pg[3.68s0( v 942'372 (0'0,942'372] local-lis/les=908/909 n=11 ec=426/26 lis/c=908/726 les/c/f=909/727/0 sis=908) [0,10,7,2,8,4]/[0,NONE,7,2,8,4]p0(0) async=[10(1)] r=0 lpr=908 pi=[726,908)/1 luod=905'370 crt=905'370 lcod 905'370 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=12},1={(0+0)=2,(0+1)=10},2={(1+0)=12},3={(1+0)=12},4={(1+0)=12},5={(1+0)=12}} trimq=21 ps=30] recovery delayed on 3:1618fca3:::smithi01822269-21:2ce; could not get rw_manager lock
2022-02-18T09:51:46.890+0000 7fd96d563700 20 osd.0 pg_epoch: 979 pg[3.86s0( v 979'563 (0'0,979'563] local-lis/les=908/909 n=7 ec=571/26 lis/c=908/771 les/c/f=909/772/0 sis=908) [0,5,9,6,7,10]/[0,5,9,6,7,NONE]p0(0) async=[10(5)] r=0 lpr=908 pi=[771,908)/1 luod=905'561 crt=905'561 lcod 905'561 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=7},1={(1+0)=7},2={(1+0)=7},3={(1+0)=7},4={(1+0)=7},5={(0+0)=2,(0+1)=5}} trimq=32 ps=30] recovery delayed on 3:61ac2e26:::smithi01822269-14:2f7; could not get rw_manager lock
2022-02-18T09:51:46.890+0000 7fd96955b700 20 osd.0 pg_epoch: 979 pg[3.86s0( v 979'563 (0'0,979'563] local-lis/les=908/909 n=7 ec=571/26 lis/c=908/771 les/c/f=909/772/0 sis=908) [0,5,9,6,7,10]/[0,5,9,6,7,NONE]p0(0) async=[10(5)] r=0 lpr=908 pi=[771,908)/1 luod=905'561 crt=905'561 lcod 905'561 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=7},1={(1+0)=7},2={(1+0)=7},3={(1+0)=7},4={(1+0)=7},5={(0+0)=2,(0+1)=5}} trimq=32 ps=30] recovery delayed on 3:61ac2e26:::smithi01822269-14:2f7; could not get rw_manager lock
2022-02-18T09:51:53.690+0000 7fd96d563700 20 osd.0 pg_epoch: 985 pg[3.6s0( v 985'691 (0'0,985'691] local-lis/les=908/909 n=14 ec=26/26 lis/c=908/774 les/c/f=909/775/0 sis=908) [0,10,9,8,11,4]/[0,NONE,9,8,11,4]p0(0) async=[10(1)] r=0 lpr=908 pi=[774,908)/1 luod=977'689 crt=977'689 lcod 976'688 mlcod 38'10 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=14},1={(0+0)=5,(0+1)=9},2={(1+0)=14},3={(1+0)=14},4={(1+0)=14},5={(1+0)=14}} trimq=35 ps=30] recovery delayed on 3:60e7a2b9:::smithi01822269-15:2fd; could not get rw_manager lock
2022-02-18T09:51:53.690+0000 7fd96955b700 20 osd.0 pg_epoch: 985 pg[3.6s0( v 985'691 (0'0,985'691] local-lis/les=908/909 n=14 ec=26/26 lis/c=908/774 les/c/f=909/775/0 sis=908) [0,10,9,8,11,4]/[0,NONE,9,8,11,4]p0(0) async=[10(1)] r=0 lpr=908 pi=[774,908)/1 luod=977'689 crt=977'689 lcod 976'688 mlcod 38'10 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=14},1={(0+0)=5,(0+1)=9},2={(1+0)=14},3={(1+0)=14},4={(1+0)=14},5={(1+0)=14}} trimq=35 ps=30] recovery delayed on 3:60e7a2b9:::smithi01822269-15:2fd; could not get rw_manager lock

#20 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-02-15_16:22:25-rados-wip-yuri6-testing-2022-02-14-1456-distro-default-smithi/6685233

2022-02-15T21:49:41.379+0000 7f9e78566700 20 osd.11 pg_epoch: 113 pg[3.1fs1( v 113'25 (0'0,113'25] local-lis/les=112/113 n=3 ec=47/26 lis/c=112/59 les/c/f=113/60/0 sis=112) [0,11,10,5,2,3]/[NONE,11,10,5,2,3]p11(1) async=[0(0)] r=1 lpr=112 pi=[59,112)/1 luod=91'23 crt=91'23 lcod 91'23 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(0+0)=2},1={(1+0)=2},2={(1+0)=2},3={(1+0)=2},4={(1+0)=2},5={(1+0)=2}} trimq=[2f~1] ps=[2~2,6~2,9~6,11~1,14~2,17~1,1a~1,1d~1,21~1,23~2,29~1,2c~1]] recovery delayed on 3:febdcb30:::smithi02521141-39:33; could not get rw_manager lock
2022-02-15T21:49:41.379+0000 7f9e78566700 20 osd.11 pg_epoch: 113 pg[3.1fs1( v 113'25 (0'0,113'25] local-lis/les=112/113 n=3 ec=47/26 lis/c=112/59 les/c/f=113/60/0 sis=112) [0,11,10,5,2,3]/[NONE,11,10,5,2,3]p11(1) async=[0(0)] r=1 lpr=112 pi=[59,112)/1 luod=91'23 crt=91'23 lcod 91'23 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(0+0)=2},1={(1+0)=2},2={(1+0)=2},3={(1+0)=2},4={(1+0)=2},5={(1+0)=2}} trimq=[2f~1] ps=[2~2,6~2,9~6,11~1,14~2,17~1,1a~1,1d~1,21~1,23~2,29~1,2c~1]] recovery delayed on 3:febdcb30:::smithi02521141-39:33; could not get rw_manager lock
2022-02-15T21:49:44.727+0000 7f9e77564700 20 osd.11 pg_epoch: 116 pg[3.1s1( v 116'43 (0'0,116'43] local-lis/les=112/113 n=6 ec=26/26 lis/c=112/47 les/c/f=113/49/0 sis=112) [0,11,4,5,9,7]/[NONE,11,4,5,9,7]p11(1) async=[0(0)] r=1 lpr=112 pi=[47,112)/1 luod=104'41 crt=104'41 lcod 103'40 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(0+0)=8},1={(1+0)=8},2={(1+0)=8},3={(1+0)=8},4={(1+0)=8},5={(1+0)=8}} trimq=[2~2,6~2,9~6,11~1,14~2,17~1,1a~1,1d~1,21~4,29~1,2c~1,2f~1,31~1]] recovery delayed on 3:83dc320a:::smithi02521141-31:36; could not get rw_manager lock
2022-02-15T21:49:44.831+0000 7f9e7b56c700 20 osd.11 pg_epoch: 116 pg[3.1s1( v 116'43 (0'0,116'43] local-lis/les=112/113 n=6 ec=26/26 lis/c=112/47 les/c/f=113/49/0 sis=112) [0,11,4,5,9,7]/[NONE,11,4,5,9,7]p11(1) async=[0(0)] r=1 lpr=112 pi=[47,112)/1 luod=104'41 crt=104'41 lcod 103'40 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped mbc={0={(0+0)=8},1={(1+0)=8},2={(1+0)=8},3={(1+0)=8},4={(1+0)=8},5={(1+0)=8}} trimq=[2~2,6~2,9~6,11~1,14~2,17~1,1a~1,1d~1,21~4,29~1,2c~1,2f~1,31~1]] recovery delayed on 3:83dc320a:::smithi02521141-31:36; could not get rw_manager lock
2022-02-15T21:49:48.475+0000 7f9e78566700 20 osd.11 pg_epoch: 119 pg[3.17s1( v 119'57 (0'0,119'57] local-lis/les=112/113 n=12 ec=35/26 lis/c=112/48 les/c/f=113/49/0 sis=112) [0,11,5,3,7,2]/[NONE,11,5,3,7,2]p11(1) async=[0(0)] r=1 lpr=112 pi=[48,112)/1 crt=119'57 lcod 119'56 mlcod 103'52 active+recovering+forced_recovery+undersized+degraded+remapped rops=1 mbc={0={(0+0)=2},1={(1+0)=2},2={(1+0)=2},3={(1+0)=2},4={(1+0)=2},5={(1+0)=2}} trimq=[2~2,6~2,9~6,11~1,14~2,17~1,1a~1,1d~1,21~4,28~2,2c~1,2f~1,31~1]] recovery delayed on 3:e8eaab51:::smithi02521141-26:head; could not get rw_manager lock
2022-02-15T21:49:48.479+0000 7f9e7455e700 20 osd.11 pg_epoch: 120 pg[3.17s1( v 119'57 (0'0,119'57] local-lis/les=112/113 n=12 ec=35/26 lis/c=112/48 les/c/f=113/49/0 sis=112) [0,11,5,3,7,2]/[NONE,11,5,3,7,2]p11(1) async=[0(0)] r=1 lpr=112 pi=[48,112)/1 crt=119'57 lcod 119'56 mlcod 103'52 active+recovering+forced_recovery+undersized+degraded+remapped rops=1 mbc={0={(0+0)=2},1={(1+0)=2},2={(1+0)=2},3={(1+0)=2},4={(1+0)=2},5={(1+0)=2}} trimq=[2~2,6~2,9~6,11~1,14~2,17~1,1a~1,1d~1,21~4,28~2,2c~1,2f~1,31~1]] recovery delayed on 3:e8eaab51:::smithi02521141-26:head; could not get rw_manager lock
2022-02-15T21:58:45.215+0000 7f9e7bd6d700 20 osd.11 pg_epoch: 623 pg[3.30s0( v 623'119 (0'0,623'119] local-lis/les=345/346 n=8 ec=128/26 lis/c=345/268 les/c/f=346/269/0 sis=345) [11,9,0,7,8,1]/[11,NONE,0,7,8,1]p11(0) async=[9(1)] r=0 lpr=345 pi=[268,345)/1 luod=593'117 crt=593'117 lcod 593'116 mlcod 244'49 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=7},1={(0+0)=4,(0+1)=3},2={(1+0)=7},3={(1+0)=7},4={(1+0)=7},5={(1+0)=7}} trimq=101 ps=41] recovery delayed on 3:0ca55696:::smithi02521141-14:18e; could not get rw_manager lock
2022-02-15T21:58:45.215+0000 7f9e77d65700 20 osd.11 pg_epoch: 623 pg[3.30s0( v 623'119 (0'0,623'119] local-lis/les=345/346 n=8 ec=128/26 lis/c=345/268 les/c/f=346/269/0 sis=345) [11,9,0,7,8,1]/[11,NONE,0,7,8,1]p11(0) async=[9(1)] r=0 lpr=345 pi=[268,345)/1 luod=593'117 crt=593'117 lcod 593'116 mlcod 244'49 active+recovery_wait+undersized+degraded+remapped mbc={0={(1+0)=7},1={(0+0)=4,(0+1)=3},2={(1+0)=7},3={(1+0)=7},4={(1+0)=7},5={(1+0)=7}} trimq=101 ps=41] recovery delayed on 3:0ca55696:::smithi02521141-14:18e; could not get rw_manager lock

/a/yuriw-2022-02-16_19:58:00-rados-wip-yuri6-testing-2022-02-14-1456-distro-default-smithi/6689620

#21 Updated by Laura Flores about 2 years ago

I'm guessing that the problem involves pgs that are stuck in the `active+recovering+undersized+remapped` state (or `active+recovering+undersized+degraded+remapped` -- something of the "recovering" variation). With these pgs, I am finding a pattern of head objects that are in the middle of recovery, but the rw_manager lock is unable to be obtained, like the pattern Sridhar described. Thus, recovery is unable to make any more progress.

In these cases, other pgs tend to get stuck in "active+recovery_wait+undersized+remapped" (or another slight variation), and they never enter the recovery process. This might be connected to the common factor of the pg stuck in recovery with an unobtainable rw_manager lock.

#22 Updated by Radoslaw Zarzynski about 2 years ago

  • Duplicated by Bug #50842: pacific: recovery does not complete because of rw_manager lock not being released added

#23 Updated by Laura Flores about 2 years ago

More for my own reference, but it's clear that the rw_manager problem occurs here in the PrimaryLogPG code when prepping object replica pushes. We are unable to get the recovery read from the object context:

src/osd/PrimaryLogPG.cc

int PrimaryLogPG::prep_object_replica_pushes(
  const hobject_t& soid, eversion_t v,
  PGBackend::RecoveryHandle *h,
  bool *work_started)
{
  ceph_assert(is_primary());
  dout(10) << __func__ << ": on " << soid << dendl;

  if (soid.snap && soid.snap < CEPH_NOSNAP) {
    // do we have the head and/or snapdir?
    hobject_t head = soid.get_head();
    if (recovery_state.get_pg_log().get_missing().is_missing(head)) {
      if (recovering.count(head)) {
        dout(10) << " missing but already recovering head " << head << dendl;
        return 0;
      } else {
        int r = recover_missing(
            head, recovery_state.get_pg_log().get_missing().get_items().find(head)->second.need,
            get_recovery_op_priority(), h);
        if (r != PULL_NONE)
          return 1;
        return 0;
      }
    }
  }

  // NOTE: we know we will get a valid oloc off of disk here.
  ObjectContextRef obc = get_object_context(soid, false);
  if (!obc) {
    primary_error(soid, v);
    return 0;
  }

  if (!obc->get_recovery_read()) {
    dout(20) << "recovery delayed on " << soid
             << "; could not get rw_manager lock" << dendl;
    *work_started = true;
    return 0;
  } else {
    dout(20) << "recovery got recovery read lock on " << soid
             << dendl;
  }

  start_recovery_op(soid);
  ceph_assert(!recovering.count(soid));
  recovering.insert(make_pair(soid, obc));

  int r = pgbackend->recover_object(
    soid,
    v,
    ObjectContextRef(),
    obc, // has snapset context
    h);
  if (r < 0) {
    dout(0) << __func__ << " Error " << r << " on oid " << soid << dendl;
    on_failed_pull({ pg_whoami }, soid, v);
    return 0;
  }
  return 1;
}

#24 Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2022-03-14_18:47:44-rados-wip-yuri3-testing-2022-03-14-0946-distro-default-smithi/6736449

#25 Updated by Sridhar Seshasayee about 2 years ago

/a/yuriw-2022-03-16_20:38:07-rados-wip-yuri3-testing-2022-03-16-1030-distro-default-smithi/6739311

#26 Updated by Aishwarya Mathuria almost 2 years ago

/a/yuriw-2022-03-29_21:35:01-rados-wip-yuri3-testing-2022-03-29-1133-distro-default-smithi/6767720

#27 Updated by Radoslaw Zarzynski almost 2 years ago

  • Status changed from New to In Progress

#28 Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-04-01_01:23:52-rados-wip-yuri2-testing-2022-03-31-1523-pacific-distro-default-smithi/6771162

#30 Updated by Laura Flores almost 2 years ago

Looking further into this issue, it looks like the bug occurs whenever "prep_object_replica_pushes()" is called, and the rwstate is in write mode (i.e. "rwstate(write n=1 w=0)"). When the rwstate is in write mode, the rw_manager lock is unable to be retrieved (see "get_read_lock()" in object_state.h). There are some scenarios in which "prep_object_replica_pushes()" is called, and the rwstate is "none" (i.e. "rwstate(none n=0 w=0)"); here, the rw_manager lock is able to be retrieved, and the pg exits recovery.

Unsuccessful example (rwstate is "write"):

2022-03-30T00:09:05.261+0000 7f652d5ec700 10 osd.9 pg_epoch: 212 pg[3.20s0( v 212'188 (0'0,212'188] local-lis/les=206/207 n=3 ec=60/26 lis/c=206/154 les/c/f=207/155/0 sis=206) [9,2,10,4,5,11]/[9,2,10,NONE,5,11]p9(0) async=[4(3)] r=0 lpr=206 pi=[154,206)/1 luod=212'187 crt=210'186 lcod 210'186 mlcod 210'185 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(0+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[9e~2,a5~1,aa~1] ps=[a0~2,a3~1]] prep_object_replica_pushes: on 3:05fdc56b:::smithi03929708-26:head
2022-03-30T00:09:05.261+0000 7f652d5ec700 10 osd.9 pg_epoch: 212 pg[3.20s0( v 212'188 (0'0,212'188] local-lis/les=206/207 n=3 ec=60/26 lis/c=206/154 les/c/f=207/155/0 sis=206) [9,2,10,4,5,11]/[9,2,10,NONE,5,11]p9(0) async=[4(3)] r=0 lpr=206 pi=[154,206)/1 luod=212'187 crt=210'186 lcod 210'186 mlcod 210'185 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(0+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[9e~2,a5~1,aa~1] ps=[a0~2,a3~1]] get_object_context: 0x555d9071edc0 3:05fdc56b:::smithi03929708-26:head rwstate(write n=1 w=0) oi: 3:05fdc56b:::smithi03929708-26:head(212'188 client.4551.0:2296 dirty|data_digest s 5521408 uv 256 dd a9c676 alloc_hint [0 0 0]) exists: 1 ssc: 0x555d8fa9c580 snapset: ac=[]:{9f=[9f,9e],ac=[ac]}
2022-03-30T00:09:05.261+0000 7f652d5ec700 20 osd.9 pg_epoch: 212 pg[3.20s0( v 212'188 (0'0,212'188] local-lis/les=206/207 n=3 ec=60/26 lis/c=206/154 les/c/f=207/155/0 sis=206) [9,2,10,4,5,11]/[9,2,10,NONE,5,11]p9(0) async=[4(3)] r=0 lpr=206 pi=[154,206)/1 luod=212'187 crt=210'186 lcod 210'186 mlcod 210'185 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=1},1={(1+0)=1},2={(1+0)=1},3={(0+0)=1},4={(1+0)=1},5={(1+0)=1}} trimq=[9e~2,a5~1,aa~1] ps=[a0~2,a3~1]] recovery delayed on 3:05fdc56b:::smithi03929708-26:head; could not get rw_manager lock

Successful example (rwstate is "none"):

2022-03-30T00:08:06.663+0000 7f652d5ec700 10 osd.9 pg_epoch: 153 pg[3.20s0( v 152'113 (0'0,152'113] local-lis/les=150/151 n=10 ec=60/26 lis/c=150/65 les/c/f=151/66/0 sis=150) [9,2,10,7,5,11]/[9,2,10,NONE,5,11]p9(0) async=[7(3)] r=0 lpr=150 pi=[65,150)/1 crt=152'113 lcod 147'111 mlcod 37'24 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=9},1={(1+0)=9},2={(1+0)=9},3={(0+0)=2,(0+1)=7},4={(1+0)=9},5={(1+0)=9}} trimq=[3b~1,3f~1,61~1,74~1] ps=[44~1,50~1,59~1,65~1,68~1,72~1]] prep_object_replica_pushes: on 3:05fdc56b:::smithi03929708-26:head
2022-03-30T00:08:06.663+0000 7f652d5ec700 10 osd.9 pg_epoch: 153 pg[3.20s0( v 152'113 (0'0,152'113] local-lis/les=150/151 n=10 ec=60/26 lis/c=150/65 les/c/f=151/66/0 sis=150) [9,2,10,7,5,11]/[9,2,10,NONE,5,11]p9(0) async=[7(3)] r=0 lpr=150 pi=[65,150)/1 crt=152'113 lcod 147'111 mlcod 37'24 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=9},1={(1+0)=9},2={(1+0)=9},3={(0+0)=2,(0+1)=7},4={(1+0)=9},5={(1+0)=9}} trimq=[3b~1,3f~1,61~1,74~1] ps=[44~1,50~1,59~1,65~1,68~1,72~1]] get_object_context: obc NOT found in cache: 3:05fdc56b:::smithi03929708-26:head
2022-03-30T00:08:06.663+0000 7f652d5ec700 10 osd.9 pg_epoch: 153 pg[3.20s0( v 152'113 (0'0,152'113] local-lis/les=150/151 n=10 ec=60/26 lis/c=150/65 les/c/f=151/66/0 sis=150) [9,2,10,7,5,11]/[9,2,10,NONE,5,11]p9(0) async=[7(3)] r=0 lpr=150 pi=[65,150)/1 crt=152'113 lcod 147'111 mlcod 37'24 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=9},1={(1+0)=9},2={(1+0)=9},3={(0+0)=2,(0+1)=7},4={(1+0)=9},5={(1+0)=9}} trimq=[3b~1,3f~1,61~1,74~1] ps=[44~1,50~1,59~1,65~1,68~1,72~1]] populate_obc_watchers 3:05fdc56b:::smithi03929708-26:head
2022-03-30T00:08:06.663+0000 7f652d5ec700 20 osd.9 pg_epoch: 153 pg[3.20s0( v 152'113 (0'0,152'113] local-lis/les=150/151 n=10 ec=60/26 lis/c=150/65 les/c/f=151/66/0 sis=150) [9,2,10,7,5,11]/[9,2,10,NONE,5,11]p9(0) async=[7(3)] r=0 lpr=150 pi=[65,150)/1 crt=152'113 lcod 147'111 mlcod 37'24 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=9},1={(1+0)=9},2={(1+0)=9},3={(0+0)=2,(0+1)=7},4={(1+0)=9},5={(1+0)=9}} trimq=[3b~1,3f~1,61~1,74~1] ps=[44~1,50~1,59~1,65~1,68~1,72~1]] PrimaryLogPG::check_blocklisted_obc_watchers for obc 3:05fdc56b:::smithi03929708-26:head
2022-03-30T00:08:06.663+0000 7f652d5ec700 10 osd.9 pg_epoch: 153 pg[3.20s0( v 152'113 (0'0,152'113] local-lis/les=150/151 n=10 ec=60/26 lis/c=150/65 les/c/f=151/66/0 sis=150) [9,2,10,7,5,11]/[9,2,10,NONE,5,11]p9(0) async=[7(3)] r=0 lpr=150 pi=[65,150)/1 crt=152'113 lcod 147'111 mlcod 37'24 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=9},1={(1+0)=9},2={(1+0)=9},3={(0+0)=2,(0+1)=7},4={(1+0)=9},5={(1+0)=9}} trimq=[3b~1,3f~1,61~1,74~1] ps=[44~1,50~1,59~1,65~1,68~1,72~1]] get_object_context: 0x555d802b7600 3:05fdc56b:::smithi03929708-26:head rwstate(none n=0 w=0) oi: 3:05fdc56b:::smithi03929708-26:head(123'83 client.4551.0:1229 dirty s 0 uv 83 alloc_hint [0 0 0]) exists: 1 ssc: 0x555d8fa0be40 snapset: 59=[]:{3b=[3b]}
2022-03-30T00:08:06.663+0000 7f652d5ec700 20 osd.9 pg_epoch: 153 pg[3.20s0( v 152'113 (0'0,152'113] local-lis/les=150/151 n=10 ec=60/26 lis/c=150/65 les/c/f=151/66/0 sis=150) [9,2,10,7,5,11]/[9,2,10,NONE,5,11]p9(0) async=[7(3)] r=0 lpr=150 pi=[65,150)/1 crt=152'113 lcod 147'111 mlcod 37'24 active+recovering+undersized+degraded+remapped mbc={0={(1+0)=9},1={(1+0)=9},2={(1+0)=9},3={(0+0)=2,(0+1)=7},4={(1+0)=9},5={(1+0)=9}} trimq=[3b~1,3f~1,61~1,74~1] ps=[44~1,50~1,59~1,65~1,68~1,72~1]] recovery got recovery read lock on 3:05fdc56b:::smithi03929708-26:head

#31 Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-04-22_21:06:04-rados-wip-yuri3-testing-2022-04-22-0534-quincy-distro-default-smithi/6802072

#32 Updated by Laura Flores almost 2 years ago

  • Backport changed from pacific to quincy,pacific

#33 Updated by Neha Ojha almost 2 years ago

This was discussed in the rados meeting this week. Laura is trying to check if the bug exists in Octopus or not, to help identify differences between Pacific and Octopus.

#34 Updated by Laura Flores almost 2 years ago

Ran 100 thrash-erasure-code-big tests in octopus, and the `wait_for_recovery` assertion occurred 18/100 times, with 19 failures total (the other failure had a different root cause). There are no osd logs to compare those, but the pg_map shows that there is always 1 pg stuck in `recovery`, just like the master/pacific examples.

Description: rados:thrash-erasure-code-big/{ceph cluster/{12-osds openstack} msgr-failures/osd-dispatch-delay objectstore/bluestore-stupid rados recovery-overrides/{more-active-recovery} supported-random-distro$/{ubuntu_18.04} thrashers/pggrow thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}
http://pulpito.front.sepia.ceph.com/lflores-2022-04-27_17:53:15-rados:thrash-erasure-code-big-wip-yuri2-testing-2022-04-26-1132-octopus-distro-default-smithi/

Several more observations:
  • This bug seems to occur exclusively in tests with `thrashers/pggrow` and `thrashers/mapgap`.
  • Majority of the tests include `msgr-failures/osd-***-delay`, and a handful include `msgr-failures/few` (all coupled with pggrow or mapgap).

#36 Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-05-13_14:13:55-rados-wip-yuri3-testing-2022-05-12-1609-octopus-distro-default-smithi/6832699

#37 Updated by Laura Flores almost 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 46346

#38 Updated by Laura Flores almost 2 years ago

  • Backport changed from quincy,pacific to quincy,pacific,octopus

#39 Updated by Neha Ojha almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport

#40 Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55743: octopus: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. added

#41 Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55744: quincy: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. added

#42 Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55745: pacific: "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. added

#43 Updated by Laura Flores almost 2 years ago

  • Status changed from Pending Backport to Resolved

#44 Updated by Laura Flores almost 2 years ago

  • Status changed from Resolved to Pending Backport

#45 Updated by Laura Flores almost 2 years ago

  • Status changed from Pending Backport to Resolved

#46 Updated by Laura Flores almost 2 years ago

  • Related to Bug #56149: thrash-erasure-code: AssertionError: wait_for_recovery timeout due to "active+recovering+undersized" pg added

#47 Updated by Laura Flores about 1 year ago

  • Related to Bug #58893: test_map_discontinuity: AssertionError: wait_for_clean: failed before timeout expired added

Also available in: Atom PDF