Bug #53806
unessesarily long laggy PG state
0%
Description
the first `pg_lease_ack_t` after becoming laggy would not trigger `recheck_readable`. However, every other ack would trigger it. The logic is inverted. Causing unnecessarily long laggy PG state.
Reproduction:
MON=1 OSD=2 MDS=0 MGR=1 ../src/vstart.sh --new --debug
./bin/ceph osd pool create test 1 --size 2
./bin/ceph config set osd debug_osd 20/20
./bin/ceph config set osd debug_ms 1/1
./bin/ceph config set osd.0 ms_blackhole_osd true # osd.0 is the primary of test pool
# wait for 20s
echo 12345 | ./bin/rados -p test put 1 -
# press Ctrl+C after several seconds
# Make sure pg 2.0 is laggy now.
./bin/ceph config rm osd.0 ms_blackhole_osd
Wait for pg 2.0 exit laggy state, which takes unnecessarily long.
Inspect logs of OSD.0: `grep -E '==== pg_lease_ack|recheck_readable' out/osd.0.log`
2022-01-08T01:54:37.488+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 193 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 417.773223877s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T01:54:37.488+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy 2022-01-08T01:59:41.507+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 232 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 721.793029785s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T01:59:49.507+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 234 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 729.794006348s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T01:59:49.507+0800 7f5314ff9700 10 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean+laggy] recheck_readable no longer laggy (mnow 713.795532227s < readable_until 729.794006348s) 2022-01-08T01:59:57.506+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 236 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 737.794433594s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T01:59:57.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy 2022-01-08T02:00:05.510+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 238 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 745.794982910s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T02:00:05.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy 2022-01-08T02:00:13.510+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 240 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 753.796752930s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T02:00:13.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
See that the first `pg_lease_ack` after canceling backhole does not trigger `recheck_readable`, but every other one does.
Related issues
History
#1 Updated by Josh Durgin almost 2 years ago
- Status changed from New to Fix Under Review
- Priority changed from Normal to High
- Backport set to octopus, pacific, quincy
#2 Updated by Radoslaw Zarzynski over 1 year ago
- Pull request ID set to 44499
#3 Updated by Kefu Chai over 1 year ago
- Status changed from Fix Under Review to Pending Backport
#4 Updated by Backport Bot over 1 year ago
- Copied to Backport #56734: pacific: unessesarily long laggy PG state added
#5 Updated by Backport Bot over 1 year ago
- Copied to Backport #56735: octopus: unessesarily long laggy PG state added
#6 Updated by Backport Bot over 1 year ago
- Copied to Backport #56736: quincy: unessesarily long laggy PG state added
#7 Updated by Backport Bot over 1 year ago
- Tags set to backport_processed
#8 Updated by Laura Flores about 1 year ago
- Related to Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs added
#9 Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from Pending Backport to New
Reopening b/c the original fix had to be reverted: https://github.com/ceph/ceph/pull/44499#issuecomment-1247315820.
#10 Updated by Joshua Baergen about 1 year ago
FWIW, we've seen this happen very frequently during Nautilus->{Octopus,Pacific} upgrades. I had just tracked down the "now < old_ru" logic inversion and then happily found that someone had beat me to it!
#11 Updated by Radoslaw Zarzynski about 1 year ago
OK, Aishwarya has found in testing that the break
-related commit (https://github.com/ceph/ceph/pull/44499/commits/9337fa6704180de90e1c18e314849566ff628818) is the problematic one; the logic inversion fix (https://github.com/ceph/ceph/pull/44499/commits/caeca396e8b149cfa09ed99eda4f7a7186b005b4) alone was fine.
I think we can re-send the latter as a new PR.
#12 Updated by 玮文 胡 about 1 year ago
Radoslaw Zarzynski wrote:
OK, Aishwarya has found in testing that the
break
-related commit (https://github.com/ceph/ceph/pull/44499/commits/9337fa6704180de90e1c18e314849566ff628818) is the problematic one; the logic inversion fix (https://github.com/ceph/ceph/pull/44499/commits/caeca396e8b149cfa09ed99eda4f7a7186b005b4) alone was fine.
OK, thanks for the testing, and sorry for introducing the bug. Could you elaborate more on why that commit is problematic? I think as long as `acting` does not have duplicate entries, the logic is exactly the same as before.
#13 Updated by Radoslaw Zarzynski 12 months ago
I think as long as `acting` does not have duplicate entries, the logic is exactly the same as before.
Yeah. I'm afraid the assumption about being free of duplicates is invalid. See https://tracker.ceph.com/issues/57546#note-9.