Actions
Bug #53806
closedunessesarily long laggy PG state
Status:
Resolved
Priority:
High
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:
100%
Source:
Community (dev)
Tags:
backport_processed
Backport:
octopus, pacific, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
the first `pg_lease_ack_t` after becoming laggy would not trigger `recheck_readable`. However, every other ack would trigger it. The logic is inverted. Causing unnecessarily long laggy PG state.
Reproduction:
MON=1 OSD=2 MDS=0 MGR=1 ../src/vstart.sh --new --debug
./bin/ceph osd pool create test 1 --size 2
./bin/ceph config set osd debug_osd 20/20
./bin/ceph config set osd debug_ms 1/1
./bin/ceph config set osd.0 ms_blackhole_osd true # osd.0 is the primary of test pool
# wait for 20s
echo 12345 | ./bin/rados -p test put 1 -
# press Ctrl+C after several seconds
# Make sure pg 2.0 is laggy now.
./bin/ceph config rm osd.0 ms_blackhole_osd
Wait for pg 2.0 exit laggy state, which takes unnecessarily long.
Inspect logs of OSD.0: `grep -E '==== pg_lease_ack|recheck_readable' out/osd.0.log`
2022-01-08T01:54:37.488+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 193 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 417.773223877s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T01:54:37.488+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy 2022-01-08T01:59:41.507+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 232 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 721.793029785s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T01:59:49.507+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 234 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 729.794006348s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T01:59:49.507+0800 7f5314ff9700 10 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean+laggy] recheck_readable no longer laggy (mnow 713.795532227s < readable_until 729.794006348s) 2022-01-08T01:59:57.506+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 236 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 737.794433594s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T01:59:57.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy 2022-01-08T02:00:05.510+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 238 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 745.794982910s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T02:00:05.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy 2022-01-08T02:00:13.510+0800 7f53c7fff700 1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 240 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 753.796752930s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190 2022-01-08T02:00:13.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
See that the first `pg_lease_ack` after canceling backhole does not trigger `recheck_readable`, but every other one does.
Actions