Project

General

Profile

Actions

Bug #53806

closed

unessesarily long laggy PG state

Added by 玮文 胡 over 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

100%

Source:
Community (dev)
Tags:
backport_processed
Backport:
octopus, pacific, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

the first `pg_lease_ack_t` after becoming laggy would not trigger `recheck_readable`. However, every other ack would trigger it. The logic is inverted. Causing unnecessarily long laggy PG state.

Reproduction:

MON=1 OSD=2 MDS=0 MGR=1 ../src/vstart.sh --new --debug
./bin/ceph osd pool create test 1 --size 2
./bin/ceph config set osd debug_osd 20/20
./bin/ceph config set osd debug_ms 1/1
./bin/ceph config set osd.0 ms_blackhole_osd true  # osd.0 is the primary of test pool
# wait for 20s
echo 12345 | ./bin/rados -p test put 1 -
# press Ctrl+C after several seconds
# Make sure pg 2.0 is laggy now.
./bin/ceph config rm osd.0 ms_blackhole_osd

Wait for pg 2.0 exit laggy state, which takes unnecessarily long.

Inspect logs of OSD.0: `grep -E '==== pg_lease_ack|recheck_readable' out/osd.0.log`

2022-01-08T01:54:37.488+0800 7f53c7fff700  1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 193 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 417.773223877s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T01:54:37.488+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
2022-01-08T01:59:41.507+0800 7f53c7fff700  1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 232 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 721.793029785s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T01:59:49.507+0800 7f53c7fff700  1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 234 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 729.794006348s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T01:59:49.507+0800 7f5314ff9700 10 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean+laggy] recheck_readable no longer laggy (mnow 713.795532227s < readable_until 729.794006348s)
2022-01-08T01:59:57.506+0800 7f53c7fff700  1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 236 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 737.794433594s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T01:59:57.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
2022-01-08T02:00:05.510+0800 7f53c7fff700  1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 238 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 745.794982910s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T02:00:05.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy
2022-01-08T02:00:13.510+0800 7f53c7fff700  1 -- [v2:202.38.247.227:6834/230204,v1:202.38.247.227:6835/230204] <== osd.1 v2:202.38.247.227:6842/317837 240 ==== pg_lease_ack(2.0 pg_lease_ack(ruub 753.796752930s) e17/17) v1 ==== 42+0+0 (crc 0 0 0) 0x7f52f40350a0 con 0x7f53c800f190
2022-01-08T02:00:13.510+0800 7f5314ff9700 20 osd.0 pg_epoch: 17 pg[2.0( empty local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [0,1] r=0 lpr=15 crt=0'0 mlcod 0'0 active+clean] recheck_readable wasn't wait or laggy

See that the first `pg_lease_ack` after canceling backhole does not trigger `recheck_readable`, but every other one does.


Related issues 4 (1 open3 closed)

Related to RADOS - Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgsFix Under ReviewAishwarya Mathuria

Actions
Copied to RADOS - Backport #56734: pacific: unessesarily long laggy PG stateRejectedAishwarya MathuriaActions
Copied to RADOS - Backport #56735: octopus: unessesarily long laggy PG stateResolvedActions
Copied to RADOS - Backport #56736: quincy: unessesarily long laggy PG stateResolvedAishwarya MathuriaActions
Actions #1

Updated by Josh Durgin about 2 years ago

  • Status changed from New to Fix Under Review
  • Priority changed from Normal to High
  • Backport set to octopus, pacific, quincy
Actions #2

Updated by Radoslaw Zarzynski about 2 years ago

  • Pull request ID set to 44499
Actions #3

Updated by Kefu Chai over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #4

Updated by Backport Bot over 1 year ago

  • Copied to Backport #56734: pacific: unessesarily long laggy PG state added
Actions #5

Updated by Backport Bot over 1 year ago

  • Copied to Backport #56735: octopus: unessesarily long laggy PG state added
Actions #6

Updated by Backport Bot over 1 year ago

  • Copied to Backport #56736: quincy: unessesarily long laggy PG state added
Actions #7

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #8

Updated by Laura Flores over 1 year ago

  • Related to Bug #57546: rados/thrash-erasure-code: wait_for_recovery timeout due to "active+clean+remapped+laggy" pgs added
Actions #9

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from Pending Backport to New

Reopening b/c the original fix had to be reverted: https://github.com/ceph/ceph/pull/44499#issuecomment-1247315820.

Actions #10

Updated by Joshua Baergen over 1 year ago

FWIW, we've seen this happen very frequently during Nautilus->{Octopus,Pacific} upgrades. I had just tracked down the "now < old_ru" logic inversion and then happily found that someone had beat me to it!

Actions #11

Updated by Radoslaw Zarzynski over 1 year ago

OK, Aishwarya has found in testing that the break-related commit (https://github.com/ceph/ceph/pull/44499/commits/9337fa6704180de90e1c18e314849566ff628818) is the problematic one; the logic inversion fix (https://github.com/ceph/ceph/pull/44499/commits/caeca396e8b149cfa09ed99eda4f7a7186b005b4) alone was fine.

I think we can re-send the latter as a new PR.

Actions #12

Updated by 玮文 胡 over 1 year ago

Radoslaw Zarzynski wrote:

OK, Aishwarya has found in testing that the break-related commit (https://github.com/ceph/ceph/pull/44499/commits/9337fa6704180de90e1c18e314849566ff628818) is the problematic one; the logic inversion fix (https://github.com/ceph/ceph/pull/44499/commits/caeca396e8b149cfa09ed99eda4f7a7186b005b4) alone was fine.

OK, thanks for the testing, and sorry for introducing the bug. Could you elaborate more on why that commit is problematic? I think as long as `acting` does not have duplicate entries, the logic is exactly the same as before.

Actions #13

Updated by Radoslaw Zarzynski over 1 year ago

I think as long as `acting` does not have duplicate entries, the logic is exactly the same as before.

Yeah. I'm afraid the assumption about being free of duplicates is invalid. See https://tracker.ceph.com/issues/57546#note-9.

Actions #14

Updated by Konstantin Shalygin 3 months ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF