Project

General

Profile

Bug #50351

osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state

Added by Mykola Golub 2 months ago. Updated about 1 month ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On nautilus we have been observing an issue when an EC pg is in active+backfill_unfound+degraded state (which happens when unfound objects are detected during backfill), and a non-primary osd for this pg is restarted, the state is changed to 'active+clean' after the restart and information about unfound objects is lost.

It is easy to reproduce on nautilus vstart cluster. But then I tried to reproduce it on the master, the state did not change here after the restart, but I was observing the primary osd crashes after the non-primary restart:

2021-04-14T07:24:47.038+0100 7f3460012640 -1 ../src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::_clear_recovery_state()' thread 7f3460012640 time 2021-04-14T07:24:46.938051+0100
../src/osd/PrimaryLogPG.cc: 13082: FAILED ceph_assert(recovering.count(*i))

 ceph version 17.0.0-3141-gc35ab161031 (c35ab1610310fc33731b2ce2c7a73796df5535a2) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19d) [0x55f0a3656885]
 2: /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd(+0x34f9b07) [0x55f0a3656b07]
 3: (PrimaryLogPG::_clear_recovery_state()+0x201) [0x55f0a2d69f97]
 4: (PG::clear_recovery_state()+0x1bd) [0x55f0a2c71d9d]
 5: (PG::cancel_recovery()+0xe9) [0x55f0a2c71f1d]
 6: (PG::on_new_interval()+0x169) [0x55f0a2c78fab]
 7: (PeeringState::on_new_interval()+0x9b0) [0x55f0a2fe1eba]
 8: (PeeringState::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ceph::os::Transaction&)+0xc2c) [0x55f0a2fe08d4]
 9: (PeeringState::Reset::react(PeeringState::AdvMap const&)+0x4bc) [0x55f0a3003ba4]

Steps to reproduce:

1) Deploy a cluster with at least 4 osds (to have a room to backfill a pg from one osd to another), with small max_pg_log (for easier to trigger backfill) and filestore backend (for easier to remove object shards and generate unfound objects), create 3+2 EC pool and fill with many objects (to be able to trigger backfill):

adonis:~/ceph/ceph.ci/build% OSD=4 ../src/vstart.sh -n -o osd_max_pg_log_entries=10 -f # small pg_log is for easier to trigger backfill, filestore is for easier to remove object shards
adonis:~/ceph/ceph.ci/build% ceph osd pool create test 1 1 erasure; ceph osd pool set test pg_autoscale_mode off
pool 'test' created
set pool 5 pg_autoscale_mode to off
adonis:~/ceph/ceph.ci/build% for i in `seq 200`; do rados -p test put test$i ./bin/test_crypto ; done

Chose one object and remove it from all shards but primary:
adonis:~/ceph/ceph.ci/build% ceph osd map test test99                                                                      
osdmap e60 pool 'test' (5) object 'test99' -> pg 5.d9607dc6 (5.0) -> up ([3,1,2], p3) acting ([3,1,2], p3)
adonis:~/ceph/ceph.ci/build% ls -1 dev/osd*/current/5.0s*_head/test99*
dev/osd1/current/5.0s1_head/test99__head_D9607DC6__5_ffffffffffffffff_1
dev/osd2/current/5.0s2_head/test99__head_D9607DC6__5_ffffffffffffffff_2
dev/osd3/current/5.0s0_head/test99__head_D9607DC6__5_ffffffffffffffff_0
adonis:~/ceph/ceph.ci/build% rm dev/osd*/current/5.0s[12]_head/test99*
adonis:~/ceph/ceph.ci/build% ls -1 dev/osd*/current/5.0s*_head/test99*
dev/osd3/current/5.0s0_head/test99__head_D9607DC6__5_ffffffffffffffff_0

Set one of non-primary osds out to trigger the backfill and observe the state is changed to active+backfill_unfound+degraded state after it is complete:
adonis:~/ceph/ceph.ci/build% ceph osd out 1                                                                                
adonis:~/ceph/ceph.ci/build% ceph pg dump |grep '^5\.0'
dumped all
5.0          200                   0         0        188        0  489212800            0           0  323       323  active+remapped+backfilling  2021-04-14T07:21:05.430368+0100  54'1123   63:1976  [3,0,2]           3  [3,1,2]               3         0'0  2021-04-14T07:00:39.765421+0100              0'0  2021-04-14T07:00:39.765421+0100              0
adonis:~/ceph/ceph.ci/build% ceph pg dump |grep '^5\.0'
dumped all
5.0          200                   0         2        115        1  489212800            0           0  323       323  active+backfill_unfound+degraded+remapped  2021-04-14T07:21:09.595125+0100  54'1123   63:2349  [3,0,2]           3  [3,1,2]               3         0'0  2021-04-14T07:00:39.765421+0100              0'0  2021-04-14T07:00:39.765421+0100              0
adonis:~/ceph/ceph.ci/build% ceph pg dump_stuck unclean
PG_STAT  STATE                                      UP       UP_PRIMARY  ACTING   ACTING_PRIMARY
5.0      active+backfill_unfound+degraded+remapped  [3,0,2]           3  [3,1,2]               3
ok

Restart a non-primary osd:
adonis:~/ceph/ceph.ci/build% ps auxww |grep 'ceph-osd -i 2'
mgolub   40025  6.2  0.3 889212 241700 ?       Ssl  07:18   0:21 /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd -i 2 -c /home/mgolub/ceph/ceph.ci/build/ceph.conf
mgolub   40420  0.0  0.0   6224   788 pts/6    S+   07:24   0:00 grep --color=auto ceph-osd -i 2
adonis:~/ceph/ceph.ci/build% kill 40025
adonis:~/ceph/ceph.ci/build% /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd -i 2 -c /home/mgolub/ceph/ceph.ci/build/ceph.conf

After this step, on the master the primary osd crashes with the backtrace shown above. On the nautilus there is no crash but the state is changed to active+clean, loosing the information about unfound objects.


Related issues

Copied to RADOS - Backport #50792: nautilus: osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state In Progress
Copied to RADOS - Backport #50793: octopus: osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state Resolved
Copied to RADOS - Backport #50794: pacific: osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state Resolved

History

#1 Updated by Mykola Golub about 2 months ago

`PrimaryLogPG::on_failed_pull` [1] looks suspicious to me. We remove the oid from `backfills_in_flight` here only if the backfill source is primary. This seems to make sense for a replicated backend, but probably not for the EC backend. And looking at the history, the behavior seemed to be changed in [2], when in `ECBackend::_failed_push` instead of unconditionally calling `backfill_add_missing` (which removed the in-flight backfill) we started to call `on_failed_pull` (which removes the in-flight backfill only for the primary).

I tried changing `on_failed_pull` to unconditionally call `backfills_in_flight.erase(oid)`. Sure if fixed the assertion failure but it returned back the nautilus behavior, i.e. the pg state changing its state from "backfill_unfound" to "clean". Which does not make me very happy.

[1] https://github.com/ceph/ceph/blob/813933f81e3d682a0b1ae6dd906e38e78c4859a4/src/osd/PrimaryLogPG.cc#L12453
[2] https://github.com/ceph/ceph/commit/8a8947d2a32d6390cb17099398e7f2212660c9a1

#2 Updated by Mykola Golub about 1 month ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 41270

In the mailing list thread [1] I provided some details why I think the current behaviour of `PrimaryLogPG::on_failed_pull`, when it does not remove the oid from `backfills_in_flight`, seems ok, as it leads to the reasonable osd behavior. So I think we just need to remove the assert.

[1] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/OJEZB4YEGWV3EUJPFGLG2O3VHERR5ATI/

#3 Updated by Mykola Golub about 1 month ago

  • Backport set to pacific,octopus,nautilus

#4 Updated by Kefu Chai about 1 month ago

  • Status changed from Fix Under Review to Pending Backport

#5 Updated by Backport Bot about 1 month ago

  • Copied to Backport #50792: nautilus: osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state added

#6 Updated by Backport Bot about 1 month ago

  • Copied to Backport #50793: octopus: osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state added

#7 Updated by Backport Bot about 1 month ago

  • Copied to Backport #50794: pacific: osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state added

Also available in: Atom PDF