osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state
On nautilus we have been observing an issue when an EC pg is in active+backfill_unfound+degraded state (which happens when unfound objects are detected during backfill), and a non-primary osd for this pg is restarted, the state is changed to 'active+clean' after the restart and information about unfound objects is lost.
It is easy to reproduce on nautilus vstart cluster. But then I tried to reproduce it on the master, the state did not change here after the restart, but I was observing the primary osd crashes after the non-primary restart:
2021-04-14T07:24:47.038+0100 7f3460012640 -1 ../src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::_clear_recovery_state()' thread 7f3460012640 time 2021-04-14T07:24:46.938051+0100 ../src/osd/PrimaryLogPG.cc: 13082: FAILED ceph_assert(recovering.count(*i)) ceph version 17.0.0-3141-gc35ab161031 (c35ab1610310fc33731b2ce2c7a73796df5535a2) quincy (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19d) [0x55f0a3656885] 2: /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd(+0x34f9b07) [0x55f0a3656b07] 3: (PrimaryLogPG::_clear_recovery_state()+0x201) [0x55f0a2d69f97] 4: (PG::clear_recovery_state()+0x1bd) [0x55f0a2c71d9d] 5: (PG::cancel_recovery()+0xe9) [0x55f0a2c71f1d] 6: (PG::on_new_interval()+0x169) [0x55f0a2c78fab] 7: (PeeringState::on_new_interval()+0x9b0) [0x55f0a2fe1eba] 8: (PeeringState::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ceph::os::Transaction&)+0xc2c) [0x55f0a2fe08d4] 9: (PeeringState::Reset::react(PeeringState::AdvMap const&)+0x4bc) [0x55f0a3003ba4]
Steps to reproduce:
1) Deploy a cluster with at least 4 osds (to have a room to backfill a pg from one osd to another), with small max_pg_log (for easier to trigger backfill) and filestore backend (for easier to remove object shards and generate unfound objects), create 3+2 EC pool and fill with many objects (to be able to trigger backfill):
adonis:~/ceph/ceph.ci/build% OSD=4 ../src/vstart.sh -n -o osd_max_pg_log_entries=10 -f # small pg_log is for easier to trigger backfill, filestore is for easier to remove object shards adonis:~/ceph/ceph.ci/build% ceph osd pool create test 1 1 erasure; ceph osd pool set test pg_autoscale_mode off pool 'test' created set pool 5 pg_autoscale_mode to off adonis:~/ceph/ceph.ci/build% for i in `seq 200`; do rados -p test put test$i ./bin/test_crypto ; done
Chose one object and remove it from all shards but primary:
adonis:~/ceph/ceph.ci/build% ceph osd map test test99 osdmap e60 pool 'test' (5) object 'test99' -> pg 5.d9607dc6 (5.0) -> up ([3,1,2], p3) acting ([3,1,2], p3) adonis:~/ceph/ceph.ci/build% ls -1 dev/osd*/current/5.0s*_head/test99* dev/osd1/current/5.0s1_head/test99__head_D9607DC6__5_ffffffffffffffff_1 dev/osd2/current/5.0s2_head/test99__head_D9607DC6__5_ffffffffffffffff_2 dev/osd3/current/5.0s0_head/test99__head_D9607DC6__5_ffffffffffffffff_0 adonis:~/ceph/ceph.ci/build% rm dev/osd*/current/5.0s_head/test99* adonis:~/ceph/ceph.ci/build% ls -1 dev/osd*/current/5.0s*_head/test99* dev/osd3/current/5.0s0_head/test99__head_D9607DC6__5_ffffffffffffffff_0
Set one of non-primary osds out to trigger the backfill and observe the state is changed to active+backfill_unfound+degraded state after it is complete:
adonis:~/ceph/ceph.ci/build% ceph osd out 1 adonis:~/ceph/ceph.ci/build% ceph pg dump |grep '^5\.0' dumped all 5.0 200 0 0 188 0 489212800 0 0 323 323 active+remapped+backfilling 2021-04-14T07:21:05.430368+0100 54'1123 63:1976 [3,0,2] 3 [3,1,2] 3 0'0 2021-04-14T07:00:39.765421+0100 0'0 2021-04-14T07:00:39.765421+0100 0 adonis:~/ceph/ceph.ci/build% ceph pg dump |grep '^5\.0' dumped all 5.0 200 0 2 115 1 489212800 0 0 323 323 active+backfill_unfound+degraded+remapped 2021-04-14T07:21:09.595125+0100 54'1123 63:2349 [3,0,2] 3 [3,1,2] 3 0'0 2021-04-14T07:00:39.765421+0100 0'0 2021-04-14T07:00:39.765421+0100 0 adonis:~/ceph/ceph.ci/build% ceph pg dump_stuck unclean PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 5.0 active+backfill_unfound+degraded+remapped [3,0,2] 3 [3,1,2] 3 ok
Restart a non-primary osd:
adonis:~/ceph/ceph.ci/build% ps auxww |grep 'ceph-osd -i 2' mgolub 40025 6.2 0.3 889212 241700 ? Ssl 07:18 0:21 /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd -i 2 -c /home/mgolub/ceph/ceph.ci/build/ceph.conf mgolub 40420 0.0 0.0 6224 788 pts/6 S+ 07:24 0:00 grep --color=auto ceph-osd -i 2 adonis:~/ceph/ceph.ci/build% kill 40025 adonis:~/ceph/ceph.ci/build% /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd -i 2 -c /home/mgolub/ceph/ceph.ci/build/ceph.conf
After this step, on the master the primary osd crashes with the backtrace shown above. On the nautilus there is no crash but the state is changed to active+clean, loosing the information about unfound objects.
#1 Updated by Mykola Golub about 2 months ago
`PrimaryLogPG::on_failed_pull`  looks suspicious to me. We remove the oid from `backfills_in_flight` here only if the backfill source is primary. This seems to make sense for a replicated backend, but probably not for the EC backend. And looking at the history, the behavior seemed to be changed in , when in `ECBackend::_failed_push` instead of unconditionally calling `backfill_add_missing` (which removed the in-flight backfill) we started to call `on_failed_pull` (which removes the in-flight backfill only for the primary).
I tried changing `on_failed_pull` to unconditionally call `backfills_in_flight.erase(oid)`. Sure if fixed the assertion failure but it returned back the nautilus behavior, i.e. the pg state changing its state from "backfill_unfound" to "clean". Which does not make me very happy.
#2 Updated by Mykola Golub about 1 month ago
- Status changed from New to Fix Under Review
- Pull request ID set to 41270
In the mailing list thread  I provided some details why I think the current behaviour of `PrimaryLogPG::on_failed_pull`, when it does not remove the oid from `backfills_in_flight`, seems ok, as it leads to the reasonable osd behavior. So I think we just need to remove the assert.