Bug #50351: osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state - RADOS - Ceph

Actions

Copy link

Bug #50351

closed

osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state

Added by Mykola Golub about 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific,octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

41270

Crash signature (v1):

Crash signature (v2):

Description

On nautilus we have been observing an issue when an EC pg is in active+backfill_unfound+degraded state (which happens when unfound objects are detected during backfill), and a non-primary osd for this pg is restarted, the state is changed to 'active+clean' after the restart and information about unfound objects is lost.

It is easy to reproduce on nautilus vstart cluster. But then I tried to reproduce it on the master, the state did not change here after the restart, but I was observing the primary osd crashes after the non-primary restart:

2021-04-14T07:24:47.038+0100 7f3460012640 -1 ../src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::_clear_recovery_state()' thread 7f3460012640 time 2021-04-14T07:24:46.938051+0100
../src/osd/PrimaryLogPG.cc: 13082: FAILED ceph_assert(recovering.count(*i))

 ceph version 17.0.0-3141-gc35ab161031 (c35ab1610310fc33731b2ce2c7a73796df5535a2) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19d) [0x55f0a3656885]
 2: /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd(+0x34f9b07) [0x55f0a3656b07]
 3: (PrimaryLogPG::_clear_recovery_state()+0x201) [0x55f0a2d69f97]
 4: (PG::clear_recovery_state()+0x1bd) [0x55f0a2c71d9d]
 5: (PG::cancel_recovery()+0xe9) [0x55f0a2c71f1d]
 6: (PG::on_new_interval()+0x169) [0x55f0a2c78fab]
 7: (PeeringState::on_new_interval()+0x9b0) [0x55f0a2fe1eba]
 8: (PeeringState::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ceph::os::Transaction&)+0xc2c) [0x55f0a2fe08d4]
 9: (PeeringState::Reset::react(PeeringState::AdvMap const&)+0x4bc) [0x55f0a3003ba4]

Steps to reproduce:

1) Deploy a cluster with at least 4 osds (to have a room to backfill a pg from one osd to another), with small max_pg_log (for easier to trigger backfill) and filestore backend (for easier to remove object shards and generate unfound objects), create 3+2 EC pool and fill with many objects (to be able to trigger backfill):

adonis:~/ceph/ceph.ci/build% OSD=4 ../src/vstart.sh -n -o osd_max_pg_log_entries=10 -f # small pg_log is for easier to trigger backfill, filestore is for easier to remove object shards
adonis:~/ceph/ceph.ci/build% ceph osd pool create test 1 1 erasure; ceph osd pool set test pg_autoscale_mode off
pool 'test' created
set pool 5 pg_autoscale_mode to off
adonis:~/ceph/ceph.ci/build% for i in `seq 200`; do rados -p test put test$i ./bin/test_crypto ; done

Chose one object and remove it from all shards but primary:

adonis:~/ceph/ceph.ci/build% ceph osd map test test99                                                                      
osdmap e60 pool 'test' (5) object 'test99' -> pg 5.d9607dc6 (5.0) -> up ([3,1,2], p3) acting ([3,1,2], p3)
adonis:~/ceph/ceph.ci/build% ls -1 dev/osd*/current/5.0s*_head/test99*
dev/osd1/current/5.0s1_head/test99__head_D9607DC6__5_ffffffffffffffff_1
dev/osd2/current/5.0s2_head/test99__head_D9607DC6__5_ffffffffffffffff_2
dev/osd3/current/5.0s0_head/test99__head_D9607DC6__5_ffffffffffffffff_0
adonis:~/ceph/ceph.ci/build% rm dev/osd*/current/5.0s[12]_head/test99*
adonis:~/ceph/ceph.ci/build% ls -1 dev/osd*/current/5.0s*_head/test99*
dev/osd3/current/5.0s0_head/test99__head_D9607DC6__5_ffffffffffffffff_0

Set one of non-primary osds out to trigger the backfill and observe the state is changed to active+backfill_unfound+degraded state after it is complete:

adonis:~/ceph/ceph.ci/build% ceph osd out 1                                                                                
adonis:~/ceph/ceph.ci/build% ceph pg dump |grep '^5\.0'
dumped all
5.0          200                   0         0        188        0  489212800            0           0  323       323  active+remapped+backfilling  2021-04-14T07:21:05.430368+0100  54'1123   63:1976  [3,0,2]           3  [3,1,2]               3         0'0  2021-04-14T07:00:39.765421+0100              0'0  2021-04-14T07:00:39.765421+0100              0
adonis:~/ceph/ceph.ci/build% ceph pg dump |grep '^5\.0'
dumped all
5.0          200                   0         2        115        1  489212800            0           0  323       323  active+backfill_unfound+degraded+remapped  2021-04-14T07:21:09.595125+0100  54'1123   63:2349  [3,0,2]           3  [3,1,2]               3         0'0  2021-04-14T07:00:39.765421+0100              0'0  2021-04-14T07:00:39.765421+0100              0
adonis:~/ceph/ceph.ci/build% ceph pg dump_stuck unclean
PG_STAT  STATE                                      UP       UP_PRIMARY  ACTING   ACTING_PRIMARY
5.0      active+backfill_unfound+degraded+remapped  [3,0,2]           3  [3,1,2]               3
ok

Restart a non-primary osd:

adonis:~/ceph/ceph.ci/build% ps auxww |grep 'ceph-osd -i 2'
mgolub   40025  6.2  0.3 889212 241700 ?       Ssl  07:18   0:21 /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd -i 2 -c /home/mgolub/ceph/ceph.ci/build/ceph.conf
mgolub   40420  0.0  0.0   6224   788 pts/6    S+   07:24   0:00 grep --color=auto ceph-osd -i 2
adonis:~/ceph/ceph.ci/build% kill 40025
adonis:~/ceph/ceph.ci/build% /home/mgolub/ceph/ceph.ci/build/bin/ceph-osd -i 2 -c /home/mgolub/ceph/ceph.ci/build/ceph.conf

After this step, on the master the primary osd crashes with the backtrace shown above. On the nautilus there is no crash but the state is changed to active+clean, loosing the information about unfound objects.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #50351

osd: FAILED ceph_assert(recovering.count(*i)) after non-primary osd restart when in backfill_unfound state

Updated by Mykola Golub about 3 years ago

Updated by Mykola Golub almost 3 years ago

Updated by Mykola Golub almost 3 years ago

Updated by Kefu Chai almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Loïc Dachary over 2 years ago