Bug #21827: OSD crashed while reparing inconsistent PG - Ceph - Ceph

Actions

Copy link

Bug #21827

closed

OSD crashed while reparing inconsistent PG

Added by Ana Aviles over 6 years ago. Updated over 6 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

OSD

Target version:

v12.2.1

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v12.2.1

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

There was 1 inconsistent PG. ceph repair ended up with the primary crashing every time it tries to repair the PG.

2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 3 errors, 1 fixed
2017-10-17 17:48:56.047896 7f234930d700 -1
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7f234930d700 time 2017-10-17 17:48:55.924115
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x56236c8ff3f2]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr&lt;ObjectContext&gt;, bool,
ObjectStore::Transaction*)+0xd63) [0x56236c476213]
 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
PullOp*, std::__cxx11::list&lt;ReplicatedBackend::pull_complete_info,
std::allocator&lt;ReplicatedBackend::pull_complete_info&gt; >,
ObjectStore::Transaction)+0x693) [0x56236c60d4d3]
 4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr&lt;OpRequest&gt;)+0x2b5)
[0x56236c60dd75]
 5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x20c)
[0x56236c61196c]
 6: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x50)
[0x56236c521aa0]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&,
ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
 8: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;,
boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3a9)
[0x56236c3091a9]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr&lt;OpRequest&gt;
const&)+0x57) [0x56236c5a2ae7]
 10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
[0x56236c9041e4]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
 13: (()+0x76ba) [0x7f2366be96ba]
 14: (clone()+0x6d) [0x7f2365c603dd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is
needed to interpret this.

The OSD got a push op containing a snapshot it doesn't think should exist. I also see that there's a comment "// hmm, should we warn?" on that assert.

We uploaded log of the osd with debug osd = 20 reference: 6e4dba6f-2c15-4920-b591-fe380bbca200

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Okay, so the log we've got here is a different assert, as mentioned on the email thread https://www.mail-archive.com/ceph-users@lists.ceph.com/msg41143.html

In function 'int PrimaryLogPG::recover_missing(const hobject_t&, eversion_t, int, PGBackend::RecoveryHandle*)' thread 7ff8a24db700 time 2017-10-18 12:09:30.866304
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 10369: FAILED assert(head_obc)

They further say in the email

When we request the arguments of the 3th frame, it tells us, 
#3  0x0000555555b833e9 in PrimaryLogPG::recover_missing (this=this@entry=0x555562aca000, soid=..., v=..., priority=<optimized out>, h=h@entry=0x555567a3a080) at /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc:10369
10369    /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: No such file or directory.
(gdb) info args
this = 0x555562aca000
soid = @0x7fffd8b1a6c0: {oid = {name = "rbd_data.30732d3238f3535.", '0' <repeats 12 times>, "f7e3"}, snap = {val = 244802}, hash = 3610026748, max = false, nibblewise_key_cache = 3475686013, hash_reverse_bits = 1061500139, 
  static POOL_META = -1, static POOL_TEMP_START = -2, pool = 2, nspace = "", key = ""}
v = <optimized out>
priority = <optimized out>
h = 0x555567a3a080
(gdb) Quit

So it points to: rbd_data.30732d3238f3535.(0x12)f7e3

This is weird, as we cannot found any reference to "rbd_data.30732d3238f3535.000000000000f7e3", this rbd prefix does not exist on our system (when running rbd ls and requesting info of the rbds,30732d3238f3535 is not found ), nor we can find any object on the filesystem which looks to belong to this rbd image. It can of course be an old rbd image which is deleted by now, but we do not understand why it is still trying to access it or repair it. It seems this repair is in some queue or something.
This is happening in the same PG as the problem we found yesterday.

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Hmm, yep, it's getting in to the needs_recovery_map somehow. Earliest reference to it in the log is

2017-10-18 12:09:28.176070 7ff8a54e1700 20 osd.78 pg_epoch: 383591 pg[2.2fc( v 383476'2437792 lc 0'0 (382825'2435026,383476'2437792] local-lis/les=383557/383559 n=15215 ec=1/1 lis/c 383557/382568 les/c/f 383559/382573/154843 383591/383591/383591) [78,93] r=0 lpr=383591 pi=[382568,383591)/4 crt=211748'1040300 mlcod 0'
0 inconsistent+peering m=387]  after missing 2:3f4534eb:::rbd_data.30732d3238f3535.000000000000f7e3:3bc42 need 327428'1994485 have 0'0

and then

2017-10-18 12:09:28.255551 7ff8a54e1700  0 0x557fdeff4290 2.2fc unexpected need for 2:3f4534eb:::rbd_data.30732d3238f3535.000000000000f7e3:3bc42 have 327428'1994485 flags = none tried to add 327428'1994485 flags = none

Those are while processing the log for osd.93 in this case. So we're accepting a peer's missing set and then crashing when we don't have the object_info for an object they reference. There is a claim the object exists on osd.69.

I'm not sure why we're requiring the ObjectContext to exist on ourselves, although it is for a snapshot so the logic seems to be a bit different.

Can you guys add debug logs for those OSDs that cover the time period when osd.78 boots up and crashes?

Actions

Copy link

Updated by Ana Aviles over 6 years ago

Sure, thanks!. We have updated the log of 93 of that period ( log post reference bd68ea72-3e9a-4bb3-896e-8db0f86e49ed ) with debug_osd = 20. Also, for osd 69 ( log post reference c8f0f6fd-287d-49bb-88e0-8002798637e6 ), however, sadly enough the debug level is not 20 for this one.

Actions

Copy link

Updated by Ana Aviles over 6 years ago

We have also recreated another set of logs for osd.93 and osd.69, references are: f9a6fcec-8581-4ca3-a875-075e3745b071 and 627b644d-31f4-4f87-93ef-ce2a87cb13d8 respectively

We cannot reproduce the first assert osd.78 hit. The inconsistent pg initially belonged to [78, 93]. 78 was crashing over and over on that first assert. We decided to put it out (but still up). While shifting data to the new PG set [93, 69], osd.93 crashed on the assert ( assert(head_obc) ) you mentioned. This is now the state we can reproduce.

Actions

Copy link

Updated by Ana Aviles over 6 years ago

We have some updates. With the help of Wido we got over this mysterious object. There were two problems:

osd.93 had a problem with object rbd_data.332d5a836bcc485.000000000000fcf6:466a7 had missing attrs. We had previously removed because it was identified as an unexpected clone. We saved the content but not the xattrs. Later, we decided to put it back, but missing the metadata. That was solved by copying the attributes from other objects.

Once that was solved, we hit again the missing mysterious object rbd_data.30732d3238f3535.000000000000f7e3

In the first place, it was complaining about not being able to find the snapdir of that object. By copying a random RBD object preserving xattrs to the locations where that object should live, we got the OSD up and running. Still, how osd.69 claimed to have that OSD is a mystery to us. We then had that object as unfound, we marked it as lost, and surprisingly enough, then OSD.69 (the replacement for osd.78) crashed on the same place that osd.78 had initially crashed. The logs of osd.69 ( reference 021aea27-1ae3-4dae-890f-92206d3c3a90 ), and osd.93 ( reference aa5cf92f-e5c6-4abd-8bf6-0eed958851de ) I don't know if its of used. I though we had debug_osd = 20 on osd.69, but I believe not :( We can't reproduce now, because we removed the objects on which it was crashing, and it works fine now.
rbd_data.332d5a836bcc485.000000000000fcf6:4663e
rbd_data.332d5a836bcc485.000000000000fcf6:466a7

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Status changed from New to Can't reproduce

Okay. Sounds like this was induced by a combination of whatever had previously broken your cluster and the manual repair surgery that went on.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #21827

OSD crashed while reparing inconsistent PG

Updated by Greg Farnum over 6 years ago

Updated by Greg Farnum over 6 years ago

Updated by Ana Aviles over 6 years ago

Updated by Ana Aviles over 6 years ago

Updated by Ana Aviles over 6 years ago

Updated by Greg Farnum over 6 years ago