Bug #36337: OSDs crash with failed assertion in PGLog::merge_log as logs do not overlap - RADOS - Ceph

Actions

Copy link

Bug #36337

open

OSDs crash with failed assertion in PGLog::merge_log as logs do not overlap

Added by Jonas Jelten over 5 years ago. Updated over 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.2

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello!

Unfortunately, our single-node-"Cluster" with 11 OSDs is broken because some OSDs crash when they start peering.
I'm on Ubuntu 18.04 with Ceph Mimic (13.2.2).

The problem seems to be created when RAM was filled up and OSD processes then crashed because of memory allocation failures.

No weird commands (e.g. force_create_pg) were used on this cluster and it was set up with 13.2.1 initially.
The affected pool seems to be a replicated pool with size=3 and min_size=2 (which haven't been changed).

Crash log of osd.9 (only the crashed thread) is attached. It crashes with:

0> 2018-10-07 15:02:31.019 7f27b99a8700 -1 ** Caught signal (Aborted) *
 in thread 7f27b99a8700 thread_name:tp_osd_tp

ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
 1: (()+0x91b860) [0x55d747c9e860]
 2: (()+0x12890) [0x7f27da05a890]
 3: (gsignal()+0xc7) [0x7f27d8f1ee97]
 4: (abort()+0x141) [0x7f27d8f20801]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f27db96c970]
 6: (()+0x26c9e7) [0x7f27db96c9e7]
 7: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1ad0) [0x55d747865f50]
 8: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x55d7477c3194]
 9: (PG::proc_master_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_missing_set&lt;false&gt;&, pg_shard_t)+0x8b) [0x55d747804abb]
 10: (PG::RecoveryState::GetLog::react(PG::RecoveryState::GotLog const&)+0xab) [0x55d747804e4b]
 11: (boost::statechart::simple_state&lt;PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mp
 12: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_queued_events()+0xb3) [0x55d7478
 13: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_b
 14: (PG::do_peering_event(std::shared_ptr&lt;PGPeeringEvent&gt;, PG::RecoveryCtx*)+0x143) [0x55d747811a53]
 15: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr&lt;PGPeeringEvent&gt;, ThreadPool::TPHandle&)+0xcf) [0x55d74775250f]
 16: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x50) [0x55d7479be820]
 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x598) [0x55d747762148]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x46e) [0x7f27db97175e]
 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f27db9737e0]
 20: (()+0x76db) [0x7f27da04f6db]
 21: (clone()+0x3f) [0x7f27d900188f]

0> 2018-10-07 15:02:30.959 7f27b99a8700 -1 /home/jj/cephsrc/ceph-13.2.2/src/osd/PGLog.cc: In function 'void PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool
/home/jj/cephsrc/ceph-13.2.2/src/osd/PGLog.cc: 345: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

The last hints are:

-485> 2018-10-07 15:02:30.955 7f27b99a8700 10 osd.9 pg_epoch: 13898 pg[4.2( v 3035'149287 (3035'146200,3035'149287] local-lis/les=3015/3016 n=6960 ec=140/140 lis/c 3161/3015 les/c/f 3162/3016/0 13898/13898/1389
  -483> 2018-10-07 15:02:30.955 7f27b99a8700 10 merge_log log((3035'145746,3035'145746], crt=3035'145746) from osd.4 into log((3035'146200,3035'149287], crt=3035'149287)

So it seems the contact to osd.4 for pg 4.2 leads to the crash.
The logs do no longer overlap but ceph is still attempting to merge them.
osd.4 has (3035'145746,3035'145746], osd.9 has (3035'146200,3035'149287], which don't overlap.

I guess that it would be okay to get rid of the PG on osd.4 as its log is older there.
If that is true, Ceph should do this automatically.

pools:

pool 1 'xxx_data' erasure size 10 min_size 9 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 175 flags hashpspool,ec_overwrites stripe_width 32768 application cephfs
pool 3 'ssn_root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 147 flags hashpspool stripe_width 0 application cephfs
pool 4 'ssn_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 147 flags hashpspool stripe_width 0 application cephfs
pool 5 'xxx_backup' erasure size 10 min_size 8 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 last_change 176 flags hashpspool,ec_overwrites stripe_width 28672 application cephfs

pg dump is attached.

pg 4.2 query says this:

{
"state": "recovery_wait+undersized+degraded+peered",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 13902,
"up": [
4
],
"acting": [
4
],
"acting_recovery_backfill": [
"4"
],
[...]

"recovery_state": [
  {
      "name": "Started/Primary/Active",
      "enter_time": "2018-10-07 15:02:38.218711",
      "might_have_unfound": [
      {
          "osd": "7",
          "status": "osd is down" 
        },
      {
          "osd": "9",
          "status": "osd is down" 
        }
      ],
      "recovery_progress": {
        "backfill_targets": [],
        "waiting_on_backfill": [],
        "last_backfill_started": "MIN",
        "backfill_info": {
          "begin": "MIN",
          "end": "MIN",
          "objects": []
        },
        "peer_backfill_info": [],
        "backfills_in_flight": [],
        "recovering": [],
        "pg_backend": {
          "pull_from_peer": [],
          "pushing": []
        }
      },
    },
  {
      "name": "Started",
      "enter_time": "2018-10-07 15:02:37.251392" 
    }
  ],

My current way of resolving this is ceph-objectstore-tool export-delete of the pg in the osd that has the broken and older log. In this case, I delete pg 4.2 from OSD.4.

After I've done that until the cluster was completely healthy again (all PGs active+clean), I launched my MDS again, but this led to another corruption! Now, other PGs have the same log-overlap-issues and I have to delete the broken PGs again. My guess is that the cluster actually wasn't active+clean, and writes by the MDS just triggered log writes which led to syncs and crashes again.

Any pointers what can be done to fix this from happening? And of course to either actually fix the corruption or fix it more efficiently?

Files

Download all files

osd9-crash-threadlog20.log (797 KB) osd9-crash-threadlog20.log	log of osd.9 that has the crash	Jonas Jelten, 10/07/2018 01:16 PM
pgdump (85.8 KB) pgdump	ceph pg dump	Jonas Jelten, 10/07/2018 01:19 PM
ceph-osd.4.log (314 KB) ceph-osd.4.log	osd4 log that peered with osd.9	Jonas Jelten, 10/07/2018 01:34 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #36337

OSDs crash with failed assertion in PGLog::merge_log as logs do not overlap

Updated by Jonas Jelten over 5 years ago

Updated by Jonas Jelten over 5 years ago

Updated by Greg Farnum over 5 years ago

Updated by Sage Weil about 5 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Neha Ojha almost 4 years ago

Updated by Neha Ojha over 3 years ago