Bug #12194: osd crash FAILED assert(!parent->get_log().get_missing().is_missing(soid)) - Ceph - Ceph

Actions

Copy link

Bug #12194

closed

osd crash FAILED assert(!parent->get_log().get_missing().is_missing(soid))

Added by Ben Hines almost 9 years ago. Updated about 7 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

v0.94.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After updating to Hammer 0.94.2 via 0.93 -> 0.94, one OSD crashes with the following assert.

-14> 2015-06-30 23:26:17.970533 7f02d0dd9700  1 -- 10.30.1.128:6805/4022 <== osd.63 10.30.1.125:6815/22138 11 ==== pg_info(1 pgs e28954:12.74) v4 ==== 759+0+0 (2625070229 0 0) 0xfe8d280 con 0xf2ac8e0
   -13> 2015-06-30 23:26:17.970549 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970359, event: header_read, op: pg_info(1 pgs e28954:12.74)
   -12> 2015-06-30 23:26:17.970556 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970360, event: throttled, op: pg_info(1 pgs e28954:12.74)
   -11> 2015-06-30 23:26:17.970562 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970380, event: all_read, op: pg_info(1 pgs e28954:12.74)
   -10> 2015-06-30 23:26:17.970622 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970548, event: dispatched, op: pg_info(1 pgs e28954:12.74)
    -9> 2015-06-30 23:26:17.970636 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970636, event: waiting_for_osdmap, op: pg_info(1 pgs e28954:12.74)
    -8> 2015-06-30 23:26:17.970650 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970650, event: started, op: pg_info(1 pgs e28954:12.74)
    -7> 2015-06-30 23:26:17.970673 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970673, event: done, op: pg_info(1 pgs e28954:12.74)
    -6> 2015-06-30 23:26:17.970696 7f02d0dd9700  1 -- 10.30.1.128:6805/4022 <== osd.58 10.30.1.37:6802/2027655 9 ==== MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954 v2 ==== 26+0+0 (2409593423 0 0) 0x432cb00 con 0xf2af220
    -5> 2015-06-30 23:26:17.970719 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970049, event: header_read, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
    -4> 2015-06-30 23:26:17.970726 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970051, event: throttled, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
    -3> 2015-06-30 23:26:17.970732 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970057, event: all_read, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
    -2> 2015-06-30 23:26:17.970737 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970716, event: dispatched, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
    -1> 2015-06-30 23:26:17.970744 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970743, event: waiting_for_osdmap, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
     0> 2015-06-30 23:26:17.972003 7f02c31c3700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::sub_op_modify_impl(OpRequestRef) [with T = MOSDRepOp, int MSGTYPE = 112]' thread 7f02c31c3700 time 2015-06-30 23:26:17.968440
osd/ReplicatedBackend.cc: 1138: FAILED assert(!parent->get_log().get_missing().is_missing(soid))

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (void ReplicatedBackend::sub_op_modify_impl<MOSDRepOp, 112>(std::tr1::shared_ptr<OpRequest>)+0x1488) [0x9d7898]
 2: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x56) [0x9bdcb6]
 3: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x37f) [0x9c0d9f]
 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x184) [0x873b04]
 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x178) [0x683318]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59e) [0x683cae]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x795) [0xafe0f5]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb019f0]
 9: /lib64/libpthread.so.0() [0x37632079d1]
 10: (clone()+0x6d) [0x3762ee88fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 5 filestore
   1/ 3 keyvaluestore
   1/ 1 journal
   0/ 5 ms
   1/ 5 mon
   5/20 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.85.log

Each time i launch the OSD it crashes on a different pg, but with the same stack. I attached the full log.

Files

ceph-osd.85.zip (236 KB) ceph-osd.85.zip

Ben Hines, 07/01/2015 06:44 AM

Actions

Copy link

Updated by Kefu Chai almost 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Ben Hines over 8 years ago

Continue to encounter this one periodically, now has happened on a 2nd OSD. btrfs, centos 7.1.

Actions

Copy link

Updated by Adolfo Brandes over 8 years ago

I can confirm what looks like this same exact bug. While still on 0.93, filesystem on OSDs had to be converted from btrfs to xfs "the hard way" (rsyncing -aX data out of btrfs and back in to recreated xfs OSDs), but everything worked ok. When uprading to 0.94.2 (Ubuntu 14.04 on kernel 3.19, btw), I see this exact behavior when restarting OSDs. To fix it, I need to zap and recreate each disk, one node at a time, but the problem will "propagate" back to already converted OSDs even if I don't restart a 0.94.2 OSD on old data and simply zap it out of the gate.

The latter is probably due to the 0.93 OSDs elsewhere propagating bogus data.

Actions

Copy link

Updated by Sage Weil over 8 years ago

Does this appear to be related to 0.93 -> ahmmer upgrade? I don't think we've seen this with stable users...

Actions

Copy link

Updated by Sage Weil over 8 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Ben Hines over 8 years ago

I'm now on 94.3 (and planning to avoid future prereleases ;) -- have not seen this crash since the last one a month ago.

Fortunately, most of the data in our cluster is deleted and recreated every few weeks, so old data is likely to get purged and repaired.

Actions

Copy link

Updated by Luis Periquito about 8 years ago

I'm seeing the exact same issue in a production cluster, that was created as Hammer (0.94.2) and since upgraded to 0.94.3 and 0.94.5.
This cluster had always a XFS filesystem for the OSDs.

We have powered off a rack (power failure). When the nodes came back up some of the OSDs would crash. One of them crashed repeatedly, and I set that one as out. The others I just restarted and then they kept working. I had some multiple OSDs failing in the same hosts, hosts with only one OSD failing, and hosts without any failure.

The messages are all very similar and look like this:

2016-03-11 14:39:08.393277 7f4ee9681700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::sub_op_modify_impl(OpRequestRef) [with T = MOSDRepOp; int MSGTYPE = 112; OpRequestRef = std::tr1::shared_ptr<OpRequest>]' thread 7f4ee9681700 time 2016-03-11 14:39:08.391222
osd/ReplicatedBackend.cc: 1138: FAILED assert(!parent->get_log().get_missing().is_missing(soid))

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc60eb]
 2: (void ReplicatedBackend::sub_op_modify_impl<MOSDRepOp, 112>(std::tr1::shared_ptr<OpRequest>)+0xe19) [0xa2e0f9]
 3: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x49) [0xa19c09]
 4: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x38e) [0xa1a03e]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x167) [0x843de7]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3d5) [0x6a1b05]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x331) [0x6a2051]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x875) [0xbb5ab5]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbb7bd0]
 10: (()+0x8182) [0x7f4f09729182]
 11: (clone()+0x6d) [0x7f4f07c9447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The node I've set as out, after the whole cluster recovered and was HEALTH_OK, I just set it as in and it backfilled and is now working as normal.
The crusmap has a failure domain of rack.

Actions

Copy link

Updated by Sage Weil about 7 years ago

Status changed from Need More Info to Won't Fix

please reopen this ticket if you are still seeing this (on a later hammer version or on jewel+)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #12194

osd crash FAILED assert(!parent->get_log().get_missing().is_missing(soid))

Updated by Kefu Chai almost 9 years ago

Updated by Ben Hines over 8 years ago

Updated by Adolfo Brandes over 8 years ago

Updated by Sage Weil over 8 years ago

Updated by Sage Weil over 8 years ago

Updated by Ben Hines over 8 years ago

Updated by Luis Periquito about 8 years ago

Updated by Sage Weil about 7 years ago