Project

General

Profile

Bug #12194

osd crash FAILED assert(!parent->get_log().get_missing().is_missing(soid))

Added by Ben Hines over 8 years ago. Updated almost 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After updating to Hammer 0.94.2 via 0.93 -> 0.94, one OSD crashes with the following assert.

-14> 2015-06-30 23:26:17.970533 7f02d0dd9700  1 -- 10.30.1.128:6805/4022 <== osd.63 10.30.1.125:6815/22138 11 ==== pg_info(1 pgs e28954:12.74) v4 ==== 759+0+0 (2625070229 0 0) 0xfe8d280 con 0xf2ac8e0
   -13> 2015-06-30 23:26:17.970549 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970359, event: header_read, op: pg_info(1 pgs e28954:12.74)
   -12> 2015-06-30 23:26:17.970556 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970360, event: throttled, op: pg_info(1 pgs e28954:12.74)
   -11> 2015-06-30 23:26:17.970562 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970380, event: all_read, op: pg_info(1 pgs e28954:12.74)
   -10> 2015-06-30 23:26:17.970622 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970548, event: dispatched, op: pg_info(1 pgs e28954:12.74)
    -9> 2015-06-30 23:26:17.970636 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970636, event: waiting_for_osdmap, op: pg_info(1 pgs e28954:12.74)
    -8> 2015-06-30 23:26:17.970650 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970650, event: started, op: pg_info(1 pgs e28954:12.74)
    -7> 2015-06-30 23:26:17.970673 7f02d0dd9700  5 -- op tracker -- seq: 359, time: 2015-06-30 23:26:17.970673, event: done, op: pg_info(1 pgs e28954:12.74)
    -6> 2015-06-30 23:26:17.970696 7f02d0dd9700  1 -- 10.30.1.128:6805/4022 <== osd.58 10.30.1.37:6802/2027655 9 ==== MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954 v2 ==== 26+0+0 (2409593423 0 0) 0x432cb00 con 0xf2af220
    -5> 2015-06-30 23:26:17.970719 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970049, event: header_read, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
    -4> 2015-06-30 23:26:17.970726 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970051, event: throttled, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
    -3> 2015-06-30 23:26:17.970732 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970057, event: all_read, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
    -2> 2015-06-30 23:26:17.970737 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970716, event: dispatched, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
    -1> 2015-06-30 23:26:17.970744 7f02d0dd9700  5 -- op tracker -- seq: 360, time: 2015-06-30 23:26:17.970743, event: waiting_for_osdmap, op: MRecoveryReserve REQUEST  pgid: 6.6e, query_epoch: 28954
     0> 2015-06-30 23:26:17.972003 7f02c31c3700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::sub_op_modify_impl(OpRequestRef) [with T = MOSDRepOp, int MSGTYPE = 112]' thread 7f02c31c3700 time 2015-06-30 23:26:17.968440
osd/ReplicatedBackend.cc: 1138: FAILED assert(!parent->get_log().get_missing().is_missing(soid))

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (void ReplicatedBackend::sub_op_modify_impl<MOSDRepOp, 112>(std::tr1::shared_ptr<OpRequest>)+0x1488) [0x9d7898]
 2: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x56) [0x9bdcb6]
 3: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x37f) [0x9c0d9f]
 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x184) [0x873b04]
 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x178) [0x683318]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x59e) [0x683cae]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x795) [0xafe0f5]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xb019f0]
 9: /lib64/libpthread.so.0() [0x37632079d1]
 10: (clone()+0x6d) [0x3762ee88fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 5 filestore
   1/ 3 keyvaluestore
   1/ 1 journal
   0/ 5 ms
   1/ 5 mon
   5/20 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.85.log

Each time i launch the OSD it crashes on a different pg, but with the same stack. I attached the full log.

ceph-osd.85.zip (236 KB) Ben Hines, 07/01/2015 06:44 AM

History

#1 Updated by Kefu Chai over 8 years ago

  • Description updated (diff)

#2 Updated by Ben Hines over 8 years ago

Continue to encounter this one periodically, now has happened on a 2nd OSD. btrfs, centos 7.1.

#3 Updated by Adolfo Brandes over 8 years ago

I can confirm what looks like this same exact bug. While still on 0.93, filesystem on OSDs had to be converted from btrfs to xfs "the hard way" (rsyncing -aX data out of btrfs and back in to recreated xfs OSDs), but everything worked ok. When uprading to 0.94.2 (Ubuntu 14.04 on kernel 3.19, btw), I see this exact behavior when restarting OSDs. To fix it, I need to zap and recreate each disk, one node at a time, but the problem will "propagate" back to already converted OSDs even if I don't restart a 0.94.2 OSD on old data and simply zap it out of the gate.

The latter is probably due to the 0.93 OSDs elsewhere propagating bogus data.

#4 Updated by Sage Weil over 8 years ago

Does this appear to be related to 0.93 -> ahmmer upgrade? I don't think we've seen this with stable users...

#5 Updated by Sage Weil over 8 years ago

  • Status changed from New to Need More Info

#6 Updated by Ben Hines over 8 years ago

I'm now on 94.3 (and planning to avoid future prereleases ;) -- have not seen this crash since the last one a month ago.

Fortunately, most of the data in our cluster is deleted and recreated every few weeks, so old data is likely to get purged and repaired.

#7 Updated by Luis Periquito about 8 years ago

I'm seeing the exact same issue in a production cluster, that was created as Hammer (0.94.2) and since upgraded to 0.94.3 and 0.94.5.
This cluster had always a XFS filesystem for the OSDs.

We have powered off a rack (power failure). When the nodes came back up some of the OSDs would crash. One of them crashed repeatedly, and I set that one as out. The others I just restarted and then they kept working. I had some multiple OSDs failing in the same hosts, hosts with only one OSD failing, and hosts without any failure.

The messages are all very similar and look like this:

2016-03-11 14:39:08.393277 7f4ee9681700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::sub_op_modify_impl(OpRequestRef) [with T = MOSDRepOp; int MSGTYPE = 112; OpRequestRef = std::tr1::shared_ptr<OpRequest>]' thread 7f4ee9681700 time 2016-03-11 14:39:08.391222
osd/ReplicatedBackend.cc: 1138: FAILED assert(!parent->get_log().get_missing().is_missing(soid))

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc60eb]
 2: (void ReplicatedBackend::sub_op_modify_impl<MOSDRepOp, 112>(std::tr1::shared_ptr<OpRequest>)+0xe19) [0xa2e0f9]
 3: (ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0x49) [0xa19c09]
 4: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x38e) [0xa1a03e]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x167) [0x843de7]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3d5) [0x6a1b05]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x331) [0x6a2051]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x875) [0xbb5ab5]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbb7bd0]
 10: (()+0x8182) [0x7f4f09729182]
 11: (clone()+0x6d) [0x7f4f07c9447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The node I've set as out, after the whole cluster recovered and was HEALTH_OK, I just set it as in and it backfilled and is now working as normal.
The crusmap has a failure domain of rack.

#8 Updated by Sage Weil almost 7 years ago

  • Status changed from Need More Info to Won't Fix

please reopen this ticket if you are still seeing this (on a later hammer version or on jewel+)

Also available in: Atom PDF