Project

General

Profile

Actions

Bug #12665

closed

osd/ReplicatedPG.cc: 2706: FAILED assert(p != snapset.clones.end())

Added by Bram Pieters over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading our Ceph cluster from 0.80.4 to 0.94.1 we have intermittent crashes on multiple osd's.
Marking those OSD's out result in rebalancing of the cluster, triggering other OSD's to crash.
It looks like some data is causing the crashes but we have no clue which data it is.

Meanwhile we've cleaned up as much data as possible by
- removing old rbd's
- removing all snapshots of all rbd's
- copying rbd's who had snapshots to new rbd's via rbd copy

We auto restart the OSD's in the meanwhile every 5 mins but we're afraid data corruption will occur within rbd's because of intermittent io lockups at the clients as a result of continuous recalculations of the crush map.

I've included 2 log files from 2 OSD's while they have crashed.

Ceph Version:
  1. ceph -v
    ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
Our Osd tree:
  1. ceph osd tree
    ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1 85.00000 root default
    -3 85.00000 rack unknownrack
    -9 4.00000 host netxen-25307
    71 1.00000 osd.71 up 1.00000 1.00000
    72 1.00000 osd.72 up 1.00000 1.00000
    73 1.00000 osd.73 up 1.00000 1.00000
    74 1.00000 osd.74 up 1.00000 1.00000
    -10 4.00000 host netxen-25308
    81 1.00000 osd.81 up 1.00000 1.00000
    82 1.00000 osd.82 up 1.00000 1.00000
    83 1.00000 osd.83 up 1.00000 1.00000
    84 1.00000 osd.84 up 1.00000 1.00000
    -11 4.00000 host netxen-25309
    91 1.00000 osd.91 up 1.00000 1.00000
    92 1.00000 osd.92 up 1.00000 1.00000
    93 1.00000 osd.93 up 1.00000 1.00000
    94 1.00000 osd.94 up 1.00000 1.00000
    -12 4.00000 host netxen-25310
    101 1.00000 osd.101 up 1.00000 1.00000
    102 1.00000 osd.102 up 1.00000 1.00000
    103 1.00000 osd.103 up 1.00000 1.00000
    104 1.00000 osd.104 up 1.00000 1.00000
    -7 4.00000 host netxen-25311
    111 1.00000 osd.111 up 1.00000 1.00000
    112 1.00000 osd.112 up 1.00000 1.00000
    113 1.00000 osd.113 up 1.00000 1.00000
    114 1.00000 osd.114 up 1.00000 1.00000
    -8 4.00000 host netxen-25312
    121 1.00000 osd.121 down 0 1.00000
    122 1.00000 osd.122 down 0 1.00000
    123 1.00000 osd.123 down 0 1.00000
    124 1.00000 osd.124 down 0 1.00000
    -13 0 host netxen-25313
    131 0 osd.131 up 0 1.00000
    132 0 osd.132 up 1.00000 1.00000
    133 0 osd.133 up 0 1.00000
    134 0 osd.134 up 1.00000 1.00000
    -14 4.00000 host netxen-25314
    141 1.00000 osd.141 up 0 1.00000
    142 1.00000 osd.142 down 1.00000 1.00000
    143 1.00000 osd.143 up 0 1.00000
    144 1.00000 osd.144 up 0 1.00000
    -15 4.00000 host netxen-25315
    151 1.00000 osd.151 up 1.00000 1.00000
    152 1.00000 osd.152 up 1.00000 1.00000
    153 1.00000 osd.153 up 1.00000 1.00000
    154 1.00000 osd.154 up 1.00000 1.00000
    -16 4.00000 host netxen-25316
    161 1.00000 osd.161 up 1.00000 1.00000
    162 1.00000 osd.162 up 1.00000 1.00000
    163 1.00000 osd.163 up 1.00000 1.00000
    164 1.00000 osd.164 up 1.00000 1.00000
    -17 4.00000 host netxen-25317
    171 1.00000 osd.171 up 1.00000 1.00000
    172 1.00000 osd.172 up 1.00000 1.00000
    173 1.00000 osd.173 up 1.00000 1.00000
    174 1.00000 osd.174 up 1.00000 1.00000
    -18 4.00000 host netxen-25318
    181 1.00000 osd.181 up 1.00000 1.00000
    182 1.00000 osd.182 up 1.00000 1.00000
    183 1.00000 osd.183 up 1.00000 1.00000
    184 1.00000 osd.184 up 1.00000 1.00000
    -19 4.00000 host netxen-25319
    191 1.00000 osd.191 up 1.00000 1.00000
    192 1.00000 osd.192 up 1.00000 1.00000
    193 1.00000 osd.193 up 1.00000 1.00000
    194 1.00000 osd.194 up 1.00000 1.00000
    -20 4.00000 host netxen-25320
    201 1.00000 osd.201 up 1.00000 1.00000
    202 1.00000 osd.202 up 1.00000 1.00000
    203 1.00000 osd.203 up 1.00000 1.00000
    204 1.00000 osd.204 up 1.00000 1.00000
    -21 3.00000 host netxen-25321
    211 1.00000 osd.211 up 1.00000 1.00000
    212 1.00000 osd.212 up 0 1.00000
    213 0 osd.213 up 1.00000 1.00000
    214 1.00000 osd.214 up 0 1.00000
    -22 4.00000 host netxen-25322
    221 1.00000 osd.221 up 1.00000 1.00000
    222 1.00000 osd.222 up 1.00000 1.00000
    223 1.00000 osd.223 up 1.00000 1.00000
    224 1.00000 osd.224 up 1.00000 1.00000
    -2 4.00000 host netxen-25323
    231 1.00000 osd.231 up 1.00000 1.00000
    232 1.00000 osd.232 up 1.00000 1.00000
    233 1.00000 osd.233 up 1.00000 1.00000
    234 1.00000 osd.234 up 1.00000 1.00000
    -4 4.00000 host netxen-25324
    241 1.00000 osd.241 up 1.00000 1.00000
    242 1.00000 osd.242 up 1.00000 1.00000
    243 1.00000 osd.243 up 1.00000 1.00000
    244 1.00000 osd.244 up 1.00000 1.00000
    -5 4.00000 host netxen-25325
    251 1.00000 osd.251 up 1.00000 1.00000
    252 1.00000 osd.252 up 1.00000 1.00000
    253 1.00000 osd.253 up 1.00000 1.00000
    254 1.00000 osd.254 up 1.00000 1.00000
    -6 1.00000 host netxen-25326
    261 0 osd.261 up 0 1.00000
    262 0 osd.262 up 0 1.00000
    263 1.00000 osd.263 up 1.00000 1.00000
    264 0 osd.264 up 0 1.00000
    -23 4.00000 host netxen-25327
    271 1.00000 osd.271 up 1.00000 1.00000
    272 1.00000 osd.272 up 1.00000 1.00000
    273 1.00000 osd.273 up 1.00000 1.00000
    274 1.00000 osd.274 up 1.00000 1.00000
    -24 4.00000 host netxen-25328
    281 1.00000 osd.281 up 1.00000 1.00000
    282 1.00000 osd.282 up 1.00000 1.00000
    283 1.00000 osd.283 up 1.00000 1.00000
    284 1.00000 osd.284 up 1.00000 1.00000
    -25 1.00000 host netxen25329
    291 0 osd.291 up 0 1.00000
    292 0 osd.292 up 0 1.00000
    293 0 osd.293 down 1.00000 1.00000
    294 1.00000 osd.294 up 0 1.00000
    -26 4.00000 host netxen25330
    301 1.00000 osd.301 up 1.00000 1.00000
    302 1.00000 osd.302 up 1.00000 1.00000
    303 1.00000 osd.303 up 1.00000 1.00000
    304 1.00000 osd.304 up 1.00000 1.00000
Ceph Health
  1. ceph -s
    cluster 2e1396f7-deaa-45b5-9db9-62e046089435
    health HEALTH_WARN
    3 pgs backfilling
    786 pgs degraded
    1 pgs recovering
    992 pgs stuck unclean
    786 pgs undersized
    1 requests are blocked > 32 sec
    recovery 821761/20467274 objects degraded (4.015%)
    recovery 943602/20467274 objects misplaced (4.610%)
    recovery 1/6821655 unfound (0.000%)
    3/80 in osds are down
    noout,noscrub,nodeep-scrub flag(s) set
    monmap e13: 3 mons at {0=192.168.252.76:6789/0,2=192.168.252.36:6789/0,4=192.168.252.37:6789/0}
    election epoch 6506, quorum 0,1,2 2,4,0
    mdsmap e2050: 1/1/1 up {0=2=up:active}, 1 up:standby
    osdmap e140054: 96 osds: 89 up, 80 in; 722 remapped pgs
    flags noout,noscrub,nodeep-scrub
    pgmap v68664097: 5760 pgs, 11 pools, 11143 GB data, 6661 kobjects
    32536 GB used, 41576 GB / 74112 GB avail
    821761/20467274 objects degraded (4.015%)
    943602/20467274 objects misplaced (4.610%)
    1/6821655 unfound (0.000%)
    4316 active+clean
    722 active+undersized+degraded
    655 active+remapped
    63 active+undersized+degraded+remapped
    3 active+remapped+backfilling
    1 active+recovering+undersized+degraded
    client io 12887 kB/s rd, 10347 kB/s wr, 2123 op/s

Files

ceph-osd.144.log.gz (241 KB) ceph-osd.144.log.gz Bram Pieters, 08/10/2015 11:12 PM
ceph-osd.294.log.gz (28.1 KB) ceph-osd.294.log.gz Bram Pieters, 08/10/2015 11:15 PM
Actions #2

Updated by Bram Pieters over 8 years ago

An update:

We were able to finish the rebalancing by setting:
osd pg max concurrent snap trims = 0

We also need noscrub and nodeeb-scrub flags to be set to avoid osd's to crash.

Really like some feedback or suggestions

Actions #3

Updated by Bram Pieters over 8 years ago

So far the problem keeps persisting, so here is a small part of the log from an crashing osd...

2015-08-08 23:31:17.446134 7f4599edb780 0 filestore(/ceph/osd294) backend xfs (magic 0x58465342)
2015-08-08 23:31:17.449008 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: FIEMAP ioctl is supported and appears to work
2015-08-08 23:31:17.449017 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-08-08 23:31:17.449162 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: syncfs(2) syscall not supported
2015-08-08 23:31:17.449169 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: no syncfs(2), must use sync(2).
2015-08-08 23:31:17.449172 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: WARNING: multiple ceph-osd daemons on the same host will be slow
2015-08-08 23:31:17.449238 7f4599edb780 0 xfsfilestorebackend(/ceph/osd294) detect_feature: extsize is supported and kernel 3.18.14 >= 3.5
2015-08-08 23:31:17.597437 7f4599edb780 0 filestore(/ceph/osd294) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-08-08 23:31:17.599591 7f4599edb780 1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
2015-08-08 23:31:17.599593 7f4599edb780 1 journal _open /ceph/journals/osd294.journal fd 20: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-08-08 23:31:18.966346 7f4599edb780 1 journal _open /ceph/journals/osd294.journal fd 19: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-08-08 23:31:18.969114 7f4599edb780 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2015-08-08 23:31:18.971901 7f4599edb780 0 osd.294 131995 load_pgs
2015-08-08 23:31:23.370619 7f4599edb780 0 osd.294 131995 load_pgs opened 236 pgs
2015-08-08 23:31:23.371524 7f4599edb780 -1 osd.294 131995 log_to_monitors {default=true}
2015-08-08 23:31:23.385258 7f457fc2f700 0 osd.294 131995 ignoring osdmap until we have initialized
2015-08-08 23:31:23.388683 7f457fc2f700 0 osd.294 131995 ignoring osdmap until we have initialized
2015-08-08 23:31:23.430444 7f4599edb780 0 osd.294 131995 done with init, starting boot process
2015-08-08 23:31:23.628194 7f457741e700 -1 osd.294 131995 lsb_release_parse - pclose failed: (0) Success
2015-08-08 23:31:26.538498 7f458e06b700 0 -
192.168.252.126:6907/549 >> 192.168.252.175:0/4144533872 pipe(0x2843a000 sd=33 :6907 s=0 pgs=0 cs=0 l=0 c=0x289fe000).accept peer addr is really 192.168.252.175:0/4144533872 (socket is 192.168.252.175:40657/0)
2015-08-08 23:31:36.539982 7f454bf06700 0 -- 192.168.252.126:6907/549 >> 192.168.252.20:0/2174871064 pipe(0x1bb93500 sd=420 :6907 s=0 pgs=0 cs=0 l=0 c=0x281ab9a0).accept peer addr is really 192.168.252.20:0/2174871064 (socket is 192.168.252.20:58718/0)
2015-08-08 23:32:08.345959 7f4570410700 -1 osd/ReplicatedPG.cc: In function 'ReplicatedPG::RepGather* ReplicatedPG::trim_object(const hobject_t&)' thread 7f4570410700 time 2015-08-08 23:32:08.344485
osd/ReplicatedPG.cc: 2706: FAILED assert(p != snapset.clones.end())

ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: (ReplicatedPG::trim_object(hobject_t const&)+0x19aa) [0x86f42a]
2: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x972) [0x876c92]
3: (boost::statechart::simple_state&lt;ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa8) [0x8cd2d8]
4: (boost::statechart::state_machine&lt;ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x16b) [0x8bd02b]
5: (ReplicatedPG::snap_trimmer()+0x4f0) [0x8348a0]
6: (OSD::SnapTrimWQ::_process(PG*)+0x1d) [0x68330d]
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4fa) [0xb0acca]
8: (ThreadPool::WorkThread::entry()+0x10) [0xb0c720]
9: (()+0x68ca) [0x7f45994098ca]
10: (clone()+0x6d) [0x7f4597adfb6d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---
10000> 2015-08-08 23:31:59.192474 7f4575c1b700 5 - op tracker -- seq: 4641, time: 2015-08-08 23:31:59.192474, event: done, op: pg_backfill(progress 9.82 e 132009/132009 lb cdc9ac82/default.1812767.6_ebab74e952455e327d9a45574339efbd.jpg/head//9)
9999> 2015-08-08 23:31:59.193402 7f45852ff700 1 - 172.31.0.126:6906/549 --> 172.31.0.37:6902/12529 -- pg_trim(9.82 to 130716'91608 e132009) v1 -- ?+0 0x28663c00 con 0x274e1dc0
9998> 2015-08-08 23:31:59.195139 7f4585b00700 1 - 172.31.0.126:6906/549 --> 172.31.0.37:6902/12529 -- MOSDPGPushReply(9.82 132009 [PushReplyOp(eb6aac82/default.2431383.1_f16b-1e97b6f11ffb117fdc2ef7081566d4cce2a4/head//9)]) v2 -- ?+0 0x2c8d6c00 con 0x274e1dc0
9997> 2015-08-08 23:31:59.198878 7f456569b700 1 - 172.31.0.126:6906/549 <== osd.224 172.31.0.171:6906/7829 639 ==== MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ==== 16171+0+0 (1445467986 0 0) 0x2c8cbe00 con 0x274b7580
9996> 2015-08-08 23:31:59.198905 7f456569b700 5 - op tracker -- seq: 4642, time: 2015-08-08 23:31:59.198793, event: header_read, op: MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
9995> 2015-08-08 23:31:59.198927 7f456569b700 5 - op tracker -- seq: 4642, time: 2015-08-08 23:31:59.198795, event: throttled, op: MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
9994> 2015-08-08 23:31:59.198938 7f456569b700 5 - op tracker -- seq: 4642, time: 2015-08-08 23:31:59.198872, event: all_read, op: MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
9993> 2015-08-08 23:31:59.198947 7f456569b700 5 - op tracker -- seq: 4642, time: 0.000000, event: dispatched, op: MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
9992> 2015-08-08 23:31:59.199015 7f456569b700 1 - 172.31.0.126:6906/549 <== osd.224 172.31.0.171:6906/7829 640 ==== pg_backfill(progress 9.1bb e 132009/132009 lb 22a31dbb/default.2431891.2_312d-6f41d4c71246385099d7e0fba31898ec22d5/head//9) v3 ==== 812+0+0 (2046507159 0 0) 0x4171680 con 0x274b7580
9991> 2015-08-08 23:31:59.199026 7f456569b700 5 - op tracker -- seq: 4643, time: 2015-08-08 23:31:59.198980, event: header_read, op: pg_backfill(progress 9.1bb e 132009/132009 lb 22a31dbb/default.2431891.2_312d-6f41d4c71246385099d7e0fba31898ec22d5/head//9)

Actions #4

Updated by Sage Weil over 8 years ago

  • Status changed from New to Need More Info

Hi Brad,

Are you using cache tiering? This may be a result of an old bug, #8629, that was present in your older version of firefly.

Actions #5

Updated by Bram Pieters over 8 years ago

Hi Sage,

No, untill now we've never used cache tiering.
This cluster is pretty "old" and went thru upgrades since v0.56

Only since 2 weeks osd's are starting to crash when we were running v0.80.4.
We were able stop the crashing by setting noscrub and nodeep-scrub.

We hoped to fix this by upgrading to v0.94.1. This was not the case so we started to remove all snaps from all rbd's.
This triggered numerous extra crashes by different OSD's (not all of them) and we had alot of trouble to get to a stable situation as you can see from the ceph osd tree.

By setting
osd pg max concurrent snap trims = 0
and noscrub and nodeep-scrub, the cluster was able to finish replicating and we have then re-added all down osd's again.

But these 3 parameters/flags need to remain set to prevent "random" OSD's to go down.

We think it's an issue which has been solved in the 0.8x series in a version we've never ran, based on other issue's we've seen in the bug tracker.
And this "fix" does not exist anymore in 0.94.1.
So this would then be a potential upgrade issue from v0.80.4 directly to 0.94.1.

We have 3 other clusters which all ran and run the same versions. We've tested the upgrade on all those clusters first before upgrading this cluster without issues.
The difference is that the other clusters never have had snapshots on rbd's.

Kr,
Bram

PS: Brad should be Bram ;)

Actions #6

Updated by Bram Pieters over 8 years ago

Hi Sage,

Any ideas on this problem ?
Can we provide more logs or should we run some extra tests ?
The problem is very easy to trigger so generating additional debugs is no problem.

Kr,
Bram

Actions #7

Updated by Sage Weil over 8 years ago

  • Source changed from other to Community (user)

Hi Bram,

Can you generate a full log (debug osd = 20, debug ms = 1) for a scrub that leads to a crash (if it does crash?)? This will let us see what the corruption is, and expand the repair code to deal with it gracefully.

If scrub doesn't crash, do the same for a snap trim that leads to a crash (with full logs).

Thanks!

Actions #8

Updated by Paul Emmerich over 8 years ago

We are encountering the same issue after trying to delete a snapshot.
Here is a log file with debug osd = 20/20 (set with inject_args right after starting it up, hopefully early enough): https://dl.dropboxusercontent.com/u/24773939/ceph-crash.log.gz (1.7mb)

A work-around is setting osd disk threads = 0 to get IO back up running, then export/backup and delete the broken objects/image.

P.S.: uploading files here is broken, I get a "request entity too large error" when trying to upload 1.7mb

Actions #9

Updated by Sage Weil over 8 years ago

  • Status changed from Need More Info to 12
  • Assignee set to Sage Weil
Actions #10

Updated by Sage Weil over 8 years ago

  • Status changed from 12 to Need More Info

Paul Emmerich wrote:

We are encountering the same issue after trying to delete a snapshot.
Here is a log file with debug osd = 20/20 (set with inject_args right after starting it up, hopefully early enough): https://dl.dropboxusercontent.com/u/24773939/ceph-crash.log.gz (1.7mb)

A work-around is setting osd disk threads = 0 to get IO back up running, then export/backup and delete the broken objects/image.

P.S.: uploading files here is broken, I get a "request entity too large error" when trying to upload 1.7mb

Paul, can you reproduce that crash with debug osd = 20 and attach (a link to) the log?

Thanks!

Actions #11

Updated by Sage Weil over 8 years ago

  • Priority changed from Urgent to High
Actions #12

Updated by Paul Emmerich over 8 years ago

Here is a complete log with debug osd = 20 from startup to crash.

https://www.dropbox.com/s/62v6rprsfo2ghdh/crash.109.log.gz?dl=1 (10.2 MiB, ~100 MB uncompressed)

Actions #13

Updated by Sage Weil over 8 years ago

  • Assignee changed from Sage Weil to David Zafman
Actions #14

Updated by David Zafman over 8 years ago

  • Status changed from Need More Info to Fix Under Review

Crashes will be fixed by https://github.com/ceph/ceph/pull/5783

In particular commit https://github.com/dzafman/ceph/commit/3ec5b62d5f68c11514b56c8ac2f4165c07bf521e

I will resolve on the assumption that the snapset corruption occurred because of a bug in an older release.

Actions #15

Updated by David Zafman over 8 years ago

  • Status changed from Fix Under Review to Resolved

eb0ca424815e94c78a2d09dbf787d102172f4ddf

Actions

Also available in: Atom PDF