Bug #12665: osd/ReplicatedPG.cc: 2706: FAILED assert(p != snapset.clones.end()) - Ceph - Ceph

Actions

Copy link

Bug #12665

closed

osd/ReplicatedPG.cc: 2706: FAILED assert(p != snapset.clones.end())

Added by Bram Pieters over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After upgrading our Ceph cluster from 0.80.4 to 0.94.1 we have intermittent crashes on multiple osd's.
Marking those OSD's out result in rebalancing of the cluster, triggering other OSD's to crash.
It looks like some data is causing the crashes but we have no clue which data it is.

Meanwhile we've cleaned up as much data as possible by
- removing old rbd's
- removing all snapshots of all rbd's
- copying rbd's who had snapshots to new rbd's via rbd copy

We auto restart the OSD's in the meanwhile every 5 mins but we're afraid data corruption will occur within rbd's because of intermittent io lockups at the clients as a result of continuous recalculations of the crush map.

I've included 2 log files from 2 OSD's while they have crashed.

Ceph Version:

ceph -v
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

Our Osd tree:

ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 85.00000 root default
-3 85.00000 rack unknownrack
-9 4.00000 host netxen-25307
71 1.00000 osd.71 up 1.00000 1.00000
72 1.00000 osd.72 up 1.00000 1.00000
73 1.00000 osd.73 up 1.00000 1.00000
74 1.00000 osd.74 up 1.00000 1.00000
-10 4.00000 host netxen-25308
81 1.00000 osd.81 up 1.00000 1.00000
82 1.00000 osd.82 up 1.00000 1.00000
83 1.00000 osd.83 up 1.00000 1.00000
84 1.00000 osd.84 up 1.00000 1.00000
-11 4.00000 host netxen-25309
91 1.00000 osd.91 up 1.00000 1.00000
92 1.00000 osd.92 up 1.00000 1.00000
93 1.00000 osd.93 up 1.00000 1.00000
94 1.00000 osd.94 up 1.00000 1.00000
-12 4.00000 host netxen-25310
101 1.00000 osd.101 up 1.00000 1.00000
102 1.00000 osd.102 up 1.00000 1.00000
103 1.00000 osd.103 up 1.00000 1.00000
104 1.00000 osd.104 up 1.00000 1.00000
-7 4.00000 host netxen-25311
111 1.00000 osd.111 up 1.00000 1.00000
112 1.00000 osd.112 up 1.00000 1.00000
113 1.00000 osd.113 up 1.00000 1.00000
114 1.00000 osd.114 up 1.00000 1.00000
-8 4.00000 host netxen-25312
121 1.00000 osd.121 down 0 1.00000
122 1.00000 osd.122 down 0 1.00000
123 1.00000 osd.123 down 0 1.00000
124 1.00000 osd.124 down 0 1.00000
-13 0 host netxen-25313
131 0 osd.131 up 0 1.00000
132 0 osd.132 up 1.00000 1.00000
133 0 osd.133 up 0 1.00000
134 0 osd.134 up 1.00000 1.00000
-14 4.00000 host netxen-25314
141 1.00000 osd.141 up 0 1.00000
142 1.00000 osd.142 down 1.00000 1.00000
143 1.00000 osd.143 up 0 1.00000
144 1.00000 osd.144 up 0 1.00000
-15 4.00000 host netxen-25315
151 1.00000 osd.151 up 1.00000 1.00000
152 1.00000 osd.152 up 1.00000 1.00000
153 1.00000 osd.153 up 1.00000 1.00000
154 1.00000 osd.154 up 1.00000 1.00000
-16 4.00000 host netxen-25316
161 1.00000 osd.161 up 1.00000 1.00000
162 1.00000 osd.162 up 1.00000 1.00000
163 1.00000 osd.163 up 1.00000 1.00000
164 1.00000 osd.164 up 1.00000 1.00000
-17 4.00000 host netxen-25317
171 1.00000 osd.171 up 1.00000 1.00000
172 1.00000 osd.172 up 1.00000 1.00000
173 1.00000 osd.173 up 1.00000 1.00000
174 1.00000 osd.174 up 1.00000 1.00000
-18 4.00000 host netxen-25318
181 1.00000 osd.181 up 1.00000 1.00000
182 1.00000 osd.182 up 1.00000 1.00000
183 1.00000 osd.183 up 1.00000 1.00000
184 1.00000 osd.184 up 1.00000 1.00000
-19 4.00000 host netxen-25319
191 1.00000 osd.191 up 1.00000 1.00000
192 1.00000 osd.192 up 1.00000 1.00000
193 1.00000 osd.193 up 1.00000 1.00000
194 1.00000 osd.194 up 1.00000 1.00000
-20 4.00000 host netxen-25320
201 1.00000 osd.201 up 1.00000 1.00000
202 1.00000 osd.202 up 1.00000 1.00000
203 1.00000 osd.203 up 1.00000 1.00000
204 1.00000 osd.204 up 1.00000 1.00000
-21 3.00000 host netxen-25321
211 1.00000 osd.211 up 1.00000 1.00000
212 1.00000 osd.212 up 0 1.00000
213 0 osd.213 up 1.00000 1.00000
214 1.00000 osd.214 up 0 1.00000
-22 4.00000 host netxen-25322
221 1.00000 osd.221 up 1.00000 1.00000
222 1.00000 osd.222 up 1.00000 1.00000
223 1.00000 osd.223 up 1.00000 1.00000
224 1.00000 osd.224 up 1.00000 1.00000
-2 4.00000 host netxen-25323
231 1.00000 osd.231 up 1.00000 1.00000
232 1.00000 osd.232 up 1.00000 1.00000
233 1.00000 osd.233 up 1.00000 1.00000
234 1.00000 osd.234 up 1.00000 1.00000
-4 4.00000 host netxen-25324
241 1.00000 osd.241 up 1.00000 1.00000
242 1.00000 osd.242 up 1.00000 1.00000
243 1.00000 osd.243 up 1.00000 1.00000
244 1.00000 osd.244 up 1.00000 1.00000
-5 4.00000 host netxen-25325
251 1.00000 osd.251 up 1.00000 1.00000
252 1.00000 osd.252 up 1.00000 1.00000
253 1.00000 osd.253 up 1.00000 1.00000
254 1.00000 osd.254 up 1.00000 1.00000
-6 1.00000 host netxen-25326
261 0 osd.261 up 0 1.00000
262 0 osd.262 up 0 1.00000
263 1.00000 osd.263 up 1.00000 1.00000
264 0 osd.264 up 0 1.00000
-23 4.00000 host netxen-25327
271 1.00000 osd.271 up 1.00000 1.00000
272 1.00000 osd.272 up 1.00000 1.00000
273 1.00000 osd.273 up 1.00000 1.00000
274 1.00000 osd.274 up 1.00000 1.00000
-24 4.00000 host netxen-25328
281 1.00000 osd.281 up 1.00000 1.00000
282 1.00000 osd.282 up 1.00000 1.00000
283 1.00000 osd.283 up 1.00000 1.00000
284 1.00000 osd.284 up 1.00000 1.00000
-25 1.00000 host netxen25329
291 0 osd.291 up 0 1.00000
292 0 osd.292 up 0 1.00000
293 0 osd.293 down 1.00000 1.00000
294 1.00000 osd.294 up 0 1.00000
-26 4.00000 host netxen25330
301 1.00000 osd.301 up 1.00000 1.00000
302 1.00000 osd.302 up 1.00000 1.00000
303 1.00000 osd.303 up 1.00000 1.00000
304 1.00000 osd.304 up 1.00000 1.00000

Ceph Health

ceph -s
cluster 2e1396f7-deaa-45b5-9db9-62e046089435
health HEALTH_WARN
3 pgs backfilling
786 pgs degraded
1 pgs recovering
992 pgs stuck unclean
786 pgs undersized
1 requests are blocked > 32 sec
recovery 821761/20467274 objects degraded (4.015%)
recovery 943602/20467274 objects misplaced (4.610%)
recovery 1/6821655 unfound (0.000%)
3/80 in osds are down
noout,noscrub,nodeep-scrub flag(s) set
monmap e13: 3 mons at {0=192.168.252.76:6789/0,2=192.168.252.36:6789/0,4=192.168.252.37:6789/0}
election epoch 6506, quorum 0,1,2 2,4,0
mdsmap e2050: 1/1/1 up {0=2=up:active}, 1 up:standby
osdmap e140054: 96 osds: 89 up, 80 in; 722 remapped pgs
flags noout,noscrub,nodeep-scrub
pgmap v68664097: 5760 pgs, 11 pools, 11143 GB data, 6661 kobjects
32536 GB used, 41576 GB / 74112 GB avail
821761/20467274 objects degraded (4.015%)
943602/20467274 objects misplaced (4.610%)
1/6821655 unfound (0.000%)
4316 active+clean
722 active+undersized+degraded
655 active+remapped
63 active+undersized+degraded+remapped
3 active+remapped+backfilling
1 active+recovering+undersized+degraded
client io 12887 kB/s rd, 10347 kB/s wr, 2123 op/s

Files

Download all files

ceph-osd.144.log.gz (241 KB) ceph-osd.144.log.gz		Bram Pieters, 08/10/2015 11:12 PM
ceph-osd.294.log.gz (28.1 KB) ceph-osd.294.log.gz		Bram Pieters, 08/10/2015 11:15 PM

Actions

Copy link

Updated by Bram Pieters over 8 years ago

File ceph-osd.294.log.gz ceph-osd.294.log.gz added

Actions

Copy link

Updated by Bram Pieters over 8 years ago

An update:

We were able to finish the rebalancing by setting:
osd pg max concurrent snap trims = 0

We also need noscrub and nodeeb-scrub flags to be set to avoid osd's to crash.

Really like some feedback or suggestions

Actions

Copy link

Updated by Bram Pieters over 8 years ago

So far the problem keeps persisting, so here is a small part of the log from an crashing osd...

2015-08-08 23:31:17.446134 7f4599edb780 0 filestore(/ceph/osd294) backend xfs (magic 0x58465342)
2015-08-08 23:31:17.449008 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: FIEMAP ioctl is supported and appears to work
2015-08-08 23:31:17.449017 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-08-08 23:31:17.449162 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: syncfs(2) syscall not supported
2015-08-08 23:31:17.449169 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: no syncfs(2), must use sync(2).
2015-08-08 23:31:17.449172 7f4599edb780 0 genericfilestorebackend(/ceph/osd294) detect_features: WARNING: multiple ceph-osd daemons on the same host will be slow
2015-08-08 23:31:17.449238 7f4599edb780 0 xfsfilestorebackend(/ceph/osd294) detect_feature: extsize is supported and kernel 3.18.14 >= 3.5
2015-08-08 23:31:17.597437 7f4599edb780 0 filestore(/ceph/osd294) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-08-08 23:31:17.599591 7f4599edb780 1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
2015-08-08 23:31:17.599593 7f4599edb780 1 journal _open /ceph/journals/osd294.journal fd 20: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-08-08 23:31:18.966346 7f4599edb780 1 journal _open /ceph/journals/osd294.journal fd 19: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 0
2015-08-08 23:31:18.969114 7f4599edb780 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2015-08-08 23:31:18.971901 7f4599edb780 0 osd.294 131995 load_pgs
2015-08-08 23:31:23.370619 7f4599edb780 0 osd.294 131995 load_pgs opened 236 pgs
2015-08-08 23:31:23.371524 7f4599edb780 -1 osd.294 131995 log_to_monitors {default=true}
2015-08-08 23:31:23.385258 7f457fc2f700 0 osd.294 131995 ignoring osdmap until we have initialized
2015-08-08 23:31:23.388683 7f457fc2f700 0 osd.294 131995 ignoring osdmap until we have initialized
2015-08-08 23:31:23.430444 7f4599edb780 0 osd.294 131995 done with init, starting boot process
2015-08-08 23:31:23.628194 7f457741e700 -1 osd.294 131995 lsb_release_parse - pclose failed: (0) Success
2015-08-08 23:31:26.538498 7f458e06b700 0 - 192.168.252.126:6907/549 >> 192.168.252.175:0/4144533872 pipe(0x2843a000 sd=33 :6907 s=0 pgs=0 cs=0 l=0 c=0x289fe000).accept peer addr is really 192.168.252.175:0/4144533872 (socket is 192.168.252.175:40657/0)
2015-08-08 23:31:36.539982 7f454bf06700 0 -- 192.168.252.126:6907/549 >> 192.168.252.20:0/2174871064 pipe(0x1bb93500 sd=420 :6907 s=0 pgs=0 cs=0 l=0 c=0x281ab9a0).accept peer addr is really 192.168.252.20:0/2174871064 (socket is 192.168.252.20:58718/0)
2015-08-08 23:32:08.345959 7f4570410700 -1 osd/ReplicatedPG.cc: In function 'ReplicatedPG::RepGather* ReplicatedPG::trim_object(const hobject_t&)' thread 7f4570410700 time 2015-08-08 23:32:08.344485
osd/ReplicatedPG.cc: 2706: FAILED assert(p != snapset.clones.end())

ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ReplicatedPG::trim_object(hobject_t const&)+0x19aa) [0x86f42a]
 2: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x972) [0x876c92]
 3: (boost::statechart::simple_state&lt;ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa8) [0x8cd2d8]
 4: (boost::statechart::state_machine&lt;ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x16b) [0x8bd02b]
 5: (ReplicatedPG::snap_trimmer()+0x4f0) [0x8348a0]
 6: (OSD::SnapTrimWQ::_process(PG*)+0x1d) [0x68330d]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4fa) [0xb0acca]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xb0c720]
 9: (()+0x68ca) [0x7f45994098ca]
 10: (clone()+0x6d) [0x7f4597adfb6d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---
~~10000> 2015-08-08 23:31:59.192474 7f4575c1b700 5 -~~ op tracker -- seq: 4641, time: 2015-08-08 23:31:59.192474, event: done, op: pg_backfill(progress 9.82 e 132009/132009 lb cdc9ac82/default.1812767.6_ebab74e952455e327d9a45574339efbd.jpg/head//9)
~~9999> 2015-08-08 23:31:59.193402 7f45852ff700 1 -~~ 172.31.0.126:6906/549 --> 172.31.0.37:6902/12529 -- pg_trim(9.82 to 130716'91608 e132009) v1 -- ?+0 0x28663c00 con 0x274e1dc0
~~9998> 2015-08-08 23:31:59.195139 7f4585b00700 1 -~~ 172.31.0.126:6906/549 --> 172.31.0.37:6902/12529 -- MOSDPGPushReply(9.82 132009 [PushReplyOp(eb6aac82/default.2431383.1_f16b-1e97b6f11ffb117fdc2ef7081566d4cce2a4/head//9)]) v2 -- ?+0 0x2c8d6c00 con 0x274e1dc0
~~9997> 2015-08-08 23:31:59.198878 7f456569b700 1 -~~ 172.31.0.126:6906/549 <== osd.224 172.31.0.171:6906/7829 639 ==== MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ==== 16171+0+0 (1445467986 0 0) 0x2c8cbe00 con 0x274b7580
~~9996> 2015-08-08 23:31:59.198905 7f456569b700 5 -~~ op tracker -- seq: 4642, time: 2015-08-08 23:31:59.198793, event: header_read, op: MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
~~9995> 2015-08-08 23:31:59.198927 7f456569b700 5 -~~ op tracker -- seq: 4642, time: 2015-08-08 23:31:59.198795, event: throttled, op: MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
~~9994> 2015-08-08 23:31:59.198938 7f456569b700 5 -~~ op tracker -- seq: 4642, time: 2015-08-08 23:31:59.198872, event: all_read, op: MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
~~9993> 2015-08-08 23:31:59.198947 7f456569b700 5 -~~ op tracker -- seq: 4642, time: 0.000000, event: dispatched, op: MOSDPGPush(9.1bb 132009 [PushOp(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9, version: 90902'92220, data_included: [0~14252], data_size: 14252, omap_header_size: 0, omap_entries_size: 0, attrset_size: 9, recovery_info: ObjectRecoveryInfo(8d841dbb/default.1812767.3_sys-master/root/h1f/h15/8806611615774/bodyMedia-8799173253184/head//9@90902'92220, copy_subset: [0~14252], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:14252, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))])
~~9992> 2015-08-08 23:31:59.199015 7f456569b700 1 -~~ 172.31.0.126:6906/549 <== osd.224 172.31.0.171:6906/7829 640 ==== pg_backfill(progress 9.1bb e 132009/132009 lb 22a31dbb/default.2431891.2_312d-6f41d4c71246385099d7e0fba31898ec22d5/head//9) v3 ==== 812+0+0 (2046507159 0 0) 0x4171680 con 0x274b7580
~~9991> 2015-08-08 23:31:59.199026 7f456569b700 5 -~~ op tracker -- seq: 4643, time: 2015-08-08 23:31:59.198980, event: header_read, op: pg_backfill(progress 9.1bb e 132009/132009 lb 22a31dbb/default.2431891.2_312d-6f41d4c71246385099d7e0fba31898ec22d5/head//9)

Actions

Copy link

Updated by Sage Weil over 8 years ago

Status changed from New to Need More Info

Hi Brad,

Are you using cache tiering? This may be a result of an old bug, #8629, that was present in your older version of firefly.

Actions

Copy link

Updated by Bram Pieters over 8 years ago

Hi Sage,

No, untill now we've never used cache tiering.
This cluster is pretty "old" and went thru upgrades since v0.56

Only since 2 weeks osd's are starting to crash when we were running v0.80.4.
We were able stop the crashing by setting noscrub and nodeep-scrub.

We hoped to fix this by upgrading to v0.94.1. This was not the case so we started to remove all snaps from all rbd's.
This triggered numerous extra crashes by different OSD's (not all of them) and we had alot of trouble to get to a stable situation as you can see from the ceph osd tree.

By setting
osd pg max concurrent snap trims = 0
and noscrub and nodeep-scrub, the cluster was able to finish replicating and we have then re-added all down osd's again.

But these 3 parameters/flags need to remain set to prevent "random" OSD's to go down.

We think it's an issue which has been solved in the 0.8x series in a version we've never ran, based on other issue's we've seen in the bug tracker.
And this "fix" does not exist anymore in 0.94.1.
So this would then be a potential upgrade issue from v0.80.4 directly to 0.94.1.

We have 3 other clusters which all ran and run the same versions. We've tested the upgrade on all those clusters first before upgrading this cluster without issues.
The difference is that the other clusters never have had snapshots on rbd's.

Kr,
Bram

PS: Brad should be Bram ;)

Actions

Copy link

Updated by Bram Pieters over 8 years ago

Hi Sage,

Any ideas on this problem ?
Can we provide more logs or should we run some extra tests ?
The problem is very easy to trigger so generating additional debugs is no problem.

Kr,
Bram

Actions

Copy link

Updated by Sage Weil over 8 years ago

Source changed from other to Community (user)

Hi Bram,

Can you generate a full log (debug osd = 20, debug ms = 1) for a scrub that leads to a crash (if it does crash?)? This will let us see what the corruption is, and expand the repair code to deal with it gracefully.

If scrub doesn't crash, do the same for a snap trim that leads to a crash (with full logs).

Thanks!

Actions

Copy link

Updated by Paul Emmerich over 8 years ago

We are encountering the same issue after trying to delete a snapshot.
Here is a log file with debug osd = 20/20 (set with inject_args right after starting it up, hopefully early enough): https://dl.dropboxusercontent.com/u/24773939/ceph-crash.log.gz (1.7mb)

A work-around is setting osd disk threads = 0 to get IO back up running, then export/backup and delete the broken objects/image.

P.S.: uploading files here is broken, I get a "request entity too large error" when trying to upload 1.7mb

Actions

Copy link

Updated by Sage Weil over 8 years ago

Status changed from Need More Info to 12
Assignee set to Sage Weil

Actions

Copy link

#10

Updated by Sage Weil over 8 years ago

Status changed from 12 to Need More Info

Paul Emmerich wrote:

We are encountering the same issue after trying to delete a snapshot.
Here is a log file with debug osd = 20/20 (set with inject_args right after starting it up, hopefully early enough): https://dl.dropboxusercontent.com/u/24773939/ceph-crash.log.gz (1.7mb)

A work-around is setting osd disk threads = 0 to get IO back up running, then export/backup and delete the broken objects/image.

P.S.: uploading files here is broken, I get a "request entity too large error" when trying to upload 1.7mb

Paul, can you reproduce that crash with debug osd = 20 and attach (a link to) the log?

Thanks!

Actions

Copy link

#11