Bug #12665
closedosd/ReplicatedPG.cc: 2706: FAILED assert(p != snapset.clones.end())
0%
Description
After upgrading our Ceph cluster from 0.80.4 to 0.94.1 we have intermittent crashes on multiple osd's.
Marking those OSD's out result in rebalancing of the cluster, triggering other OSD's to crash.
It looks like some data is causing the crashes but we have no clue which data it is.
Meanwhile we've cleaned up as much data as possible by
- removing old rbd's
- removing all snapshots of all rbd's
- copying rbd's who had snapshots to new rbd's via rbd copy
We auto restart the OSD's in the meanwhile every 5 mins but we're afraid data corruption will occur within rbd's because of intermittent io lockups at the clients as a result of continuous recalculations of the crush map.
I've included 2 log files from 2 OSD's while they have crashed.
Ceph Version:- ceph -v
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
- ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 85.00000 root default
-3 85.00000 rack unknownrack
-9 4.00000 host netxen-25307
71 1.00000 osd.71 up 1.00000 1.00000
72 1.00000 osd.72 up 1.00000 1.00000
73 1.00000 osd.73 up 1.00000 1.00000
74 1.00000 osd.74 up 1.00000 1.00000
-10 4.00000 host netxen-25308
81 1.00000 osd.81 up 1.00000 1.00000
82 1.00000 osd.82 up 1.00000 1.00000
83 1.00000 osd.83 up 1.00000 1.00000
84 1.00000 osd.84 up 1.00000 1.00000
-11 4.00000 host netxen-25309
91 1.00000 osd.91 up 1.00000 1.00000
92 1.00000 osd.92 up 1.00000 1.00000
93 1.00000 osd.93 up 1.00000 1.00000
94 1.00000 osd.94 up 1.00000 1.00000
-12 4.00000 host netxen-25310
101 1.00000 osd.101 up 1.00000 1.00000
102 1.00000 osd.102 up 1.00000 1.00000
103 1.00000 osd.103 up 1.00000 1.00000
104 1.00000 osd.104 up 1.00000 1.00000
-7 4.00000 host netxen-25311
111 1.00000 osd.111 up 1.00000 1.00000
112 1.00000 osd.112 up 1.00000 1.00000
113 1.00000 osd.113 up 1.00000 1.00000
114 1.00000 osd.114 up 1.00000 1.00000
-8 4.00000 host netxen-25312
121 1.00000 osd.121 down 0 1.00000
122 1.00000 osd.122 down 0 1.00000
123 1.00000 osd.123 down 0 1.00000
124 1.00000 osd.124 down 0 1.00000
-13 0 host netxen-25313
131 0 osd.131 up 0 1.00000
132 0 osd.132 up 1.00000 1.00000
133 0 osd.133 up 0 1.00000
134 0 osd.134 up 1.00000 1.00000
-14 4.00000 host netxen-25314
141 1.00000 osd.141 up 0 1.00000
142 1.00000 osd.142 down 1.00000 1.00000
143 1.00000 osd.143 up 0 1.00000
144 1.00000 osd.144 up 0 1.00000
-15 4.00000 host netxen-25315
151 1.00000 osd.151 up 1.00000 1.00000
152 1.00000 osd.152 up 1.00000 1.00000
153 1.00000 osd.153 up 1.00000 1.00000
154 1.00000 osd.154 up 1.00000 1.00000
-16 4.00000 host netxen-25316
161 1.00000 osd.161 up 1.00000 1.00000
162 1.00000 osd.162 up 1.00000 1.00000
163 1.00000 osd.163 up 1.00000 1.00000
164 1.00000 osd.164 up 1.00000 1.00000
-17 4.00000 host netxen-25317
171 1.00000 osd.171 up 1.00000 1.00000
172 1.00000 osd.172 up 1.00000 1.00000
173 1.00000 osd.173 up 1.00000 1.00000
174 1.00000 osd.174 up 1.00000 1.00000
-18 4.00000 host netxen-25318
181 1.00000 osd.181 up 1.00000 1.00000
182 1.00000 osd.182 up 1.00000 1.00000
183 1.00000 osd.183 up 1.00000 1.00000
184 1.00000 osd.184 up 1.00000 1.00000
-19 4.00000 host netxen-25319
191 1.00000 osd.191 up 1.00000 1.00000
192 1.00000 osd.192 up 1.00000 1.00000
193 1.00000 osd.193 up 1.00000 1.00000
194 1.00000 osd.194 up 1.00000 1.00000
-20 4.00000 host netxen-25320
201 1.00000 osd.201 up 1.00000 1.00000
202 1.00000 osd.202 up 1.00000 1.00000
203 1.00000 osd.203 up 1.00000 1.00000
204 1.00000 osd.204 up 1.00000 1.00000
-21 3.00000 host netxen-25321
211 1.00000 osd.211 up 1.00000 1.00000
212 1.00000 osd.212 up 0 1.00000
213 0 osd.213 up 1.00000 1.00000
214 1.00000 osd.214 up 0 1.00000
-22 4.00000 host netxen-25322
221 1.00000 osd.221 up 1.00000 1.00000
222 1.00000 osd.222 up 1.00000 1.00000
223 1.00000 osd.223 up 1.00000 1.00000
224 1.00000 osd.224 up 1.00000 1.00000
-2 4.00000 host netxen-25323
231 1.00000 osd.231 up 1.00000 1.00000
232 1.00000 osd.232 up 1.00000 1.00000
233 1.00000 osd.233 up 1.00000 1.00000
234 1.00000 osd.234 up 1.00000 1.00000
-4 4.00000 host netxen-25324
241 1.00000 osd.241 up 1.00000 1.00000
242 1.00000 osd.242 up 1.00000 1.00000
243 1.00000 osd.243 up 1.00000 1.00000
244 1.00000 osd.244 up 1.00000 1.00000
-5 4.00000 host netxen-25325
251 1.00000 osd.251 up 1.00000 1.00000
252 1.00000 osd.252 up 1.00000 1.00000
253 1.00000 osd.253 up 1.00000 1.00000
254 1.00000 osd.254 up 1.00000 1.00000
-6 1.00000 host netxen-25326
261 0 osd.261 up 0 1.00000
262 0 osd.262 up 0 1.00000
263 1.00000 osd.263 up 1.00000 1.00000
264 0 osd.264 up 0 1.00000
-23 4.00000 host netxen-25327
271 1.00000 osd.271 up 1.00000 1.00000
272 1.00000 osd.272 up 1.00000 1.00000
273 1.00000 osd.273 up 1.00000 1.00000
274 1.00000 osd.274 up 1.00000 1.00000
-24 4.00000 host netxen-25328
281 1.00000 osd.281 up 1.00000 1.00000
282 1.00000 osd.282 up 1.00000 1.00000
283 1.00000 osd.283 up 1.00000 1.00000
284 1.00000 osd.284 up 1.00000 1.00000
-25 1.00000 host netxen25329
291 0 osd.291 up 0 1.00000
292 0 osd.292 up 0 1.00000
293 0 osd.293 down 1.00000 1.00000
294 1.00000 osd.294 up 0 1.00000
-26 4.00000 host netxen25330
301 1.00000 osd.301 up 1.00000 1.00000
302 1.00000 osd.302 up 1.00000 1.00000
303 1.00000 osd.303 up 1.00000 1.00000
304 1.00000 osd.304 up 1.00000 1.00000
- ceph -s
cluster 2e1396f7-deaa-45b5-9db9-62e046089435
health HEALTH_WARN
3 pgs backfilling
786 pgs degraded
1 pgs recovering
992 pgs stuck unclean
786 pgs undersized
1 requests are blocked > 32 sec
recovery 821761/20467274 objects degraded (4.015%)
recovery 943602/20467274 objects misplaced (4.610%)
recovery 1/6821655 unfound (0.000%)
3/80 in osds are down
noout,noscrub,nodeep-scrub flag(s) set
monmap e13: 3 mons at {0=192.168.252.76:6789/0,2=192.168.252.36:6789/0,4=192.168.252.37:6789/0}
election epoch 6506, quorum 0,1,2 2,4,0
mdsmap e2050: 1/1/1 up {0=2=up:active}, 1 up:standby
osdmap e140054: 96 osds: 89 up, 80 in; 722 remapped pgs
flags noout,noscrub,nodeep-scrub
pgmap v68664097: 5760 pgs, 11 pools, 11143 GB data, 6661 kobjects
32536 GB used, 41576 GB / 74112 GB avail
821761/20467274 objects degraded (4.015%)
943602/20467274 objects misplaced (4.610%)
1/6821655 unfound (0.000%)
4316 active+clean
722 active+undersized+degraded
655 active+remapped
63 active+undersized+degraded+remapped
3 active+remapped+backfilling
1 active+recovering+undersized+degraded
client io 12887 kB/s rd, 10347 kB/s wr, 2123 op/s
Files