Project

General

Profile

Actions

Bug #12665

closed

osd/ReplicatedPG.cc: 2706: FAILED assert(p != snapset.clones.end())

Added by Bram Pieters over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading our Ceph cluster from 0.80.4 to 0.94.1 we have intermittent crashes on multiple osd's.
Marking those OSD's out result in rebalancing of the cluster, triggering other OSD's to crash.
It looks like some data is causing the crashes but we have no clue which data it is.

Meanwhile we've cleaned up as much data as possible by
- removing old rbd's
- removing all snapshots of all rbd's
- copying rbd's who had snapshots to new rbd's via rbd copy

We auto restart the OSD's in the meanwhile every 5 mins but we're afraid data corruption will occur within rbd's because of intermittent io lockups at the clients as a result of continuous recalculations of the crush map.

I've included 2 log files from 2 OSD's while they have crashed.

Ceph Version:
  1. ceph -v
    ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
Our Osd tree:
  1. ceph osd tree
    ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1 85.00000 root default
    -3 85.00000 rack unknownrack
    -9 4.00000 host netxen-25307
    71 1.00000 osd.71 up 1.00000 1.00000
    72 1.00000 osd.72 up 1.00000 1.00000
    73 1.00000 osd.73 up 1.00000 1.00000
    74 1.00000 osd.74 up 1.00000 1.00000
    -10 4.00000 host netxen-25308
    81 1.00000 osd.81 up 1.00000 1.00000
    82 1.00000 osd.82 up 1.00000 1.00000
    83 1.00000 osd.83 up 1.00000 1.00000
    84 1.00000 osd.84 up 1.00000 1.00000
    -11 4.00000 host netxen-25309
    91 1.00000 osd.91 up 1.00000 1.00000
    92 1.00000 osd.92 up 1.00000 1.00000
    93 1.00000 osd.93 up 1.00000 1.00000
    94 1.00000 osd.94 up 1.00000 1.00000
    -12 4.00000 host netxen-25310
    101 1.00000 osd.101 up 1.00000 1.00000
    102 1.00000 osd.102 up 1.00000 1.00000
    103 1.00000 osd.103 up 1.00000 1.00000
    104 1.00000 osd.104 up 1.00000 1.00000
    -7 4.00000 host netxen-25311
    111 1.00000 osd.111 up 1.00000 1.00000
    112 1.00000 osd.112 up 1.00000 1.00000
    113 1.00000 osd.113 up 1.00000 1.00000
    114 1.00000 osd.114 up 1.00000 1.00000
    -8 4.00000 host netxen-25312
    121 1.00000 osd.121 down 0 1.00000
    122 1.00000 osd.122 down 0 1.00000
    123 1.00000 osd.123 down 0 1.00000
    124 1.00000 osd.124 down 0 1.00000
    -13 0 host netxen-25313
    131 0 osd.131 up 0 1.00000
    132 0 osd.132 up 1.00000 1.00000
    133 0 osd.133 up 0 1.00000
    134 0 osd.134 up 1.00000 1.00000
    -14 4.00000 host netxen-25314
    141 1.00000 osd.141 up 0 1.00000
    142 1.00000 osd.142 down 1.00000 1.00000
    143 1.00000 osd.143 up 0 1.00000
    144 1.00000 osd.144 up 0 1.00000
    -15 4.00000 host netxen-25315
    151 1.00000 osd.151 up 1.00000 1.00000
    152 1.00000 osd.152 up 1.00000 1.00000
    153 1.00000 osd.153 up 1.00000 1.00000
    154 1.00000 osd.154 up 1.00000 1.00000
    -16 4.00000 host netxen-25316
    161 1.00000 osd.161 up 1.00000 1.00000
    162 1.00000 osd.162 up 1.00000 1.00000
    163 1.00000 osd.163 up 1.00000 1.00000
    164 1.00000 osd.164 up 1.00000 1.00000
    -17 4.00000 host netxen-25317
    171 1.00000 osd.171 up 1.00000 1.00000
    172 1.00000 osd.172 up 1.00000 1.00000
    173 1.00000 osd.173 up 1.00000 1.00000
    174 1.00000 osd.174 up 1.00000 1.00000
    -18 4.00000 host netxen-25318
    181 1.00000 osd.181 up 1.00000 1.00000
    182 1.00000 osd.182 up 1.00000 1.00000
    183 1.00000 osd.183 up 1.00000 1.00000
    184 1.00000 osd.184 up 1.00000 1.00000
    -19 4.00000 host netxen-25319
    191 1.00000 osd.191 up 1.00000 1.00000
    192 1.00000 osd.192 up 1.00000 1.00000
    193 1.00000 osd.193 up 1.00000 1.00000
    194 1.00000 osd.194 up 1.00000 1.00000
    -20 4.00000 host netxen-25320
    201 1.00000 osd.201 up 1.00000 1.00000
    202 1.00000 osd.202 up 1.00000 1.00000
    203 1.00000 osd.203 up 1.00000 1.00000
    204 1.00000 osd.204 up 1.00000 1.00000
    -21 3.00000 host netxen-25321
    211 1.00000 osd.211 up 1.00000 1.00000
    212 1.00000 osd.212 up 0 1.00000
    213 0 osd.213 up 1.00000 1.00000
    214 1.00000 osd.214 up 0 1.00000
    -22 4.00000 host netxen-25322
    221 1.00000 osd.221 up 1.00000 1.00000
    222 1.00000 osd.222 up 1.00000 1.00000
    223 1.00000 osd.223 up 1.00000 1.00000
    224 1.00000 osd.224 up 1.00000 1.00000
    -2 4.00000 host netxen-25323
    231 1.00000 osd.231 up 1.00000 1.00000
    232 1.00000 osd.232 up 1.00000 1.00000
    233 1.00000 osd.233 up 1.00000 1.00000
    234 1.00000 osd.234 up 1.00000 1.00000
    -4 4.00000 host netxen-25324
    241 1.00000 osd.241 up 1.00000 1.00000
    242 1.00000 osd.242 up 1.00000 1.00000
    243 1.00000 osd.243 up 1.00000 1.00000
    244 1.00000 osd.244 up 1.00000 1.00000
    -5 4.00000 host netxen-25325
    251 1.00000 osd.251 up 1.00000 1.00000
    252 1.00000 osd.252 up 1.00000 1.00000
    253 1.00000 osd.253 up 1.00000 1.00000
    254 1.00000 osd.254 up 1.00000 1.00000
    -6 1.00000 host netxen-25326
    261 0 osd.261 up 0 1.00000
    262 0 osd.262 up 0 1.00000
    263 1.00000 osd.263 up 1.00000 1.00000
    264 0 osd.264 up 0 1.00000
    -23 4.00000 host netxen-25327
    271 1.00000 osd.271 up 1.00000 1.00000
    272 1.00000 osd.272 up 1.00000 1.00000
    273 1.00000 osd.273 up 1.00000 1.00000
    274 1.00000 osd.274 up 1.00000 1.00000
    -24 4.00000 host netxen-25328
    281 1.00000 osd.281 up 1.00000 1.00000
    282 1.00000 osd.282 up 1.00000 1.00000
    283 1.00000 osd.283 up 1.00000 1.00000
    284 1.00000 osd.284 up 1.00000 1.00000
    -25 1.00000 host netxen25329
    291 0 osd.291 up 0 1.00000
    292 0 osd.292 up 0 1.00000
    293 0 osd.293 down 1.00000 1.00000
    294 1.00000 osd.294 up 0 1.00000
    -26 4.00000 host netxen25330
    301 1.00000 osd.301 up 1.00000 1.00000
    302 1.00000 osd.302 up 1.00000 1.00000
    303 1.00000 osd.303 up 1.00000 1.00000
    304 1.00000 osd.304 up 1.00000 1.00000
Ceph Health
  1. ceph -s
    cluster 2e1396f7-deaa-45b5-9db9-62e046089435
    health HEALTH_WARN
    3 pgs backfilling
    786 pgs degraded
    1 pgs recovering
    992 pgs stuck unclean
    786 pgs undersized
    1 requests are blocked > 32 sec
    recovery 821761/20467274 objects degraded (4.015%)
    recovery 943602/20467274 objects misplaced (4.610%)
    recovery 1/6821655 unfound (0.000%)
    3/80 in osds are down
    noout,noscrub,nodeep-scrub flag(s) set
    monmap e13: 3 mons at {0=192.168.252.76:6789/0,2=192.168.252.36:6789/0,4=192.168.252.37:6789/0}
    election epoch 6506, quorum 0,1,2 2,4,0
    mdsmap e2050: 1/1/1 up {0=2=up:active}, 1 up:standby
    osdmap e140054: 96 osds: 89 up, 80 in; 722 remapped pgs
    flags noout,noscrub,nodeep-scrub
    pgmap v68664097: 5760 pgs, 11 pools, 11143 GB data, 6661 kobjects
    32536 GB used, 41576 GB / 74112 GB avail
    821761/20467274 objects degraded (4.015%)
    943602/20467274 objects misplaced (4.610%)
    1/6821655 unfound (0.000%)
    4316 active+clean
    722 active+undersized+degraded
    655 active+remapped
    63 active+undersized+degraded+remapped
    3 active+remapped+backfilling
    1 active+recovering+undersized+degraded
    client io 12887 kB/s rd, 10347 kB/s wr, 2123 op/s

Files

ceph-osd.144.log.gz (241 KB) ceph-osd.144.log.gz Bram Pieters, 08/10/2015 11:12 PM
ceph-osd.294.log.gz (28.1 KB) ceph-osd.294.log.gz Bram Pieters, 08/10/2015 11:15 PM
Actions

Also available in: Atom PDF