Project

General

Profile

Actions

Bug #612

closed

OSD: Crash during auto scrub

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After I saw #611 my cluster started to crash. One after the other, the OSD's started to go down, all with a message about the PG scrub.

osd3

2010-11-24 13:40:30.219184 7f1ceb598710 osd3 7496 pg[3.73( v 3632'261 (3632'257,3632'261] n=5 ec=2 les=7105 7496/7496/7496) [4,3] r=1 lcod 0'0 inactive] on_change
2010-11-24 13:40:30.219200 7f1ceb598710 osd3 7496 dec_scrubs_pending 0 -> -1 (max 1, active 0)
2010-11-24 13:40:30.267352 7f1cea596710 -- [2001:16f8:10:2::c3c3:2e3a]:6802/3084 <== osd0 [2001:16f8:10:2::c3c3:8f6b]:6802/22472 6513 ==== osd_ping(e0 as_of 7493) v1 ==== 61+0+0 (1292880165 0 0) 0x986cc40
2010-11-24 13:40:30.267380 7f1cea596710 osd3 7496 heartbeat_dispatch 0x986cc40
2010-11-24 13:40:30.267393 7f1cea596710 osd3 7496 handle_osd_ping from osd0 got stat stat(2010-11-24 13:40:30.617188 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-24 13:40:30.368203 7f1cea596710 -- [2001:16f8:10:2::c3c3:2e3a]:6802/3084 <== osd7 [2001:16f8:10:2::c3c3:2f6c]:6802/18600 6409 ==== osd_ping(e0 as_of 7491) v1 ==== 61+0+0 (3351412111 0 0) 0x2470a80
2010-11-24 13:40:30.368244 7f1cea596710 osd3 7496 heartbeat_dispatch 0x2470a80
2010-11-24 13:40:30.368254 7f1cea596710 osd3 7496 handle_osd_ping from osd7 got stat stat(2010-11-24 13:40:30.380154 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
osd/OSD.cc: In function 'void OSD::dec_scrubs_pending()':
osd/OSD.cc:2378: FAILED assert(scrubs_pending >= 0)
 ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d)
 1: /usr/bin/cosd() [0x4c1f75]
 2: (ReplicatedPG::on_change()+0x5f) [0x47b49f]
 3: (OSD::advance_map(ObjectStore::Transaction&)+0x159b) [0x4e775b]
 4: (OSD::handle_osd_map(MOSDMap*)+0x1d66) [0x4eb046]
 5: (OSD::_dispatch(Message*)+0x2c8) [0x4f5848]
 6: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9]
 7: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29]
 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c]
 9: (Thread::_entry_func(void*)+0xa) [0x47015a]
 10: (()+0x69ca) [0x7f1cf4d589ca]
 11: (clone()+0x6d) [0x7f1cf3abe70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d)
 1: (sigabrt_handler(int)+0x7d) [0x5dec4d]
 2: (()+0x33af0) [0x7f1cf3a0baf0]
 3: (gsignal()+0x35) [0x7f1cf3a0ba75]
 4: (abort()+0x180) [0x7f1cf3a0f5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f1cf42c18e5]
 6: (()+0xcad16) [0x7f1cf42bfd16]
 7: (()+0xcad43) [0x7f1cf42bfd43]
 8: (()+0xcae3e) [0x7f1cf42bfe3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc88]
 10: /usr/bin/cosd() [0x4c1f75]
 11: (ReplicatedPG::on_change()+0x5f) [0x47b49f]
 12: (OSD::advance_map(ObjectStore::Transaction&)+0x159b) [0x4e775b]
 13: (OSD::handle_osd_map(MOSDMap*)+0x1d66) [0x4eb046]
 14: (OSD::_dispatch(Message*)+0x2c8) [0x4f5848]
 15: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9]
 16: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29]
 17: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c]
 18: (Thread::_entry_func(void*)+0xa) [0x47015a]
 19: (()+0x69ca) [0x7f1cf4d589ca]
 20: (clone()+0x6d) [0x7f1cf3abe70d]

osd0 & osd1

2010-11-24 13:41:57.831888 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+scrubbing+inconsistent] scrub  /0 1000004c77c.00000000/head(1330'3 mds0.6:4072 wrlock_by=unknown0.0:0)
2010-11-24 13:41:57.831902 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+scrubbing+inconsistent] scrub got 5/5 objects, 0/0 clones, 593242/593242 bytes, 581/581 kb.
2010-11-24 13:41:57.831916 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+scrubbing+inconsistent] _scrub (scrub) finish
2010-11-24 13:41:57.831934 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+scrubbing+inconsistent] write_info info 309
2010-11-24 13:41:57.831950 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+scrubbing+inconsistent] write_info bigbl 54
2010-11-24 13:41:57.831974 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+scrubbing+inconsistent] share_pg_info
2010-11-24 13:41:57.831995 7ffde11bb710 -- [2001:16f8:10:2::c3c3:8f6b]:6801/22472 --> osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/9520 -- pg_info(1 pgs e7527) v1 -- ?+0 0x5cff380
2010-11-24 13:41:57.832016 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+inconsistent] update_stats 6967'3637
2010-11-24 13:41:57.832030 7ffde11bb710 osd0 7527 dec_scrubs_active 1 -> 0 (max 1, pending 0)
2010-11-24 13:41:57.832037 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+inconsistent] scrub requesting unreserve from osd2
2010-11-24 13:41:57.832052 7ffde11bb710 -- [2001:16f8:10:2::c3c3:8f6b]:6801/22472 --> osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/9520 -- osd_sub_op(unknown0.0:0 1.24 /0 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v1 -- ?+0 0xb8fc580
2010-11-24 13:41:57.832081 7ffde11bb710 osd0 7527 pg[1.24( v 7116'14 (1534'4,7116'14] n=5 ec=2 les=7504 7496/7496/6967) [0,2] r=0 mlcod 0'0 active+clean+degraded+inconsistent] scrub done
2010-11-24 13:41:57.832119 7ffde11bb710 osd0 7527 pg[0.25( v 1486'438 (1486'436,1486'438] n=435 ec=2 les=7504 7496/7496/6967) [0,2] r=0 lcod 0'0 mlcod 0'0 active+clean+degraded+inconsistent] scrub start
2010-11-24 13:41:57.832132 7ffde11bb710 osd0 7527 pg[0.25( v 1486'438 (1486'436,1486'438] n=435 ec=2 les=7504 7496/7496/6967) [0,2] r=0 lcod 0'0 mlcod 0'0 active+clean+degraded+scrubbing+inconsistent] update_stats 6967'4384
2010-11-24 13:41:57.835255 7ffde43c3710 -- [2001:16f8:10:2::c3c3:8f6b]:6801/22472 <== osd7 [2001:16f8:10:2::c3c3:2f6c]:6801/18600 9813 ==== pg_info(1 pgs e7527) v1 ==== 316+0+0 (198917258 0 0) 0x24951c0
2010-11-24 13:41:57.835277 7ffde43c3710 osd0 7527 _dispatch 0x24951c0 pg_info(1 pgs e7527) v1
2010-11-24 13:41:57.835284 7ffde43c3710 osd0 7527 handle_pg_info pg_info(1 pgs e7527) v1 from osd7
2010-11-24 13:41:57.835291 7ffde43c3710 osd0 7527 require_same_or_newer_map 7527 (i am 7527) 0x24951c0
2010-11-24 13:41:57.835299 7ffde43c3710 osd0 7527 pg[1.21f( v 7116'437 (1330'433,7116'437] n=4 ec=2 les=7057 7050/7050/7050) [7,0,4] r=1 luod=0'0 lcod 6736'436 active]: _process_pg_info info: 1.21f( v 7116'437 (1330'433,7116'437] n=4 ec=2 les=7057 7050/7050/7050), (log omitted), (missing omitted)
2010-11-24 13:41:57.842995 7ffde3bc2710 -- [2001:16f8:10:2::c3c3:8f6b]:6802/22472 <== osd5 [2001:16f8:10:2::c3c3:3b6a]:6802/15476 133 ==== osd_ping(e0 as_of 7525) v1 ==== 61+0+0 (3586543513 0 0) 0x6c3f000
2010-11-24 13:41:57.843018 7ffde3bc2710 osd0 7527 heartbeat_dispatch 0x6c3f000
2010-11-24 13:41:57.843024 7ffde3bc2710 osd0 7527 handle_osd_ping from osd5 got stat stat(2010-11-24 13:41:57.476449 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-24 13:41:57.843041 7ffde3bc2710 osd0 7527 take_peer_stat peer osd5 stat(2010-11-24 13:41:57.476449 oprate=0 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
osd/PG.cc: In function 'void PG::scrub()':
osd/PG.cc:2971: FAILED assert(osd->scrubs_pending >= 0)
 ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d)
 1: (PG::scrub()+0x24a7) [0x542667]
 2: (ThreadPool::worker()+0x28f) [0x5cc57f]
 3: (ThreadPool::WorkThread::entry()+0xd) [0x4fe19d]
 4: (Thread::_entry_func(void*)+0xa) [0x47015a]
 5: (()+0x69ca) [0x7ffdee3849ca]
 6: (clone()+0x6d) [0x7ffded0ea70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d)
 1: (sigabrt_handler(int)+0x7d) [0x5dec4d]
 2: (()+0x33af0) [0x7ffded037af0]
 3: (gsignal()+0x35) [0x7ffded037a75]
 4: (abort()+0x180) [0x7ffded03b5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7ffded8ed8e5]
 6: (()+0xcad16) [0x7ffded8ebd16]
 7: (()+0xcad43) [0x7ffded8ebd43]
 8: (()+0xcae3e) [0x7ffded8ebe3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc88]
 10: (PG::scrub()+0x24a7) [0x542667]
 11: (ThreadPool::worker()+0x28f) [0x5cc57f]
 12: (ThreadPool::WorkThread::entry()+0xd) [0x4fe19d]
 13: (Thread::_entry_func(void*)+0xa) [0x47015a]
 14: (()+0x69ca) [0x7ffdee3849ca]
 15: (clone()+0x6d) [0x7ffded0ea70d]

What I noticed in 'ceph -w', is:

2010-11-24 13:39:22.427341   log 2010-11-24 13:39:21.143950 osd1 [2001:16f8:10:2::c3c3:a24f]:6800/3750 222 : [INF] 3.11d scrub ok
2010-11-24 13:39:22.918378   log 2010-11-24 13:39:22.020493 osd11 [2001:16f8:10:2::c3c3:ab76]:6800/11274 368 : [ERR] 0.38 scrub osd3 /0 size 22890 != 4194304
2010-11-24 13:39:24.626067   log 2010-11-24 13:39:23.654725 osd11 [2001:16f8:10:2::c3c3:ab76]:6800/11274 369 : [INF] 0.7e scrub ok
2010-11-24 13:39:25.187218    pg v97833: 3216 pgs: 1 active, 2483 active+clean, 565 peering, 33 crashed+peering, 12 down+peering, 89 crashed+down+peering, 28 active+clean+inconsistent, 5 inconsistent+peering; 429 GB data, 1390 GB used, 3299 GB / 4713 GB avail; 1/1410396 degraded (0.000%)
....
....

2010-11-24 13:39:29.149193   log 2010-11-24 13:39:27.966334 osd1 [2001:16f8:10:2::c3c3:a24f]:6800/3750 224 : [INF] 3.170 scrub ok
2010-11-24 13:39:31.216851   log 2010-11-24 13:39:30.163439 osd1 [2001:16f8:10:2::c3c3:a24f]:6800/3750 225 : [INF] 3.175 scrub ok
2010-11-24 13:39:31.216851   log 2010-11-24 13:39:31.011049 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 11 : [INF] osd5 [2001:16f8:10:2::c3c3:3b6a]:6800/15476 failed (by osd0 [2001:16f8:10:2::c3c3:8f6b]:6800/22472)
2010-11-24 13:39:31.216851   log 2010-11-24 13:39:31.022855 mon0 [2001:16f8:10:2::c3c3:af78]:6789/0 12 : [INF] osd5 [2001:16f8:10:2::c3c3:3b6a]:6800/15476 failed (by osd1 [2001:16f8:10:2::c3c3:a24f]:6800/3750)
2010-11-24 13:39:31.556649    pg v97837: 3216 pgs: 1 active, 2482 active+clean, 565 peering, 33 crashed+peering, 12 down+peering, 89 crashed+down+peering, 29 active+clean+inconsistent, 5 inconsistent+peering; 429 GB data, 1390 GB used, 3299 GB / 4713 GB avail; 1/1410396 degraded (0.000%)
2010-11-24 13:39:31.872471   osd e7474: 12 osds: 9 up, 10 in

But osd5 never went down, it just got busy, since it had been down after #611

I've ran cdebugpack on 4 OSD's and uploaded the data to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_pg_auto_scrub


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #579: OSD::sched_scrub: FAILED assert(pg_map.count(pgid)ResolvedSage Weil11/15/2010

Actions
Actions

Also available in: Atom PDF