Project

General

Profile

Actions

Bug #13827

closed

OSD takes suicide timeout during deep_scrub on pg after upgrade to Hammer 94.5

Added by Michael Hackett over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Dreamhost has upgraded to Ceph Hammer 94.5 and has been incurring occasional OSD crashes due to suicide timeouts during deep_scrubs on pgs. They have since disabled deep scrubbing and have not encountered any further OSD crashes since.
OSD logs from two OSD's were gathered with debug=20 during the crash.

Reviewing the log file for OSD.933 I can see that we definitely take the suicide timeout waiting for thread 0x7fc1c21c4700 which is the deep_scrub on pg 5.143e.

2015-11-18 18:21:04.385672 7fc1c21c4700 10 osd.933 pg_epoch: 821902 pg[5.143e( v 821849'118406 (811813'115381,821849'118406] local-les=821901 n=14374 ec=22869 les/c 821901/821902 821900/821900/817050) [1351,470,933] r=2 lpr=821900 pi=664979-821899/79 luod=0'0 crt=821849'118406 lcod 0'0 active] be_deep_scrub cc40143e/.dir.default.120331675.42828/head//5 seed 4294967295

A little over 60 seconds after the last deep scrub message in the logs we see the assert because we are waiting on the thread.

2015-11-18 18:22:05.031867 7fc1d4331700 20 osd.933 821902 share_map_peer 0x60f2840 already has epoch 821902
2015-11-18 18:22:05.040871 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had timed out after 60
2015-11-18 18:22:05.040875 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had suicide timed out after 60
2015-11-18 18:22:05.042788 7fc1cc1d8700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc1cc1d8700 time 2015-11-18 18:22:05.040885
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1a76f]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2e1) [0xa55e71]
3: (ceph::HeartbeatMap::is_healthy()+0xb7) [0xa56647]
4: (OSD::handle_osd_ping(MOSDPing*)+0x7a7) [0x671587]
5: (OSD::heartbeat_dispatch(Message*)+0x48b) [0x67262b]
6: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0xbdcdd7]
7: (DispatchQueue::entry()+0x44a) [0xbd9f7a]
8: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf9e7d]
9: (()+0x7e9a) [0x7fc1e1830e9a]
10: (clone()+0x6d) [0x7fc1e05244bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Reviewing OSD.1087 I see a similar string of events....

Last entry in the logs for the deep_scrubbing of pg 5.2dd8 with thread 0x7f3d97661700

2015-11-18 18:27:06.856363 7f3d97661700 10 osd.1087 pg_epoch: 821921 pg[5.2dd8( v 821875'127927 (821294'119424,821875'127927] local-les=821909 n=14388 ec=22869 les/c 821909/821909 821908/821908/821908) [1087,1381,963] r=0 lpr=821908 crt=821875'127927 lcod 0'0 mlcod 0'0 active+clean+scrubbing+deep] be_deep_scrub 54492dd8/26013210.32048__shadow__udcIeNjrzpHhFaCn4P4zmy_EAgWZI-V_1/head//5 seed 4294967295

I am curious of this message seen on pg 5.2dd8 where we are seeing 0//0//-1 right after the last deep_scrub message on the pg.

2015-11-18 18:27:06.947841 7f3d70e3a700 10 osd.1087 821921 handle_replica_op osd_sub_op(unknown.0.0:0 5.2dd8 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 epoch 821921

2015-11-18 18:27:06.947868 7f3d70e3a700 15 osd.1087 821921 enqueue_op 0x2e42b100 prio 127 cost 32984 latency 0.000394 osd_sub_op(unknown.0.0:0 5.2dd8 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11

As with OSD.933 I see just about 60 seconds later the OSD asserts with suicide timeout while waiting on thread 0x7f3d97661700.

2015-11-18 18:28:07.020357 7f3da2e78700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f3d97661700' had timed out after 60
2015-11-18 18:28:07.020388 7f3da2e78700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f3d97661700' had suicide timed out after 60
2015-11-18 18:28:07.022342 7f3da2e78700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3da2e78700 time 2015-11-18 18:28:07.020408
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1a76f]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2e1) [0xa55e71]
3: (ceph::HeartbeatMap::is_healthy()+0xb7) [0xa56647]
4: (OSD::handle_osd_ping(MOSDPing*)+0x7a7) [0x671587]
5: (OSD::heartbeat_dispatch(Message*)+0x48b) [0x67262b]
6: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0xbdcdd7]
7: (DispatchQueue::entry()+0x44a) [0xbd9f7a]
8: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf9e7d]
9: (()+0x7e9a) [0x7f3db6450e9a]
10: (clone()+0x6d) [0x7f3db51444bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---


Files

pgquery_crush.tar.xz (12.2 KB) pgquery_crush.tar.xz Justin Bautista, 11/25/2015 09:25 PM

Related issues 1 (0 open1 closed)

Related to Ceph - Backport #14376: scrub suicide timeout is the same as the regular timeout -- should probably match recovery at leastResolvedSamuel JustActions
Actions

Also available in: Atom PDF