Bug #13827: OSD takes suicide timeout during deep_scrub on pg after upgrade to Hammer 94.5 - Ceph - Ceph

Actions

Copy link

Bug #13827

closed

OSD takes suicide timeout during deep_scrub on pg after upgrade to Hammer 94.5

Added by Michael Hackett over 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

v0.94.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Dreamhost has upgraded to Ceph Hammer 94.5 and has been incurring occasional OSD crashes due to suicide timeouts during deep_scrubs on pgs. They have since disabled deep scrubbing and have not encountered any further OSD crashes since.
OSD logs from two OSD's were gathered with debug=20 during the crash.

Reviewing the log file for OSD.933 I can see that we definitely take the suicide timeout waiting for thread 0x7fc1c21c4700 which is the deep_scrub on pg 5.143e.

2015-11-18 18:21:04.385672 7fc1c21c4700 10 osd.933 pg_epoch: 821902 pg[5.143e( v 821849'118406 (811813'115381,821849'118406] local-les=821901 n=14374 ec=22869 les/c 821901/821902 821900/821900/817050) [1351,470,933] r=2 lpr=821900 pi=664979-821899/79 luod=0'0 crt=821849'118406 lcod 0'0 active] be_deep_scrub cc40143e/.dir.default.120331675.42828/head//5 seed 4294967295

A little over 60 seconds after the last deep scrub message in the logs we see the assert because we are waiting on the thread.

2015-11-18 18:22:05.031867 7fc1d4331700 20 osd.933 821902 share_map_peer 0x60f2840 already has epoch 821902
2015-11-18 18:22:05.040871 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had timed out after 60
2015-11-18 18:22:05.040875 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had suicide timed out after 60
2015-11-18 18:22:05.042788 7fc1cc1d8700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc1cc1d8700 time 2015-11-18 18:22:05.040885
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1a76f]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2e1) [0xa55e71]
 3: (ceph::HeartbeatMap::is_healthy()+0xb7) [0xa56647]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x7a7) [0x671587]
 5: (OSD::heartbeat_dispatch(Message*)+0x48b) [0x67262b]
 6: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0xbdcdd7]
 7: (DispatchQueue::entry()+0x44a) [0xbd9f7a]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf9e7d]
 9: (()+0x7e9a) [0x7fc1e1830e9a]
 10: (clone()+0x6d) [0x7fc1e05244bd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Reviewing OSD.1087 I see a similar string of events....

Last entry in the logs for the deep_scrubbing of pg 5.2dd8 with thread 0x7f3d97661700

2015-11-18 18:27:06.856363 7f3d97661700 10 osd.1087 pg_epoch: 821921 pg[5.2dd8( v 821875'127927 (821294'119424,821875'127927] local-les=821909 n=14388 ec=22869 les/c 821909/821909 821908/821908/821908) [1087,1381,963] r=0 lpr=821908 crt=821875'127927 lcod 0'0 mlcod 0'0 active+clean+scrubbing+deep] be_deep_scrub 54492dd8/26013210.32048__shadow__udcIeNjrzpHhFaCn4P4zmy_EAgWZI-V_1/head//5 seed 4294967295

I am curious of this message seen on pg 5.2dd8 where we are seeing 0//0//-1 right after the last deep_scrub message on the pg.

2015-11-18 18:27:06.947841 7f3d70e3a700 10 osd.1087 821921 handle_replica_op osd_sub_op(unknown.0.0:0 5.2dd8 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 epoch 821921

2015-11-18 18:27:06.947868 7f3d70e3a700 15 osd.1087 821921 enqueue_op 0x2e42b100 prio 127 cost 32984 latency 0.000394 osd_sub_op(unknown.0.0:0 5.2dd8 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[]) v11

As with OSD.933 I see just about 60 seconds later the OSD asserts with suicide timeout while waiting on thread 0x7f3d97661700.

2015-11-18 18:28:07.020357 7f3da2e78700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f3d97661700' had timed out after 60
2015-11-18 18:28:07.020388 7f3da2e78700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7f3d97661700' had suicide timed out after 60
2015-11-18 18:28:07.022342 7f3da2e78700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3da2e78700 time 2015-11-18 18:28:07.020408
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1a76f]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2e1) [0xa55e71]
 3: (ceph::HeartbeatMap::is_healthy()+0xb7) [0xa56647]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x7a7) [0x671587]
 5: (OSD::heartbeat_dispatch(Message*)+0x48b) [0x67262b]
 6: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0xbdcdd7]
 7: (DispatchQueue::entry()+0x44a) [0xbd9f7a]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf9e7d]
 9: (()+0x7e9a) [0x7f3db6450e9a]
 10: (clone()+0x6d) [0x7f3db51444bd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---

Files

pgquery_crush.tar.xz (12.2 KB) pgquery_crush.tar.xz

Justin Bautista, 11/25/2015 09:25 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Nathan Cutler over 8 years ago

Tracker changed from Tasks to Bug
Project changed from Stable releases to Ceph

Actions

Copy link

Updated by Justin Bautista over 8 years ago

File pgquery_crush.tar.xz pgquery_crush.tar.xz added

Actions

Copy link

Updated by Samuel Just over 8 years ago

Can you attach the osd logs?

Actions

Copy link

Updated by Samuel Just over 8 years ago

Assignee set to Samuel Just

Actions

Copy link

Updated by Michael Hackett over 8 years ago

OSD logs have debug enabled and are large over the limit for tracker.
OSD.1087 is located here:
https://api.access.redhat.com/rs/cases/01541555/attachments/01f44e8e-7700-4080-a766-d5ce2ba8ad41

OSD.933 is located here:
https://api.access.redhat.com/rs/cases/01541555/attachments/6e52287d-fbea-4f3e-9209-093977844ce4

Actions

Copy link

Updated by Michael Hackett over 8 years ago

Deep scrub was disabled for several days and saw no further crashes. On 12/14 deep scrub was re-enabled and within the hour several OSD's reported crashes. Logs from OSD's 1003 and 111 have been gathered with the requested debug levels enabled:

debug osd = 25
debug ms = 1
debug filestore = 20

Looking at these crashes they are not due to suicide timeouts:

OSD.1003

2015-12-14 22:01:06.977918 7fb67d1fe700 10 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 0, writeout_from: 837823'120653, trimmed:
2015-12-14 22:01:06.978031 7fb67d1fe700 5 filestore(/srv/ceph/osd/1003) queue_transactions existing osr(5.398a 0xb916280)/0xb916280
2015-12-14 22:01:06.977892 7fb6779f3700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7fb6779f3700 time 2015-12-14 22:01:06.953063
os/FileStore.cc: 2850: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1a76f]
 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0x9fa) [0x8d669a]
 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x2c6) [0x957a66]
 4: (PGBackend::be_scan_list(ScrubMap&, std::vector&lt;hobject_t, std::allocator&lt;hobject_t&gt; > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x3f4) [0x897ed4]
 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1f7) [0x7ac7a7]
 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x45a) [0x7ace5a]
 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xfa) [0x6ba35a]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xb0964e]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xb0c4b0]
 10: (()+0x7e9a) [0x7fb697003e9a]
 11: (clone()+0x6d) [0x7fb695cf74bd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

2015-12-14 22:01:06.978039 7fb67d1fe700 5 filestore(/srv/ceph/osd/1003) queue_transactions (writeahead) 21346787 0x27a895f8
2015-12-14 22:01:06.978068 7fb67d1fe700 10 osd.1003 837823 dequeue_op 0x2a78fa00 finish

OSD.111

0> 2015-12-15 01:08:45.190418 7fb08650e700 -1 os/FileStore.cc: In function 'int FileStore::lfn_find(const ghobject_t&, const Index&, IndexedPath*)' thread 7fb08650e700 time 2015-12-15 01:08:45.175487
os/FileStore.cc: 179: FAILED assert(!m_filestore_fail_eio || r != -5)

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xb1a76f]
 2: (FileStore::lfn_find(ghobject_t const&, Index const&, std::tr1::shared_ptr&lt;CollectionIndex::Path&gt;*)+0xad) [0x8b009d]
 3: (FileStore::lfn_stat(coll_t, ghobject_t const&, stat*)+0xde) [0x8b339e]
 4: (FileStore::stat(coll_t, ghobject_t const&, stat*, bool)+0x57) [0x8bc3f7]
 5: (PGBackend::be_scan_list(ScrubMap&, std::vector&lt;hobject_t, std::allocator&lt;hobject_t&gt; > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x13e) [0x897c1e]
 6: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1f7) [0x7ac7a7]
 7: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x45a) [0x7ace5a]
 8: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xfa) [0x6ba35a]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xb0964e]
 10: (ThreadPool::WorkThread::entry()+0x10) [0xb0c4b0]
 11: (()+0x7e9a) [0x7fb0a530ce9a]
 12: (clone()+0x6d) [0x7fb0a3ffe2ed]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
25/25 osd
0/ 5 optracker
0/ 5 objclass
20/20 filestore
1/ 3 keyvaluestore
1/ 3 journal
1/ 1 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000

OSD logs are located on:
https://api.access.redhat.com/rs/cases/01541555/attachments/a85b8151-cb57-4f41-9705-4b2a79eddc6c
https://api.access.redhat.com/rs/cases/01541555/attachments/cd6f4cee-21eb-41e3-8035-2eb902ea97a0

Actions

Copy link

Updated by Michael Hackett over 8 years ago

Previous version of Firefly prior to Hammer 94.5 upgrade:

"common": "0.80.9-1precise"
"osd": "0.80.9-2-g62645d3-1precise"
"mon": "0.80.9-1precise"
"rgw": "0.80.10-1-g364151a-1precise"

No OS or general system upgrades have been performed.

Actions

Copy link

Updated by Samuel Just over 8 years ago

1003 and 111 above crashed on EIO, the disks are probably bad -- unrelated to the original problem unless the original problem was also bad disks.

Actions

Copy link

#10

Updated by Samuel Just over 8 years ago

osd.933 timed out while deep scrubbing cc40143e/.dir.default.120331675.42828/head//5

and

osd.1087 timed out while deep scrubbing 54492dd8/26013210.32048__shadow__udcIeNjrzpHhFaCn4P4zmy_EAgWZI-V_1/head//5

Does this cluster still have some enormously large index objects (very very large buckets)? Probably worth trying to dump those two objects to find out how large they are. It could also just be that the disks are flaky and hang while scanning the data for those objects.

Actions

Copy link

#11

Updated by Robin Johnson over 8 years ago

Samuel Just wrote:

osd.933 timed out while deep scrubbing cc40143e/.dir.default.120331675.42828/head//5

rados stat:
.rgw.data.1/.dir.default.120331675.42828 mtime 2015-03-12 04:46:15.000000, size 0
Belongs to a bucket that was deleted a long time ago. The bucket data is gone, but the bucket.instance does still exist (should it have been cleaned up?)
Has no xattr or omap data;
Does have an omap header:
header (69 bytes) :
0000 : 05 02 3f 00 00 00 01 00 00 00 01 02 02 18 00 00 : ..?.............
0010 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 : ................
0020 : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 : ................
0030 : 00 05 00 00 00 00 00 00 00 00 00 00 00 00 00 00 : ................
0040 : 00 00 00 00 00 : .....

osd.1087 timed out while deep scrubbing 54492dd8/26013210.32048__shadow__udcIeNjrzpHhFaCn4P4zmy_EAgWZI-V_1/head//5

.rgw.data.1/26013210.32048__shadow__udcIeNjrzpHhFaCn4P4zmy_EAgWZI-V_1 mtime 2014-01-08 07:21:12.000000, size 3571712
No xattr, omap header, omap keys.

Does this cluster still have some enormously large index objects (very very large buckets)?

Yes, we do have some very large bucket indexes that predate Hammer's bucket index sharding. (72 users that one or more buckets over 1M objects, and 7 users with buckets over 10M objects)

Probably worth trying to dump those two objects to find out how large they are.

They are tiny.

It could also just be that the disks are flaky and hang while scanning the data for those objects.

I can fetch the content of the non-empty file fine.

Actions

Copy link

#12

Updated by Samuel Just over 8 years ago

Can they simply be removed?

Actions

Copy link

#13

Updated by Samuel Just over 8 years ago

Alternately, those versions of firefly didn't have a deep scrub suicide timeout set, so the current 60s default might just be too low. You could simply set the scrub suicide timeout to a larger value (600s?). I'm going to change the default to 300 anyway.

Actions

Copy link

#14

Updated by Samuel Just over 8 years ago

Related to Backport #14376: scrub suicide timeout is the same as the regular timeout -- should probably match recovery at least added

Actions

Copy link

#15

Updated by Robin Johnson over 8 years ago

Samuel Just wrote:

Alternately, those versions of firefly didn't have a deep scrub suicide timeout set, so the current 60s default might just be too low. You could simply set the scrub suicide timeout to a larger value (600s?). I'm going to change the default to 300 anyway.

We've raised our scrub suicide timeout to 600s as suggested, and it appears to much happier now; we'll run overnight and report back.

You were right re EIO on osd.1003/osd.111, they were borderline and just starting to fail. They will be handled by our staff soon.

Actions

Copy link

#16

Updated by Samuel Just about 8 years ago

Status changed from New to Resolved

Marking resolved -- appears to be just flaky disks and too conservative defaults for this configuration (there is another bug for changing the defaults).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #13827

OSD takes suicide timeout during deep_scrub on pg after upgrade to Hammer 94.5

Updated by Nathan Cutler over 8 years ago

Updated by Justin Bautista over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Michael Hackett over 8 years ago

Updated by Michael Hackett over 8 years ago

Updated by Michael Hackett over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Robin Johnson over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Samuel Just over 8 years ago

Updated by Robin Johnson over 8 years ago

Updated by Samuel Just about 8 years ago