Project

General

Profile

Actions

Bug #8301

closed

Suicide Timeout on Cache Tier OSDs

Added by Mark Nelson almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This appears to happen in both firefly and master. During 4MB random write tests on an RBD volume with a Ceph cache tier configuration the tier pool OSDs all commit suicide due to heartbeat timeouts. Before the suicide, all OSDs in the cluster (and in some cases at multiple different points in time) claim to be internally unhealthy. All OSDs are on the same system. This issue only appears to happen during long running tests with many objects. Smaller tests with a smaller cache pool and smaller volume did not reproduce it.

On the base pool OSDs:

osd.3.log:2014-05-07 05:44:12.973121 7ffad7f4d700 10 osd.3 152 internal heartbeat not healthy, dropping ping request
osd.3.log:2014-05-07 05:44:44.120810 7ffad6f4b700 10 osd.3 162 internal heartbeat not healthy, dropping ping request

While on the cache tier OSDs:

osd.35.log: -1536> 2014-05-07 05:42:40.533492 7f18a48a9700 10 osd.35 151 internal heartbeat not healthy, dropping ping request
osd.35.log: -1535> 2014-05-07 05:42:41.780083 7f18a38a7700 10 osd.35 151 internal heartbeat not healthy, dropping ping request

Eventually the cache tier OSDs show:

     0> 2014-05-07 05:45:27.531799 7fe2c9636700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2c9636700 time 2014-05-07 05:45:14.061454
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b]
 2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x72) [0x99a392]
 3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5dd) [0xa42afd]
 4: (ThreadPool::WorkThread::entry()+0x10) [0xa43f00]
 5: (()+0x7f6e) [0x7fe2de543f6e]
 6: (clone()+0x6d) [0x7fe2dc8e49cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
  20/20 osd
   0/ 0 optracker
   0/ 0 objclass
  20/20 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   1/ 1 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.30.log
--- end dump of recent events ---
2014-05-07 05:45:27.531801 7fe2db79d700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2db79d700 time 2014-05-07 05:44:57.707730
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b]
 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x99a8f6]
 3: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x99af07]
 4: (CephContextServiceThread::entry()+0x154) [0xa65ec4]
 5: (()+0x7f6e) [0x7fe2de543f6e]
 6: (clone()+0x6d) [0x7fe2dc8e49cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2014-05-07 05:45:27.679140 7fe2ca638700  1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0}) v2 -- ?+0 0x4354000 con 0x1511f080
2014-05-07 05:45:27.679208 7fe2ca638700  1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- log(1 entries) v1 -- ?+0 0x313de480 con 0x1511f080
2014-05-07 05:45:28.205448 7fe2bd3f9700  1 -- 192.168.10.1:6950/7353 >> :/0 pipe(0xcab3480 sd=36 :6950 s=0 pgs=0 cs=0 l=0 c=0x1279d340).accept sd=36 192.168.10.2:54864/0
2014-05-07 05:45:28.406463 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry woke after 5.000071
2014-05-07 05:45:28.406494 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry waiting for max_interval 5.000000
2014-05-07 05:45:28.760636 7fe2c3e2b700 20 osd.30 152 update_osd_stat osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33])
2014-05-07 05:45:28.760658 7fe2c3e2b700  5 osd.30 152 heartbeat: osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33])
--- begin dump of recent events ---
    -7> 2014-05-07 05:45:27.531801 7fe2db79d700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2db79d700 time 2014-05-07 05:44:57.707730
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b]
 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x99a8f6]
 3: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x99af07]
 4: (CephContextServiceThread::entry()+0x154) [0xa65ec4]
 5: (()+0x7f6e) [0x7fe2de543f6e]
 6: (clone()+0x6d) [0x7fe2dc8e49cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -6> 2014-05-07 05:45:27.679140 7fe2ca638700  1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0}) v2 -- ?+0 0x4354000 con 0x1511f080
    -5> 2014-05-07 05:45:27.679208 7fe2ca638700  1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- log(1 entries) v1 -- ?+0 0x313de480 con 0x1511f080
    -4> 2014-05-07 05:45:28.205448 7fe2bd3f9700  1 -- 192.168.10.1:6950/7353 >> :/0 pipe(0xcab3480 sd=36 :6950 s=0 pgs=0 cs=0 l=0 c=0x1279d340).accept sd=36 192.168.10.2:54864/0
    -3> 2014-05-07 05:45:28.406463 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry woke after 5.000071
    -2> 2014-05-07 05:45:28.406494 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry waiting for max_interval 5.000000
    -1> 2014-05-07 05:45:28.760636 7fe2c3e2b700 20 osd.30 152 update_osd_stat osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33])
     0> 2014-05-07 05:45:28.760658 7fe2c3e2b700  5 osd.30 152 heartbeat: osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33])
--- logging levels ---
--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
  20/20 osd
   0/ 0 optracker
   0/ 0 objclass
  20/20 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   1/ 1 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.30.log
--- end dump of recent events ---
Actions #1

Updated by Ian Colle almost 10 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Greg Farnum almost 10 years ago

Sounds like you're just putting too many ops into the cluster too quickly. Is there anything making you think differently?

Actions #3

Updated by Mark Nelson almost 10 years ago

Greg:

That very well may be the case. What I am seeing is that if each test in the set of tests is restricted to run for a shorter time (say 2 minutes) we don't hit this, but with longer running tests (30 minutes each) we hit it very quickly once the 4MB random write tests start. Perhaps things have just gotten so far behind with the intense workload that everything backs up.

Having said that, I am not sufficiently well versed in how the heartbeat code works to know if it makes sense that heavy client IO should be able to cause internal heartbeat health failures? Looking at the code in HeartBeatMap.cc, it appears that this happens when we fail a time out test based on some grace period, but I haven't read into it enough to understand the details.

Actions #4

Updated by Greg Farnum almost 10 years ago

Well, the OSDs are supposed to throttle incoming client requests in order to prevent this, but they're generally better at doing that with bandwidth than ops, and cache pools are adding a bunch of work that our current defaults weren't configured to account for.
The internal heartbeats are just saying "this op took too long to finish", but that could happen if, say, the FileStore code is saying "nope, I've got too many uncommitted ops so you're going to block a while" and the upper layers keep letting in new ops.

Actions #5

Updated by Samuel Just almost 10 years ago

You need to reproduce this with debug osd = 20, debug filestore = 20, debug ms = 1.

Actions #6

Updated by Mark Nelson almost 10 years ago

Sam:

This was tested with debug osd = 20, debug filestore = 20, debug ms = 1.

Unfortunately some of the other debug levels were below normal and we didn't get info about what thread was stalling, so tests are rerunning now.

Actions #7

Updated by Samuel Just almost 10 years ago

This backtrace is consistent with the hangs I saw in some logs generated in a subsequent run.

(gdb) bt
#0 ReplicatedPG::calc_trim_to (this=0x31da000) at osd/ReplicatedPG.cc:1041
#1 0x0000000000804ea6 in ReplicatedPG::execute_ctx (this=this@entry=0x31da000, ctx=ctx@entry=0x416e7200) at osd/ReplicatedPG.cc:1838
#2 0x00000000008093af in ReplicatedPG::do_op (this=0x31da000, op=std::tr1::shared_ptr (count 6, weak 0) 0x2afa43c0) at osd/ReplicatedPG.cc:1549
#3 0x00000000007a8855 in ReplicatedPG::do_request (this=0x31da000, op=std::tr1::shared_ptr (count 6, weak 0) 0x2afa43c0, handle=...) at osd/ReplicatedPG.cc:1123
#4 0x0000000000606431 in OSD::dequeue_op (this=0x2af8000, pg=..., op=std::tr1::shared_ptr (count 6, weak 0) 0x2afa43c0, handle=...) at osd/OSD.cc:7688
#5 0x0000000000620a74 in OSD::OpWQ::_process (this=0x2af8e28, pg=..., handle=...) at osd/OSD.cc:7658
#6 0x000000000066363c in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x2af8e28, handle=...) at ./common/WorkQueue.h:190
#7 0x0000000000a4300f in ThreadPool::worker (this=0x2af8470, wt=0x2aaf300) at common/WorkQueue.cc:125
#8 0x0000000000a43f00 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:317
#9 0x00007f09dc27bf6e in start_thread (arg=0x7f09c4368700) at pthread_create.c:311
#10 0x00007f09da61c9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) print pg_log.log.head
$5 = {version = 4897, epoch = 134, __pad = 0}
(gdb) print pg_log.log.tail
$6 = {version = 68, epoch = 134, __pad = 0}
(gdb) print info.last_update
$7 = {version = 4897, epoch = 134, __pad = 0}
(gdb) print info.log_tail
$8 = {version = 68, epoch = 134, __pad = 0}
(gdb) print i
$9 = 1344
(gdb) print new_trim_to
$10 = {version = 0, epoch = 0, __pad = 0}
(gdb) print min_last_complete_ondisk
$11 = {version = 4894, epoch = 134, __pad = 0}
(gdb) print num_to_trim
$12 = 1829
(gdb)

Actions #8

Updated by Samuel Just almost 10 years ago

i is only 1344, I don't yet see how this situation leads to a hang. The value of new_trim_to is odd.

I've pushed a branch with some debugging in calc_trim_to(). wip-8301

Actions #9

Updated by Samuel Just almost 10 years ago

above was on 24c5ea8df040da0889be7ab1a9985ae03ee68d9e

Actions #10

Updated by Mark Nelson almost 10 years ago

This was reproduced last night using wip-8301 without the cache tier active. Restarting the failed OSDs resulted in cluster recovery and subsequent continuation of the tests, IE the problem doesn't persist restarts. Current belief is that there may be a bug in the trim code and/or mysterious heap corruption.

Side Note: Encountered #8232 during the most recent test as well.

Actions #11

Updated by Mark Nelson almost 10 years ago

Over the weekend I was able to reproduce this under valgrind. Sadly valgrind didn't report any errors for the OSDs that went down (but did produce core files). There was a bit of definitely leaked memory and some possibly leaked memory. Using GDB to examine the core files didn't show any threads in trim at all. Looking through the logs, one of the OSDs that eventually committed suicide was first marked down by another one:

2014-05-11 07:10:02.494011 167ba700 10 osd.33 2352 handle_osd_ping osd.32 192.168.10.1:6962/9462 says i am down in 2355

which itself reported:

2014-05-11 07:09:59.484555 2e039700  1 -- 192.168.10.1:6962/9462 >> :/0 pipe(0x2a420c20 sd=39 :6962 s=0 pgs=0 cs=0 l=0 c=0x2a89a910).accept sd=39 192.168.10.1:33845/0
2014-05-11 07:09:59.485993 28401700  1 -- 192.168.10.1:6963/9462 >> :/0 pipe(0x24ddebf0 sd=51 :6963 s=0 pgs=0 cs=0 l=0 c=0x2bb23160).accept sd=51 192.168.10.1:54779/0
2014-05-11 07:09:59.494535 2df38700  1 -- 192.168.10.1:6960/9462 >> :/0 pipe(0x2b9a24b0 sd=75 :6960 s=0 pgs=0 cs=0 l=0 c=0x3b5cb970).accept sd=75 192.168.10.2:41837/0
2014-05-11 07:09:59.765187 157b8700 10 osd.32 2352  advance to epoch 2353 (<= newest 2355)
2014-05-11 07:09:59.771667 157b8700  1 -- 192.168.10.1:6961/9462 mark_down 192.168.10.1:6896/3021488 -- pipe dne
2014-05-11 07:09:59.772114 157b8700  7 osd.32 2353 advance_map epoch 2353  0 pgs
2014-05-11 07:09:59.777988 157b8700 10 osd.32 2353  advance to epoch 2354 (<= newest 2355)
2014-05-11 07:09:59.785828 157b8700  1 -- 192.168.10.1:6961/9462 mark_down 192.168.10.1:6921/4021964 -- pipe dne
2014-05-11 07:09:59.786294 157b8700  1 -- 192.168.10.1:6961/9462 mark_down 192.168.10.1:6966/10427 -- 0x2ba4ecd0
2014-05-11 07:09:59.789207 147b6700  1 osd.32 2353 ms_handle_reset con 0x3b5cb970 session 0
2014-05-11 07:09:59.801405 157b8700  1 -- 192.168.10.1:0/9462 mark_down 0x29ed1f70 -- 0x29ed1cd0
2014-05-11 07:09:59.810771 157b8700  1 -- 192.168.10.1:0/9462 mark_down 0x29ed2580 -- 0x29ed22e0
2014-05-11 07:09:59.823592 157b8700  7 osd.32 2354 advance_map epoch 2354  0 pgs
2014-05-11 07:09:59.830068 157b8700 10 osd.32 2354  advance to epoch 2355 (<= newest 2355)
2014-05-11 07:09:59.836726 157b8700  1 -- 192.168.10.1:6961/9462 mark_down 192.168.10.1:6971/11581 -- 0x2bdf2d10
2014-05-11 07:10:00.254733 157b8700  1 -- 192.168.10.1:0/9462 mark_down 0x2b968ca0 -- 0x2b968a00
2014-05-11 07:10:00.256135 157b8700  1 -- 192.168.10.1:0/9462 mark_down 0x2b96aa20 -- 0x2b96a780
2014-05-11 07:10:00.263173 157b8700  7 osd.32 2355 advance_map epoch 2355  0 pgs
2014-05-11 07:10:00.271104 157b8700 10 osd.32 2355 write_superblock sb(f8d57262-1ed0-4f73-b83a-0aa19ba3c139 osd.32 97539fe3-f882-4555-8c1a-25b93396003d e2355 [1,2355] lci=[130,2355])
2014-05-11 07:10:00.601168 157b8700  5 filestore(/tmp/cbt/mnt/osd-device-32-data) queue_transactions existing osr(default 0x7b90238)/0x7b90238
2014-05-11 07:10:00.603484 157b8700  5 filestore(/tmp/cbt/mnt/osd-device-32-data) queue_transactions (writeahead) 2070 0x7a6f750
2014-05-11 07:10:00.711442 157b8700  7 osd.32 2355 consume_map version 2355
2014-05-11 07:10:00.712531 157b8700  7 osd.32 2355 activate_map version 2355

I am going to upgrade to the latest master with the fix for 8232 and retest.

Actions #12

Updated by Mark Nelson almost 10 years ago

Sadly this happened again today when running RBD tests with 18 OSDs using SSD journals. Not seeing much new in the core dump.

     0> 2014-05-15 20:06:50.125016 7fe58aed3700 -1 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fe58aed3700 time 2014-05-15 20:06:33.525889
msg/Pipe.cc: 1070: FAILED assert(m)

 ceph version 0.80-rc1-154-gb2a5ec7 (b2a5ec71bbbff13b59d9e2e8dbd81c2c1d455d2c)
 1: (Pipe::connect()+0x37c1) [0xb0e601]
 2: (Pipe::writer()+0x4f1) [0xb0fb51]
 3: (Pipe::Writer::entry()+0xd) [0xb1aa7d]
 4: (()+0x7f6e) [0x7fe5b5f0af6e]
 5: (clone()+0x6d) [0x7fe5b42ab9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.2.log
--- end dump of recent events ---
2014-05-15 20:06:53.799163 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe5a0ffd700' had timed out after 15
2014-05-15 20:06:53.799185 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59dff7700' had timed out after 15
2014-05-15 20:06:53.799189 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59eff9700' had timed out after 15
2014-05-15 20:06:53.799191 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59f7fa700' had timed out after 15
2014-05-15 20:06:53.799195 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59fffb700' had timed out after 15
2014-05-15 20:06:53.799197 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe5a07fc700' had timed out after 15
2014-05-15 20:06:53.799200 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59d7f6700' had timed out after 15
2014-05-15 20:06:55.376780 7fe58aed3700 -1 *** Caught signal (Aborted) **
 in thread 7fe58aed3700
 ceph version 0.80-rc1-154-gb2a5ec7 (b2a5ec71bbbff13b59d9e2e8dbd81c2c1d455d2c)
 1: ceph-osd() [0x96f8bf]
 2: (()+0xfbb0) [0x7fe5b5f12bb0]
 3: (gsignal()+0x37) [0x7fe5b41e7f77]
 4: (abort()+0x148) [0x7fe5b41eb5e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe5b4af36e5]
 6: (()+0x5e856) [0x7fe5b4af1856]
 7: (()+0x5e883) [0x7fe5b4af1883]
 8: (()+0x5eaae) [0x7fe5b4af1aae]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa523b2]
 10: (Pipe::connect()+0x37c1) [0xb0e601]
 11: (Pipe::writer()+0x4f1) [0xb0fb51]
 12: (Pipe::Writer::entry()+0xd) [0xb1aa7d]
 13: (()+0x7f6e) [0x7fe5b5f0af6e]
 14: (clone()+0x6d) [0x7fe5b42ab9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -11> 2014-05-15 20:06:51.275191 7fe59b7f2700  5 osd.2 64 heartbeat: osd_stat(43395 MB used, 888 GB avail, 931 GB total, peers []/[] op hist [])
   -10> 2014-05-15 20:06:52.216410 7fe5a1fff700 10 monclient: tick
    -9> 2014-05-15 20:06:52.216426 7fe5a1fff700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2014-05-15 20:06:22.216425)
    -8> 2014-05-15 20:06:52.216447 7fe5a1fff700 10 monclient: renew subs? (now: 2014-05-15 20:06:52.216447; renew after: 2014-05-15 20:08:56.473454) -- no
    -7> 2014-05-15 20:06:53.799163 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe5a0ffd700' had timed out after 15
    -6> 2014-05-15 20:06:53.799185 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59dff7700' had timed out after 15
    -5> 2014-05-15 20:06:53.799189 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59eff9700' had timed out after 15
    -4> 2014-05-15 20:06:53.799191 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59f7fa700' had timed out after 15
    -3> 2014-05-15 20:06:53.799195 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59fffb700' had timed out after 15
    -2> 2014-05-15 20:06:53.799197 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe5a07fc700' had timed out after 15
    -1> 2014-05-15 20:06:53.799200 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59d7f6700' had timed out after 15
     0> 2014-05-15 20:06:55.376780 7fe58aed3700 -1 *** Caught signal (Aborted) **
 in thread 7fe58aed3700

 ceph version 0.80-rc1-154-gb2a5ec7 (b2a5ec71bbbff13b59d9e2e8dbd81c2c1d455d2c)
 1: ceph-osd() [0x96f8bf]
 2: (()+0xfbb0) [0x7fe5b5f12bb0]
 3: (gsignal()+0x37) [0x7fe5b41e7f77]
 4: (abort()+0x148) [0x7fe5b41eb5e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe5b4af36e5]
 6: (()+0x5e856) [0x7fe5b4af1856]
 7: (()+0x5e883) [0x7fe5b4af1883]
 8: (()+0x5eaae) [0x7fe5b4af1aae]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa523b2]
 10: (Pipe::connect()+0x37c1) [0xb0e601]
 11: (Pipe::writer()+0x4f1) [0xb0fb51]
 12: (Pipe::Writer::entry()+0xd) [0xb1aa7d]
 13: (()+0x7f6e) [0x7fe5b5f0af6e]
 14: (clone()+0x6d) [0x7fe5b42ab9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.2.log
--- end dump of recent events ---
Actions #13

Updated by Sage Weil almost 10 years ago

  • Assignee set to Mark Nelson
Actions #14

Updated by Mark Nelson almost 10 years ago

After creating and digging through many more gcore, valgrind, and collectl logs, it appears that this issue is related to invoking bad behavior in the kernel VM layer. Basically we are causing major page faults even when there is plenty of memory available used as buffer cache which can delay things enough to cause heartbeat timeouts. A bug has been filed with RH to try and get the attention of some of the kernel folks here:

https://bugzilla.redhat.com/show_bug.cgi?id=1099702

A workaround that so far appears to be mitigating the problem is to set swappiness to 10:

echo 10 | sudo tee /proc/sys/vm/swappiness
Actions #15

Updated by Samuel Just almost 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF