Bug #8301: Suicide Timeout on Cache Tier OSDs - Ceph - Ceph

Actions

Copy link

#1

Updated by Ian Colle about 10 years ago

Priority changed from Normal to Urgent

Actions

Copy link

#2

Updated by Greg Farnum about 10 years ago

Sounds like you're just putting too many ops into the cluster too quickly. Is there anything making you think differently?

Actions

Copy link

#3

Updated by Mark Nelson about 10 years ago

Greg:

That very well may be the case. What I am seeing is that if each test in the set of tests is restricted to run for a shorter time (say 2 minutes) we don't hit this, but with longer running tests (30 minutes each) we hit it very quickly once the 4MB random write tests start. Perhaps things have just gotten so far behind with the intense workload that everything backs up.

Having said that, I am not sufficiently well versed in how the heartbeat code works to know if it makes sense that heavy client IO should be able to cause internal heartbeat health failures? Looking at the code in HeartBeatMap.cc, it appears that this happens when we fail a time out test based on some grace period, but I haven't read into it enough to understand the details.

Actions

Copy link

#4

Updated by Greg Farnum about 10 years ago

Well, the OSDs are supposed to throttle incoming client requests in order to prevent this, but they're generally better at doing that with bandwidth than ops, and cache pools are adding a bunch of work that our current defaults weren't configured to account for.
The internal heartbeats are just saying "this op took too long to finish", but that could happen if, say, the FileStore code is saying "nope, I've got too many uncommitted ops so you're going to block a while" and the upper layers keep letting in new ops.

Actions

Copy link

#5

Updated by Samuel Just about 10 years ago

You need to reproduce this with debug osd = 20, debug filestore = 20, debug ms = 1.

Actions

Copy link

#6

Updated by Mark Nelson about 10 years ago

Sam:

This was tested with debug osd = 20, debug filestore = 20, debug ms = 1.

Unfortunately some of the other debug levels were below normal and we didn't get info about what thread was stalling, so tests are rerunning now.

Actions

Copy link

#7

Updated by Samuel Just almost 10 years ago

This backtrace is consistent with the hangs I saw in some logs generated in a subsequent run.

(gdb) bt
#0 ReplicatedPG::calc_trim_to (this=0x31da000) at osd/ReplicatedPG.cc:1041
#1 0x0000000000804ea6 in ReplicatedPG::execute_ctx (this=this@entry=0x31da000, ctx=ctx@entry=0x416e7200) at osd/ReplicatedPG.cc:1838
#2 0x00000000008093af in ReplicatedPG::do_op (this=0x31da000, op=std::tr1::shared_ptr (count 6, weak 0) 0x2afa43c0) at osd/ReplicatedPG.cc:1549
#3 0x00000000007a8855 in ReplicatedPG::do_request (this=0x31da000, op=std::tr1::shared_ptr (count 6, weak 0) 0x2afa43c0, handle=...) at osd/ReplicatedPG.cc:1123
#4 0x0000000000606431 in OSD::dequeue_op (this=0x2af8000, pg=..., op=std::tr1::shared_ptr (count 6, weak 0) 0x2afa43c0, handle=...) at osd/OSD.cc:7688
#5 0x0000000000620a74 in OSD::OpWQ::_process (this=0x2af8e28, pg=..., handle=...) at osd/OSD.cc:7658
#6 0x000000000066363c in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x2af8e28, handle=...) at ./common/WorkQueue.h:190
#7 0x0000000000a4300f in ThreadPool::worker (this=0x2af8470, wt=0x2aaf300) at common/WorkQueue.cc:125
#8 0x0000000000a43f00 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:317
#9 0x00007f09dc27bf6e in start_thread (arg=0x7f09c4368700) at pthread_create.c:311
#10 0x00007f09da61c9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) print pg_log.log.head
$5 = {version = 4897, epoch = 134, __pad = 0}
(gdb) print pg_log.log.tail
$6 = {version = 68, epoch = 134, __pad = 0}
(gdb) print info.last_update
$7 = {version = 4897, epoch = 134, __pad = 0}
(gdb) print info.log_tail
$8 = {version = 68, epoch = 134, __pad = 0}
(gdb) print i
$9 = 1344
(gdb) print new_trim_to
$10 = {version = 0, epoch = 0, __pad = 0}
(gdb) print min_last_complete_ondisk
$11 = {version = 4894, epoch = 134, __pad = 0}
(gdb) print num_to_trim
$12 = 1829
(gdb)

Actions

Copy link

#8

Updated by Samuel Just almost 10 years ago

i is only 1344, I don't yet see how this situation leads to a hang. The value of new_trim_to is odd.

I've pushed a branch with some debugging in calc_trim_to(). wip-8301

Actions

Copy link

#9

Updated by Samuel Just almost 10 years ago

above was on 24c5ea8df040da0889be7ab1a9985ae03ee68d9e

Actions

Copy link

#10

Updated by Mark Nelson almost 10 years ago

This was reproduced last night using wip-8301 without the cache tier active. Restarting the failed OSDs resulted in cluster recovery and subsequent continuation of the tests, IE the problem doesn't persist restarts. Current belief is that there may be a bug in the trim code and/or mysterious heap corruption.

Side Note: Encountered #8232 during the most recent test as well.

Actions

Copy link

#11

Updated by Mark Nelson almost 10 years ago

Over the weekend I was able to reproduce this under valgrind. Sadly valgrind didn't report any errors for the OSDs that went down (but did produce core files). There was a bit of definitely leaked memory and some possibly leaked memory. Using GDB to examine the core files didn't show any threads in trim at all. Looking through the logs, one of the OSDs that eventually committed suicide was first marked down by another one:

2014-05-11 07:10:02.494011 167ba700 10 osd.33 2352 handle_osd_ping osd.32 192.168.10.1:6962/9462 says i am down in 2355

which itself reported:

2014-05-11 07:09:59.484555 2e039700  1 -- 192.168.10.1:6962/9462 >> :/0 pipe(0x2a420c20 sd=39 :6962 s=0 pgs=0 cs=0 l=0 c=0x2a89a910).accept sd=39 192.168.10.1:33845/0
2014-05-11 07:09:59.485993 28401700  1 -- 192.168.10.1:6963/9462 >> :/0 pipe(0x24ddebf0 sd=51 :6963 s=0 pgs=0 cs=0 l=0 c=0x2bb23160).accept sd=51 192.168.10.1:54779/0
2014-05-11 07:09:59.494535 2df38700  1 -- 192.168.10.1:6960/9462 >> :/0 pipe(0x2b9a24b0 sd=75 :6960 s=0 pgs=0 cs=0 l=0 c=0x3b5cb970).accept sd=75 192.168.10.2:41837/0
2014-05-11 07:09:59.765187 157b8700 10 osd.32 2352  advance to epoch 2353 (<= newest 2355)
2014-05-11 07:09:59.771667 157b8700  1 -- 192.168.10.1:6961/9462 mark_down 192.168.10.1:6896/3021488 -- pipe dne
2014-05-11 07:09:59.772114 157b8700  7 osd.32 2353 advance_map epoch 2353  0 pgs
2014-05-11 07:09:59.777988 157b8700 10 osd.32 2353  advance to epoch 2354 (<= newest 2355)
2014-05-11 07:09:59.785828 157b8700  1 -- 192.168.10.1:6961/9462 mark_down 192.168.10.1:6921/4021964 -- pipe dne
2014-05-11 07:09:59.786294 157b8700  1 -- 192.168.10.1:6961/9462 mark_down 192.168.10.1:6966/10427 -- 0x2ba4ecd0
2014-05-11 07:09:59.789207 147b6700  1 osd.32 2353 ms_handle_reset con 0x3b5cb970 session 0
2014-05-11 07:09:59.801405 157b8700  1 -- 192.168.10.1:0/9462 mark_down 0x29ed1f70 -- 0x29ed1cd0
2014-05-11 07:09:59.810771 157b8700  1 -- 192.168.10.1:0/9462 mark_down 0x29ed2580 -- 0x29ed22e0
2014-05-11 07:09:59.823592 157b8700  7 osd.32 2354 advance_map epoch 2354  0 pgs
2014-05-11 07:09:59.830068 157b8700 10 osd.32 2354  advance to epoch 2355 (<= newest 2355)
2014-05-11 07:09:59.836726 157b8700  1 -- 192.168.10.1:6961/9462 mark_down 192.168.10.1:6971/11581 -- 0x2bdf2d10
2014-05-11 07:10:00.254733 157b8700  1 -- 192.168.10.1:0/9462 mark_down 0x2b968ca0 -- 0x2b968a00
2014-05-11 07:10:00.256135 157b8700  1 -- 192.168.10.1:0/9462 mark_down 0x2b96aa20 -- 0x2b96a780
2014-05-11 07:10:00.263173 157b8700  7 osd.32 2355 advance_map epoch 2355  0 pgs
2014-05-11 07:10:00.271104 157b8700 10 osd.32 2355 write_superblock sb(f8d57262-1ed0-4f73-b83a-0aa19ba3c139 osd.32 97539fe3-f882-4555-8c1a-25b93396003d e2355 [1,2355] lci=[130,2355])
2014-05-11 07:10:00.601168 157b8700  5 filestore(/tmp/cbt/mnt/osd-device-32-data) queue_transactions existing osr(default 0x7b90238)/0x7b90238
2014-05-11 07:10:00.603484 157b8700  5 filestore(/tmp/cbt/mnt/osd-device-32-data) queue_transactions (writeahead) 2070 0x7a6f750
2014-05-11 07:10:00.711442 157b8700  7 osd.32 2355 consume_map version 2355
2014-05-11 07:10:00.712531 157b8700  7 osd.32 2355 activate_map version 2355

I am going to upgrade to the latest master with the fix for 8232 and retest.

Actions

Copy link

#12

Updated by Mark Nelson almost 10 years ago

Sadly this happened again today when running RBD tests with 18 OSDs using SSD journals. Not seeing much new in the core dump.

     0> 2014-05-15 20:06:50.125016 7fe58aed3700 -1 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fe58aed3700 time 2014-05-15 20:06:33.525889
msg/Pipe.cc: 1070: FAILED assert(m)

 ceph version 0.80-rc1-154-gb2a5ec7 (b2a5ec71bbbff13b59d9e2e8dbd81c2c1d455d2c)
 1: (Pipe::connect()+0x37c1) [0xb0e601]
 2: (Pipe::writer()+0x4f1) [0xb0fb51]
 3: (Pipe::Writer::entry()+0xd) [0xb1aa7d]
 4: (()+0x7f6e) [0x7fe5b5f0af6e]
 5: (clone()+0x6d) [0x7fe5b42ab9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.2.log
--- end dump of recent events ---
2014-05-15 20:06:53.799163 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe5a0ffd700' had timed out after 15
2014-05-15 20:06:53.799185 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59dff7700' had timed out after 15
2014-05-15 20:06:53.799189 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59eff9700' had timed out after 15
2014-05-15 20:06:53.799191 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59f7fa700' had timed out after 15
2014-05-15 20:06:53.799195 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59fffb700' had timed out after 15
2014-05-15 20:06:53.799197 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe5a07fc700' had timed out after 15
2014-05-15 20:06:53.799200 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59d7f6700' had timed out after 15
2014-05-15 20:06:55.376780 7fe58aed3700 -1 *** Caught signal (Aborted) **
 in thread 7fe58aed3700
 ceph version 0.80-rc1-154-gb2a5ec7 (b2a5ec71bbbff13b59d9e2e8dbd81c2c1d455d2c)
 1: ceph-osd() [0x96f8bf]
 2: (()+0xfbb0) [0x7fe5b5f12bb0]
 3: (gsignal()+0x37) [0x7fe5b41e7f77]
 4: (abort()+0x148) [0x7fe5b41eb5e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe5b4af36e5]
 6: (()+0x5e856) [0x7fe5b4af1856]
 7: (()+0x5e883) [0x7fe5b4af1883]
 8: (()+0x5eaae) [0x7fe5b4af1aae]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa523b2]
 10: (Pipe::connect()+0x37c1) [0xb0e601]
 11: (Pipe::writer()+0x4f1) [0xb0fb51]
 12: (Pipe::Writer::entry()+0xd) [0xb1aa7d]
 13: (()+0x7f6e) [0x7fe5b5f0af6e]
 14: (clone()+0x6d) [0x7fe5b42ab9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -11> 2014-05-15 20:06:51.275191 7fe59b7f2700  5 osd.2 64 heartbeat: osd_stat(43395 MB used, 888 GB avail, 931 GB total, peers []/[] op hist [])
   -10> 2014-05-15 20:06:52.216410 7fe5a1fff700 10 monclient: tick
    -9> 2014-05-15 20:06:52.216426 7fe5a1fff700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2014-05-15 20:06:22.216425)
    -8> 2014-05-15 20:06:52.216447 7fe5a1fff700 10 monclient: renew subs? (now: 2014-05-15 20:06:52.216447; renew after: 2014-05-15 20:08:56.473454) -- no
    -7> 2014-05-15 20:06:53.799163 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe5a0ffd700' had timed out after 15
    -6> 2014-05-15 20:06:53.799185 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59dff7700' had timed out after 15
    -5> 2014-05-15 20:06:53.799189 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59eff9700' had timed out after 15
    -4> 2014-05-15 20:06:53.799191 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59f7fa700' had timed out after 15
    -3> 2014-05-15 20:06:53.799195 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59fffb700' had timed out after 15
    -2> 2014-05-15 20:06:53.799197 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe5a07fc700' had timed out after 15
    -1> 2014-05-15 20:06:53.799200 7fe5b3164700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fe59d7f6700' had timed out after 15
     0> 2014-05-15 20:06:55.376780 7fe58aed3700 -1 *** Caught signal (Aborted) **
 in thread 7fe58aed3700

 ceph version 0.80-rc1-154-gb2a5ec7 (b2a5ec71bbbff13b59d9e2e8dbd81c2c1d455d2c)
 1: ceph-osd() [0x96f8bf]
 2: (()+0xfbb0) [0x7fe5b5f12bb0]
 3: (gsignal()+0x37) [0x7fe5b41e7f77]
 4: (abort()+0x148) [0x7fe5b41eb5e8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe5b4af36e5]
 6: (()+0x5e856) [0x7fe5b4af1856]
 7: (()+0x5e883) [0x7fe5b4af1883]
 8: (()+0x5eaae) [0x7fe5b4af1aae]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1f2) [0xa523b2]
 10: (Pipe::connect()+0x37c1) [0xb0e601]
 11: (Pipe::writer()+0x4f1) [0xb0fb51]
 12: (Pipe::Writer::entry()+0xd) [0xb1aa7d]
 13: (()+0x7f6e) [0x7fe5b5f0af6e]
 14: (clone()+0x6d) [0x7fe5b42ab9cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.2.log
--- end dump of recent events ---

Actions

Copy link

#13

Updated by Sage Weil almost 10 years ago

Assignee set to Mark Nelson

Actions

Copy link

#14

Updated by Mark Nelson almost 10 years ago

After creating and digging through many more gcore, valgrind, and collectl logs, it appears that this issue is related to invoking bad behavior in the kernel VM layer. Basically we are causing major page faults even when there is plenty of memory available used as buffer cache which can delay things enough to cause heartbeat timeouts. A bug has been filed with RH to try and get the attention of some of the kernel folks here:

https://bugzilla.redhat.com/show_bug.cgi?id=1099702

A workaround that so far appears to be mitigating the problem is to set swappiness to 10:

echo 10 | sudo tee /proc/sys/vm/swappiness

Actions

Copy link

#15

Updated by Samuel Just almost 10 years ago

Status changed from New to Resolved

Project

General

Profile

Ceph

Custom queries

Bug #8301

Suicide Timeout on Cache Tier OSDs

Updated by Ian Colle about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Mark Nelson about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Samuel Just about 10 years ago

Updated by Mark Nelson about 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Samuel Just almost 10 years ago

Updated by Mark Nelson almost 10 years ago

Updated by Mark Nelson almost 10 years ago

Updated by Mark Nelson almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Mark Nelson almost 10 years ago

Updated by Samuel Just almost 10 years ago