Project

General

Profile

Actions

Bug #11677

closed

Almost all OSDs in the cluster crashing at the same time, repeatedly

Added by Daniel Schneller almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Tonight our Hammer cluster suffered a series of OSD crashes on all cluster nodes.
We are running Hammer (0.94.1-98-g7df3eb5, built because we had a major problem a week ago which we suspected to be related to bugs we found in the tracker, that were not yet in 0.94.1).

Around 22.00 users started reporting a down web application, and at the same time we found lots of OSD crashes and restarts.
The stack trace in the log looks like this on all of them:

   -10> 2015-05-18 22:02:21.926872 7f5ac7722700  1 -- 10.102.4.14:6833/14731 <== osd.6 10.102.5.11:0/32301 219298 ==== osd_ping(ping e34350 stamp 2015-05-18 22:02:21.920146) v2 ==== 47+0+0 (1114598428 0 0) 0x5062e200 con 0x4d8ff4a0
    -9> 2015-05-18 22:02:21.926906 7f5ac7722700  1 -- 10.102.4.14:6833/14731 --> 10.102.5.11:0/32301 -- osd_ping(ping_reply e34348 stamp 2015-05-18 22:02:21.920146) v2 -- ?+0 0x4f018400 con 0x4d8ff4a0
    -8> 2015-05-18 22:02:21.939509 7f5ab59ec700  1 -- 10.102.4.14:6832/14731 <== client.139501649 10.102.4.15:0/1037325 48 ==== osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-hint object_si
ze 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349) v5 ==== 276+0+4096 (1982592369 0 3054609927) 0x21935900 con 0x20faf9c0
    -7> 2015-05-18 22:02:21.939545 7f5ab59ec700  5 -- op tracker -- seq: 11397738, time: 2015-05-18 22:02:21.939401, event: header_read, op: osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-h
int object_size 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349)
    -6> 2015-05-18 22:02:21.939558 7f5ab59ec700  5 -- op tracker -- seq: 11397738, time: 2015-05-18 22:02:21.939405, event: throttled, op: osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-hin
t object_size 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349)
    -5> 2015-05-18 22:02:21.939566 7f5ab59ec700  5 -- op tracker -- seq: 11397738, time: 2015-05-18 22:02:21.939497, event: all_read, op: osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349)
    -4> 2015-05-18 22:02:21.939575 7f5ab59ec700  5 -- op tracker -- seq: 11397738, time: 0.000000, event: dispatched, op: osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349)
    -3> 2015-05-18 22:02:21.939594 7f5ab59ec700 10 monclient: renew_subs
    -2> 2015-05-18 22:02:21.939603 7f5ab59ec700 10 monclient: _send_mon_message to mon.node01 at 10.102.4.11:6789/0
    -1> 2015-05-18 22:02:21.939608 7f5ab59ec700  1 -- 10.102.4.14:6832/14731 --> 10.102.4.11:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=34349}) v2 -- ?+0 0x505fc000 con 0x3d5209a0
     0> 2015-05-18 22:02:21.961378 7f5ad8cb5700 -1 *** Caught signal (Aborted) **
 in thread 7f5ad8cb5700

 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682)
 1: /usr/bin/ceph-osd() [0xacb3ba]
 2: (()+0x10340) [0x7f5ae7000340]
 3: (gsignal()+0x39) [0x7f5ae549fbb9]
 4: (abort()+0x148) [0x7f5ae54a2fc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f5ae5dab6b5]
 6: (()+0x5e836) [0x7f5ae5da9836]
 7: (()+0x5e863) [0x7f5ae5da9863]
 8: (()+0x5eaa2) [0x7f5ae5da9aa2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc2d78]
 10: /usr/bin/ceph-osd() [0x8b9b05]
 11: (ReplicatedPG::remove_repop(ReplicatedPG::RepGather*)+0xec) [0x84516c]
 12: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x912) [0x857082]
 13: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x16d) [0x857bbd]
 14: (Context::complete(int)+0x9) [0x6caf09]
 15: (ReplicatedBackend::op_applied(ReplicatedBackend::InProgressOp*)+0x1ec) [0xa081dc]
 16: (Context::complete(int)+0x9) [0x6caf09]
 17: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af634]
 18: (Context::complete(int)+0x9) [0x6caf09]
 19: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x70b764]
 20: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6cb759]
 21: (Finisher::finisher_thread_entry()+0x158) [0xaef528]
 22: (()+0x8182) [0x7f5ae6ff8182]
 23: (clone()+0x6d) [0x7f5ae5563fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This is what a grep in the log directory of one of the nodes shows:

[B|root@node03]  /var/log/ceph ?  grep "0 ceph version 0.94.1-98-g7df3eb5" /var/log/ceph/ceph-osd.*.log
/var/log/ceph/ceph-osd.36.log:2015-05-18 22:02:30.686385 7f3a5b9c8900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 27837
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:01:08.092723 7fd2caa60900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19423
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:01:53.821736 7ffe5f7da900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 25097
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:05:26.804857 7ffb7ffda900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 6192
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:07:11.497568 7f359c827900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 8955
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:07:58.161750 7f0ab170a900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 14842
/var/log/ceph/ceph-osd.38.log:2015-05-18 22:01:09.698353 7fc6a0ee6900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19772
/var/log/ceph/ceph-osd.38.log:2015-05-18 22:01:45.043304 7f4715b1a900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 24644
/var/log/ceph/ceph-osd.38.log:2015-05-18 22:04:29.600179 7ff2b6910900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 2736
/var/log/ceph/ceph-osd.39.log:2015-05-18 22:01:08.464218 7f5803409900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19545
/var/log/ceph/ceph-osd.40.log:2015-05-18 22:07:15.776315 7fab24688900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9711
/var/log/ceph/ceph-osd.41.log:2015-05-18 22:07:11.769112 7f7de74fc900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9104
/var/log/ceph/ceph-osd.41.log:2015-05-18 22:07:57.753562 7f27fe298900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 14667
/var/log/ceph/ceph-osd.41.log:2015-05-18 22:09:13.061530 7ffe5f95f900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 20405
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:01:08.664508 7f6e2d431900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19585
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:01:53.577065 7f982f80c900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 25021
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:03:01.646920 7fa1c168f900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29226
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:03:56.978982 7f30808fc900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 952
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:07:12.434963 7ff4da199900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9483
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:07:58.154927 7fa1236ef900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 14840
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:08:41.457952 7f1865fdb900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 18398
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:10:46.099379 7f3389365900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 24311
/var/log/ceph/ceph-osd.43.log:2015-05-18 22:01:05.419036 7fe9ae8f0900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19109
/var/log/ceph/ceph-osd.43.log:2015-05-18 22:07:56.379190 7f2c4f271900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 14560
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:01:09.442229 7f52bd0fa900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19714
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:03:01.513365 7f719a595900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29208
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:03:35.168178 7f9fba7b5900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 31863
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:07:12.407905 7f574d7cb900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9468
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:08:12.754007 7fcf6c263900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 15417
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:11:11.201391 7f84e87a8900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 25184
/var/log/ceph/ceph-osd.45.log:2015-05-18 22:01:06.757867 7f06c3202900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19234
/var/log/ceph/ceph-osd.45.log:2015-05-18 22:04:29.559758 7f0d38b8c900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 2718
/var/log/ceph/ceph-osd.45.log:2015-05-18 22:07:12.450421 7f6e2b66a900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9500
/var/log/ceph/ceph-osd.46.log:2015-05-18 22:01:07.322495 7fa3b5d51900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19322
/var/log/ceph/ceph-osd.46.log:2015-05-18 22:03:35.197774 7fb316d5f900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 31886
/var/log/ceph/ceph-osd.46.log:2015-05-18 22:07:12.424735 7fb512ae3900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9476
/var/log/ceph/ceph-osd.47.log:2015-05-18 22:03:06.317413 7effcdbc7900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29419
/var/log/ceph/ceph-osd.47.log:2015-05-18 22:04:22.347075 7f5e6d0e9900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 2200
/var/log/ceph/ceph-osd.47.log:2015-05-18 22:07:11.851807 7f5b14e5e900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9126

The same picture on the other 3 machines. Not all OSDs crashed the same number of times, and a very few did not restart at all.
We found a pastebin-like entry here http://budgetinsuance.com/LwQug8QA with a very similar trace for 0.94.1, but we could not figure out if it is referenced in some existing bug ticket. Hence we are creating this one.

After abound 15 minutes the cluster seems to have calmed down again, however we are very nervous about this, because of the outage we had last week.

Maybe related: About 30-40 minutes before the crashes we created and shortly afterwards deleted a snapshot on a data pool for testing. Nothing else was done in the meantime apart from regular application operations (VM volumes and RGW access).

If we can provide any other information to diagnose, please let us know.


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #11908: Almost all OSDs in the cluster crashing at the same time, repeatedlyResolvedAbhishek Lekshmanan05/18/2015Actions
Actions

Also available in: Atom PDF