Project

General

Profile

Bug #11677

Almost all OSDs in the cluster crashing at the same time, repeatedly

Added by Daniel Schneller over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Tonight our Hammer cluster suffered a series of OSD crashes on all cluster nodes.
We are running Hammer (0.94.1-98-g7df3eb5, built because we had a major problem a week ago which we suspected to be related to bugs we found in the tracker, that were not yet in 0.94.1).

Around 22.00 users started reporting a down web application, and at the same time we found lots of OSD crashes and restarts.
The stack trace in the log looks like this on all of them:

   -10> 2015-05-18 22:02:21.926872 7f5ac7722700  1 -- 10.102.4.14:6833/14731 <== osd.6 10.102.5.11:0/32301 219298 ==== osd_ping(ping e34350 stamp 2015-05-18 22:02:21.920146) v2 ==== 47+0+0 (1114598428 0 0) 0x5062e200 con 0x4d8ff4a0
    -9> 2015-05-18 22:02:21.926906 7f5ac7722700  1 -- 10.102.4.14:6833/14731 --> 10.102.5.11:0/32301 -- osd_ping(ping_reply e34348 stamp 2015-05-18 22:02:21.920146) v2 -- ?+0 0x4f018400 con 0x4d8ff4a0
    -8> 2015-05-18 22:02:21.939509 7f5ab59ec700  1 -- 10.102.4.14:6832/14731 <== client.139501649 10.102.4.15:0/1037325 48 ==== osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-hint object_si
ze 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349) v5 ==== 276+0+4096 (1982592369 0 3054609927) 0x21935900 con 0x20faf9c0
    -7> 2015-05-18 22:02:21.939545 7f5ab59ec700  5 -- op tracker -- seq: 11397738, time: 2015-05-18 22:02:21.939401, event: header_read, op: osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-h
int object_size 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349)
    -6> 2015-05-18 22:02:21.939558 7f5ab59ec700  5 -- op tracker -- seq: 11397738, time: 2015-05-18 22:02:21.939405, event: throttled, op: osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-hin
t object_size 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349)
    -5> 2015-05-18 22:02:21.939566 7f5ab59ec700  5 -- op tracker -- seq: 11397738, time: 2015-05-18 22:02:21.939497, event: all_read, op: osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349)
    -4> 2015-05-18 22:02:21.939575 7f5ab59ec700  5 -- op tracker -- seq: 11397738, time: 0.000000, event: dispatched, op: osd_op(client.139501649.0:755810 rbd_data.3f6f4af6ff14e92.0000000000000044 [stat,set-alloc-hint object_size 8388608 write_size 8388608,write 905216~4096] 19.e650fba6 ack+ondisk+write+known_if_redirected e34349)
    -3> 2015-05-18 22:02:21.939594 7f5ab59ec700 10 monclient: renew_subs
    -2> 2015-05-18 22:02:21.939603 7f5ab59ec700 10 monclient: _send_mon_message to mon.node01 at 10.102.4.11:6789/0
    -1> 2015-05-18 22:02:21.939608 7f5ab59ec700  1 -- 10.102.4.14:6832/14731 --> 10.102.4.11:6789/0 -- mon_subscribe({monmap=4+,osd_pg_creates=0,osdmap=34349}) v2 -- ?+0 0x505fc000 con 0x3d5209a0
     0> 2015-05-18 22:02:21.961378 7f5ad8cb5700 -1 *** Caught signal (Aborted) **
 in thread 7f5ad8cb5700

 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682)
 1: /usr/bin/ceph-osd() [0xacb3ba]
 2: (()+0x10340) [0x7f5ae7000340]
 3: (gsignal()+0x39) [0x7f5ae549fbb9]
 4: (abort()+0x148) [0x7f5ae54a2fc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f5ae5dab6b5]
 6: (()+0x5e836) [0x7f5ae5da9836]
 7: (()+0x5e863) [0x7f5ae5da9863]
 8: (()+0x5eaa2) [0x7f5ae5da9aa2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc2d78]
 10: /usr/bin/ceph-osd() [0x8b9b05]
 11: (ReplicatedPG::remove_repop(ReplicatedPG::RepGather*)+0xec) [0x84516c]
 12: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x912) [0x857082]
 13: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x16d) [0x857bbd]
 14: (Context::complete(int)+0x9) [0x6caf09]
 15: (ReplicatedBackend::op_applied(ReplicatedBackend::InProgressOp*)+0x1ec) [0xa081dc]
 16: (Context::complete(int)+0x9) [0x6caf09]
 17: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af634]
 18: (Context::complete(int)+0x9) [0x6caf09]
 19: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x70b764]
 20: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6cb759]
 21: (Finisher::finisher_thread_entry()+0x158) [0xaef528]
 22: (()+0x8182) [0x7f5ae6ff8182]
 23: (clone()+0x6d) [0x7f5ae5563fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This is what a grep in the log directory of one of the nodes shows:

[B|root@node03]  /var/log/ceph ?  grep "0 ceph version 0.94.1-98-g7df3eb5" /var/log/ceph/ceph-osd.*.log
/var/log/ceph/ceph-osd.36.log:2015-05-18 22:02:30.686385 7f3a5b9c8900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 27837
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:01:08.092723 7fd2caa60900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19423
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:01:53.821736 7ffe5f7da900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 25097
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:05:26.804857 7ffb7ffda900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 6192
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:07:11.497568 7f359c827900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 8955
/var/log/ceph/ceph-osd.37.log:2015-05-18 22:07:58.161750 7f0ab170a900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 14842
/var/log/ceph/ceph-osd.38.log:2015-05-18 22:01:09.698353 7fc6a0ee6900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19772
/var/log/ceph/ceph-osd.38.log:2015-05-18 22:01:45.043304 7f4715b1a900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 24644
/var/log/ceph/ceph-osd.38.log:2015-05-18 22:04:29.600179 7ff2b6910900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 2736
/var/log/ceph/ceph-osd.39.log:2015-05-18 22:01:08.464218 7f5803409900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19545
/var/log/ceph/ceph-osd.40.log:2015-05-18 22:07:15.776315 7fab24688900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9711
/var/log/ceph/ceph-osd.41.log:2015-05-18 22:07:11.769112 7f7de74fc900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9104
/var/log/ceph/ceph-osd.41.log:2015-05-18 22:07:57.753562 7f27fe298900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 14667
/var/log/ceph/ceph-osd.41.log:2015-05-18 22:09:13.061530 7ffe5f95f900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 20405
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:01:08.664508 7f6e2d431900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19585
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:01:53.577065 7f982f80c900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 25021
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:03:01.646920 7fa1c168f900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29226
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:03:56.978982 7f30808fc900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 952
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:07:12.434963 7ff4da199900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9483
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:07:58.154927 7fa1236ef900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 14840
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:08:41.457952 7f1865fdb900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 18398
/var/log/ceph/ceph-osd.42.log:2015-05-18 22:10:46.099379 7f3389365900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 24311
/var/log/ceph/ceph-osd.43.log:2015-05-18 22:01:05.419036 7fe9ae8f0900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19109
/var/log/ceph/ceph-osd.43.log:2015-05-18 22:07:56.379190 7f2c4f271900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 14560
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:01:09.442229 7f52bd0fa900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19714
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:03:01.513365 7f719a595900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29208
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:03:35.168178 7f9fba7b5900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 31863
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:07:12.407905 7f574d7cb900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9468
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:08:12.754007 7fcf6c263900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 15417
/var/log/ceph/ceph-osd.44.log:2015-05-18 22:11:11.201391 7f84e87a8900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 25184
/var/log/ceph/ceph-osd.45.log:2015-05-18 22:01:06.757867 7f06c3202900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19234
/var/log/ceph/ceph-osd.45.log:2015-05-18 22:04:29.559758 7f0d38b8c900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 2718
/var/log/ceph/ceph-osd.45.log:2015-05-18 22:07:12.450421 7f6e2b66a900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9500
/var/log/ceph/ceph-osd.46.log:2015-05-18 22:01:07.322495 7fa3b5d51900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19322
/var/log/ceph/ceph-osd.46.log:2015-05-18 22:03:35.197774 7fb316d5f900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 31886
/var/log/ceph/ceph-osd.46.log:2015-05-18 22:07:12.424735 7fb512ae3900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9476
/var/log/ceph/ceph-osd.47.log:2015-05-18 22:03:06.317413 7effcdbc7900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29419
/var/log/ceph/ceph-osd.47.log:2015-05-18 22:04:22.347075 7f5e6d0e9900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 2200
/var/log/ceph/ceph-osd.47.log:2015-05-18 22:07:11.851807 7f5b14e5e900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9126

The same picture on the other 3 machines. Not all OSDs crashed the same number of times, and a very few did not restart at all.
We found a pastebin-like entry here http://budgetinsuance.com/LwQug8QA with a very similar trace for 0.94.1, but we could not figure out if it is referenced in some existing bug ticket. Hence we are creating this one.

After abound 15 minutes the cluster seems to have calmed down again, however we are very nervous about this, because of the outage we had last week.

Maybe related: About 30-40 minutes before the crashes we created and shortly afterwards deleted a snapshot on a data pool for testing. Nothing else was done in the meantime apart from regular application operations (VM volumes and RGW access).

If we can provide any other information to diagnose, please let us know.


Related issues

Copied to Ceph - Backport #11908: Almost all OSDs in the cluster crashing at the same time, repeatedly Resolved 05/18/2015

Associated revisions

Revision 5c2b7957 (diff)
Added by Samuel Just over 5 years ago

ReplicatedPG::finish_ctx: take excl lock if operation is rw

Fixes: #11677
Signed-off-by: Samuel Just <>

Revision c7b6a637 (diff)
Added by Samuel Just over 5 years ago

ReplicatedPG::finish_ctx: take excl lock if operation is rw

Fixes: #11677
Signed-off-by: Samuel Just <>
(cherry picked from commit 5c2b795724423ed484ab451de855ddcfc085342b)

History

#1 Updated by Daniel Schneller over 5 years ago

About every hour the crashes repeat. The calm that we saw after the initial round of crashes did not last long.
Here is a partial overview, showing the timestamps:

node03: /var/log/ceph/ceph-osd.37.log 2015-05-18 22:54:34.014586 7f462f998900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29874
node02: /var/log/ceph/ceph-osd.23.log 2015-05-18 22:55:42.184608 7f0eef476900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 12244
node04: /var/log/ceph/ceph-osd.26.log 2015-05-18 22:56:32.768119 7f6166ebb900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19871
node01: /var/log/ceph/ceph-osd.2.log 2015-05-18 22:57:05.486453 7ff108a78900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29669
node04: /var/log/ceph/ceph-osd.32.log 2015-05-18 22:57:05.692554 7f3907114900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 21923
node02: /var/log/ceph/ceph-osd.14.log 2015-05-18 22:57:42.204904 7f0a603ac900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 17457
node02: /var/log/ceph/ceph-osd.12.log 2015-05-18 22:57:42.362774 7f6e172d6900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 17508
node03: /var/log/ceph/ceph-osd.47.log 2015-05-18 22:57:42.382949 7f6a04ee9900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 9335
node01: /var/log/ceph/ceph-osd.4.log 2015-05-18 22:57:53.055510 7fbc9e94c900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 31403
node01: /var/log/ceph/ceph-osd.1.log 2015-05-18 22:57:53.562636 7f71f3992900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 31668
node01: /var/log/ceph/ceph-osd.11.log 2015-05-18 22:57:53.642671 7f3be18ca900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 31708
node04: /var/log/ceph/ceph-osd.35.log 2015-05-18 22:57:54.095995 7f6cb0062900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 23378
node04: /var/log/ceph/ceph-osd.31.log 2015-05-18 22:57:54.692564 7fe5b6218900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 23646
node04: /var/log/ceph/ceph-osd.29.log 2015-05-18 22:57:54.848554 7fc4f47cf900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 23791
node04: /var/log/ceph/ceph-osd.27.log 2015-05-18 22:57:54.886953 7f6402b11900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 23847
node04: /var/log/ceph/ceph-osd.30.log 2015-05-18 22:57:54.919093 7f595481d900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 23865
node04: /var/log/ceph/ceph-osd.28.log 2015-05-18 22:57:55.032077 7f39cb3a8900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 23896
node04: /var/log/ceph/ceph-osd.33.log 2015-05-18 22:57:55.042579 7fac10e2b900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 23899
node04: /var/log/ceph/ceph-osd.34.log 2015-05-18 22:58:50.306221 7f5a326f0900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 29372
node02: /var/log/ceph/ceph-osd.16.log 2015-05-18 22:58:52.968074 7f5f51ca1900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 22405
node01: /var/log/ceph/ceph-osd.7.log 2015-05-18 22:58:58.922759 7f790ed3c900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 4945
node01: /var/log/ceph/ceph-osd.9.log 2015-05-18 22:58:58.967777 7f158222e900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 4980
node01: /var/log/ceph/ceph-osd.5.log 2015-05-18 22:58:59.036874 7f7afe890900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 5012
node02: /var/log/ceph/ceph-osd.19.log 2015-05-18 22:59:28.084251 7fc472e27900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 24342
node04: /var/log/ceph/ceph-osd.24.log 2015-05-18 22:59:28.798162 7ff1dd9a4900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 30925
node03: /var/log/ceph/ceph-osd.43.log 2015-05-19 00:03:37.476412 7faafde0f900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 806
node02: /var/log/ceph/ceph-osd.21.log 2015-05-19 00:03:42.274620 7f07bf5d9900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 11439
node01: /var/log/ceph/ceph-osd.6.log 2015-05-19 00:10:25.679138 7f0a83f19900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 32451
node03: /var/log/ceph/ceph-osd.45.log 2015-05-19 01:03:35.909488 7f71f7355900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 19553
node02: /var/log/ceph/ceph-osd.17.log 2015-05-19 01:03:36.433078 7f1113bbd900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 31597
node01: /var/log/ceph/ceph-osd.8.log 2015-05-19 01:04:00.859449 7f16fc64b900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 16781
node01: /var/log/ceph/ceph-osd.0.log 2015-05-19 01:04:01.252603 7f40df72c900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 16869
node03: /var/log/ceph/ceph-osd.39.log 2015-05-19 01:04:04.867572 7ffa5ae8b900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 20643
node03: /var/log/ceph/ceph-osd.38.log 2015-05-19 01:04:07.000713 7f1cc5180900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 20785
node03: /var/log/ceph/ceph-osd.36.log 2015-05-19 01:04:07.131433 7fabf3cbf900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 20827
node01: /var/log/ceph/ceph-osd.3.log 2015-05-19 01:04:08.447527 7f89c9f71900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 17079
node03: /var/log/ceph/ceph-osd.44.log 2015-05-19 01:04:09.018316 7f5801f8a900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 21061
node03: /var/log/ceph/ceph-osd.40.log 2015-05-19 01:04:09.102133 7f90abfcc900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 21079
node04: /var/log/ceph/ceph-osd.25.log 2015-05-19 01:04:36.392399 7fee754a5900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 5996
node03: /var/log/ceph/ceph-osd.42.log 2015-05-19 01:04:36.492702 7faecb106900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 24298
node02: /var/log/ceph/ceph-osd.20.log 2015-05-19 01:04:37.061725 7fd2d7927900  0 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682), process ceph-osd, pid 2038

#2 Updated by Daniel Schneller over 5 years ago

Up'ed debug_osd to 20 for OSDs on one node in anticipation of another crash.
3 on node01 crashed again about 2.5h later again.

Excerpt from ceph-osd.5.log:

2015-05-19 03:38:35.432717 7f7ace55d700 10 osd.5 34697 dequeue_op 0x38112700 finish
2015-05-19 03:38:35.432865 7f7acbd58700 10 osd.5 34697 dequeue_op 0x535fc400 prio 196 cost 0 latency 0.054271 osd_repop_reply(client.141679501.0:19791 81.1f9 ondisk, result = 0) v1 pg pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 3467
0/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 luod=34688'23684 lua=34688'23684 crt=34688'23683 lcod 34688'23683 mlcod 34688'23683 active+clean]
2015-05-19 03:38:35.433129 7f7acbd58700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 luod=34688'23684 lua=34688'23684 crt=34688'23683 lcod 34688'23683 ml
cod 34688'23683 active+clean] handle_message: 0x535fc400
2015-05-19 03:38:35.433518 7f7acbd58700  7 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 luod=34688'23684 lua=34688'23684 crt=34688'23683 lcod 34688'23683 ml
cod 34688'23683 active+clean] sub_op_modify_reply: tid 656357 op  ack_type 4 from 17
2015-05-19 03:38:35.433862 7f7acbd58700 10 osd.5 34697 dequeue_op 0x535fc400 finish
2015-05-19 03:38:35.433887 7f7aeca5f700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 luod=34688'23684 lua=34688'23684 crt=34688'23683 lcod 34688'23683 ml
cod 34688'23683 active+clean] op_commit: 656357
2015-05-19 03:38:35.434284 7f7ace55d700 10 osd.5 34697 dequeue_op 0x5336e900 prio 196 cost 0 latency 0.020147 osd_repop_reply(client.141679501.0:19791 81.1f9 ondisk, result = 0) v1 pg pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 3467
0/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 luod=34688'23684 lua=34688'23684 crt=34688'23683 lcod 34688'23683 mlcod 34688'23683 active+clean]
2015-05-19 03:38:35.434431 7f7ace55d700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 luod=34688'23684 lua=34688'23684 crt=34688'23683 lcod 34688'23683 ml
cod 34688'23683 active+clean] handle_message: 0x5336e900
2015-05-19 03:38:35.434481 7f7ace55d700  7 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 luod=34688'23684 lua=34688'23684 crt=34688'23683 lcod 34688'23683 ml
cod 34688'23683 active+clean] sub_op_modify_reply: tid 656357 op  ack_type 4 from 41
2015-05-19 03:38:35.434927 7f7ace55d700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 luod=34688'23684 lua=34688'23684 crt=34688'23683 lcod 34688'23683 ml
cod 34688'23683 active+clean] repop_all_committed: repop tid 656357 all committed 
2015-05-19 03:38:35.435083 7f7ace55d700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 lua=34688'23684 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 a
ctive+clean] eval_repop repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=0 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) 
v5) wants=d
2015-05-19 03:38:35.435397 7f7ace55d700 15 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 lua=34688'23684 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] log_op_stats osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5 inb 0 outb 0 rlat 0.000000 lat 0.078165
2015-05-19 03:38:35.435602 7f7ace55d700 15 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 lua=34688'23684 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] publish_stats_to_osd 34697:49018
2015-05-19 03:38:35.435681 7f7ace55d700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 lua=34688'23684 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean]  sending commit on repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=0 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5) 0x528f7c80
2015-05-19 03:38:35.436156 7f7ace55d700 10 osd.5 34697 dequeue_op 0x5336e900 finish
2015-05-19 03:38:35.445244 7f7aed260700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 lua=34688'23684 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] op_applied: 656357
2015-05-19 03:38:35.445291 7f7aed260700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 lua=34688'23684 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] op_applied version 34697'23687
2015-05-19 03:38:35.445305 7f7aed260700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] repop_all_applied: repop tid 656357 all applied 
2015-05-19 03:38:35.445317 7f7aed260700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] eval_repop repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5) wants=d
2015-05-19 03:38:35.445336 7f7aed260700 15 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] publish_stats_to_osd 34697: no change since
2015-05-19 03:38:35.445346 7f7aed260700 15 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] do_osd_op_effects client.141679501 con 0x13fcd180
2015-05-19 03:38:35.445356 7f7aed260700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]  removing repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445370 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]    q front is repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445381 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean] remove_repop repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445393 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]  obc obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/head//81(dne) rwstate(excl n=1 w=0))
2015-05-19 03:38:35.445401 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]  clone_obc obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/c8//81 rwstate(write n=1 w=0))
2015-05-19 03:38:35.445409 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]  snapset_obc obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/snapdir//81 rwstate(write n=1 w=0))
2015-05-19 03:38:35.449476 7f7aed260700 -1 osd/osd_types.h: In function 'void ObjectContext::RWState::put_excl(std::list<std::tr1::shared_ptr<OpRequest> >*)' thread 7f7aed260700 time 2015-05-19 03:38:35.445434
osd/osd_types.h: 3167: FAILED assert(state == RWEXCL)

 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc2b8b]
 2: /usr/bin/ceph-osd() [0x8b9b05]
 3: (ReplicatedPG::remove_repop(ReplicatedPG::RepGather*)+0xec) [0x84516c]
 4: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x912) [0x857082]
 5: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x16d) [0x857bbd]
 6: (Context::complete(int)+0x9) [0x6caf09]
 7: (ReplicatedBackend::op_applied(ReplicatedBackend::InProgressOp*)+0x1ec) [0xa081dc]
 8: (Context::complete(int)+0x9) [0x6caf09]
 9: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af634]
 10: (Context::complete(int)+0x9) [0x6caf09]
 11: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x70b764]
 12: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6cb759]
 13: (Finisher::finisher_thread_entry()+0x158) [0xaef528]
 14: (()+0x8182) [0x7f7afd540182]
 15: (clone()+0x6d) [0x7f7afbaabfbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

Full logs of all 3 here: https://www.dropbox.com/sh/u3cwjr2gblahdwg/AADBm694EkgYGC25ApGs7uFDa?dl=0

There were further crashes, up to 9 OSDs were down at the same time. The logs of the others were on default, though. I needed, those can also be provided.

#3 Updated by Daniel Schneller over 5 years ago

Several requests have been hanging since what appears to be the time of the crash:

2015-05-19 07:55:43.478450 osd.3 [WRN] 26 slow requests, 1 included below; oldest blocked for > 15403.099777 secs
2015-05-19 07:55:43.478459 osd.3 [WRN] slow request 15360.699902 seconds old, received at 2015-05-19 03:39:42.778492: osd_op(client.139494234.0:1480247 rbd_header.3f700ee75034f6d [watch ping cookie 139648810647552 gen 24] 21.815a72d4 ondisk+write+known_if_redirected e34723) currently no flag points reached
2015-05-19 07:55:44.479241 osd.3 [WRN] 26 slow requests, 1 included below; oldest blocked for > 15404.100567 secs
2015-05-19 07:55:44.479249 osd.3 [WRN] slow request 15360.424581 seconds old, received at 2015-05-19 03:39:44.054603: osd_repop(client.141647897.0:19923 102.18 990e66d8/gc.0/head//102 v 34725'75572) currently no flag points reached
2015-05-19 07:55:48.481854 osd.3 [WRN] 26 slow requests, 2 included below; oldest blocked for > 15408.103152 secs
2015-05-19 07:55:48.481873 osd.3 [WRN] slow request 15360.703059 seconds old, received at 2015-05-19 03:39:47.778711: osd_op(client.139494234.0:1480249 rbd_header.3f700ee75034f6d [watch ping cookie 139648810647552 gen 24] 21.815a72d4 ondisk+write+known_if_redirected e34725) currently no flag points reached
2015-05-19 07:55:48.481880 osd.3 [WRN] slow request 15360.627638 seconds old, received at 2015-05-19 03:39:47.854131: osd_repop(client.139501823.0:705400 19.2ab 17040eab/rbd_data.3f6f5a23563f060.0000000000000083/head//19 v 34725'266136) currently no flag points reached
[B|daniel.schneller@node02]  ~ ?  sudo ceph health detail
HEALTH_WARN 158 requests are blocked > 32 sec; 9 osds have slow requests; too many PGs per OSD (940 > max 300); noscrub,nodeep-scrub flag(s) set
158 ops are blocked > 16777.2 sec
9 ops are blocked > 16777.2 sec on osd.0
46 ops are blocked > 16777.2 sec on osd.1
4 ops are blocked > 16777.2 sec on osd.2
26 ops are blocked > 16777.2 sec on osd.3
16 ops are blocked > 16777.2 sec on osd.4
37 ops are blocked > 16777.2 sec on osd.6
17 ops are blocked > 16777.2 sec on osd.7
2 ops are blocked > 16777.2 sec on osd.8
1 ops are blocked > 16777.2 sec on osd.9
9 osds have slow requests
too many PGs per OSD (940 > max 300)
noscrub,nodeep-scrub flag(s) set

#4 Updated by Daniel Schneller over 5 years ago

Once the OSDs holding the blocked operations had crashed again, those were apparently processed.

#5 Updated by Daniel Schneller over 5 years ago

Downgraded to 0.94.1-trusty "official" packages to rule out we had compiled a somehow unstable intermediate state between releases. Did not help, though. Crashes repeat in waves around every 30-60 minutes.

#6 Updated by Daniel Schneller over 5 years ago

Seems we did not delete one of the test-snapshots last night after all.
So it kind of makes a little more sense -- given our limited understanding of C++ -- that it might have something to do with that.

If I read the Crash/Debug output above correctly, we are encountering the asserting failure coming from https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.h#L833 where it attempts to call ctx->snapset_obc->put_excl, but apparently does not own some lock.

The line explicitly saying snapset_obc is what makes us believe the snapshot is to blame, which would fit, because it seems to be the only difference to what the system was like before the crashes started to happen. We are unsure why they happen so irregularly, but maybe that makes more sense to one of you. Snapshots were taken on all the RGW related pools:

  • .rgw.root
  • .rgw
  • .rgw.control
  • .rgw.gc
  • .rgw.buckets
  • .rgw.buckets.index
  • .log
  • .intent-log
  • .usage
  • .users
  • .users.email
  • .users.swift
  • .users.uid

There are other snapshots on them, but those were taken with 0.80.8 (firefly). When we upgraded to Hammer we disabled the cronjob that takes them. Last night the test was intended to check manually if the script could be re-enabled. HTH.

#7 Updated by Florian Haas over 5 years ago

#8 Updated by Daniel Schneller over 5 years ago

My bad. Clicked the wrong line before copying the link. That's exactly what I was referring to. Thanks Florian.

#9 Updated by Samuel Just over 5 years ago

  • Assignee set to Samuel Just

#10 Updated by Samuel Just over 5 years ago

I have a unittest which reproduces it now. Logs would still be helpful to confirm.

#11 Updated by Daniel Schneller over 5 years ago

Just checked if I might have caught a crash in the short time I had the logging turned up earlier, but I wasn't lucky enough. Will see what I can do, but considering I haven slept much in the last 36 hours, I might not be up long enough...

Can you share any insight on what might be triggering it? If so, it might be easier to enable the logging either when I know it should happen soon. Or can I willfully make it happen once to capture the verbose log?

#12 Updated by Samuel Just over 5 years ago

Nevermind about the logging then. What caused it is an interaction between a radosgw operation and pool snapshotting. Use of pool snapshotting with radosgw is very rare (I've never seen it done on purpose), so it hasn't come up. Removing any pool snapshots on the problem pool might help.

#13 Updated by Samuel Just over 5 years ago

  • Status changed from New to 7

Testing a fix.

#14 Updated by Samuel Just over 5 years ago

Actually, removing just the snapshots taken after you resumed taking snapshots might suffice.

#15 Updated by Jason Dillaman over 5 years ago

I can repeat this every time during a librbd fsx run on the wip-11579 branch:

ceph_test_librbd_fsx -d -W -R -p 100 -P ~/archive -r 1 -w 1 -t 1 -h 1 -l 250000000 -S 8208 -N 5000 pool_client.0 image_client.0

#16 Updated by Samuel Just over 5 years ago

Jason, can you confirm whether https://github.com/athanatos/ceph/tree/wip-11677 fixes it? (Probably merge it into whatever you are testing with)

#17 Updated by Daniel Schneller over 5 years ago

Follow up: after we had deleted the newly created snapshot from the RGW pools, the system has not crashed again. We are now considering other options for daily "backups".

#18 Updated by Samuel Just over 5 years ago

It's not really a very good backup option not least because the snapshots are not synchronized across pools. Also, because each snapshotted object would be lost in the same circumstances as the head object.

#19 Updated by Samuel Just over 5 years ago

(removed, wrong bug)

#20 Updated by Daniel Schneller over 5 years ago

We're aware that this is not atomic. We could accept that, because at the top me we do the snapshots we can be reasonably sure there's no activity. As for losing head and snapshots at the same time due to e. g. hardware failures, that is in fact not what this backup is intended to cover. It is more about being able to restore individual accidentally deleted objects for a while. We are now pursuing alternative options for that purpose. But thanks for pointing these issues out again, they are correct, of course.

#21 Updated by Daniel Schneller over 5 years ago

Damn autocorrect. "at the top me we" => "at the time we".

#22 Updated by Sage Weil over 5 years ago

  • Status changed from 7 to Resolved
  • Target version set to v9.0.3
  • Backport set to hammer

#23 Updated by Loic Dachary over 5 years ago

  • Status changed from Resolved to Pending Backport

#25 Updated by Loic Dachary over 5 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF