Project

General

Profile

Bug #4556

OSDs crash with OSD::handle_op during recovery

Added by Wido den Hollander about 11 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While tracking down #3816 I stumbled upon this one multiple times.

I tried the upgrade to 0.56.4 to be sure, but that didn't change anything.

 ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
 1: /usr/bin/ceph-osd() [0x788fba]
 2: (()+0xfcb0) [0x7f083e63ecb0]
 3: (gsignal()+0x35) [0x7f083cffd425]
 4: (abort()+0x17b) [0x7f083d000b8b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f083d94f69d]
 6: (()+0xb5846) [0x7f083d94d846]
 7: (()+0xb5873) [0x7f083d94d873]
 8: (()+0xb596e) [0x7f083d94d96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8343af]
 10: (OSD::handle_op(std::tr1::shared_ptr<OpRequest>)+0x12d8) [0x624668]
 11: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0xe9) [0x62cba9]
 12: (OSD::do_waiters()+0x1a5) [0x62d105]
 13: (OSD::ms_dispatch(Message*)+0x1c2) [0x636a82]
 14: (DispatchQueue::entry()+0x349) [0x8c7399]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x81fbad]
 16: (()+0x7e9a) [0x7f083e636e9a]
 17: (clone()+0x6d) [0x7f083d0bacbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This cluster has already sustained a lot of issues and some OSDs have been down and out for quite some time now.

I added the logs of two OSDs:
- osd.2
- osd.38

It goes wrong during the peering process. All 40 OSDs are active and trying to recover, but one by one they keep falling down until I eventually end up with 11 OSDs and a cluster in a very bad state.

osdmap e21977: 40 osds: 11 up, 11 in

In the end it's just 11 OSDs surviving. I added the output of "ceph osd tree". As you can see it's 11 OSDs surviving, but not always the same ones.

The attached logs were produced with debug osd = 20

From what I make up out of the logs it goes wrong when OSDs transition to the Primary state for a PG, that seems to go wrong and they crash.

ceph-osd.2.log.gz (901 KB) Wido den Hollander, 03/26/2013 05:41 AM

ceph-osd.38.log.gz (412 KB) Wido den Hollander, 03/26/2013 05:41 AM

tree.1.txt View (1.12 KB) Wido den Hollander, 03/26/2013 05:41 AM

tree.2.txt View (1.12 KB) Wido den Hollander, 03/26/2013 05:41 AM


Related issues

Related to Ceph - Bug #3816: osd/OSD.cc: 3318: FAILED assert(osd_lock.is_locked()) Resolved 01/16/2013

Associated revisions

Revision f2dda43c (diff)
Added by Sage Weil about 11 years ago

osd: EINVAL when rmw_flags is 0

A broken client (e.g., v0.56) can send a request that ends up with an
rmw_flags of 0. Treat this as invalid and return EINVAL.

Fixes: #4556
Signed-off-by: Sage Weil <>

Revision 6b6e0cef (diff)
Added by Sage Weil about 11 years ago

osd: EINVAL when rmw_flags is 0

A broken client (e.g., v0.56) can send a request that ends up with an
rmw_flags of 0. Treat this as invalid and return EINVAL.

Fixes: #4556
Signed-off-by: Sage Weil <>
(cherry picked from commit f2dda43c9ed4fda9cfa87362514985ee79e0ae15)

History

#1 Updated by Sage Weil about 11 years ago

  • Priority changed from Normal to Urgent

#2 Updated by Sage Weil about 11 years ago

Wido-

This is the same assert we say on #3816. Is it possible to reproduce this iwth some logging so we can see the request that is triggering it? It might be an old client, or ... not sure. Would like to know before we just remove the assert.

THanks!

#3 Updated by Sage Weil about 11 years ago

  • Status changed from New to Need More Info

#4 Updated by Wido den Hollander about 11 years ago

I just saw osd.0 (and a couple of other) crash and have a core-file.

This is what the backtrace tells me:

Core was generated by `/usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 6, Aborted.
#0  0x00007fc2e019db7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fc2e019db7b in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000078910e in reraise_fatal (signum=6) at global/signal_handler.cc:58
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:104
#3  <signal handler called>
#4  0x00007fc2deb5c425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007fc2deb5fb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007fc2df4ae69d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fc2df4ac846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fc2df4ac873 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007fc2df4ac96e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000008343af in ceph::__ceph_assert_fail (assertion=0x8f2659 "rmw_flags", file=<optimized out>, line=57, func=0x909390 "bool MOSDOp::check_rmw(int)")
    at common/assert.cc:77
#11 0x0000000000624668 in check_rmw (this=<optimized out>, flag=<optimized out>) at ./messages/MOSDOp.h:57
#12 check_rmw (flag=4, this=<optimized out>) at osd/OSD.cc:6337
#13 need_write_cap (this=0x8e1b480) at ./messages/MOSDOp.h:107
#14 may_write (this=0x8e1b480) at ./messages/MOSDOp.h:100
#15 OSD::handle_op (this=0x3316000, op=...) at osd/OSD.cc:5907
#16 0x000000000062cba9 in OSD::dispatch_op (this=0x3316000, op=...) at osd/OSD.cc:3440
#17 0x000000000063630e in OSD::_dispatch (this=0x3316000, m=<optimized out>) at osd/OSD.cc:3523
#18 0x0000000000636a7a in OSD::ms_dispatch (this=0x3316000, m=0x8e1b480) at osd/OSD.cc:3281
#19 0x00000000008c7399 in ms_deliver_dispatch (m=0x8e1b480, this=0x331c000) at msg/Messenger.h:553
#20 DispatchQueue::entry (this=0x331c0e8) at msg/DispatchQueue.cc:107
#21 0x000000000081fbad in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:85
#22 0x00007fc2e0195e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#23 0x00007fc2dec19cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#24 0x0000000000000000 in ?? ()
(gdb)
I uploaded some logs to the cephdrop account, the files are:
  • /home/cephdrop/4556-ceph-osd.0.log.gz
  • /home/cephdrop/4556-ceph-osd.1.log.gz
  • /home/cephdrop/4556-ceph-osd.6.log.gz

All these OSDs crashed with the backtrace posted above.

Does this seem to be an issue with older clients?

#5 Updated by Sage Weil about 11 years ago

  • Status changed from Need More Info to Pending Backport

#6 Updated by Sage Weil about 11 years ago

  • Assignee set to Sage Weil

#7 Updated by Sage Weil almost 11 years ago

  • Status changed from Pending Backport to Resolved

#9 Updated by Xinxin Shu almost 9 years ago

  • Regression set to No

Also available in: Atom PDF