Project

General

Profile

Actions

Bug #14821

closed

OSD segfault in ms_get_authorizer -- hammer ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

Added by Samuel Just about 8 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
bCore was generated by `/usr/bin/ceph-osd -i 145 --pid-file /var/run/ceph/osd.145.pid -c /etc/ceph/ceph'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000061d19e in OSD::ms_get_authorizer(int, AuthAuthorizer**, bool) ()
Missing separate debuginfos, use: debuginfo-install ceph-osd-0.94.1-13.el7cp.x86_64
(gdb) bt
#0 0x000000000061d19e in OSD::ms_get_authorizer(int, AuthAuthorizer**, bool) ()
#1 0x0000000000acb5db in SimpleMessenger::get_authorizer(int, bool) ()
#2 0x0000000000bd0cf9 in Pipe::connect() ()
#3 0x0000000000bd4301 in Pipe::writer() ()
#4 0x0000000000bdfbdd in Pipe::Writer::entry() ()
#5 0x00007fa429eb3df5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007fa4289961ad in clone () from /lib64/libc.so.6
(gdb)


Files

gdb.out.gz (208 KB) gdb.out.gz Samuel Just, 02/19/2016 04:56 PM
gdb2.out.gz (306 KB) gdb2.out.gz Samuel Just, 02/19/2016 05:02 PM
Actions #1

Updated by Samuel Just about 8 years ago

From another osd, two threads actually ran into trouble in that method:

Thread 1209 (Thread 0x7f66d74a4700 (LWP 602689)):
#0 0x00007f66fedc129b in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) () from /lib64/libtcmalloc.so.4
#1 0x00007f66fedc15b0 in tcmalloc::ThreadCache::Scavenge() () from /lib64/libtcmalloc.so.4
#2 0x00007f66fedd0e67 in tc_delete () from /lib64/libtcmalloc.so.4
#3 0x0000000000b8f0e9 in pretty_version_to_str() ()
#4 0x00000000009f81eb in ceph::BackTrace::print(std::ostream&) ()
#5 0x00000000009f642f in handle_fatal_signal(int) ()
#6 <signal handler called>
#7 0x000000000061d19e in OSD::ms_get_authorizer(int, AuthAuthorizer**, bool) ()
#8 0x0000000000acb5db in SimpleMessenger::get_authorizer(int, bool) ()
#9 0x0000000000bd0cf9 in Pipe::connect() ()
#10 0x0000000000bd4301 in Pipe::writer() ()
#11 0x0000000000bdfbdd in Pipe::Writer::entry() ()
#12 0x00007f66fe622df5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f66fd1051ad in clone () from /lib64/libc.so.6

in addition to

#6 0x00007f66fd1051ad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f666c5e1700 (LWP 1007837)):
#0 0x000000000061d19e in OSD::ms_get_authorizer(int, AuthAuthorizer**, bool) ()
#1 0x0000000000acb5db in SimpleMessenger::get_authorizer(int, bool) ()
#2 0x0000000000bd0cf9 in Pipe::connect() ()
#3 0x0000000000bd4301 in Pipe::writer() ()
#4 0x0000000000bdfbdd in Pipe::Writer::entry() ()
#5 0x00007f66fe622df5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007f66fd1051ad in clone () from /lib64/libc.so.6

Actions #2

Updated by Samuel Just about 8 years ago

145 (in the summary) also has the same pattern

Thread 1459 (Thread 0x7fa3eb833700 (LWP 604327)):
#0 0x00007fa42acfc819 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1 0x00007fa42ad032e0 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#2 0x00000000009f63f2 in handle_fatal_signal(int) ()
#3 <signal handler called>
#4 0x000000000061d19e in OSD::ms_get_authorizer(int, AuthAuthorizer**, bool) ()
#5 0x0000000000acb5db in SimpleMessenger::get_authorizer(int, bool) ()
#6 0x0000000000bd0cf9 in Pipe::connect() ()
#7 0x0000000000bd4301 in Pipe::writer() ()
#8 0x0000000000bdfbdd in Pipe::Writer::entry() ()
#9 0x00007fa429eb3df5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa4289961ad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fa3f69dd700 (LWP 1002130)):
#0 0x000000000061d19e in OSD::ms_get_authorizer(int, AuthAuthorizer**, bool) ()
#1 0x0000000000acb5db in SimpleMessenger::get_authorizer(int, bool) ()
#2 0x0000000000bd0cf9 in Pipe::connect() ()
#3 0x0000000000bd4301 in Pipe::writer() ()
#4 0x0000000000bdfbdd in Pipe::Writer::entry() ()
#5 0x00007fa429eb3df5 in start_thread () from /lib64/libpthread.so.0
#6 0x00007fa4289961ad in clone () from /lib64/libc.so.6

Actions #3

Updated by Samuel Just about 8 years ago

Is there anything which actually protects from a race between the monc->auth access in OSD::ms_get_authorizor and the bit in MonClient::authenticate where we switch it out?

Actions #4

Updated by Samuel Just about 8 years ago

On one mon node:

[global]
rbd cache writethrough until flush = false
osd crush chooseleaf type = 0
osd crush update on start = false
osd pg bits = 10
osd pgp bits = 10
auth client required = none
auth cluster required = none
auth service required = none
keyring = /tmp/cbt/ceph/keyring
log to syslog = false
log file = /tmp/cbt/ceph/log/$name.log
rbd cache = true
filestore merge threshold = 40
filestore split multiple = 8
osd op threads = 8
mon pg warn max object skew = 100000
mon pg warn min per osd = 0
mon pg warn max per osd = 32768
[mon.a]
mon addr = 192.168.32.1:6789
host = gqac022.sbu.lab.eng.bos.redhat.com
mon data = /tmp/cbt/ceph/mon.$id

[osd.0]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-0-data
osd journal = /dev/disk/by-partlabel/osd-device-0-journal
[osd.1]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-1-data
osd journal = /dev/disk/by-partlabel/osd-device-1-journal
[osd.2]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-2-data
osd journal = /dev/disk/by-partlabel/osd-device-2-journal
[osd.3]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-3-data
osd journal = /dev/disk/by-partlabel/osd-device-3-journal
[osd.4]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-4-data
osd journal = /dev/disk/by-partlabel/osd-device-4-journal
[osd.5]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-5-data
osd journal = /dev/disk/by-partlabel/osd-device-5-journal
[osd.6]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-6-data
osd journal = /dev/disk/by-partlabel/osd-device-6-journal
[osd.7]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-7-data
osd journal = /dev/disk/by-partlabel/osd-device-7-journal
[osd.8]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-8-data
osd journal = /dev/disk/by-partlabel/osd-device-8-journal
[osd.9]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-9-data
osd journal = /dev/disk/by-partlabel/osd-device-9-journal
[osd.10]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-10-data
osd journal = /dev/disk/by-partlabel/osd-device-10-journal
[osd.11]
host = gqas012.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-11-data
osd journal = /dev/disk/by-partlabel/osd-device-11-journal

[osd.12]
host = gqas003.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-0-data
osd journal = /dev/disk/by-partlabel/osd-device-0-journal
[osd.13]
host = gqas003.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-1-data
osd journal = /dev/disk/by-partlabel/osd-device-1-journal
[osd.14]
host = gqas003.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-2-data
osd journal = /dev/disk/by-partlabel/osd-device-2-journal
[osd.15]
host = gqas003.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-3-data
osd journal = /dev/disk/by-partlabel/osd-device-3-journal
[osd.16]
host = gqas003.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-4-data
osd journal = /dev/disk/by-partlabel/osd-device-4-journal
[osd.17]
host = gqas003.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-5-data
osd journal = /dev/disk/by-partlabel/osd-device-5-journal
[osd.18]
host = gqas003.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-6-data
osd journal = /dev/disk/by-partlabel/osd-device-6-journal
[osd.19]
host = gqas003.sbu.lab.eng.bos.redhat.com
osd data = /tmp/cbt/mnt/osd-device-7-data
osd journal = /dev/disk/by-partlabel/osd-device-7-journal
[osd.20]
host = gqas003.sbu.lab.eng.bos.redhat.com
...

On the other:

[global]
fsid = b65cb329-7e2b-4507-a78a-347095aa9269
mon_initial_members = gqac022-priv, gqac026-priv, gqac014-priv
mon_host = 192.168.32.1,192.168.32.43,192.168.32.44
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

cephx appears to be enabled on one but not on the other. Can this cause the monc to reset the auth pointer and cause the race?

Actions #5

Updated by Samuel Just about 8 years ago

<bennyturns> sjusthm, its feasible the OS install would be done by then
<sjusthm> bennyturns: I think you reimaged the mon (gqac022)
<bennyturns> sjusthm, I did
<sjusthm> a new mon booted with a different cephx config
<bennyturns> ahh
  • harold|brb is now known as harold
    <sjusthm> that triggered a race condition as the monclient switched out the authorizer object
    <sjusthm> and the osd crashed
    <bennyturns> sounds like a good fit!
    <sjusthm> right before it would have rejected the mon (the authorizer would have failed, of course)
  • AcroBot___ has quit (Remote host closed the connection)
    <sjusthm> bennyturns: if you can disprove this with clocks, please do so
    <sjusthm> but I think this isn't a big deal
    <sjusthm> (not that we won't fix it...)
Actions #6

Updated by Samuel Just about 8 years ago

  • File gdb.out.gz added

Attached backtraces from one of the nodes, no mention of MonClient::handle_auth. Not sure whether that torpedoes my theory.

Actions #7

Updated by Samuel Just about 8 years ago

  • File deleted (gdb.out.gz)
Actions #8

Updated by Samuel Just about 8 years ago

Replaced with slightly more complete version.

Actions #9

Updated by Samuel Just about 8 years ago

Backtraces from another node as well, still no sign of the call into handle_auth.

Actions #10

Updated by Samuel Just about 8 years ago

The sequence of events as far as I can tell is
1) Cluster happy
2) gqac022 reimaged to serve as mon in smaller cluster with auth disabled
3) Some osds try to connect to that mon and crash due to cephx mismatch, perhaps due a race between OSD::ms_get_authorizer and MonClient::handle_auth (no evidence of MonClient::handle_auth in backtraces).

Actions #11

Updated by Samuel Just about 8 years ago

  • Has duplicate Bug #13826: segfault from PrebufferedStreambuf::overflow added
Actions #12

Updated by Samuel Just about 8 years ago

  • Has duplicate deleted (Bug #13826: segfault from PrebufferedStreambuf::overflow)
Actions #13

Updated by Samuel Just over 7 years ago

  • Priority changed from Urgent to Normal
Actions #14

Updated by Sage Weil about 7 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF