Bug #967: osd: PG::do_peer crash when restarting other OSD in PG - Ceph - Ceph

Actions

Copy link

Bug #967

closed

osd: PG::do_peer crash when restarting other OSD in PG

Added by Wido den Hollander about 13 years ago. Updated about 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.27

% Done:

Spent time:

1:00 h

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This morning I pulled out the machine hosting osd20, 21, 22 and 23. After bringing this machine back, 3 PG's stayed in peering:

root@amd:~# ceph pg dump -o -|grep peering
1.7a8    5    0    0    0    5    1025    495    495    peering    26'5    236'110    [20,16,2]    [20,16,2]    0'0    2011-03-31 10:49:18.484866
3.44b    23    0    0    0    94208    96468992    2548    2548    peering    25'132    234'203    [20,15,0]    [20,15,0]    0'0    2011-03-31 10:55:54.624271
2.35d    16    0    0    0    65536    67108864    1659    1659    peering    24'16    243'98    [20,18,14]    [20,18,14]    0'0    2011-03-31 10:50:51.514923
root@amd:~#

To gather more information I restarted osd20 with 'debug osd = 20', to trigger the peering process again.

When this occured, osd15 crashed with the following backtrace:

(gdb) bt
#0  0x00007fea4cbcc7bb in raise () from /lib/libpthread.so.0
#1  0x000000000061ec33 in reraise_fatal (signum=5929) at common/signal.cc:63
#2  0x000000000061f95b in handle_fatal_signal (signum=6) at common/signal.cc:110
#3  <signal handler called>
#4  0x00007fea4b79ca75 in raise () from /lib/libc.so.6
#5  0x00007fea4b7a05c0 in abort () from /lib/libc.so.6
#6  0x00007fea4c0528e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#7  0x00007fea4c050d16 in ?? () from /usr/lib/libstdc++.so.6
#8  0x00007fea4c050d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#9  0x00007fea4c050e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#10 0x0000000000606fba in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, 
    func=0x647fe0 "void PG::do_peer(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<const pg_t, PG::Query> > "...) at common/assert.cc:86
#11 0x0000000000572ca3 in PG::do_peer (this=0x2de2000, t=<value optimized out>, tfin=<value optimized out>, query_map=<value optimized out>, 
    activator_map=<value optimized out>) at osd/PG.cc:1656
#12 0x0000000000519ff2 in OSD::handle_pg_notify (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:4168
#13 0x000000000051ad4d in OSD::_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2569
#14 0x000000000051b7df in OSD::ms_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2388
#15 0x0000000000473453 in Messenger::ms_deliver_dispatch (this=0x258aa00) at msg/Messenger.h:98
#16 SimpleMessenger::dispatch_entry (this=0x258aa00) at msg/SimpleMessenger.cc:352
#17 0x000000000046a26c in SimpleMessenger::DispatchThread::entry (this=0x258ae88) at msg/SimpleMessenger.h:533
#18 0x00007fea4cbc39ca in start_thread () from /lib/libpthread.so.0
#19 0x00007fea4b84f70d in clone () from /lib/libc.so.6
#20 0x0000000000000000 in ?? ()
(gdb)

I tried restarting osd20 multiple times to trigger osd15 in crashing again, but it didn't.

Remote access to the machines is possible from logger.ceph.widodh.nl

osd20: root@atom5.ceph.widodh.nl
osd15: root@atom3.ceph.widodh.nl

Logging is done via remote syslog to noisy.ceph.widodh.nl:/var/log/remote/ceph/osd.log

Those three PG's are in crashed+peering right now, I haven't been able to make anything up from the logs.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #967

osd: PG::do_peer crash when restarting other OSD in PG

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Wido den Hollander about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago