Project

General

Profile

Actions

Bug #967

closed

osd: PG::do_peer crash when restarting other OSD in PG

Added by Wido den Hollander about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This morning I pulled out the machine hosting osd20, 21, 22 and 23. After bringing this machine back, 3 PG's stayed in peering:

root@amd:~# ceph pg dump -o -|grep peering
1.7a8    5    0    0    0    5    1025    495    495    peering    26'5    236'110    [20,16,2]    [20,16,2]    0'0    2011-03-31 10:49:18.484866
3.44b    23    0    0    0    94208    96468992    2548    2548    peering    25'132    234'203    [20,15,0]    [20,15,0]    0'0    2011-03-31 10:55:54.624271
2.35d    16    0    0    0    65536    67108864    1659    1659    peering    24'16    243'98    [20,18,14]    [20,18,14]    0'0    2011-03-31 10:50:51.514923
root@amd:~#

To gather more information I restarted osd20 with 'debug osd = 20', to trigger the peering process again.

When this occured, osd15 crashed with the following backtrace:

(gdb) bt
#0  0x00007fea4cbcc7bb in raise () from /lib/libpthread.so.0
#1  0x000000000061ec33 in reraise_fatal (signum=5929) at common/signal.cc:63
#2  0x000000000061f95b in handle_fatal_signal (signum=6) at common/signal.cc:110
#3  <signal handler called>
#4  0x00007fea4b79ca75 in raise () from /lib/libc.so.6
#5  0x00007fea4b7a05c0 in abort () from /lib/libc.so.6
#6  0x00007fea4c0528e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#7  0x00007fea4c050d16 in ?? () from /usr/lib/libstdc++.so.6
#8  0x00007fea4c050d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#9  0x00007fea4c050e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#10 0x0000000000606fba in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, 
    func=0x647fe0 "void PG::do_peer(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<const pg_t, PG::Query> > "...) at common/assert.cc:86
#11 0x0000000000572ca3 in PG::do_peer (this=0x2de2000, t=<value optimized out>, tfin=<value optimized out>, query_map=<value optimized out>, 
    activator_map=<value optimized out>) at osd/PG.cc:1656
#12 0x0000000000519ff2 in OSD::handle_pg_notify (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:4168
#13 0x000000000051ad4d in OSD::_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2569
#14 0x000000000051b7df in OSD::ms_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2388
#15 0x0000000000473453 in Messenger::ms_deliver_dispatch (this=0x258aa00) at msg/Messenger.h:98
#16 SimpleMessenger::dispatch_entry (this=0x258aa00) at msg/SimpleMessenger.cc:352
#17 0x000000000046a26c in SimpleMessenger::DispatchThread::entry (this=0x258ae88) at msg/SimpleMessenger.h:533
#18 0x00007fea4cbc39ca in start_thread () from /lib/libpthread.so.0
#19 0x00007fea4b84f70d in clone () from /lib/libc.so.6
#20 0x0000000000000000 in ?? ()
(gdb)

I tried restarting osd20 multiple times to trigger osd15 in crashing again, but it didn't.

Remote access to the machines is possible from logger.ceph.widodh.nl

osd20:
osd15:

Logging is done via remote syslog to noisy.ceph.widodh.nl:/var/log/remote/ceph/osd.log

Those three PG's are in crashed+peering right now, I haven't been able to make anything up from the logs.

Actions

Also available in: Atom PDF