Actions
Bug #967
closedosd: PG::do_peer crash when restarting other OSD in PG
% Done:
0%
Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
This morning I pulled out the machine hosting osd20, 21, 22 and 23. After bringing this machine back, 3 PG's stayed in peering:
root@amd:~# ceph pg dump -o -|grep peering 1.7a8 5 0 0 0 5 1025 495 495 peering 26'5 236'110 [20,16,2] [20,16,2] 0'0 2011-03-31 10:49:18.484866 3.44b 23 0 0 0 94208 96468992 2548 2548 peering 25'132 234'203 [20,15,0] [20,15,0] 0'0 2011-03-31 10:55:54.624271 2.35d 16 0 0 0 65536 67108864 1659 1659 peering 24'16 243'98 [20,18,14] [20,18,14] 0'0 2011-03-31 10:50:51.514923 root@amd:~#
To gather more information I restarted osd20 with 'debug osd = 20', to trigger the peering process again.
When this occured, osd15 crashed with the following backtrace:
(gdb) bt #0 0x00007fea4cbcc7bb in raise () from /lib/libpthread.so.0 #1 0x000000000061ec33 in reraise_fatal (signum=5929) at common/signal.cc:63 #2 0x000000000061f95b in handle_fatal_signal (signum=6) at common/signal.cc:110 #3 <signal handler called> #4 0x00007fea4b79ca75 in raise () from /lib/libc.so.6 #5 0x00007fea4b7a05c0 in abort () from /lib/libc.so.6 #6 0x00007fea4c0528e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #7 0x00007fea4c050d16 in ?? () from /usr/lib/libstdc++.so.6 #8 0x00007fea4c050d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #9 0x00007fea4c050e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #10 0x0000000000606fba in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, func=0x647fe0 "void PG::do_peer(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<const pg_t, PG::Query> > "...) at common/assert.cc:86 #11 0x0000000000572ca3 in PG::do_peer (this=0x2de2000, t=<value optimized out>, tfin=<value optimized out>, query_map=<value optimized out>, activator_map=<value optimized out>) at osd/PG.cc:1656 #12 0x0000000000519ff2 in OSD::handle_pg_notify (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:4168 #13 0x000000000051ad4d in OSD::_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2569 #14 0x000000000051b7df in OSD::ms_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2388 #15 0x0000000000473453 in Messenger::ms_deliver_dispatch (this=0x258aa00) at msg/Messenger.h:98 #16 SimpleMessenger::dispatch_entry (this=0x258aa00) at msg/SimpleMessenger.cc:352 #17 0x000000000046a26c in SimpleMessenger::DispatchThread::entry (this=0x258ae88) at msg/SimpleMessenger.h:533 #18 0x00007fea4cbc39ca in start_thread () from /lib/libpthread.so.0 #19 0x00007fea4b84f70d in clone () from /lib/libc.so.6 #20 0x0000000000000000 in ?? () (gdb)
I tried restarting osd20 multiple times to trigger osd15 in crashing again, but it didn't.
Remote access to the machines is possible from logger.ceph.widodh.nl
osd20: root@atom5.ceph.widodh.nl
osd15: root@atom3.ceph.widodh.nl
Logging is done via remote syslog to noisy.ceph.widodh.nl:/var/log/remote/ceph/osd.log
Those three PG's are in crashed+peering right now, I haven't been able to make anything up from the logs.
Actions