osd: PG::do_peer crash when restarting other OSD in PG
This morning I pulled out the machine hosting osd20, 21, 22 and 23. After bringing this machine back, 3 PG's stayed in peering:
root@amd:~# ceph pg dump -o -|grep peering 1.7a8 5 0 0 0 5 1025 495 495 peering 26'5 236'110 [20,16,2] [20,16,2] 0'0 2011-03-31 10:49:18.484866 3.44b 23 0 0 0 94208 96468992 2548 2548 peering 25'132 234'203 [20,15,0] [20,15,0] 0'0 2011-03-31 10:55:54.624271 2.35d 16 0 0 0 65536 67108864 1659 1659 peering 24'16 243'98 [20,18,14] [20,18,14] 0'0 2011-03-31 10:50:51.514923 root@amd:~#
To gather more information I restarted osd20 with 'debug osd = 20', to trigger the peering process again.
When this occured, osd15 crashed with the following backtrace:
(gdb) bt #0 0x00007fea4cbcc7bb in raise () from /lib/libpthread.so.0 #1 0x000000000061ec33 in reraise_fatal (signum=5929) at common/signal.cc:63 #2 0x000000000061f95b in handle_fatal_signal (signum=6) at common/signal.cc:110 #3 <signal handler called> #4 0x00007fea4b79ca75 in raise () from /lib/libc.so.6 #5 0x00007fea4b7a05c0 in abort () from /lib/libc.so.6 #6 0x00007fea4c0528e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6 #7 0x00007fea4c050d16 in ?? () from /usr/lib/libstdc++.so.6 #8 0x00007fea4c050d43 in std::terminate() () from /usr/lib/libstdc++.so.6 #9 0x00007fea4c050e3e in __cxa_throw () from /usr/lib/libstdc++.so.6 #10 0x0000000000606fba in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, func=0x647fe0 "void PG::do_peer(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<const pg_t, PG::Query> > "...) at common/assert.cc:86 #11 0x0000000000572ca3 in PG::do_peer (this=0x2de2000, t=<value optimized out>, tfin=<value optimized out>, query_map=<value optimized out>, activator_map=<value optimized out>) at osd/PG.cc:1656 #12 0x0000000000519ff2 in OSD::handle_pg_notify (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:4168 #13 0x000000000051ad4d in OSD::_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2569 #14 0x000000000051b7df in OSD::ms_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2388 #15 0x0000000000473453 in Messenger::ms_deliver_dispatch (this=0x258aa00) at msg/Messenger.h:98 #16 SimpleMessenger::dispatch_entry (this=0x258aa00) at msg/SimpleMessenger.cc:352 #17 0x000000000046a26c in SimpleMessenger::DispatchThread::entry (this=0x258ae88) at msg/SimpleMessenger.h:533 #18 0x00007fea4cbc39ca in start_thread () from /lib/libpthread.so.0 #19 0x00007fea4b84f70d in clone () from /lib/libc.so.6 #20 0x0000000000000000 in ?? () (gdb)
I tried restarting osd20 multiple times to trigger osd15 in crashing again, but it didn't.
Remote access to the machines is possible from logger.ceph.widodh.nl
Logging is done via remote syslog to noisy.ceph.widodh.nl:/var/log/remote/ceph/osd.log
Those three PG's are in crashed+peering right now, I haven't been able to make anything up from the logs.
#2 Updated by Sage Weil over 8 years ago
- Status changed from New to In Progress
- Assignee set to Sage Weil
First I saw 1.7a8 not peering because osd.16 wasn't sending a response. I restarted osd.16 with logging enabled... 1.7a8 peered but then osd.16 crashed on activate because last_update was 0'0 (shouldn't happen). No logs, so not sure why...
I pushed a wido_osd_fix branch that works around that assert. Need to restart with that fix and see where we end up!
#3 Updated by Wido den Hollander over 8 years ago
Ok, it took a while to get there, but after about 45 mins it got to:
pg v73145: 10608 pgs: 10607 active+clean, 1 crashed+peering; 1637 GB data, 5051 GB used, 69403 GB / 74520 GB avail
The broken PG:
3.44b 23 0 0 0 94208 96468992 2548 2548 crashed+peering 25'132 257'312 [20,15,0] [20,15,0] 0'0 2011-03-31 10:55:54.624271
#4 Updated by Sage Weil over 8 years ago
- Status changed from In Progress to Resolved
When I went to look at this the pg was waiting for log+missing on osd0. There was no logging, so I restarted osd0 with logging and then it repeered successfully.
I see the bug that triggered this. Should be fixed by b0f817ac19d432abc69659493c8d043360dbea97