Project

General

Profile

Bug #967

osd: PG::do_peer crash when restarting other OSD in PG

Added by Wido den Hollander about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
Start date:
04/01/2011
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

This morning I pulled out the machine hosting osd20, 21, 22 and 23. After bringing this machine back, 3 PG's stayed in peering:

root@amd:~# ceph pg dump -o -|grep peering
1.7a8    5    0    0    0    5    1025    495    495    peering    26'5    236'110    [20,16,2]    [20,16,2]    0'0    2011-03-31 10:49:18.484866
3.44b    23    0    0    0    94208    96468992    2548    2548    peering    25'132    234'203    [20,15,0]    [20,15,0]    0'0    2011-03-31 10:55:54.624271
2.35d    16    0    0    0    65536    67108864    1659    1659    peering    24'16    243'98    [20,18,14]    [20,18,14]    0'0    2011-03-31 10:50:51.514923
root@amd:~#

To gather more information I restarted osd20 with 'debug osd = 20', to trigger the peering process again.

When this occured, osd15 crashed with the following backtrace:

(gdb) bt
#0  0x00007fea4cbcc7bb in raise () from /lib/libpthread.so.0
#1  0x000000000061ec33 in reraise_fatal (signum=5929) at common/signal.cc:63
#2  0x000000000061f95b in handle_fatal_signal (signum=6) at common/signal.cc:110
#3  <signal handler called>
#4  0x00007fea4b79ca75 in raise () from /lib/libc.so.6
#5  0x00007fea4b7a05c0 in abort () from /lib/libc.so.6
#6  0x00007fea4c0528e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#7  0x00007fea4c050d16 in ?? () from /usr/lib/libstdc++.so.6
#8  0x00007fea4c050d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#9  0x00007fea4c050e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#10 0x0000000000606fba in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, 
    func=0x647fe0 "void PG::do_peer(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<const pg_t, PG::Query> > "...) at common/assert.cc:86
#11 0x0000000000572ca3 in PG::do_peer (this=0x2de2000, t=<value optimized out>, tfin=<value optimized out>, query_map=<value optimized out>, 
    activator_map=<value optimized out>) at osd/PG.cc:1656
#12 0x0000000000519ff2 in OSD::handle_pg_notify (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:4168
#13 0x000000000051ad4d in OSD::_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2569
#14 0x000000000051b7df in OSD::ms_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2388
#15 0x0000000000473453 in Messenger::ms_deliver_dispatch (this=0x258aa00) at msg/Messenger.h:98
#16 SimpleMessenger::dispatch_entry (this=0x258aa00) at msg/SimpleMessenger.cc:352
#17 0x000000000046a26c in SimpleMessenger::DispatchThread::entry (this=0x258ae88) at msg/SimpleMessenger.h:533
#18 0x00007fea4cbc39ca in start_thread () from /lib/libpthread.so.0
#19 0x00007fea4b84f70d in clone () from /lib/libc.so.6
#20 0x0000000000000000 in ?? ()
(gdb)

I tried restarting osd20 multiple times to trigger osd15 in crashing again, but it didn't.

Remote access to the machines is possible from logger.ceph.widodh.nl

osd20:
osd15:

Logging is done via remote syslog to noisy.ceph.widodh.nl:/var/log/remote/ceph/osd.log

Those three PG's are in crashed+peering right now, I haven't been able to make anything up from the logs.

History

#1 Updated by Sage Weil about 8 years ago

  • Target version set to v0.27
  • translation missing: en.field_position set to 341

#2 Updated by Sage Weil about 8 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil

First I saw 1.7a8 not peering because osd.16 wasn't sending a response. I restarted osd.16 with logging enabled... 1.7a8 peered but then osd.16 crashed on activate because last_update was 0'0 (shouldn't happen). No logs, so not sure why...

I pushed a wido_osd_fix branch that works around that assert. Need to restart with that fix and see where we end up!

#3 Updated by Wido den Hollander about 8 years ago

Ok, it took a while to get there, but after about 45 mins it got to:

pg v73145: 10608 pgs: 10607 active+clean, 1 crashed+peering; 1637 GB data, 5051 GB used, 69403 GB / 74520 GB avail

The broken PG:

3.44b    23    0    0    0    94208    96468992    2548    2548    crashed+peering    25'132    257'312    [20,15,0]    [20,15,0]    0'0    2011-03-31 10:55:54.624271

#4 Updated by Sage Weil about 8 years ago

  • Status changed from In Progress to Resolved

When I went to look at this the pg was waiting for log+missing on osd0. There was no logging, so I restarted osd0 with logging and then it repeered successfully.

I see the bug that triggered this. Should be fixed by b0f817ac19d432abc69659493c8d043360dbea97

#5 Updated by Sage Weil about 8 years ago

  • translation missing: en.field_story_points set to 2
  • translation missing: en.field_position deleted (347)
  • translation missing: en.field_position set to 347

Also available in: Atom PDF