Bug #967: osd: PG::do_peer crash when restarting other OSD in PG - Ceph - Ceph

Actions

Copy link

Bug #967

closed

osd: PG::do_peer crash when restarting other OSD in PG

Added by Wido den Hollander about 13 years ago. Updated about 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.27

% Done:

Spent time:

1:00 h

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This morning I pulled out the machine hosting osd20, 21, 22 and 23. After bringing this machine back, 3 PG's stayed in peering:

root@amd:~# ceph pg dump -o -|grep peering
1.7a8    5    0    0    0    5    1025    495    495    peering    26'5    236'110    [20,16,2]    [20,16,2]    0'0    2011-03-31 10:49:18.484866
3.44b    23    0    0    0    94208    96468992    2548    2548    peering    25'132    234'203    [20,15,0]    [20,15,0]    0'0    2011-03-31 10:55:54.624271
2.35d    16    0    0    0    65536    67108864    1659    1659    peering    24'16    243'98    [20,18,14]    [20,18,14]    0'0    2011-03-31 10:50:51.514923
root@amd:~#

To gather more information I restarted osd20 with 'debug osd = 20', to trigger the peering process again.

When this occured, osd15 crashed with the following backtrace:

(gdb) bt
#0  0x00007fea4cbcc7bb in raise () from /lib/libpthread.so.0
#1  0x000000000061ec33 in reraise_fatal (signum=5929) at common/signal.cc:63
#2  0x000000000061f95b in handle_fatal_signal (signum=6) at common/signal.cc:110
#3  <signal handler called>
#4  0x00007fea4b79ca75 in raise () from /lib/libc.so.6
#5  0x00007fea4b7a05c0 in abort () from /lib/libc.so.6
#6  0x00007fea4c0528e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#7  0x00007fea4c050d16 in ?? () from /usr/lib/libstdc++.so.6
#8  0x00007fea4c050d43 in std::terminate() () from /usr/lib/libstdc++.so.6
#9  0x00007fea4c050e3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#10 0x0000000000606fba in ceph::__ceph_assert_fail (assertion=<value optimized out>, file=<value optimized out>, line=<value optimized out>, 
    func=0x647fe0 "void PG::do_peer(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<const pg_t, PG::Query> > "...) at common/assert.cc:86
#11 0x0000000000572ca3 in PG::do_peer (this=0x2de2000, t=<value optimized out>, tfin=<value optimized out>, query_map=<value optimized out>, 
    activator_map=<value optimized out>) at osd/PG.cc:1656
#12 0x0000000000519ff2 in OSD::handle_pg_notify (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:4168
#13 0x000000000051ad4d in OSD::_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2569
#14 0x000000000051b7df in OSD::ms_dispatch (this=0x258d000, m=0x2e86a80) at osd/OSD.cc:2388
#15 0x0000000000473453 in Messenger::ms_deliver_dispatch (this=0x258aa00) at msg/Messenger.h:98
#16 SimpleMessenger::dispatch_entry (this=0x258aa00) at msg/SimpleMessenger.cc:352
#17 0x000000000046a26c in SimpleMessenger::DispatchThread::entry (this=0x258ae88) at msg/SimpleMessenger.h:533
#18 0x00007fea4cbc39ca in start_thread () from /lib/libpthread.so.0
#19 0x00007fea4b84f70d in clone () from /lib/libc.so.6
#20 0x0000000000000000 in ?? ()
(gdb)

I tried restarting osd20 multiple times to trigger osd15 in crashing again, but it didn't.

Remote access to the machines is possible from logger.ceph.widodh.nl

osd20: root@atom5.ceph.widodh.nl
osd15: root@atom3.ceph.widodh.nl

Logging is done via remote syslog to noisy.ceph.widodh.nl:/var/log/remote/ceph/osd.log

Those three PG's are in crashed+peering right now, I haven't been able to make anything up from the logs.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Target version set to v0.27
Translation missing: en.field_position set to 341

Actions

Copy link

Updated by Sage Weil about 13 years ago

Status changed from New to In Progress
Assignee set to Sage Weil

First I saw 1.7a8 not peering because osd.16 wasn't sending a response. I restarted osd.16 with logging enabled... 1.7a8 peered but then osd.16 crashed on activate because last_update was 0'0 (shouldn't happen). No logs, so not sure why...

I pushed a wido_osd_fix branch that works around that assert. Need to restart with that fix and see where we end up!

Actions

Copy link

Updated by Wido den Hollander about 13 years ago

Ok, it took a while to get there, but after about 45 mins it got to:

pg v73145: 10608 pgs: 10607 active+clean, 1 crashed+peering; 1637 GB data, 5051 GB used, 69403 GB / 74520 GB avail

The broken PG:

3.44b    23    0    0    0    94208    96468992    2548    2548    crashed+peering    25'132    257'312    [20,15,0]    [20,15,0]    0'0    2011-03-31 10:55:54.624271

Actions

Copy link

Updated by Sage Weil about 13 years ago

Status changed from In Progress to Resolved

When I went to look at this the pg was waiting for log+missing on osd0. There was no logging, so I restarted osd0 with logging and then it repeered successfully.

I see the bug that triggered this. Should be fixed by b0f817ac19d432abc69659493c8d043360dbea97

Actions

Copy link

Updated by Sage Weil about 13 years ago

Translation missing: en.field_story_points set to 2
Translation missing: en.field_position deleted (~~347~~)
Translation missing: en.field_position set to 347

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #967

osd: PG::do_peer crash when restarting other OSD in PG

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Wido den Hollander about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago