Project

General

Profile

Actions

Bug #1992

closed

OSD::get_or_create_pg

Added by Wido den Hollander about 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've just upgraded my 0.39 cluster to 0.40 and that didn't go that well.

The whole cluster started bouncing and crashed eventually (50% of the OSD's) with:

2012-01-27 16:11:07.278037 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.7 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6811/20485 170 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (3540073830 0 0) 0x52924c40 con 0x51a53640
2012-01-27 16:11:07.336807 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/20353 170 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (1234559449 0 0) 0x52316a80 con 0x51d2c640
2012-01-27 16:11:07.343286 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.2 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6808/22634 172 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (1234559449 0 0) 0x529a6e00 con 0x512f9140
2012-01-27 16:11:07.455950 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.3 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6811/22821 173 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (1234559449 0 0) 0x4b611c40 con 0x51a5d640
2012-01-27 16:11:07.474723 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.6 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6808/20419 176 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (1234559449 0 0) 0x14ec0e00 con 0x512f98c0
2012-01-27 16:11:07.500584 7f53f7607700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6804/15042 <== osd.24 [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:6801/28963 32 ==== pg_log(2.ee epoch 12531 query_epoch 12531) v2 ==== 779+0+0 (2196340623 0 0) 0x9c95b00 con 0x55b0a500
osd/OSD.cc: In function 'PG* OSD::get_or_create_pg(const PG::Info&, epoch_t, int, int&, bool, ObjectStore::Transaction**, C_Contexts**)', in thread '7f53f7607700'
osd/OSD.cc: 1242: FAILED assert(!info.dne())
 ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
 1: (OSD::get_or_create_pg(PG::Info const&, unsigned int, int, int&, bool, ObjectStore::Transaction**, C_Contexts**)+0xbb1) [0x54b2d1]
 2: (OSD::handle_pg_log(MOSDPGLog*)+0x1d0) [0x54bae0]
 3: (OSD::_dispatch(Message*)+0x5c8) [0x553c98]
 4: (OSD::ms_dispatch(Message*)+0x11e) [0x5549de]
 5: (SimpleMessenger::dispatch_entry()+0x84b) [0x5bc0db]
 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b237c]
 7: (()+0x7efc) [0x7f5403ae6efc]
 8: (clone()+0x6d) [0x7f540211789d]
 ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
 1: (OSD::get_or_create_pg(PG::Info const&, unsigned int, int, int&, bool, ObjectStore::Transaction**, C_Contexts**)+0xbb1) [0x54b2d1]
 2: (OSD::handle_pg_log(MOSDPGLog*)+0x1d0) [0x54bae0]
 3: (OSD::_dispatch(Message*)+0x5c8) [0x553c98]
 4: (OSD::ms_dispatch(Message*)+0x11e) [0x5549de]
 5: (SimpleMessenger::dispatch_entry()+0x84b) [0x5bc0db]
 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b237c]
 7: (()+0x7efc) [0x7f5403ae6efc]
 8: (clone()+0x6d) [0x7f540211789d]
*** Caught signal (Aborted) **
 in thread 7f53f7607700
 ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
 1: /usr/bin/ceph-osd() [0x5fd926]
 2: (()+0x10060) [0x7f5403aef060]
 3: (gsignal()+0x35) [0x7f540206c3a5]
 4: (abort()+0x17b) [0x7f540206fb0b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f540292ad7d]
 6: (()+0xb9f26) [0x7f5402928f26]
 7: (()+0xb9f53) [0x7f5402928f53]
 8: (()+0xba04e) [0x7f540292904e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x193) [0x5cfd33]
 10: (OSD::get_or_create_pg(PG::Info const&, unsigned int, int, int&, bool, ObjectStore::Transaction**, C_Contexts**)+0xbb1) [0x54b2d1]
 11: (OSD::handle_pg_log(MOSDPGLog*)+0x1d0) [0x54bae0]
 12: (OSD::_dispatch(Message*)+0x5c8) [0x553c98]
 13: (OSD::ms_dispatch(Message*)+0x11e) [0x5549de]
 14: (SimpleMessenger::dispatch_entry()+0x84b) [0x5bc0db]
 15: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b237c]
 16: (()+0x7efc) [0x7f5403ae6efc]
 17: (clone()+0x6d) [0x7f540211789d]

Eventually all OSD's went down.

Anything to test?

Actions #1

Updated by Greg Farnum about 12 years ago

The assert here is because the PG doesn't exist yet but the OSD is not the primary for that PG. It's getting into get_or_create_pg because it's getting a PGLog...anybody have ideas?

Actions #2

Updated by Greg Farnum about 12 years ago

Er, actually, the OSD is getting an MOSDPGLog with info DNE (presumably uninitialized). That appears to be non-kosher. Wido, are logs available?

Actions #3

Updated by Wido den Hollander about 12 years ago

I see what might went wrong. I build the latest master a couple of days ago, ran the OSD's with that code for about 2 hours and then went to 0.40, which I should have done earlier.

The problem now also is that most of my OSD's were stuck in status D and forcing a hard reboot (with a sysrq emergency sync before) caused almost all my btrfs filesystems to be broken (open_ctree failed..._)

I hope I can get these filesystems fixed to see wether 0.41 will work.

Actions #4

Updated by Sage Weil about 12 years ago

What version of btrfs are you running? Have you tried the latest code?

There is a mount -o recover option for btrfs, but it's not foolproof.. the fs you get back may behave strangely. :/

Actions #5

Updated by Wido den Hollander about 12 years ago

I was running the stock 3.0 kernel from Ubuntu 11.10

I tried with the latest ceph-client code (saw your post about the btrfs code) and the latest btrfs-progs. A discussion on the #btrfs channel revealed that I might get the data out of the filesystems with the restore tool, but mounting a filesystem in this state seems to be pretty hard.. I think I'll have to start all over again :)

Actions #6

Updated by Sage Weil about 12 years ago

  • Status changed from New to Can't reproduce

Hmm, we haven't been able to trigger this with our thrashing.

Actions

Also available in: Atom PDF