Project

General

Profile

Bug #1992

OSD::get_or_create_pg

Added by Wido den Hollander over 7 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
Start date:
01/27/2012
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I've just upgraded my 0.39 cluster to 0.40 and that didn't go that well.

The whole cluster started bouncing and crashed eventually (50% of the OSD's) with:

2012-01-27 16:11:07.278037 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.7 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6811/20485 170 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (3540073830 0 0) 0x52924c40 con 0x51a53640
2012-01-27 16:11:07.336807 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.5 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6805/20353 170 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (1234559449 0 0) 0x52316a80 con 0x51d2c640
2012-01-27 16:11:07.343286 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.2 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6808/22634 172 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (1234559449 0 0) 0x529a6e00 con 0x512f9140
2012-01-27 16:11:07.455950 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.3 [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6811/22821 173 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (1234559449 0 0) 0x4b611c40 con 0x51a5d640
2012-01-27 16:11:07.474723 7f53f6e06700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:0/15043 <== osd.6 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6808/20419 176 ==== osd_ping(heartbeat e0 as_of 12473) v1 ==== 61+0+0 (1234559449 0 0) 0x14ec0e00 con 0x512f98c0
2012-01-27 16:11:07.500584 7f53f7607700 -- [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6804/15042 <== osd.24 [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:6801/28963 32 ==== pg_log(2.ee epoch 12531 query_epoch 12531) v2 ==== 779+0+0 (2196340623 0 0) 0x9c95b00 con 0x55b0a500
osd/OSD.cc: In function 'PG* OSD::get_or_create_pg(const PG::Info&, epoch_t, int, int&, bool, ObjectStore::Transaction**, C_Contexts**)', in thread '7f53f7607700'
osd/OSD.cc: 1242: FAILED assert(!info.dne())
 ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
 1: (OSD::get_or_create_pg(PG::Info const&, unsigned int, int, int&, bool, ObjectStore::Transaction**, C_Contexts**)+0xbb1) [0x54b2d1]
 2: (OSD::handle_pg_log(MOSDPGLog*)+0x1d0) [0x54bae0]
 3: (OSD::_dispatch(Message*)+0x5c8) [0x553c98]
 4: (OSD::ms_dispatch(Message*)+0x11e) [0x5549de]
 5: (SimpleMessenger::dispatch_entry()+0x84b) [0x5bc0db]
 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b237c]
 7: (()+0x7efc) [0x7f5403ae6efc]
 8: (clone()+0x6d) [0x7f540211789d]
 ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
 1: (OSD::get_or_create_pg(PG::Info const&, unsigned int, int, int&, bool, ObjectStore::Transaction**, C_Contexts**)+0xbb1) [0x54b2d1]
 2: (OSD::handle_pg_log(MOSDPGLog*)+0x1d0) [0x54bae0]
 3: (OSD::_dispatch(Message*)+0x5c8) [0x553c98]
 4: (OSD::ms_dispatch(Message*)+0x11e) [0x5549de]
 5: (SimpleMessenger::dispatch_entry()+0x84b) [0x5bc0db]
 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b237c]
 7: (()+0x7efc) [0x7f5403ae6efc]
 8: (clone()+0x6d) [0x7f540211789d]
*** Caught signal (Aborted) **
 in thread 7f53f7607700
 ceph version 0.40 (commit:7eea40ea37fb3a68a2042a2218c9b8c9c40a843e)
 1: /usr/bin/ceph-osd() [0x5fd926]
 2: (()+0x10060) [0x7f5403aef060]
 3: (gsignal()+0x35) [0x7f540206c3a5]
 4: (abort()+0x17b) [0x7f540206fb0b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f540292ad7d]
 6: (()+0xb9f26) [0x7f5402928f26]
 7: (()+0xb9f53) [0x7f5402928f53]
 8: (()+0xba04e) [0x7f540292904e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x193) [0x5cfd33]
 10: (OSD::get_or_create_pg(PG::Info const&, unsigned int, int, int&, bool, ObjectStore::Transaction**, C_Contexts**)+0xbb1) [0x54b2d1]
 11: (OSD::handle_pg_log(MOSDPGLog*)+0x1d0) [0x54bae0]
 12: (OSD::_dispatch(Message*)+0x5c8) [0x553c98]
 13: (OSD::ms_dispatch(Message*)+0x11e) [0x5549de]
 14: (SimpleMessenger::dispatch_entry()+0x84b) [0x5bc0db]
 15: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4b237c]
 16: (()+0x7efc) [0x7f5403ae6efc]
 17: (clone()+0x6d) [0x7f540211789d]

Eventually all OSD's went down.

Anything to test?

History

#1 Updated by Greg Farnum over 7 years ago

The assert here is because the PG doesn't exist yet but the OSD is not the primary for that PG. It's getting into get_or_create_pg because it's getting a PGLog...anybody have ideas?

#2 Updated by Greg Farnum over 7 years ago

Er, actually, the OSD is getting an MOSDPGLog with info DNE (presumably uninitialized). That appears to be non-kosher. Wido, are logs available?

#3 Updated by Wido den Hollander over 7 years ago

I see what might went wrong. I build the latest master a couple of days ago, ran the OSD's with that code for about 2 hours and then went to 0.40, which I should have done earlier.

The problem now also is that most of my OSD's were stuck in status D and forcing a hard reboot (with a sysrq emergency sync before) caused almost all my btrfs filesystems to be broken (open_ctree failed..._)

I hope I can get these filesystems fixed to see wether 0.41 will work.

#4 Updated by Sage Weil over 7 years ago

What version of btrfs are you running? Have you tried the latest code?

There is a mount -o recover option for btrfs, but it's not foolproof.. the fs you get back may behave strangely. :/

#5 Updated by Wido den Hollander over 7 years ago

I was running the stock 3.0 kernel from Ubuntu 11.10

I tried with the latest ceph-client code (saw your post about the btrfs code) and the latest btrfs-progs. A discussion on the #btrfs channel revealed that I might get the data out of the filesystems with the restore tool, but mounting a filesystem in this state seems to be pretty hard.. I think I'll have to start all over again :)

#6 Updated by Sage Weil over 7 years ago

  • Status changed from New to Can't reproduce

Hmm, we haven't been able to trigger this with our thrashing.

Also available in: Atom PDF