Project

General

Profile

Actions

Bug #590

closed

osd/PG.cc:1645: FAILED assert(info.last_complete >= log.tail || log.backlog)

Added by ar Fred over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading to ceph 0.23, the cluster (3 osd, 3 mon, 3 non-clustered mds) worked for about 2 hours and then one computer had some problems (osd1, kernel bug in btrfs code).

I then upgraded the whole cluster (ubuntu lucid -> maverick i.e., from kernel 2.6.32 to 2.6.35) to get a newer btrfs, did a run of btrfsck on all osd (no problem detected) and when restarting the ceph cluster, got that in the logs of osd2

osd/PG.cc: In function 'void PG::activate(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<const int, MOSDPGInfo*> > >)':
osd/PG.cc:1645: FAILED assert(info.last_complete >= log.tail || log.backlog)
ceph version 0.23 (commit:)
1: (PG::activate(ObjectStore::Transaction&, std::list<Context
, std::allocator<Context*> >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >)+0x1a61) [0x52fdc1]
2: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing&, std::map<int, MOSDPGInfo
, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >, int&)+0xa8e) [0x4eaf5e]
3: (OSD::handle_pg_log(MOSDPGLog
)+0xb1) [0x4eb5a1]
4: (OSD::_dispatch(Message*)+0x40d) [0x4edb1d]
5: (OSD::do_waiters()+0xbc) [0x4edcdc]
6: (OSD::ms_dispatch(Message*)+0x48) [0x4ee428]
7: (SimpleMessenger::dispatch_entry()+0x749) [0x45b389]
8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x454f5c]
9: (Thread::_entry_func(void*)+0xa) [0x46950a]
10: (()+0x7971) [0x7fa7a5cba971]
11: (clone()+0x6d) [0x7fa7a4a3494d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
  • Caught signal (ABRT) **
    ceph version 0.23 (commit:)
    1: (sigabrt_handler(int)+0x7d) [0x5d68cd]
    2: (()+0x33c20) [0x7fa7a4981c20]
    3: (gsignal()+0x35) [0x7fa7a4981ba5]
    4: (abort()+0x180) [0x7fa7a49856b0]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fa7a52256bd]
    6: (()+0xb9906) [0x7fa7a5223906]
    7: (()+0xb9933) [0x7fa7a5223933]
    8: (()+0xb9a3e) [0x7fa7a5223a3e]
    9: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x448) [0x5c38e8]
    10: (PG::activate(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >)+0x1a61) [0x52fdc1]
    11: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing&, std::map<int, MOSDPGInfo
    , std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >, int&)+0xa8e) [0x4eaf5e]
    12: (OSD::handle_pg_log(MOSDPGLog
    )+0xb1) [0x4eb5a1]
    13: (OSD::_dispatch(Message*)+0x40d) [0x4edb1d]
    14: (OSD::do_waiters()+0xbc) [0x4edcdc]
    15: (OSD::ms_dispatch(Message*)+0x48) [0x4ee428]
    16: (SimpleMessenger::dispatch_entry()+0x749) [0x45b389]
    17: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x454f5c]
    18: (Thread::_entry_func(void*)+0xa) [0x46950a]
    19: (()+0x7971) [0x7fa7a5cba971]
    20: (clone()+0x6d) [0x7fa7a4a3494d]

what gdb says:
(gdb) p info.last_complete
$1 = {version = 0, epoch = 0, __pad = 0}
(gdb) p log.tail
$2 = {version = 852335, epoch = 643, __pad = 0}
(gdb) p log.backlog
$3 = false

I'll keep the core file for a while in case you need more information.


Files

osd.2.log (140 KB) osd.2.log log of the crashing osd ar Fred, 11/18/2010 08:58 AM
osd.2.log.bz2 (3.85 MB) osd.2.log.bz2 ar Fred, 12/01/2010 02:25 AM
osd.1.log.bz2 (1.74 MB) osd.1.log.bz2 ar Fred, 12/06/2010 12:52 AM
osd.0.log.bz2 (1020 KB) osd.0.log.bz2 ar Fred, 12/20/2010 02:34 AM
osd.1.log.bz2 (1.57 MB) osd.1.log.bz2 ar Fred, 12/20/2010 02:34 AM
osd.2.log.bz2 (1.19 MB) osd.2.log.bz2 ar Fred, 12/20/2010 02:34 AM
osd.0.log.bz2 (1.04 MB) osd.0.log.bz2 ar Fred, 12/21/2010 02:11 AM
osd.1.log.bz2 (1.64 MB) osd.1.log.bz2 ar Fred, 12/21/2010 02:11 AM
osd.2.log.bz2 (1.13 MB) osd.2.log.bz2 ar Fred, 12/21/2010 02:11 AM
osd.0.log.bz2 (1.1 MB) osd.0.log.bz2 ar Fred, 12/22/2010 03:52 AM
osd.1.log.bz2 (1.67 MB) osd.1.log.bz2 ar Fred, 12/22/2010 03:52 AM
osd.2.log.bz2 (1.01 MB) osd.2.log.bz2 ar Fred, 12/22/2010 03:52 AM
osd.0.log.bz2 (799 KB) osd.0.log.bz2 ar Fred, 12/23/2010 08:26 AM
osd.1.log.bz2 (1.83 MB) osd.1.log.bz2 ar Fred, 12/23/2010 08:26 AM
osd.2.log.bz2 (934 KB) osd.2.log.bz2 ar Fred, 12/23/2010 08:26 AM
osd.0.log.1.gz (1.49 MB) osd.0.log.1.gz ar Fred, 12/23/2010 11:43 PM
osd.1.log.1.gz (1.75 MB) osd.1.log.1.gz ar Fred, 12/23/2010 11:43 PM
osd.2.log.1.gz (1.53 MB) osd.2.log.1.gz ar Fred, 12/23/2010 11:43 PM
Actions #1

Updated by ar Fred over 13 years ago

Actions #2

Updated by Sage Weil over 13 years ago

Need to look at this more closely. Fred, pretty sure no data is lost here, but the recovery code needs some fixing.

Revelant bits are the primary activate() code:


    m = new MOSDPGLog(osd->osdmap->get_epoch(), info);
    if (pi.last_update < log.tail) {
      // summary/backlog
      assert(log.backlog);
      m->log = log;
    } else {
      // incremental log
      assert(pi.last_update < info.last_update);
      m->log.copy_after(log, pi.last_update);
    }

Versus read_log, which generates missing based on last_complete:
  // build missing
  if (info.last_complete < info.last_update) {
    dout(10) << "read_log checking for missing items over interval (" << info.last_complete
         << "," << info.last_update << "]" << dendl;

Basically, the invariant is that the log on any machine needs to extend from last_complete to last_update. So everything from last_complete needs to be sent to the replica, I'm pretty sure. This needs to be thought about more carefully, though..

Actions #3

Updated by Colin McCabe over 13 years ago

Another potential issue that I can see here is that the code in OSD::_process_pg_info doesn't check whether it got a Log before calling

pg->merge_log(*t, info, log, from);
pg->activate(*t, fin->contexts, info_map);

For example, if the message coming in is just an Info message, it would have been invoked like this:

PG::Log empty_log;
...
_process_pg_info(m->get_epoch(), from, *p, empty_log, NULL, &info_map, created);

So potentially, we're trying to activate before getting the information we really need to do so. I'll have to think about what to change here.

Actions #4

Updated by Sage Weil over 13 years ago

Colin McCabe wrote:

Another potential issue that I can see here is that the code in OSD::_process_pg_info doesn't check whether it got a Log before calling

pg->merge_log(*t, info, log, from);
pg->activate(*t, fin->contexts, info_map);

For example, if the message coming in is just an Info message, it would have been invoked like this:

PG::Log empty_log;
...
_process_pg_info(m->get_epoch(), from, *p, empty_log, NULL, &info_map, created);

So potentially, we're trying to activate before getting the information we really need to do so. I'll have to think about what to change here.

The replicas only ever get messages from the primary, and the primary sends a log to activate. Never anything else before that. After it's active, it may send an info, but that's it.

Actions #5

Updated by Colin McCabe over 13 years ago

The replicas only ever get messages from the primary, and the primary
sends a log to activate. Never anything else before that. After it's
active, it may send an info, but that's it.

Oh, ok. That makes it clearer.

In order to avoid getting confused in the future, I'll put in an assert there.

Basically, the invariant is that the log on any machine needs to extend
from last_complete to last_update. So everything from last_complete needs
to be sent to the replica, I'm pretty sure. This needs to be thought about
more carefully, though..

I still need to think through some issues related to last_complete and lost objects...

Actions #6

Updated by Sage Weil over 13 years ago

Argh yeah I was all wrong here. The recovery code looks ok.. I think the problem is that before this the log was trimmed too much or something. Let's add an assert to PG::trim() and audit the callers.

Fred, is the cluster still in this state? Can you restart that osd with full logging on (debug osd = 20, debug ms = 1) and see if that gives us any other clues?

Actions #7

Updated by Colin McCabe over 13 years ago

  • Assignee set to Colin McCabe

I added the PG::trim assert. It seems to cause problems immediately with test_unfound.sh

The plot thickens...

Actions #8

Updated by Colin McCabe over 13 years ago

Fred, can you see if this reproduces on the latest unstable? Thanks.

-C

Actions #9

Updated by ar Fred over 13 years ago

I just tried the latest unstable: fe9fad7bea

osd log attached...

osd/OSD.cc: In function 'void OSD::_process_pg_info(epoch_t, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<const int,
MOSDPGInfo*> > >, int&)':
osd/OSD.cc:3982: FAILED assert(pg->log.tail <= pg->info.last_complete)
ceph version 0.25~rc (commit:)
1: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing
, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >, i
nt&)+0xf93) [0x4ec343]
2: (OSD::handle_pg_log(MOSDPGLog
)+0xb1) [0x4ec9b1]
3: (OSD::_dispatch(Message*)+0x41d) [0x4f0a8d]
4: (OSD::do_waiters()+0xbc) [0x4f0c4c]
5: (OSD::ms_dispatch(Message*)+0x48) [0x4f13b8]
6: (SimpleMessenger::dispatch_entry()+0x759) [0x45b679]
7: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45564c]
8: (Thread::_entry_func(void*)+0xa) [0x46adaa]
9: (()+0x7971) [0x7fb71e87c971]
10: (clone()+0x6d) [0x7fb71d84894d]

what gdb has to say
#0 0x00007fb71d795ba5 in raise () from /lib/libc.so.6
#1 0x00007fb71d7996b0 in abort () from /lib/libc.so.6
#2 0x00007fb71e0396bd in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3 0x00007fb71e037906 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007fb71e037933 in std::terminate() () from /usr/lib/libstdc++.so.6
#5 0x00007fb71e037a3e in _cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x00000000005cafc8 in ceph::
_ceph_assert_fail (assertion=0x5fd9e0 "pg->log.tail <= pg->info.last_complete", file=<value optimized out>, line=3982, func=<value optimized out>)
at common/assert.cc:30
#7 0x00000000004ec343 in OSD::_process_pg_info (this=0x1090000, epoch=<value optimized out>, from=<value optimized out>, info=..., log=<value optimized out>,
missing=<value optimized out>, info_map=0x0, created=@0x7fb714a1d5fc) at osd/OSD.cc:3982
#8 0x00000000004ec9b1 in OSD::handle_pg_log (this=0x1090000, m=0x2f8f800) at osd/OSD.cc:4044
#9 0x00000000004f0a8d in OSD::_dispatch (this=0x1090000, m=0x2f8f800) at osd/OSD.cc:2227
#10 0x00000000004f0c4c in OSD::do_waiters (this=0x1090000) at osd/OSD.cc:2136
#11 0x00000000004f13b8 in OSD::ms_dispatch (this=0x1090000, m=0x290c000) at osd/OSD.cc:2052
#12 0x000000000045b679 in ms_deliver_dispatch (this=0x1087500) at msg/Messenger.h:97
#13 SimpleMessenger::dispatch_entry (this=0x1087500) at msg/SimpleMessenger.cc:352
#14 0x000000000045564c in SimpleMessenger::DispatchThread::entry (this=0x1087988) at msg/SimpleMessenger.h:531
#15 0x000000000046adaa in Thread::_entry_func (arg=0x3015) at ./common/Thread.h:39
#16 0x00007fb71e87c971 in start_thread () from /lib/libpthread.so.0
#17 0x00007fb71d84894d in clone () from /lib/libc.so.6
#18 0x0000000000000000 in ?? ()

(gdb) p pg->log.tail
$1 = {version = 383853, epoch = 643, __pad = 0}
(gdb) p pg->info.last_complete
$2 = {version = 0, epoch = 0, __pad = 0}

Actions #10

Updated by ar Fred over 13 years ago

Actions #11

Updated by Colin McCabe over 13 years ago

Hi Fred,

I think the assertion you're seeing here was fixed very recently by 78a14622438addcd5c337c4924cce1f67d053ee9

Can you try again with the latest from the rc branch?

Actions #12

Updated by Colin McCabe over 13 years ago

  • Status changed from New to 7
Actions #13

Updated by ar Fred over 13 years ago

this is with ceph rc 39b42b21e9805b3ec838f8682420166fede719f2

I tried to solve the ENOSPC problem by removing PGs by hand: I removed some that were marked active+clean by "ceph pg dump" when the 2 other osd were up.

I could then restart the osd, it connected to the other 2 osd, started peering, recovery, ...

It stops on that:

#0 0x00007fe2fb51fba5 in raise () from /lib/libc.so.6
#1 0x00007fe2fb5236b0 in abort () from /lib/libc.so.6
#2 0x00007fe2fbdc36bd in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3 0x00007fe2fbdc1906 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007fe2fbdc1933 in std::terminate() () from /usr/lib/libstdc++.so.6
#5 0x00007fe2fbdc1a3e in _cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x00000000005cace8 in ceph::
_ceph_assert_fail (assertion=0x5fd700 "pg->log.tail <= pg->info.last_complete || pg->log.backlog", file=<value optimized out>, line=3982,
func=<value optimized out>) at common/assert.cc:30
#7 0x00000000004ec222 in OSD::_process_pg_info (this=0x22fd000, epoch=<value optimized out>, from=<value optimized out>, info=..., log=<value optimized out>,
missing=<value optimized out>, info_map=0x0, created=@0x7fe2f28a862c) at osd/OSD.cc:3982
#8 0x00000000004ec851 in OSD::handle_pg_log (this=0x22fd000, m=0x43e7400) at osd/OSD.cc:4044
#9 0x00000000004f092d in OSD::_dispatch (this=0x22fd000, m=0x43e7400) at osd/OSD.cc:2227
#10 0x00000000004f1249 in OSD::ms_dispatch (this=0x22fd000, m=0x43e7400) at osd/OSD.cc:2050
#11 0x000000000045b679 in ms_deliver_dispatch (this=0x22f4500) at msg/Messenger.h:97
#12 SimpleMessenger::dispatch_entry (this=0x22f4500) at msg/SimpleMessenger.cc:352
#13 0x000000000045564c in SimpleMessenger::DispatchThread::entry (this=0x22f4988) at msg/SimpleMessenger.h:531
#14 0x000000000046adaa in Thread::_entry_func (arg=0xf62) at ./common/Thread.h:39
#15 0x00007fe2fc606971 in start_thread () from /lib/libpthread.so.0
#16 0x00007fe2fb5d294d in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()

(gdb) p pg->log.tail
$1 = {version = 2954848, epoch = 643, __pad = 0}
(gdb) p pg->info.last_complete
$2 = {version = 2954755, epoch = 567, __pad = 0}
(gdb) p pg->log.backlog
$3 = false

tail of the osd log (please ask if you need the whole file...):
660'2955050 (660'2955049) m rb.0.2.0000000000c2/head by client7110.0:278948 2010-11-17 16:13:50.257979 indexed
660'2955051 (660'2955050) m rb.0.2.0000000000c2/head by client7110.0:278949 2010-11-17 16:13:50.258068 indexed
660'2955052 (660'2955051) m rb.0.2.0000000000c2/head by client7110.0:278952 2010-11-17 16:13:50.258370 indexed
660'2955053 (660'2955052) m rb.0.2.0000000000c2/head by client7110.0:278953 2010-11-17 16:13:50.258420 indexed
660'2955054 (660'2955053) m rb.0.2.0000000000c2/head by client7110.0:278956 2010-11-17 16:13:50.258713 indexed
660'2955055 (660'2955054) m rb.0.2.0000000000c2/head by client7110.0:278957 2010-11-17 16:13:50.258762 indexed
660'2955056 (660'2954988) m rb.0.2.000000000120/head by client7110.0:279079 2010-11-17 16:13:52.186298 indexed
660'2955057 (660'2955055) m rb.0.2.0000000000c2/head by client7110.0:280031 2010-11-17 16:14:09.682097 indexed

2010-12-03 16:29:00.261788 7fe2f28a9710 osd1 1059 pg[3.70( v 660'2955057 lc 567'2954755 (643'2954848,660'2955057] n=15 ec=2 les=1059 1051/1051/1047) [2,1] r=1 lcod 0'0 stray m=3] merge_log log(0'0,0'0] from osd2 into log(643'2954848,660'2955057]
2010-12-03 16:29:00.261800 7fe2f28a9710 osd1 1059 pg[3.70( v 660'2955057 lc 567'2954755 (643'2954848,660'2955057] n=15 ec=2 les=1059 1051/1051/1047) [2,1] r=1 lcod 0'0 stray m=3] merge_log result log(643'2954848,660'2955057] missing(3) changed=0
osd/OSD.cc: In function 'void OSD::_process_pg_info(epoch_t, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<const int, MOSDPGInfo*> > >*, int&)':
osd/OSD.cc:3982: FAILED assert(pg->log.tail <= pg->info.last_complete || pg->log.backlog)

Actions #14

Updated by Colin McCabe over 13 years ago

Hi Fred,

When you say you removed the PGs "by hand"... does that mean you used "rm -rf" on the object store while the cosd processes were stopped, and then restarted all of them? That is what comes to mind when I read your description.

Can you attach the logs for the OSDs?

Actions #15

Updated by Colin McCabe over 13 years ago

  • Status changed from 7 to In Progress
Actions #16

Updated by ar Fred over 13 years ago

At the time, when I said "by hand", it was even worse:

As the journal was full, doing a rm -Rf while the osd was stopped would not help, because when restarting the osd, it would replay the journal, and recreate most of what I deleted, and assert again (ENOSPC). And because of the assertion, the journal was kept as is (I guess)... no way out.

So what I did was multiple rm -Rf while the cosd process was replaying the journal. That was the journal replay could finish without hitting ENOSPC.

I did that because the PGs I deleted were active+clean on the 2 other OSDs...

Then I hit the assert in comment #13, please find the relevant log attached

thanks

Actions #17

Updated by Colin McCabe over 13 years ago

  • Status changed from In Progress to 7

Hi Fred,

We don't support deleting things manually from the object store while cosd is running.

Are you running with btrfs? If so, you should be able to enlarge your object store pretty easily by adding another disk (btrfs has built-in RAID).

We are looking into ways to avoid ENOSPC (see issue #95), and handle it more gracefully, but for now it remains a known issue.

Actions #18

Updated by Sage Weil over 13 years ago

If you're not running btrfs, you can also copy a pg directory to another disk and symlink it. Just be sure to preserve xattrs.

Actions #19

Updated by ar Fred over 13 years ago

I spent some time yesterday trying the latest rc (914f6ddebd899667b1937dfe9d5f1a94537dc500), and I still hit the same assertion (comment 13).

What I wanted to emphasize is that this problem seems to be about pg 3.70 according to the line logged just before the assertion, and this is not a pg I removed "by hand"!

I tried deleting that pg "by hand" from osd1 while ceph was stopped, it gets recreated when ceph restarts (I even deleted meta/pg[log|info]3.70, and they also get recreated with the same content).

It seems that the 3.70 pg comes from another osd during the recovery... so I'm wondering if this is related to the problem reported in comment 1 (that happened on osd2)?

thanks!

Actions #20

Updated by Sage Weil over 13 years ago

  • Target version set to v0.24.1
Actions #21

Updated by Sage Weil over 13 years ago

  • Assignee deleted (Colin McCabe)

ar Fred wrote:

2010-12-03 16:29:00.261788 7fe2f28a9710 osd1 1059 pg[3.70( v 660'2955057 lc 567'2954755 (643'2954848,660'2955057] n=15 ec=2 les=1059 1051/1051/1047) [2,1] r=1 lcod 0'0 stray m=3] merge_log log(0'0,0'0] from osd2 into log(643'2954848,660'2955057]
2010-12-03 16:29:00.261800 7fe2f28a9710 osd1 1059 pg[3.70( v 660'2955057 lc 567'2954755 (643'2954848,660'2955057] n=15 ec=2 les=1059 1051/1051/1047) [2,1] r=1 lcod 0'0 stray m=3] merge_log result log(643'2954848,660'2955057] missing(3) changed=0
osd/OSD.cc: In function 'void OSD::_process_pg_info(epoch_t, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<const int, MOSDPGInfo*> > >*, int&)':
osd/OSD.cc:3982: FAILED assert(pg->log.tail <= pg->info.last_complete || pg->log.backlog)

Looking at this more closely, it looks like something else is going on: the primary is sending an empty log to a stray. Can you send the log from your latest crash? Both for the crashing osd AND the one sending the message (the pg 3.70 primary.. osd2 in the above log)?

I'm adding an assert to catch this craziness slightly sooner..

Updated by ar Fred over 13 years ago

Hi, I produced these logs with the latest rc (b04b6f48237fcbd22716279726f827237a3eb4af).

I have not seen the primary asserting, so I'm attaching the log for each OSD.

Some gdb output after the crash of osd1:
(gdb) bt
#0 0x00007f7aff8b3ba5 in raise () from /lib/libc.so.6
#1 0x00007f7aff8b76b0 in abort () from /lib/libc.so.6
#2 0x00007f7b001576bd in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3 0x00007f7b00155906 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007f7b00155933 in std::terminate() () from /usr/lib/libstdc++.so.6
#5 0x00007f7b00155a3e in _cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x00000000005d5378 in ceph::
_ceph_assert_fail (assertion=0x607ba8 "pg->log.tail <= pg->info.last_complete || pg->log.backlog", file=<value optimized out>, line=3982, func=<value optimized out>)
at common/assert.cc:30
#7 0x00000000004e756b in OSD::_process_pg_info (this=0x1719000, epoch=<value optimized out>, from=<value optimized out>, info=..., log=<value optimized out>, missing=<value optimized out>,
info_map=0x0, created=@0x7f7af6d5362c) at osd/OSD.cc:3982
#8 0x00000000004f30a1 in OSD::handle_pg_log (this=0x1719000, m=0x34fac00) at osd/OSD.cc:4045
#9 0x00000000004f80cd in OSD::_dispatch (this=0x1719000, m=0x34fac00) at osd/OSD.cc:2227
#10 0x00000000004f89d9 in OSD::ms_dispatch (this=0x1719000, m=0x34fac00) at osd/OSD.cc:2050
#11 0x0000000000462259 in ms_deliver_dispatch (this=0x170f500) at msg/Messenger.h:97
#12 SimpleMessenger::dispatch_entry (this=0x170f500) at msg/SimpleMessenger.cc:352
#13 0x000000000045c10c in SimpleMessenger::DispatchThread::entry (this=0x170f988) at msg/SimpleMessenger.h:531
#14 0x000000000047192a in Thread::_entry_func (arg=0x6556) at ./common/Thread.h:39
#15 0x00007f7b00cca971 in start_thread () from /lib/libpthread.so.0
#16 0x00007f7aff96692d in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()

(gdb) p pg->log.tail
$1 = {version = 2954848, epoch = 643, __pad = 0}
(gdb) p pg->info.last_complete
$2 = {version = 2954755, epoch = 567, __pad = 0}
(gdb) p pg->log.backlog
$3 = false

I do not really understand the new assert you just added, but maybe a quick recall of what happened to my cluster may help you:
- after upgrading ceph, the kernel on osd1 crashed (as reported in comment 1, I then upgraded all boxes to a newer kernel, fearing a btrfs-related problem)
- then osd2 crashed (cosd, not the kernel, also reported in comment 1)
- then both osd0 and osd1 tried to compensate for the loss of osd2 .. and I hit the ENOSPC problem
- and here we are, trying to restart all 3 OSDs

Maybe this stray/replica/... problem is related to osd1 crashing (getting marked out of ceph), then osd1 getting back in and then osd2 getting out... all that happening without ceph being able to reach a 100% active+clean state in-between.

Actions #23

Updated by Sage Weil over 13 years ago

Okay, i see the problem. Or at least teh symptom. Please try the latest 'rc' branch, 69940e2.

I'm not certain how the replica got in this state.. it's likely due to a bug in the pg log trimming. In your case, though, the primary should be able to compensate with the above patch.

Let me know! THanks-

Updated by ar Fred over 13 years ago

I just tried 180a417603552 which is v0.24.

Same crash, and no occurrence of
dout(10) << "activate peer osd" << peer << " has last_complete < log tail and no backlog, compensating" << dendl;
in my logs, although I have debug osd = 20, debug ms = 1.

Here are the new logs...

Thanks, Fred

Actions #25

Updated by Sage Weil over 13 years ago

Oh, the case above it catching it. Can you please try

diff --git a/src/osd/PG.cc b/src/osd/PG.cc
index 7175e13..4ea8aba 100644
--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -1841,7 +1841,8 @@ void PG::activate(ObjectStore::Transaction& t, list<Context*>& tfin,

       MOSDPGLog *m = 0;

-      if (pi.last_update == info.last_update) {
+      if (pi.last_update == info.last_update &&
+         pi.last_complete >= pi.log_tail) {
         // empty log
        if (activator_map) {
          dout(10) << "activate peer osd" << peer << " is up to date, queueing in pending_activators" << dendl;

I don't have time to test this carefully (just left for vacation), but from your logs I think this'll do the trick (or at least get you past the current roadblock). Let me know!

Updated by ar Fred over 13 years ago

Thanks for the patch, which I immediately tested:

osd2 crashes first, then osd1.

osd2:
#0 0x00007fb4b3db4ba5 in raise () from /lib/libc.so.6
#1 0x00007fb4b3db86b0 in abort () from /lib/libc.so.6
#2 0x00007fb4b46586bd in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3 0x00007fb4b4656906 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007fb4b4656933 in std::terminate() () from /usr/lib/libstdc++.so.6
#5 0x00007fb4b4656a3e in _cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x00000000005d5508 in ceph::
_ceph_assert_fail (assertion=0x60d970 "info.last_update > pi.last_update", file=<value optimized out>, line=1875, func=<value optimized out>) at common/assert.cc:30
#7 0x000000000054a1f3 in PG::activate (this=0x16af000, t=<value optimized out>, tfin=<value optimized out>, activator_map=<value optimized out>) at osd/PG.cc:1875
#8 0x000000000054ad8e in PG::do_peer (this=0x16af000, t=<value optimized out>, tfin=<value optimized out>, query_map=<value optimized out>, activator_map=<value optimized out>) at osd/PG.cc:1690
#9 0x00000000004e6cff in OSD::_process_pg_info (this=0x1231000, epoch=<value optimized out>, from=<value optimized out>, info=<value optimized out>, log=<value optimized out>,
missing=<value optimized out>, info_map=0x0, created=@0x7fb4ab25462c) at osd/OSD.cc:3971
#10 0x00000000004f30a1 in OSD::handle_pg_log (this=0x1231000, m=0x2c43800) at osd/OSD.cc:4045
#11 0x00000000004f80cd in OSD::_dispatch (this=0x1231000, m=0x2c43800) at osd/OSD.cc:2227
#12 0x00000000004f89d9 in OSD::ms_dispatch (this=0x1231000, m=0x2c43800) at osd/OSD.cc:2050
#13 0x0000000000462259 in ms_deliver_dispatch (this=0x1227500) at msg/Messenger.h:97
#14 SimpleMessenger::dispatch_entry (this=0x1227500) at msg/SimpleMessenger.cc:352
#15 0x000000000045c10c in SimpleMessenger::DispatchThread::entry (this=0x1227988) at msg/SimpleMessenger.h:531
#16 0x000000000047192a in Thread::_entry_func (arg=0x1254) at ./common/Thread.h:39
#17 0x00007fb4b51cb971 in start_thread () from /lib/libpthread.so.0
#18 0x00007fb4b3e6792d in clone () from /lib/libc.so.6
#19 0x0000000000000000 in ?? ()
(gdb) f 7
#7 0x000000000054a1f3 in PG::activate (this=0x16af000, t=<value optimized out>, tfin=<value optimized out>, activator_map=<value optimized out>) at osd/PG.cc:1875
1875 osd/PG.cc: No such file or directory.
in osd/PG.cc
(gdb) p info.last_update
p$1 = {version = 2619770, epoch = 719, __pad = 0}
(gdb) p pi.last_update
$2 = {version = 2619770, epoch = 719, __pad = 0}

osd1: same crash as in latest comments above.

logs attached.

Thanks, and enjoy your time off

Actions #27

Updated by Sage Weil over 13 years ago

Looks like it got past the previous problem pg but bombed later... I forgot to test for backlog. This should do it?

diff --git a/src/osd/PG.cc b/src/osd/PG.cc
index 7175e13..4ff4823 100644
--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -1841,7 +1841,8 @@ void PG::activate(ObjectStore::Transaction& t, list<Context*>& tfin,

       MOSDPGLog *m = 0;

-      if (pi.last_update == info.last_update) {
+      if (pi.last_update == info.last_update &&
+         (pi.last_complete >= pi.log_tail && !pi.log_backlog)) {
         // empty log
        if (activator_map) {
          dout(10) << "activate peer osd" << peer << " is up to date, queueing in pending_activators" << dendl;

Updated by ar Fred over 13 years ago

Hi and thanks again for the patch. I guess it allowed things to go a bit further... at least I don't see the same crash. But osd0 and osd2 crashed. It may be something easy, like changing a > into >= or the other way round...

osd0

2010-12-23 11:17:30.690242 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1263 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 660'383856 peering] peer up [0,1], acting [0,1]
2010-12-23 11:17:30.690256 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1263 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 660'383856 peering] peer prior_set is 1
2010-12-23 11:17:30.690269 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1263 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 660'383856 peering] peer osd1 3.bc( v 660'38385
2010-12-23 11:17:30.690285 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1263 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 660'383856 peering] num_missing = 0, num_unfound
2010-12-23 11:17:30.690298 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1263 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 660'383856 peering] up_thru 1269 >= same_since 1
2010-12-23 11:17:30.690311 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 luod=0'0 lcod 0'0 mlcod 660'383856 active] trim_past_intervals:
2010-12-23 11:17:30.690326 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 luod=0'0 lcod 0'0 mlcod 660'383856 active] trim_past_intervals:
2010-12-23 11:17:30.690349 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] clear_prior
2010-12-23 11:17:30.690362 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] cancel_generate_backlog
2010-12-23 11:17:30.690378 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] write_info info 309
2010-12-23 11:17:30.690393 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] write_info bigbl 38
2010-12-23 11:17:30.690407 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] write_log
2010-12-23 11:17:30.690427 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] write_log to 0~294
2010-12-23 11:17:30.690441 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] clean_up_local
2010-12-23 11:17:30.690454 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] activate - snap_trimq []
2010-12-23 11:17:30.690467 7f10bd2fb700 osd0 1270 pg[3.bc( v 660'383856 (643'383853,660'383856] n=7 ec=2 les=1270 1269/1269/1262) [0,1] r=0 lcod 0'0 mlcod 0'0 active] activate - complete
osd/PG.cc: In function 'void PG::activate(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<const int, MOSDPG
osd/PG.cc:1875: FAILED assert(info.last_update > pi.last_update)
ceph version 0.24 (commit:)
1: (PG::activate(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >)+0x1ab3) [0x
2: (PG::do_peer(ObjectStore::Transaction&, std::list<Context
, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >,
3: (OSD::activate_map(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&)+0x332) [0x4e3132]
4: (OSD::handle_osd_map(MOSDMap*)+0x2a67) [0x4eddb7]
5: (OSD::_dispatch(Message*)+0x2b0) [0x4f7f60]
6: (OSD::ms_dispatch(Message*)+0x39) [0x4f89d9]
7: (SimpleMessenger::dispatch_entry()+0x759) [0x462259]
8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45c10c]
9: (Thread::_entry_func(void*)+0xa) [0x47192a]
10: (()+0x7971) [0x7f10c696f971]
11: (clone()+0x6d) [0x7f10c560b92d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

what gdb says:
(gdb) p info.last_update
$1 = {version = 383856, epoch = 660, __pad = 0}
(gdb) p pi.last_update
$2 = {version = 383856, epoch = 660, __pad = 0}

osd2

pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] osd1 log = log(667'9637876,667'9637876]

2010-12-23 11:17:55.276579 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] proc_replica_log for os
2010-12-23 11:17:55.276594 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] peer osd1 now 3.42( v
2010-12-23 11:17:55.276609 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] search_for_missing miss
2010-12-23 11:17:55.276623 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] peer up [2,1], acting [
2010-12-23 11:17:55.276636 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] peer prior_set is 1
2010-12-23 11:17:55.276650 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] peer osd1 3.42( v 667'
2010-12-23 11:17:55.276666 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] num_missing = 0, num_un
2010-12-23 11:17:55.276679 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1266 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 657'9637422 peering] up_thru 1271 >= same_si
2010-12-23 11:17:55.276692 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 luod=0'0 lcod 0'0 mlcod 657'9637422 active] trim_past_inter
2010-12-23 11:17:55.276707 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 luod=0'0 lcod 0'0 mlcod 657'9637422 active] trim_past_inter
2010-12-23 11:17:55.276722 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 luod=0'0 lcod 0'0 mlcod 657'9637422 active] trim_past_inter
2010-12-23 11:17:55.276736 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 luod=0'0 lcod 0'0 mlcod 657'9637422 active] trim_past_inter
2010-12-23 11:17:55.276770 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] clear_prior
2010-12-23 11:17:55.276784 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] cancel_generate_backlog
2010-12-23 11:17:55.276806 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] write_info info 309
2010-12-23 11:17:55.276822 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] write_info bigbl 38
2010-12-23 11:17:55.276836 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] write_log
2010-12-23 11:17:55.277529 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] write_log to 0~48118
2010-12-23 11:17:55.277587 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] clean_up_local
2010-12-23 11:17:55.277782 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] activate - snap_trimq []
2010-12-23 11:17:55.277798 7fe8e84df700 osd2 1276 pg[3.42( v 667'9637876 (643'9637385,667'9637876] n=20 ec=2 les=1276 1269/1269/1265) [2,1] r=0 lcod 0'0 mlcod 0'0 active] activate - complete
osd/PG.cc: In function 'void PG::activate(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<const int, MOSDPG
osd/PG.cc:1875: FAILED assert(info.last_update > pi.last_update)
ceph version 0.24 (commit:)
1: (PG::activate(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >)+0x1ab3) [0x
2: (PG::do_peer(ObjectStore::Transaction&, std::list<Context
, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >,
3: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >, int&)+0x98f) [0x4e6cff]
4: (OSD::handle_pg_log(MOSDPGLog
)+0xb1) [0x4f30a1]
5: (OSD::_dispatch(Message*)+0x41d) [0x4f80cd]
6: (OSD::ms_dispatch(Message*)+0x39) [0x4f89d9]
7: (SimpleMessenger::dispatch_entry()+0x759) [0x462259]
8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45c10c]
9: (Thread::_entry_func(void*)+0xa) [0x47192a]
10: (()+0x7971) [0x7fe8f2354971]
11: (clone()+0x6d) [0x7fe8f0ff092d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

gdb:
(gdb) p info.last_update
$1 = {version = 9637876, epoch = 667, __pad = 0}
(gdb) p pi.last_update
$2 = {version = 9637876, epoch = 667, __pad = 0}

Logs also attached...

Actions #29

Updated by Sage Weil over 13 years ago

Sigh... this one shoheuld have t conditions match and be a bit more spacecase programmer proof. Also prints peer info that'll make debugging less tedious.

diff --git a/src/osd/PG.cc b/src/osd/PG.cc
index 7175e13..638307f 100644
--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -1840,8 +1840,12 @@ void PG::activate(ObjectStore::Transaction& t, list<Context*>& tfin,
       PG::Info& pi = peer_info[peer];

       MOSDPGLog *m = 0;
-      
-      if (pi.last_update == info.last_update) {
+
+      dout(20) << "activate peer osd" << peer << " " << pi << dendl;
+
+      bool need_old_log_entries = pi.log_tail > pi.last_complete && !pi.log_backlog;
+
+      if (pi.last_update == info.last_update && !need_old_log_entries) {
         // empty log
        if (activator_map) {
          dout(10) << "activate peer osd" << peer << " is up to date, queueing in pending_activators" << dendl;
@@ -1855,7 +1859,7 @@ void PG::activate(ObjectStore::Transaction& t, list<Context*>& tfin,
       } 
       else {
        m = new MOSDPGLog(osd->osdmap->get_epoch(), info);
-       if (pi.log_tail > pi.last_complete && !pi.log_backlog) {
+       if (need_old_log_entries) {
          // the replica's tail is after it's last_complete and it has no backlog.
          // ick, this shouldn't normally happen.  but we can compensate!
          dout(10) << "activate peer osd" << peer << " has last_complete < log tail and no backlog, compensating" << dendl;

Updated by ar Fred over 13 years ago

Thanks for the patch, back to osd1 vs pg 3.70...

osd1 crashed:
2010-12-23 22:17:29.289927 7fef0ec40700 osd1 1292 pg[3.70( v 660'2955057 lc 567'2954755 (643'2954848,660'2955057] n=15 ec=2 les=1292 1290/1290/1285) [2,1] r=1 lcod 0'0 stray m=3] merge_log log(643'295484
8,660'2955057] from osd2 into log(643'2954848,660'2955057]
2010-12-23 22:17:29.289939 7fef0ec40700 osd1 1292 pg[3.70( v 660'2955057 lc 567'2954755 (643'2954848,660'2955057] n=15 ec=2 les=1292 1290/1290/1285) [2,1] r=1 lcod 0'0 stray m=3] merge_log result log(643
'2954848,660'2955057] missing(3) changed=0
osd/OSD.cc: In function 'void OSD::_process_pg_info(epoch_t, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<const int, MOSDPGInfo*> > >, int&
)':
osd/OSD.cc:3982: FAILED assert(pg->log.tail <= pg->info.last_complete || pg->log.backlog)
ceph version 0.24 (commit:)
1: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing
, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >, int&)+0x11fb) [0x4e756b]
2: (OSD::handle_pg_log(MOSDPGLog
)+0xb1) [0x4f30a1]
3: (OSD::_dispatch(Message*)+0x41d) [0x4f80cd]
4: (OSD::ms_dispatch(Message*)+0x39) [0x4f89d9]
5: (SimpleMessenger::dispatch_entry()+0x759) [0x462259]
6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45c10c]
7: (Thread::_entry_func(void*)+0xa) [0x47192a]
8: (()+0x7971) [0x7fef18bb6971]
9: (clone()+0x6d) [0x7fef1785292d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

(gdb) p pg->log.tail
$1 = {version = 2954848, epoch = 643, __pad = 0}
(gdb) p pg->info.last_complete
$2 = {version = 2954755, epoch = 567, __pad = 0}
(gdb) p pg->log.backlog
$3 = false

Thanks again for looking at this, and congratulations for the merge of rbd in qemu and libvirt, these are features I use(d) a lot (until my cluster crashed, but will of course use++ again asap)

Actions #31

Updated by Sage Weil over 13 years ago

Ok, may not look like it, but this is actaully progress. Add this:


diff --git a/src/osd/PG.cc b/src/osd/PG.cc
index 7175e13..35f639e 100644
--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -1566,6 +1566,20 @@ void PG::do_peer(ObjectStore::Transaction& t, list<Context*>& tfin,
     return;
   }

+  // do i need a backlog due to a replica with mis-trimmed log?
+  for (unsigned i=0; i<acting.size(); i++) {
+    int o = acting[i];
+    Info& pi = peer_info[o];
+    if (pi.last_complete < pi.log_tail && !pi.log_backlog &&
+       pi.last_complete < log.tail && !log.backlog) {
+      dout(10) << "must generate backlog for replica peer osd" << o
+              << " who has a last_complete " << pi.last_complete
+              << " < their log.tail " << pi.log_tail << " and no backlog" << dendl;
+      osd->queue_generate_backlog(this);
+      return;
+    }
+  }
+
   // do i need a backlog for an up peer excluded from acting?
   bool need_backlog = false;
   for (unsigned i=0; i<up.size(); i++) {

Actions #32

Updated by ar Fred over 13 years ago

Very nice, thanks. I can almost say it worked!

... almost because the recovery went well for a long time until osd1 hit ENOSPC again, which is of course a different issue.

So it seems this is all fixed, but I will confirm once I get rid of this ENOSPC issue and get a 100% active+clean state.

Thanks a lot, Fred

Actions #33

Updated by Sage Weil over 13 years ago

  • Assignee set to Sage Weil
Actions #34

Updated by Sage Weil over 13 years ago

  • Status changed from 7 to Resolved
Actions #35

Updated by ar Fred over 13 years ago

Okay, everything is fine now... got rid of ENOSPC and had a scrub error that would not repair. I upgraded to 0.24.1 and it is now repaired.

thanks a lot!

Actions

Also available in: Atom PDF