Project

General

Profile

Bug #428

osd: recovery stalls on mismatched snapset and object

Added by Sage Weil over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

80%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On Wido's cluster, I see recovery stalling on a number of objects where the head's snapset says the size is 4MB, but the object on disk on the other replica is 0 bytes. The pull code gets confused when it doesn't get back what it wants, and stalls out.

First, how did that happen.
Second, how can we properly detect this situation.
Third, what should we do?

Sigh...


Related issues

Related to Ceph - Feature #453: osd: return error (instead of blocking) on lost objects Resolved 10/19/2010 10/19/2010
Related to Ceph - Feature #526: osd: unfound objects rework Resolved 10/29/2010

History

#1 Updated by Sage Weil over 13 years ago

I've pushed a fix to at least part of the stalling problem, eb0a3fa67906181fab872d14ed5e0bcaba03da6f. This doesn't explain the 0 byte objects I saw before, but I'm not seeing any at the moment. Let's see how far this gets us.

#2 Updated by Wido den Hollander over 13 years ago

Just tried the fix, runs fine, gets to almost 0%, but then one OSD (osd7) crashed.

I've uploaded the cores, binary and logs to logger:/srv/ceph/issues/osd_crash_sub_op_push/

#4 Updated by Wido den Hollander over 13 years ago

The latest commit seems to keep the OSD alive, but right now the recovery is stalling again.

It's been hanging on 0,076% for about 15 minutes right now and I'm pretty sure it will stay there.

In the proces I had two OSD crashes on osd8, not sure if they are related.

I uploaded the core, binary and logfiles to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_recover_object_replicas

Unfortunately I don't have detailed logs of the time of the crash, because when I increased the loglevel, the OSD wouldn't crash again.

The backtrace:

Core was generated by `/usr/bin/cosd -i 8 -c /etc/ceph/ceph.conf'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004799c6 in std::vector<snapid_t, std::allocator<snapid_t> >::size (this=0x230b400, snapset=..., soid=..., 
    missing=..., data_subset=..., clone_subsets=...) at /usr/include/c++/4.4/bits/stl_vector.h:533
533    /usr/include/c++/4.4/bits/stl_vector.h: No such file or directory.
    in /usr/include/c++/4.4/bits/stl_vector.h
(gdb) bt
#0  0x00000000004799c6 in std::vector<snapid_t, std::allocator<snapid_t> >::size (this=0x230b400, snapset=..., soid=..., 
    missing=..., data_subset=..., clone_subsets=...) at /usr/include/c++/4.4/bits/stl_vector.h:533
#1  ReplicatedPG::calc_clone_subsets (this=0x230b400, snapset=..., soid=..., missing=..., data_subset=..., 
    clone_subsets=...) at osd/ReplicatedPG.cc:2617
#2  0x000000000047fa2e in ~_Rb_tree (this=0x230b400, obc=<value optimized out>, soid=..., peer=6)
    at /usr/include/c++/4.4/bits/stl_tree.h:614
#3  ~map (this=0x230b400, obc=<value optimized out>, soid=..., peer=6) at /usr/include/c++/4.4/bits/stl_map.h:87
#4  ~interval_set (this=0x230b400, obc=<value optimized out>, soid=..., peer=6) at ./include/interval_set.h:34
#5  ReplicatedPG::push_to_replica (this=0x230b400, obc=<value optimized out>, soid=..., peer=6) at osd/ReplicatedPG.cc:2843
#6  0x00000000004828f3 in ReplicatedPG::recover_object_replicas (this=0x230b400, soid=...) at osd/ReplicatedPG.cc:3665
#7  0x0000000000482d1b in ~object_t (this=0x230b400, max=<value optimized out>) at ./include/object.h:32
#8  ~sobject_t (this=0x230b400, max=<value optimized out>) at ./include/object.h:129
#9  ReplicatedPG::recover_replicas (this=0x230b400, max=<value optimized out>) at osd/ReplicatedPG.cc:3692
#10 0x000000000048779a in ReplicatedPG::start_recovery_ops (this=0x230b400, max=1) at osd/ReplicatedPG.cc:3498
#11 0x00000000004d415c in OSD::do_recovery (this=<value optimized out>, pg=<value optimized out>) at osd/OSD.cc:4339
#12 0x00000000005c34af in Mutex::Unlock (this=0x15755f8) at common/Mutex.h:104
#13 ThreadPool::worker (this=0x15755f8) at common/WorkQueue.cc:43
#14 0x00000000004f957d in RWLock::~RWLock() ()
#15 0x00007f5ef7c3f9ca in start_thread () from /lib/libpthread.so.0
#16 0x00007f5ef6bf76fd in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()

For now, the 11 OSD's stay up fine.

I'm going to add the empty OSD (osd4, the one which lost it's xattrs with the rsync), see if that might trigger the cluster to recovery fully.

#5 Updated by Wido den Hollander over 13 years ago

Adding the 12th OSD made the cluster recover again, but then it stalled at 1.622%

It's been hanging there for a few hours now, so I don't think it will go further.

#6 Updated by Sage Weil over 13 years ago

  • % Done changed from 0 to 80

Okay, the cluster is now all active and clean. The rbd snapshot(s) are corrupted.. i had to copy random data into place in some cases to fill in for missing objects. For the head... I'm sorry but I can't remember if I had to make anything up for some of those, so there may be some data that is wrong in those rbd images (in addition to the objects that are flat out missing).

In any case, the cluster at least thinks it's clean. Some of the cosd binaries are ones I copied into place with code that isn't live yet, as there are still some parts of the forgetting lost objects code that need to be finished.

The next step is either to do some scrubs, to make sure things are really correct, or to just wipe things out. Doing the scrubs will probably be a useful exercise to see what kinds of problems it finds and/or to identify problems with the scrub process itself.

The mds probably jus tneeds to be started at this point; i didn't see any problems in the metadata cluster.

You might consider moving to 3x replication :)

#7 Updated by Wido den Hollander over 13 years ago

Not knowing that you patched the binaries, I've overwritten them this morning when I installed my daily build of the unstable.

Could you upload the patches somewhere so I can see if everything stays stable? Because today I saw a lot of OSD's crashing (osd2, osd4, osd6 and osd7) (Core-dumps are still on the machines). That might be because I've overwritten your changes.

And yes, I'll switch to 3x replication :)

#8 Updated by Sage Weil over 13 years ago

  • Status changed from New to Resolved

There's a separate issue open for the remaining issue #453. Closing this one out.

Also available in: Atom PDF