Bug #1690: osd re-created from scratch will crash on start-up - Ceph - Ceph

Actions

Copy link

Bug #1690

closed

osd re-created from scratch will crash on start-up

Added by Alexandre Oliva over 12 years ago. Updated over 12 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Samuel Just

Category:

Target version:

v0.40

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Some time ago, it was possible to re-create an osd after its filesystem failed as simply as running “cosd -i # --mkfs --mkjournal”, and then starting it. This no longer works. ceph-osd --mkfs --mkjournal completes successfully, but after starting the osd, its log shows:

2011-11-05 09:49:31.203140 7f7ed41c0740 filestore(/etc/ceph/osd2) mount: enabling PARALLEL journal mode: btrfs, SNAP_CREATE_V2 detected and 'filestore btrfs snap' mode is enabled
2011-11-05 09:49:31.203557 7f7ed41c0740 journal _open /etc/ceph/osd2/journal2 fd 14: 1610612736 bytes, block size 4096 bytes, directio = 1
2011-11-05 09:49:31.203683 7f7ed41c0740 journal read_entry 4096 : seq 1 212 bytes
2011-11-05 09:49:31.203736 7f7ed41c0740 journal _open /etc/ceph/osd2/journal2 fd 14: 1610612736 bytes, block size 4096 bytes, directio = 1
*** Caught signal (Aborted) **
 in thread 0x7f7ec17fa700
*** Caught signal (Segmentation fault) **
 in thread 0x7f7ec17fa700

In order to bring the osd back up, I rsynced --exclude=/*_head a recent snap of another osd, adjusted the osd number in the superblock, and duplicated the snapshot into current. It then recovered successfully, but it was supposed to have copies of all PGs just like the other; I'm not sure how to go about recovering the osd if this wasn't the case.

Actions

Copy link

Updated by Samuel Just over 12 years ago

Assignee set to Samuel Just

I seem to be having some trouble reproducing this. What version are you running? Could you repeat the procedure with osd and filestore debugging at 25?

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version set to v0.39
Translation missing: en.field_position set to 1
Translation missing: en.field_position changed from 1 to 982

Actions

Copy link

Updated by Alexandre Oliva over 12 years ago

I was using v0.37; in order to debug this, I first build top of the tree stable (b8979f4d292f6a739daac81ce8e59aa084e11e22), and then I could no longer trigger the problem. I can't tell whether it's because it's fixed by some post-0.37 patch or because the condition that hit me twice no longer hit, but I guess we can leave it alone for now, and I'll reopen with more information if I hit it again.

Actions

Copy link