osd re-created from scratch will crash on start-up
Some time ago, it was possible to re-create an osd after its filesystem failed as simply as running “cosd -i # --mkfs --mkjournal”, and then starting it. This no longer works. ceph-osd --mkfs --mkjournal completes successfully, but after starting the osd, its log shows:
2011-11-05 09:49:31.203140 7f7ed41c0740 filestore(/etc/ceph/osd2) mount: enabling PARALLEL journal mode: btrfs, SNAP_CREATE_V2 detected and 'filestore btrfs snap' mode is enabled 2011-11-05 09:49:31.203557 7f7ed41c0740 journal _open /etc/ceph/osd2/journal2 fd 14: 1610612736 bytes, block size 4096 bytes, directio = 1 2011-11-05 09:49:31.203683 7f7ed41c0740 journal read_entry 4096 : seq 1 212 bytes 2011-11-05 09:49:31.203736 7f7ed41c0740 journal _open /etc/ceph/osd2/journal2 fd 14: 1610612736 bytes, block size 4096 bytes, directio = 1 *** Caught signal (Aborted) ** in thread 0x7f7ec17fa700 *** Caught signal (Segmentation fault) ** in thread 0x7f7ec17fa700
In order to bring the osd back up, I rsynced --exclude=/*_head a recent snap of another osd, adjusted the osd number in the superblock, and duplicated the snapshot into current. It then recovered successfully, but it was supposed to have copies of all PGs just like the other; I'm not sure how to go about recovering the osd if this wasn't the case.
#3 Updated by Alexandre Oliva over 8 years ago
I was using v0.37; in order to debug this, I first build top of the tree stable (b8979f4d292f6a739daac81ce8e59aa084e11e22), and then I could no longer trigger the problem. I can't tell whether it's because it's fixed by some post-0.37 patch or because the condition that hit me twice no longer hit, but I guess we can leave it alone for now, and I'll reopen with more information if I hit it again.