ceph-osd stops with "Caught signal (Aborted)" or "osd/PG.cc: 2683: FAILED assert(values.size() == 1)"
While my production ceph cluster was recovering from a power outage, a few of my OSDs started flapping and eventually went down. Previously, I've simply completely removed the OSDs and re-added them fresh and allowed the cluster to recover. However, the cluster is currently reporting a few items are "unfound" (3/939435 unfound (0.000%)) and I'm leery of completely removing OSDs in this state as I don't want to incur any data loss.
Digging through the archives and bug reports I've found a similar case1 with a request for reproduction with increased logging levels. I believe I've managed to gather the requested level of detail and will attach it to this report.
#2 Updated by Jamin Collins over 8 years ago
- File ceph-locate-unfound added
Near as I can tell, all the unfound objects reside on osd.6:
Is there any way to move these objects to a working OSD or get osd.6 back to a point where ceph-osd can start on it?