Bug #26993
Possible rbd snapshot race with 13.2.1 causes missing SharedBlob error and OSD corruption
0%
Description
I have two rbd pools, vm-pool and an associated EC data pool vm-ssd-ec-pool (both live on the same SSD OSDs). I have hourly snapshots that run in cron to take snapshots.
This is only speculation, but there appears to be a race that occasionally corrupts one or more OSDs at the time a snapshot is taken (or perhaps it's triggered by a series of snapshots on different rbd images?). It definitely doesn't happen on every snapshot creation. Maybe 1 in 100 or less often. But when it happens, this appears to crash the OSD in question and the OSD will abort each time it runs.
I did not get a core from the original crash, but here's the fsck log, the standard osd log from a manual run, and a core from a manual run.
fsck log (debug 20): ceph-post-file: fc6eab3e-83a3-46eb-90c1-ac02dee553b7
osd log (normal debug, sorry): ceph-post-file: 45612096-2fa8-49c3-a778-c9d8f1015e59
core: ceph-post-file: 356631aa-fdda-418b-9f60-82c7156dd04d
All OSDs are bluestore.