Project

General

Profile

Bug #26993

Possible rbd snapshot race with 13.2.1 causes missing SharedBlob error and OSD corruption

Added by Troy Ablan over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have two rbd pools, vm-pool and an associated EC data pool vm-ssd-ec-pool (both live on the same SSD OSDs). I have hourly snapshots that run in cron to take snapshots.

This is only speculation, but there appears to be a race that occasionally corrupts one or more OSDs at the time a snapshot is taken (or perhaps it's triggered by a series of snapshots on different rbd images?). It definitely doesn't happen on every snapshot creation. Maybe 1 in 100 or less often. But when it happens, this appears to crash the OSD in question and the OSD will abort each time it runs.

I did not get a core from the original crash, but here's the fsck log, the standard osd log from a manual run, and a core from a manual run.

fsck log (debug 20): ceph-post-file: fc6eab3e-83a3-46eb-90c1-ac02dee553b7
osd log (normal debug, sorry): ceph-post-file: 45612096-2fa8-49c3-a778-c9d8f1015e59
core: ceph-post-file: 356631aa-fdda-418b-9f60-82c7156dd04d

All OSDs are bluestore.

Also available in: Atom PDF