Project

General

Profile

Actions

Bug #11586

closed

RBD Data Corruption with Logging on Giant

Added by David Burley almost 9 years ago. Updated almost 9 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a ceph cluster running:
  • Ceph Giant (0.87) - Ceph repo RPMs
  • CentOS Linux release 7.0.1406 (Core)
  • Kernel 3.10.0-123.20.1.el7.x86_64

Our environment uses kernel RBDs exclusively, 3x replication, some OSDs are in a SSD group, others are in a spinner group -- but its a pretty simple config.

In troubleshooting another issue, we increased logging on all of our OSDs for the OSD daemon to "10" via inject args. Historically our cluster has had an inconsistent pg show up every few days. After increasing the log level, we hit > 20 inconsistencies in day 1, >30 on day 2, and it continued to increase daily. On day 5 we set the logging back to defaults, the following day we still had increased incidents of inconsistencies found (> 100), followed by a sharp drop off of 8, and then 4 on the following two days. All but 2 of these inconsistencies were repaired as per the Ceph manual (ceph pg repair $PG). We assume the lag in disabling the setting against finding inconsistencies is due to how deep scrubs are scheduled.

In evaluating the 2 outstanding inconsistent PGs, we ran a md5sum of files in the pg directories on each related OSD. We then diff'd the content and found one file to differ on one OSD in each of the two cases. We then hex dumped the files that differed and evaluated the file content (which would be chunks of RBD data). In the one case real file data was overwritten by what appeared to be OSD log data belonging to the OSD that the file was found on. In the other case, it was again OSD log data belonging to that same OSD that wrote the data, but appended to the end of the RBD data.

The log data should only exist on the filesystem of the OSD, which is on local disk, not anything backed by a RBD. So it appears that logging is or can corrupt RBDs in Giant. We're unsure as to whether this impacts the cluster with the default logging configuration.


Files

ceph-osd.0.log (62.5 KB) ceph-osd.0.log 500 lines with debug filestore=20 David Burley, 05/13/2015 05:23 PM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #12465: Log::reopen_log_file() must take the flusher lock to avoid closing an fd ::_flush() is still usingResolved07/24/2015

Actions
Actions

Also available in: Atom PDF