Project

General

Profile

Actions

Bug #19419

open

XFS filesystem on RBD image was corrupt after remount

Added by Bryan Apperson about 7 years ago. Updated about 7 years ago.

Status:
Need More Info
Priority:
Low
Assignee:
Category:
rbd
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Crash signature (v1):
Crash signature (v2):

Description

An RBD image was being used in production to back a Samba and NFS export. The 100TB image was formatted with XFS and put into production about 2 years ago (2015). The filesystem ran without issue until March 16th 2017. In February there was a network switch failure between the RBD client and the RADOS cluster, so that was the first thing I looked at. I was not able to reproduce the corruption:

Executed these commands:

ceph osd crush rule create-simple hostlevel default host
ceph osd pool create test_rbd 4096 4096 replicated hostlevel
rbd -p test_rbd create testrbd1 --size 100000
mkfs.xfs /dev/rbd/test_rbd/testrbd1

XFS Info:
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd/test_rbd/testrbd1 isize=256 agcount=17, agsize=1599488 blks = sectsz=512 attr=2, projid32bit=1 = crc=0
data = bsize=4096 blocks=25600000, imaxpct=25 = sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=12504, version=2 = sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Test:
Mount an XFS formatted RBD on /mnt and copy files into it, disrupt the network
connection with a soft down/up during IO.
In one window:
for i in {1..100}; do cp large_file file${i}; done
Another window:
ifdown public
"sleep 30"
ifup public

Result:
I was unable to reproduce by down/up of the public bond while doing IO to
XFS.

Test:
Mount an XFS formatted RBD on /mnt and copy files into it, disrupt the network
connection by pulling both fiber interfaces while IO is ongoing.
In one window:

for i in {1..100}; do cp large_file file${i}; done

Meanwhile:

pull cables
sleep 30
plug cables

Result:
I was unable to reproduce by removal of the physical cables while doing IO to
XFS.

The switch failure in February caused the following:

Feb 21 12:53:56 node92 mount-rbd-dropbox-dropbox: rbd: add failed: (5) Input/output error
Feb 21 12:53:56 node92 mount-rbd-dropbox-dropbox: mount: special device /dev/rbd/dropbox/dropbox does not exist
Feb 21 12:53:56 node92 systemd: Started RADOS block device mapping for dropbox in pool dropbox".
Feb 21 12:53:59 node92 kernel: libceph: mon1 10.32.22.108:6789 socket closed (con state CONNECTING)
Feb 21 12:54:09 node92 kernel: libceph: mon1 10.32.22.108:6789 socket closed (con state CONNECTING)
Feb 21 12:54:19 node92 kernel: libceph: mon0 10.32.22.87:6789 socket closed (con state CONNECTING)
Feb 21 12:54:24 node92 smbd2617: [2017/02/21 12:54:24.220457, 0] ../source3/smbd/service.c:792(make_connection_snum)
Feb 21 12:54:24 node92 smbd2617: canonicalize_connect_path failed for service dropbox, path /share/store/DROPBOX
Feb 21 12:54:24 node92 smbd2617: [2017/02/21 12:54:24.224132, 0] ../source3/smbd/service.c:792(make_connection_snum)
Feb 21 12:54:24 node92 smbd2617: canonicalize_connect_path failed for service dropbox, path /share/store/DROPBOX
Feb 21 12:54:24 node92 smbd2617: [2017/02/21 12:54:24.227230, 0] ../source3/smbd/service.c:792(make_connection_snum)
Feb 21 12:54:24 node92 smbd2617: canonicalize_connect_path failed for service dropbox, path /share/store/DROPBOX
Feb 21 12:54:29 node92 kernel: libceph: mon1 10.10.10.108:6789 socket closed (con state CONNECTING)
Feb 21 12:54:39 node92 kernel: libceph: mon2 10.10.10.89:6789 socket closed (con state CONNECTING)
Feb 21 12:54:49 node92 kernel: libceph: mon0 10.10.10.87:6789 socket closed (con state CONNECTING)
Feb 21 12:54:56 node92 mount-rbd-store-store: rbd: add failed: (5) Input/output error
Feb 21 12:54:56 node92 mount-rbd-store-store: mount: special device /dev/rbd/store/store does not exist
Feb 21 12:54:56 node92 systemd: Started RADOS block device mapping for store in pool store".

The RBD was remounted after the switch was replaced, and seemed to be working fine. Then on March 16th the image became unresponsive. After attempting a reboot of the server and a remount of the image:

Mar 17 00:25:34 node92 kernel: XFS (rbd1): Metadata corruption detected at xfs_agf_read_verify+0x70/0x120 [xfs], block 0x1
Mar 17 00:25:34 node92 kernel: XFS (rbd1): Unmount and run xfs_repair
Mar 17 00:25:34 node92 kernel: XFS (rbd1): First 64 bytes of corrupted metadata buffer:
Mar 17 00:25:34 node92 kernel: ffff8808a3a7f000: 5f 77 72 74 6e 0a 72 62 64 30 20 20 20 20 20 20 _wrtn.rbd0
Mar 17 00:25:34 node92 kernel: ffff8808a3a7f010: 20 20 20 20 20 20 20 20 30 2e 30 30 20 20 20 20 0.00
Mar 17 00:25:34 node92 kernel: ffff8808a3a7f020: 20 20 20 20 20 30 2e 30 30 20 20 20 20 20 20 20 0.00
Mar 17 00:25:34 node92 kernel: ffff8808a3a7f030: 20 20 30 2e 30 30 20 20 20 20 20 20 20 20 20 20 0.00
Mar 17 00:25:34 node92 kernel: XFS (rbd1): metadata I/O error: block 0x1 ("xfs_trans_read_buf_map") error 117 numblks 1
Mar 17 00:25:34 node92 kernel: XFS (rbd1): xfs_do_force_shutdown(0x1) called from line 383 of file fs/xfs/xfs_trans_buf.c. Return address = 0xffffffffa03bb2c1
Mar 17 00:25:34 node92 kernel: XFS (rbd1): I/O Error Detected. Shutting down filesystem
Mar 17 00:25:34 node92 kernel: XFS (rbd1): Please umount the filesystem and rectify the problem(s)
Mar 17 00:25:36 node92 kernel: XFS (rbd1): xfs_log_force: error 5 returned.

Even though the image was now mapped as /dev/rbd1. The filesystem could only be repaired using xfs_repair -L. All the while, ceph health showed HEALTH_OK, with no inconsistent PGs or errors.

ceph --version: ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
uname -a: Linux node0.ceph 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I have attached /var/log/messages from the RBD client for that time period and the XFS repair log. 7942 files were moved to lost+found with missing metadata, so something got lost in translation between what the filesystem though was committed to backing storage and what actually was. The image corruption was detected online before the image was remounted:

Firefly is EOL, however perhaps somebody in the community has encountered a similar issue:
http://tracker.ceph.com/issues/15946


Files

messages-20170319.zip (102 KB) messages-20170319.zip /var/log/messages from around the time of corruption Bryan Apperson, 03/29/2017 05:31 PM
xfs_repair_L.txt.zip (308 KB) xfs_repair_L.txt.zip XFS repair logs Bryan Apperson, 03/29/2017 05:31 PM
Actions #1

Updated by Ilya Dryomov about 7 years ago

  • Assignee set to Ilya Dryomov

Hi Bryan,

Then on March 16th the image became unresponsive. After attempting a reboot of the server and a remount of the image

Did you have to hard reboot? I think that's what caused the corruption -- some of the dirty pages weren't written back to the cluster.

Can you tell us more about this unresponsiveness? The log you attached is from after you rebooted, I'm guessing. Could you please provide the entire log for March 16th?

Actions #2

Updated by Ilya Dryomov about 7 years ago

BTW not only firefly is EOL, but that RHEL 7.0 kernel client is also very old. I'd recommend upgrading to the 7.3 kernel -- it is compatible with firefly servers.

Actions #3

Updated by Bryan Apperson about 7 years ago

Hi Ilya,

The log starts before the reboot - the corruption appeared before the reboot in the logs. The server had been up for almost 2 years when the corruption appeared. The server was rebooted after the first instance of corruption. We are working on getting that site off firefly and CentOS 7 presently. Is there a possibility that the network outage on Feb 19 could have caused this corruption?

I appreciate your fast reply.

Regards,
Bryan Apperson

Actions #4

Updated by Bryan Apperson about 7 years ago

Bryan Apperson wrote:

Hi Ilya,

The log starts before the reboot - the corruption appeared before the reboot in the logs. The server had been up for almost 2 years when the corruption appeared. The server was rebooted after the first instance of corruption. We are working on getting that site off firefly and CentOS 7 presently. Is there a possibility that the network outage on Feb 21 could have caused this corruption?

I appreciate your fast reply.

Regards,
Bryan Apperson

Actions #5

Updated by Bryan Apperson about 7 years ago

the RBD client system was gracefully rebooted at Mar 17 10:22:36.

Actions #6

Updated by Ilya Dryomov about 7 years ago

  • Status changed from New to Need More Info
  • Priority changed from High to Low

There is no way to tell at this point. It's a possibility, although if that were the case it would have manifested sooner, I think. The log indicates that a couple of XFS metadata blocks failed the CRC check, repeatedly. If you haven't seen any issues with the actual data, that's probably the extent of it.

Can you think of anything else from around March 16th?

Actions #7

Updated by Bryan Apperson about 7 years ago

Hi Ilya,

There were Samba client tests occurring on the platform that were timing out with oplocks. the RBD was being exported as a Samba share - but I find it unlikely that the corruption could have occurred through smb.

Regards,
Bryan Apperson

Actions

Also available in: Atom PDF