Bug #11586: RBD Data Corruption with Logging on Giant - Ceph - Ceph

Actions

Copy link

Bug #11586

closed

RBD Data Corruption with Logging on Giant

Added by David Burley almost 9 years ago. Updated over 8 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have a ceph cluster running:

Ceph Giant (0.87) - Ceph repo RPMs
CentOS Linux release 7.0.1406 (Core)
Kernel 3.10.0-123.20.1.el7.x86_64

Our environment uses kernel RBDs exclusively, 3x replication, some OSDs are in a SSD group, others are in a spinner group -- but its a pretty simple config.

In troubleshooting another issue, we increased logging on all of our OSDs for the OSD daemon to "10" via inject args. Historically our cluster has had an inconsistent pg show up every few days. After increasing the log level, we hit > 20 inconsistencies in day 1, >30 on day 2, and it continued to increase daily. On day 5 we set the logging back to defaults, the following day we still had increased incidents of inconsistencies found (> 100), followed by a sharp drop off of 8, and then 4 on the following two days. All but 2 of these inconsistencies were repaired as per the Ceph manual (ceph pg repair $PG). We assume the lag in disabling the setting against finding inconsistencies is due to how deep scrubs are scheduled.

In evaluating the 2 outstanding inconsistent PGs, we ran a md5sum of files in the pg directories on each related OSD. We then diff'd the content and found one file to differ on one OSD in each of the two cases. We then hex dumped the files that differed and evaluated the file content (which would be chunks of RBD data). In the one case real file data was overwritten by what appeared to be OSD log data belonging to the OSD that the file was found on. In the other case, it was again OSD log data belonging to that same OSD that wrote the data, but appended to the end of the RBD data.

The log data should only exist on the filesystem of the OSD, which is on local disk, not anything backed by a RBD. So it appears that logging is or can corrupt RBDs in Giant. We're unsure as to whether this impacts the cluster with the default logging configuration.

Files

ceph-osd.0.log (62.5 KB) ceph-osd.0.log

500 lines with debug filestore=20

David Burley, 05/13/2015 05:23 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Greg Farnum almost 9 years ago

Tracker changed from Bug to Support

This sounds exactly like an issue we've seen before with I think a certain combination of XFS version and Ceph's use of some less common filesystem features. It's definitely not Ceph doing this but the local FS. Have you enabled anything in the filesystem config options?

And I'd recommend searching through the tracker and mailing list archives for similar stories.

Actions

Copy link

Updated by David Burley almost 9 years ago

For reference, while I hunt in the ML/tracker:

# xfs_info /var/lib/ceph/osd/ceph-175
meta-data=/dev/sdl2              isize=2048   agcount=4, agsize=182531264 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0
data     =                       bsize=4096   blocks=730125056, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=356506, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount | grep 175
/dev/sdl2 on /var/lib/ceph/osd/ceph-175 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)

Actions

Copy link

Updated by Florian Haas almost 9 years ago

Greg Farnum wrote:

This sounds exactly like an issue we've seen before with I think a certain combination of XFS version and Ceph's use of some less common filesystem features. It's definitely not Ceph doing this but the local FS. Have you enabled anything in the filesystem config options?

And I'd recommend searching through the tracker and mailing list archives for similar stories.

But how exactly would that apply when not only were the OSD and the logs on different mount points, as they typically are, but if they also used entirely different filesystems (XFS and ext4)?

Actions

Copy link

Updated by David Burley almost 9 years ago

Greg,

So to clarify: You believe that an XFS bug caused a Ceph OSD debug log message write, destined for an Ext4 filesystem on one disk, to end up on a file in an XFS filesystem on another disk on the same system is caused by a known issue that someone else has hit (still hunting for said issue)?

Thank you,

David

Actions

Copy link

Updated by Greg Farnum almost 9 years ago

That seems somewhat less likely; you'd better post all the details of your setup.

There might have been an issue with some versions of leveldb grabbing bad pages as well, but I'm a lot less certain about that one...

Actions

Copy link

Updated by David Burley almost 9 years ago

Here's the universal ceph.conf file:

[global]
fsid = d87997d0-fa31-4b14-9568-581df94d7a5d
mon initial members = ceph-mon-1,ceph-mon-2,ceph-mon-3
mon host = 172.29.15.11,172.29.15.12,172.29.15.13
public network = 172.29.15.0/24
cluster network = 172.29.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 10000
osd pool default pg num = 8192 
osd pool default pgp num = 8192
osd pool default min size = 1
osd crush chooseleaf type = 1
filestore flusher = false
osd max backfills = 1
osd recovery max active = 1
mon osd down out interval = 600

[mon.ceph-mon-1]
mon address = 172.29.15.11:6789

[mon.ceph-mon-2]
mon address = 172.29.15.12:6789

[mon.ceph-mon-3]
mon address = 172.29.15.13:6789

[osd]
osd mount options xfs = noatime,inode64,logbufs=8,logbsize=256k

[client]
rbd cache = true
rbd cache writethrough until flush = true

And here's the output of the mount command on one of the spinner-backed OSDs, the only bit this doesn't capture is that journaling for these drives is off to a raw 10GB partition on a 400GB Intel DC P3700:

# mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=32860564k,nr_inodes=8215141,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/mapper/RootVG-RootVol00 on / type ext4 (rw,relatime,data=ordered)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=50,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
sunrpc on /proc/fs/nfsd type nfsd (rw,relatime)
/dev/md0 on /boot type ext3 (rw,relatime,data=ordered)
/dev/sdj1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro)
/dev/mapper/RootVG-TmpVol00 on /tmp type ext4 (rw,nosuid,nodev,noexec,noatime,data=ordered)
/dev/mapper/RootVG-VarVol00 on /var type ext4 (rw,nodev,noatime,data=ordered)
/dev/mapper/RootVG-HomeVol00 on /home type ext4 (rw,nosuid,nodev,noatime,data=ordered)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/mapper/RootVG-LogVol00 on /var/log type ext4 (rw,nodev,noatime,data=ordered)
/dev/mapper/RootVG-TmpVol00 on /var/tmp type ext4 (rw,nosuid,nodev,noexec,noatime,data=ordered)
/dev/mapper/RootVG-AuditVol00 on /var/log/audit type ext4 (rw,nodev,noatime,data=ordered)
/dev/sdg2 on /var/lib/ceph/osd/ceph-120 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdl2 on /var/lib/ceph/osd/ceph-122 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdh2 on /var/lib/ceph/osd/ceph-121 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdc2 on /var/lib/ceph/osd/ceph-128 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdf2 on /var/lib/ceph/osd/ceph-124 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sda2 on /var/lib/ceph/osd/ceph-127 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdn2 on /var/lib/ceph/osd/ceph-131 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdm2 on /var/lib/ceph/osd/ceph-129 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdb2 on /var/lib/ceph/osd/ceph-126 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdd2 on /var/lib/ceph/osd/ceph-125 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdi2 on /var/lib/ceph/osd/ceph-123 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sde2 on /var/lib/ceph/osd/ceph-130 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)

If you need anything else please let me know.

Actions

Copy link

Updated by David Burley almost 9 years ago

Shouldn't this issue also be reclassified as a bug, since its not known otherwise at this point?

Actions

Copy link

Updated by Sage Weil almost 9 years ago

Tracker changed from Support to Bug
Priority changed from Normal to High
Regression set to No

This sounds to me like XFS is doing something wrong, but i'm not certain what.

Can you attach the first few hundred lines of the osd log after a startup with 'debug filestore = 20'? specifically i'd like to see the filestore detect_fs output so we can tell what fs features it's trying to use.

Thanks!

Actions

Copy link

Updated by Sage Weil almost 9 years ago

Also, note that you're using an older release (giant 0.87).. I suggest upgrading to either the latest giant bugfix release (0.87.2) or hammer.

My guess is that upgrading to a newer kernel will help, but hopefully we can figure out what the issue is first!

Actions

Copy link

#10

Updated by David Burley almost 9 years ago

File ceph-osd.0.log ceph-osd.0.log added

Log is attached.

I would upgrade but would ideally like to get to root cause first so others won't have this issue.

Actions

Copy link

#11

Updated by David Burley almost 9 years ago

Any other data I can collect on this to help?

Actions

Copy link

#12

Updated by Sage Weil almost 9 years ago

David Burley wrote:

Any other data I can collect on this to help?

sadly it's not the same bug we hit before that cause garbage data to appear in the object and you're not using the extsize ioctl. but this is definitely some other XFS bug at fault here. :(

Actions

Copy link

#13

Updated by Sage Weil almost 9 years ago

Status changed from New to Rejected

Actions

Copy link

#14

Updated by Sage Weil almost 9 years ago

Actually, xfs corruption makes no sense if /var/log/ceph is on a different fs than the individual OSDs (which is presumably is).

Given we haven't seen this anywhere else, I would strongly recommend getting off giant.

Actions

Copy link

#15

Updated by Florian Haas almost 9 years ago

Sage Weil wrote:

Actually, xfs corruption makes no sense if /var/log/ceph is on a different fs than the individual OSDs (which is presumably is).

Well, that's kinda what I said in comment 3... :)

Actions

Copy link

#16

Updated by David Burley almost 9 years ago

We upgraded to a more recent kernel and to 0.94.2 on the ceph-osd/mon nodes as you suggested prior a month or more ago. We have not retested the failure scenario since the upgrade.

Actions

Copy link

#17

Updated by Samuel Just almost 9 years ago

Actually, when this happened with the extsize ioctl, the logs were also on a different mountpoint on ext4. It turned out xfs wasn't cleaning the new pages in certain cases where they should have been 0'd.

Actions

Copy link

#18

Updated by Sage Weil almost 9 years ago

Status changed from Rejected to 12
Priority changed from High to Urgent

observed on hammer, too.

Actions

Copy link

#19

Updated by Sage Weil almost 9 years ago

Sage Weil wrote:

observed on hammer, too.

Several mon processes were OOM killed across the cluster, causing the cluster mons to lose a quorum. The NFS cluster then did some failovers which didn't complete properly due to the lack of quorum, etc. Looking at monitoring after the fact, it looks like the mon processes may be leaking memory on the non-primary mons. However, we haven't collected enough data yet on that to generate a ticket on it.

Updated by Samuel Just almost 9 years ago

Sounds reasonable. Do you have any remaining corrupted/non-corrupted rbd block pairs you can upload?

Actions

Copy link

#38

Updated by David Burley almost 9 years ago

None in an unfixed state. I can probably hunt down a copy of an inconsistent pg object with the corruption though, given the fixer tool saved a copy when it made a replacement. Would that be helpful?

Actions

Copy link

#39

Updated by David Burley almost 9 years ago

I just looked for one, and the script stored them in /tmp, which has been cleared since. If you really need an example, I can see if I have any laying around elsewhere. But the data was literally OSD log data from the OSD that would have been writing to the pg object in question (since they log their OSD number in those messages).

Actions

Copy link

#40

Samuel Just wrote:

Do you mean that the rbd object file on that osd wound up larger than 4MB?

Yes, it was 4.7MB. You can see the object on Redhat Support ticket 01482199

Actions

Copy link

#44

Updated by David Burley over 8 years ago

Samuel Just wrote:

#12465 might be a plausible explanation. Can you reproduce on a disposable cluster? For #12465 to be the culprit:
1) The OSD needs to be under a constant random small write load to lots of different objects.
2) There needs to be a ton being written to the log. I understand that another bug was causing a tight loop to spam the log at the time. That bug is probably necessary to reproduce the corruption.
3) Log rotation needs to be running.

You probably don't need very many osds or mons to reproduce. The particular hardware, kernel, and ceph version might be important for the thread timings, though 12436 is present in firefly, hammer, and current master as well. The 100% cpu spinning from the log spam is probably necessary.

Samuel,

So do you believe that turning off the debug-{ms,filestore,osd} logging will work around the CPU saturation part of this as well; at least until the underlying defect is fixed?

Thank you,

David

Actions

Copy link

#45

Updated by Sage Weil over 8 years ago

Status changed from 12 to Duplicate

root cause is #12465

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #11586

RBD Data Corruption with Logging on Giant

Updated by Greg Farnum almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Florian Haas almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Greg Farnum almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Sage Weil almost 9 years ago

Updated by Sage Weil almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Sage Weil almost 9 years ago

Updated by Sage Weil almost 9 years ago

Updated by Sage Weil almost 9 years ago

Updated by Florian Haas almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by Sage Weil almost 9 years ago

Updated by Sage Weil almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by Samuel Just almost 9 years ago

Updated by David Burley almost 9 years ago

Updated by David Burley over 8 years ago

Updated by Sage Weil over 8 years ago