Project

General

Profile

Bug #11586

RBD Data Corruption with Logging on Giant

Added by David Burley over 6 years ago. Updated about 6 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a ceph cluster running:
  • Ceph Giant (0.87) - Ceph repo RPMs
  • CentOS Linux release 7.0.1406 (Core)
  • Kernel 3.10.0-123.20.1.el7.x86_64

Our environment uses kernel RBDs exclusively, 3x replication, some OSDs are in a SSD group, others are in a spinner group -- but its a pretty simple config.

In troubleshooting another issue, we increased logging on all of our OSDs for the OSD daemon to "10" via inject args. Historically our cluster has had an inconsistent pg show up every few days. After increasing the log level, we hit > 20 inconsistencies in day 1, >30 on day 2, and it continued to increase daily. On day 5 we set the logging back to defaults, the following day we still had increased incidents of inconsistencies found (> 100), followed by a sharp drop off of 8, and then 4 on the following two days. All but 2 of these inconsistencies were repaired as per the Ceph manual (ceph pg repair $PG). We assume the lag in disabling the setting against finding inconsistencies is due to how deep scrubs are scheduled.

In evaluating the 2 outstanding inconsistent PGs, we ran a md5sum of files in the pg directories on each related OSD. We then diff'd the content and found one file to differ on one OSD in each of the two cases. We then hex dumped the files that differed and evaluated the file content (which would be chunks of RBD data). In the one case real file data was overwritten by what appeared to be OSD log data belonging to the OSD that the file was found on. In the other case, it was again OSD log data belonging to that same OSD that wrote the data, but appended to the end of the RBD data.

The log data should only exist on the filesystem of the OSD, which is on local disk, not anything backed by a RBD. So it appears that logging is or can corrupt RBDs in Giant. We're unsure as to whether this impacts the cluster with the default logging configuration.

ceph-osd.0.log View - 500 lines with debug filestore=20 (62.5 KB) David Burley, 05/13/2015 05:23 PM


Related issues

Related to Ceph - Bug #12465: Log::reopen_log_file() must take the flusher lock to avoid closing an fd ::_flush() is still using Resolved 07/24/2015

History

#1 Updated by Greg Farnum over 6 years ago

  • Tracker changed from Bug to Support

This sounds exactly like an issue we've seen before with I think a certain combination of XFS version and Ceph's use of some less common filesystem features. It's definitely not Ceph doing this but the local FS. Have you enabled anything in the filesystem config options?

And I'd recommend searching through the tracker and mailing list archives for similar stories.

#2 Updated by David Burley over 6 years ago

For reference, while I hunt in the ML/tracker:

# xfs_info /var/lib/ceph/osd/ceph-175
meta-data=/dev/sdl2              isize=2048   agcount=4, agsize=182531264 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0
data     =                       bsize=4096   blocks=730125056, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal               bsize=4096   blocks=356506, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount | grep 175
/dev/sdl2 on /var/lib/ceph/osd/ceph-175 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)

#3 Updated by Florian Haas over 6 years ago

Greg Farnum wrote:

This sounds exactly like an issue we've seen before with I think a certain combination of XFS version and Ceph's use of some less common filesystem features. It's definitely not Ceph doing this but the local FS. Have you enabled anything in the filesystem config options?

And I'd recommend searching through the tracker and mailing list archives for similar stories.

But how exactly would that apply when not only were the OSD and the logs on different mount points, as they typically are, but if they also used entirely different filesystems (XFS and ext4)?

#4 Updated by David Burley over 6 years ago

Greg,

So to clarify: You believe that an XFS bug caused a Ceph OSD debug log message write, destined for an Ext4 filesystem on one disk, to end up on a file in an XFS filesystem on another disk on the same system is caused by a known issue that someone else has hit (still hunting for said issue)?

Thank you,

David

#5 Updated by Greg Farnum over 6 years ago

That seems somewhat less likely; you'd better post all the details of your setup.

There might have been an issue with some versions of leveldb grabbing bad pages as well, but I'm a lot less certain about that one...

#6 Updated by David Burley over 6 years ago

Here's the universal ceph.conf file:

[global]
fsid = d87997d0-fa31-4b14-9568-581df94d7a5d
mon initial members = ceph-mon-1,ceph-mon-2,ceph-mon-3
mon host = 172.29.15.11,172.29.15.12,172.29.15.13
public network = 172.29.15.0/24
cluster network = 172.29.16.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 10000
osd pool default pg num = 8192 
osd pool default pgp num = 8192
osd pool default min size = 1
osd crush chooseleaf type = 1
filestore flusher = false
osd max backfills = 1
osd recovery max active = 1
mon osd down out interval = 600

[mon.ceph-mon-1]
mon address = 172.29.15.11:6789

[mon.ceph-mon-2]
mon address = 172.29.15.12:6789

[mon.ceph-mon-3]
mon address = 172.29.15.13:6789

[osd]
osd mount options xfs = noatime,inode64,logbufs=8,logbsize=256k

[client]
rbd cache = true
rbd cache writethrough until flush = true

And here's the output of the mount command on one of the spinner-backed OSDs, the only bit this doesn't capture is that journaling for these drives is off to a raw 10GB partition on a 400GB Intel DC P3700:

# mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,size=32860564k,nr_inodes=8215141,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/mapper/RootVG-RootVol00 on / type ext4 (rw,relatime,data=ordered)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=50,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
sunrpc on /proc/fs/nfsd type nfsd (rw,relatime)
/dev/md0 on /boot type ext3 (rw,relatime,data=ordered)
/dev/sdj1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro)
/dev/mapper/RootVG-TmpVol00 on /tmp type ext4 (rw,nosuid,nodev,noexec,noatime,data=ordered)
/dev/mapper/RootVG-VarVol00 on /var type ext4 (rw,nodev,noatime,data=ordered)
/dev/mapper/RootVG-HomeVol00 on /home type ext4 (rw,nosuid,nodev,noatime,data=ordered)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/mapper/RootVG-LogVol00 on /var/log type ext4 (rw,nodev,noatime,data=ordered)
/dev/mapper/RootVG-TmpVol00 on /var/tmp type ext4 (rw,nosuid,nodev,noexec,noatime,data=ordered)
/dev/mapper/RootVG-AuditVol00 on /var/log/audit type ext4 (rw,nodev,noatime,data=ordered)
/dev/sdg2 on /var/lib/ceph/osd/ceph-120 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdl2 on /var/lib/ceph/osd/ceph-122 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdh2 on /var/lib/ceph/osd/ceph-121 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdc2 on /var/lib/ceph/osd/ceph-128 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdf2 on /var/lib/ceph/osd/ceph-124 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sda2 on /var/lib/ceph/osd/ceph-127 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdn2 on /var/lib/ceph/osd/ceph-131 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdm2 on /var/lib/ceph/osd/ceph-129 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdb2 on /var/lib/ceph/osd/ceph-126 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdd2 on /var/lib/ceph/osd/ceph-125 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sdi2 on /var/lib/ceph/osd/ceph-123 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/dev/sde2 on /var/lib/ceph/osd/ceph-130 type xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)

If you need anything else please let me know.

#7 Updated by David Burley over 6 years ago

Shouldn't this issue also be reclassified as a bug, since its not known otherwise at this point?

#8 Updated by Sage Weil over 6 years ago

  • Tracker changed from Support to Bug
  • Priority changed from Normal to High
  • Regression set to No

This sounds to me like XFS is doing something wrong, but i'm not certain what.

Can you attach the first few hundred lines of the osd log after a startup with 'debug filestore = 20'? specifically i'd like to see the filestore detect_fs output so we can tell what fs features it's trying to use.

Thanks!

#9 Updated by Sage Weil over 6 years ago

Also, note that you're using an older release (giant 0.87).. I suggest upgrading to either the latest giant bugfix release (0.87.2) or hammer.

My guess is that upgrading to a newer kernel will help, but hopefully we can figure out what the issue is first!

#10 Updated by David Burley over 6 years ago

Log is attached.

I would upgrade but would ideally like to get to root cause first so others won't have this issue.

#11 Updated by David Burley over 6 years ago

Any other data I can collect on this to help?

#12 Updated by Sage Weil over 6 years ago

David Burley wrote:

Any other data I can collect on this to help?

sadly it's not the same bug we hit before that cause garbage data to appear in the object and you're not using the extsize ioctl. but this is definitely some other XFS bug at fault here. :(

#13 Updated by Sage Weil over 6 years ago

  • Status changed from New to Rejected

#14 Updated by Sage Weil over 6 years ago

Actually, xfs corruption makes no sense if /var/log/ceph is on a different fs than the individual OSDs (which is presumably is).

Given we haven't seen this anywhere else, I would strongly recommend getting off giant.

#15 Updated by Florian Haas over 6 years ago

Sage Weil wrote:

Actually, xfs corruption makes no sense if /var/log/ceph is on a different fs than the individual OSDs (which is presumably is).

Well, that's kinda what I said in comment 3... :)

#16 Updated by David Burley over 6 years ago

We upgraded to a more recent kernel and to 0.94.2 on the ceph-osd/mon nodes as you suggested prior a month or more ago. We have not retested the failure scenario since the upgrade.

#17 Updated by Samuel Just about 6 years ago

Actually, when this happened with the extsize ioctl, the logs were also on a different mountpoint on ext4. It turned out xfs wasn't cleaning the new pages in certain cases where they should have been 0'd.

#18 Updated by Sage Weil about 6 years ago

  • Status changed from Rejected to 12
  • Priority changed from High to Urgent

observed on hammer, too.

#19 Updated by Sage Weil about 6 years ago

Sage Weil wrote:

observed on hammer, too.

Actually, David, can you confirm that this happened on a hammer OSD (and not a giant OSD during the upgrade)?

#20 Updated by David Burley about 6 years ago

We didn't do the upgrade until after this incident, so we don't have data on whether this occurs in hammer or not.

#21 Updated by Samuel Just about 6 years ago

When is the most recent time you have seen this corruption?

#22 Updated by David Burley about 6 years ago

May 8 I believe. We restored individual files on the RBDs that we were able to find impacted. The filesystems have been online and usable, however, until last Thursday when they all had to be remounted.

#23 Updated by Samuel Just about 6 years ago

Why did they have to be remounted?

#24 Updated by Samuel Just about 6 years ago

I assume that since may 8 scrubbing had been running and had not been turning up very many inconsistent objects?

#25 Updated by David Burley about 6 years ago

Several mon processes were OOM killed across the cluster, causing the cluster mons to lose a quorum. The NFS cluster then did some failovers which didn't complete properly due to the lack of quorum, etc. Looking at monitoring after the fact, it looks like the mon processes may be leaking memory on the non-primary mons. However, we haven't collected enough data yet on that to generate a ticket on it.

After May 8 it found fewer and fewer inconsistent objects as deep scrubs would pick them up and we would fix them.

#26 Updated by Samuel Just about 6 years ago

How does nfs fit in? Do you have nodes which have local fses on krbd which they export over nfs?

#27 Updated by David Burley about 6 years ago

We have a mix of use cases for the RBDs, and all of the use cases were impacted if the volume was created prior to the increase in OSD log levels. We do export a number of these RBDs via knfsd.

#28 Updated by Samuel Just about 6 years ago

How does the failover part work?

#29 Updated by David Burley about 6 years ago

Its implemented as described here (excepting this doc is old and the commands needed updated to function on the newer revs of software):
http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/

#30 Updated by Samuel Just about 6 years ago

How do the new failover nfs mounts blacklist the old ones?

#31 Updated by Samuel Just about 6 years ago

The rbd part, that is.

#32 Updated by David Burley about 6 years ago

Pacemaker is managing the resource and ensuring the rbd is only present on one node at a time. I am not sure how this is relevant though, this issue impacted more than just the filesystems mounted via NFS. We also had filesystems that were used for local storage that were impacted. Anything created prior to ~May 8 on an RBD was impacted.

#33 Updated by Samuel Just about 6 years ago

Impacted by this immediate outage, or by the logging related inconsistent objects?

#34 Updated by David Burley about 6 years ago

Both. We had some ext4 or XFS filesystems direct mounted on database hosts that were also corrupt, for instance. Those RBDs are not shared with any other host and have no failover mechanism. The majority of the impacted data is served up via NFS, but the corruption is not exclusive to them. Filesystems created prior to ~May 8 are impacted (whether shared via NFS or local), those created after have been fine.

#35 Updated by Samuel Just about 6 years ago

Have you just been using ceph pg repair to repair the inconsistent objects as they come up?

#36 Updated by David Burley about 6 years ago

We did initially, since that's what the docs said to do:
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent

We then realized that wasn't the right approach and scripted up a tool that checksummed the inconsistent object on all of the OSDs and would replace the odd-man out with the majority -- which was quite safe for this instance since the log injection was local and wouldn't have been sent out to another node unless ceph pg repair was called blindly, as it was initially.

#37 Updated by Samuel Just about 6 years ago

Sounds reasonable. Do you have any remaining corrupted/non-corrupted rbd block pairs you can upload?

#38 Updated by David Burley about 6 years ago

None in an unfixed state. I can probably hunt down a copy of an inconsistent pg object with the corruption though, given the fixer tool saved a copy when it made a replacement. Would that be helpful?

#39 Updated by David Burley about 6 years ago

I just looked for one, and the script stored them in /tmp, which has been cleared since. If you really need an example, I can see if I have any laying around elsewhere. But the data was literally OSD log data from the OSD that would have been writing to the pg object in question (since they log their OSD number in those messages).

#40 Updated by Samuel Just about 6 years ago

Yep, I want to know what was on the clean object at the offsets where the garbage shows up.

#41 Updated by Samuel Just about 6 years ago

"In the other case, it was again OSD log data belonging to that same OSD that wrote the data, but appended to the end of the RBD data."

Do you mean that the rbd object file on that osd wound up larger than 4MB?

#42 Updated by Samuel Just about 6 years ago

#12465 might be a plausible explanation. Can you reproduce on a disposable cluster? For #12465 to be the culprit:
1) The OSD needs to be under a constant random small write load to lots of different objects.
2) There needs to be a ton being written to the log. I understand that another bug was causing a tight loop to spam the log at the time. That bug is probably necessary to reproduce the corruption.
3) Log rotation needs to be running.

You probably don't need very many osds or mons to reproduce. The particular hardware, kernel, and ceph version might be important for the thread timings, though 12436 is present in firefly, hammer, and current master as well. The 100% cpu spinning from the log spam is probably necessary.

#43 Updated by David Burley about 6 years ago

Samuel Just wrote:

Do you mean that the rbd object file on that osd wound up larger than 4MB?

Yes, it was 4.7MB. You can see the object on Redhat Support ticket 01482199

#44 Updated by David Burley about 6 years ago

Samuel Just wrote:

#12465 might be a plausible explanation. Can you reproduce on a disposable cluster? For #12465 to be the culprit:
1) The OSD needs to be under a constant random small write load to lots of different objects.
2) There needs to be a ton being written to the log. I understand that another bug was causing a tight loop to spam the log at the time. That bug is probably necessary to reproduce the corruption.
3) Log rotation needs to be running.

You probably don't need very many osds or mons to reproduce. The particular hardware, kernel, and ceph version might be important for the thread timings, though 12436 is present in firefly, hammer, and current master as well. The 100% cpu spinning from the log spam is probably necessary.

Samuel,

So do you believe that turning off the debug-{ms,filestore,osd} logging will work around the CPU saturation part of this as well; at least until the underlying defect is fixed?

Thank you,

David

#45 Updated by Sage Weil about 6 years ago

  • Status changed from 12 to Duplicate

root cause is #12465

Also available in: Atom PDF