Project

General

Profile

Actions

Feature #8343

closed

please enable data integrity checking (by default) / silent data corruption

Added by Dmitry Smirnov almost 10 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

I'm scrubbing CephFS with fsprobe which detected data corruption around the time when one OSD died due to Btrfs file system corruption (Linux 3.14.2 is still not very stable with Btrfs snapshots and it occasionally crashes).

2014-05-13 22:08:31 diff: cycle 7 bufcnt 15204 bufpos 00000038 56 offset 36400038 910164024 expected 0xAA got 0x00
2014-05-13 22:08:31 diff: cycle 7 bufcnt 15204 bufpos 00000039 57 offset 36400039 910164025 expected 0xAA got 0x00
2014-05-13 22:08:31 diff: cycle 7 bufcnt 15204 bufpos 0000003A 58 offset 3640003A 910164026 expected 0xAA got 0x00
2014-05-13 22:08:31 diff: cycle 7 bufcnt 15204 bufpos 0000003B 59 offset 3640003B 910164027 expected 0xAA got 0x00
2014-05-13 22:08:31 diff: cycle 7 bufcnt 15204 bufpos 0000003C 60 offset 3640003C 910164028 expected 0xAA got 0x00
2014-05-13 22:08:31 diff: cycle 7 bufcnt 15204 bufpos 0000003D 61 offset 3640003D 910164029 expected 0xAA got 0x00
2014-05-13 22:08:31 diff: cycle 7 bufcnt 15204 bufpos 0000003E 62 offset 3640003E 910164030 expected 0xAA got 0x00
2014-05-13 22:08:31 total 1048576 differing bytes found

At least 16 MiB of data was corrupted. On all instances zeroes was read instead of previously written data.

Unfortunately that was the worst type of corruption -- a silent one.
I examined ceph.log and found nothing unusual that could suggest that kind of damage.

This has happened on replicated pool (v0.80.1) with 4 replicas and min_size==2 so corruption could be avoided.

IMHO data integrity is of paramount importance for Ceph.
Please implement (or enable) data integrity checking, preferably by default.


Related issues 1 (1 open0 closed)

Related to RADOS - Feature #8609: Improve ceph pg repairNew

Actions
Actions #1

Updated by Dmitry Smirnov almost 10 years ago

P.S. Naturally corruption was not limited to test data file -- at least one MySQL database was lost on RBD device (it is restoring from backup now) and who knows what else was damaged... :(
Previously I lost some OSDs under similar circumstances but it did not occur to me that it may lead to loss of data.

Loss (and replacement) of OSD should be expected and treated like routine operation.
Under no circumstances data corruption should be tolerated or come undetected...

Actions #2

Updated by Greg Farnum almost 10 years ago

You're going to need to provide more details; as described this doesn't quite make sense.
Did you witness filesystem corruption and then return the OSD to service after doing some kind of maintenance?
btrfs has its own integrity verification (which we rely on to work), so it should have returned an error. Or did the entire file get lost?
Has scrub turned up any issues? Have you tried scrubbing the osd that got corrupted.

It sounds to me like btrfs simply lost old files and isn't saying so when we access them. Ceph does not now, and probably never will, handle that transparently.

Actions #3

Updated by Dmitry Smirnov almost 10 years ago

Sure but instead of going deep into probably irrelevant details of how OSD was lost I would appreciate more focus on real problem which is lack of data checksumming in Ceph.

I've been notified about data corruption over email (and logs) shortly after kernel error on OSD:

May 13 22:07:28 deblab kernel: [445905.205712] ------------[ cut here ]------------
May 13 22:07:28 deblab kernel: [445905.205757] WARNING: CPU: 2 PID: 22899 at /build/linux-Tp0YRv/linux-3.14.2/fs/btrfs/extent-tree.c:5749 __btrfs_free_extent+0xa23/0xaa0 [btrfs]()
May 13 22:07:28 deblab kernel: [445905.205760] Modules linked in: nls_utf8 nls_cp437 vfat fat ceph(OF) rbd(OF) libceph(O) cdc_acm cbc drbd lru_cache ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables libcrc32c cpufreq_powersave cpufreq_stats cpufreq_conservative cpufreq_userspace snd_hrtimer pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) binfmt_misc ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rpcsec_gss_krb5 nfsv4 dns_resolver nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc fuse parport_pc ppdev lp parport loop dm_crypt joydev x86_pkg_temp_thermal intel_powerclamp iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm sb_edac snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec edac_core snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi i2c_i801 pcspkr i2c_core lpc_ich mfd_core snd_seq snd_seq_device snd_timer snd tpm_infineon tpm_tis tpm fglrx(PO) soundcore ioatdma mei_me mei processor thermal_sys dca evdev ext4 crc16 mbcache jbd2 btrfs xor raid6_pq dm_mod raid1 md_mod usb_storage sg sr_mod cdrom sd_mod crc_t10dif hid_generic usbhid hid mxm_wmi crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd xhci_hcd ehci_pci ehci_hcd usbcore usb_common e1000e ptp pps_core ahci libahci libata scsi_mod wmi button [last unloaded: ceph]
May 13 22:07:28 deblab kernel: [445905.205865] CPU: 2 PID: 22899 Comm: kworker/2:1 Tainted: PF          O 3.14-1-amd64 #1 Debian 3.14.2-1
May 13 22:07:28 deblab kernel: [445905.205867] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./X79-UP4, BIOS F4 09/05/2013
May 13 22:07:28 deblab kernel: [445905.205885] Workqueue: events do_async_commit [btrfs]
May 13 22:07:28 deblab kernel: [445905.205888]  0000000000000009 ffffffff814b9912 0000000000000000 ffffffff8105f0a2
May 13 22:07:28 deblab kernel: [445905.205893]  ffff88063f087730 0000008a2a31a000 0000000000000000 ffff880c0c3a8000
May 13 22:07:28 deblab kernel: [445905.205897]  00000000fffffffe ffffffffa0290d93 000000000022f9f0 000000000013c000
May 13 22:07:28 deblab kernel: [445905.205902] Call Trace:
May 13 22:07:28 deblab kernel: [445905.205910]  [<ffffffff814b9912>] ? dump_stack+0x41/0x51
May 13 22:07:28 deblab kernel: [445905.205916]  [<ffffffff8105f0a2>] ? warn_slowpath_common+0x72/0x90
May 13 22:07:28 deblab kernel: [445905.205933]  [<ffffffffa0290d93>] ? __btrfs_free_extent+0xa23/0xaa0 [btrfs]
May 13 22:07:28 deblab kernel: [445905.205952]  [<ffffffffa02ed2f5>] ? find_ref_head.constprop.10+0x55/0x80 [btrfs]
May 13 22:07:28 deblab kernel: [445905.205969]  [<ffffffffa0294e51>] ? __btrfs_run_delayed_refs+0x381/0x10f0 [btrfs]
May 13 22:07:28 deblab kernel: [445905.205987]  [<ffffffffa029980a>] ? btrfs_run_delayed_refs.part.60+0x6a/0x280 [btrfs]
May 13 22:07:28 deblab kernel: [445905.206006]  [<ffffffffa02a9164>] ? btrfs_commit_transaction+0x44/0x990 [btrfs]
May 13 22:07:28 deblab kernel: [445905.206023]  [<ffffffffa02aa755>] ? do_async_commit+0x25/0x30 [btrfs]
May 13 22:07:28 deblab kernel: [445905.206028]  [<ffffffff81078b9d>] ? process_one_work+0x16d/0x420
May 13 22:07:28 deblab kernel: [445905.206033]  [<ffffffff81079756>] ? worker_thread+0x116/0x3b0
May 13 22:07:28 deblab kernel: [445905.206037]  [<ffffffff81079640>] ? rescuer_thread+0x330/0x330
May 13 22:07:28 deblab kernel: [445905.206041]  [<ffffffff8107f911>] ? kthread+0xc1/0xe0
May 13 22:07:28 deblab kernel: [445905.206045]  [<ffffffff8107f850>] ? kthread_create_on_node+0x180/0x180
May 13 22:07:28 deblab kernel: [445905.206049]  [<ffffffff814c718c>] ? ret_from_fork+0x7c/0xb0
May 13 22:07:28 deblab kernel: [445905.206053]  [<ffffffff8107f850>] ? kthread_create_on_node+0x180/0x180
May 13 22:07:28 deblab kernel: [445905.206055] ---[ end trace b424f0cc69d04c9e ]---
May 13 22:07:28 deblab kernel: [445905.206060] BTRFS info (device dm-1): leaf 1089699840 total ptrs 118 free space 6638

When kernel crashed all I/O to the affected HDD was stopped.
OSD crash happened minutes later:

     0> 2014-05-13 22:11:38.830682 7fecb9bcb700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fecb9bc
b700 time 2014-05-13 22:11:38.817996
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0x7fecbe408329]
 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x7fecbe408b96]
 3: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7fecbe4091a7]
 4: (CephContextServiceThread::entry()+0x154) [0x7fecbe4d43f4]
 5: (()+0x8062) [0x7fecbd58f062]
 6: (clone()+0x6d) [0x7fecbbaafbfd]

I had to reboot host then HDD was tested for SMART errors and surface defects and found healthy.
I did not take any chances to return it back to cluster as is (`btrfsck --repair` may or may not fix errors but I don't gamble with data).
I re-formatted HDD as "ext4" and returned it to cluster as new.

Kernel crash was the reason for lost of OSD.
There were no (other) btrfs errors logged, at least nothing related to CRC errors.
I did not run Btrfs scrub because OSD file system was corrupted so I dismissed content of HDD right away.

On Ceph no files were lost (as far as I'm aware) but content of some files has changed.
For instance some hours later MySQL backup failed due to corruption in ~10 GiB "ibdata1" file located on RBD-mapped device.

This is consistent with scrubbing which write known pattern to files and later read and compare.
Scrubbing runs in parallel on tmpfs-based "/tmp" and local file system as well as on Ceph mount point.
Scrubbing detected errors on CephFS only.
Damaged MySQL database lives on different host (where no errors was logged whatsoever) so corruption propagated silently.

I hope this explains situation a little.

Actions #4

Updated by Greg Farnum almost 10 years ago

  • Priority changed from High to Normal

Have you ever run a scrub repair on the cluster?
Was your cluster going through any backfilling?

The corrupt data sounds a lot like missing objects, but if you truly never let the OSD back in, then I don't know of any mechanism how that could happen (unless you have other OSDs that have failed and been used as authoritative sources).

Actions #5

Updated by Dmitry Smirnov almost 10 years ago

  • Priority changed from Normal to High

Greg Farnum wrote:

Have you ever run a scrub repair on the cluster?

Yes. Since release of 0.80 I instructed all OSDs to perform deep-scrub several times.
Some weeks ago right after upgrade from 0.79 to 0.80 few errors were found and fixed.
Then for at least for two weeks prior to this incident deep scrub did not find any problems.

Was your cluster going through any backfilling?

No. From fsprobe log it can be seen that errors were detected just before OSD died.
Cluster was in status HEALTH_OK just before first errors appeared.

The corrupt data sounds a lot like missing objects, but if you truly never let the OSD back in, then I don't know of any mechanism how that could happen (unless you have other OSDs that have failed and been used as authoritative sources).

This is a mystery for me too. I've been running deep-scrub/repair for some time already and the following was detected/fixed:

2014-05-14 16:11:55.629061 osd.6 [ERR] 2.b deep-scrub stat mismatch, got 3265/3266 objects, 0/0 clones, 3265/3266 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 13463732848/13467927152 bytes. 
2014-05-14 16:11:55.629112 osd.6 [ERR] 2.b deep-scrub 1 errors 
2014-05-14 17:14:04.583625 osd.6 [ERR] 2.19 shard 0: soid 9fcae619/rb.0.ca6c0.238e1f29.000000001fe7/head//2 digest 3807571614 != known digest 3123480504, size 4194304 != known size 1376256 
2014-05-14 17:14:04.583660 osd.6 [ERR] 2.19 shard 1: soid 9fcae619/rb.0.ca6c0.238e1f29.000000001fe7/head//2 digest 3807571614 != known digest 3123480504, size 4194304 != known size 1376256 
2014-05-14 17:14:04.583709 osd.6 [ERR] 2.19 shard 7: soid 9fcae619/rb.0.ca6c0.238e1f29.000000001fe7/head//2 digest 3807571614 != known digest 3123480504, size 4194304 != known size 1376256 
2014-05-14 17:15:37.422313 osd.6 [ERR] 2.b repair stat mismatch, got 3265/3266 objects, 0/0 clones, 3265/3266 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 13463732848/13467927152 bytes. 
2014-05-14 17:15:37.422644 osd.6 [ERR] 2.b repair 1 errors, 1 fixed 
2014-05-14 17:29:50.408218 osd.6 [ERR] 2.19 deep-scrub stat mismatch, got 3266/3267 objects, 0/0 clones, 3266/3267 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 13416291840/13420486144 bytes. 
2014-05-14 17:29:50.408493 osd.6 [ERR] 2.19 deep-scrub 0 missing, 1 inconsistent objects 
2014-05-14 17:29:50.408664 osd.6 [ERR] 2.19 deep-scrub 4 errors 
2014-05-14 17:33:42.096760 osd.6 [ERR] 2.32 deep-scrub stat mismatch, got 3187/3188 objects, 0/0 clones, 3187/3188 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 13147161600/13151355904 bytes. 
2014-05-14 17:33:42.096786 osd.6 [ERR] 2.32 deep-scrub 1 errors 
2014-05-14 17:36:25.492731 osd.6 [ERR] 2.19 shard 0: soid 9fcae619/rb.0.ca6c0.238e1f29.000000001fe7/head//2 digest 3807571614 != known digest 3123480504, size 4194304 != known size 1376256 
2014-05-14 17:36:25.492755 osd.6 [ERR] 2.19 shard 1: soid 9fcae619/rb.0.ca6c0.238e1f29.000000001fe7/head//2 digest 3807571614 != known digest 3123480504, size 4194304 != known size 1376256 
2014-05-14 17:36:25.492791 osd.6 [ERR] 2.19 shard 7: soid 9fcae619/rb.0.ca6c0.238e1f29.000000001fe7/head//2 digest 3807571614 != known digest 3123480504, size 4194304 != known size 1376256 
2014-05-14 17:47:12.978103 osd.6 [ERR] 2.19 repair stat mismatch, got 3266/3267 objects, 0/0 clones, 3266/3267 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 13416291840/13420486144 bytes. 
2014-05-14 17:47:12.978466 osd.6 [ERR] 2.19 repair 0 missing, 1 inconsistent objects 
2014-05-14 17:47:12.978619 osd.6 [ERR] 2.19 repair 4 errors, 4 fixed 
2014-05-14 20:22:18.674784 osd.6 [ERR] 2.32 repair stat mismatch, got 3187/3188 objects, 0/0 clones, 3187/3188 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 13149869056/13154063360 bytes. 
2014-05-14 20:22:18.675118 osd.6 [ERR] 2.32 repair 1 errors, 1 fixed 

I'm quite surprised to see that errors detected on recreated from scratch and re-filled OSD.6.
Just in case I disabled snapshotting on all Btrfs OSDs:

filestore btrfs snap        = false

Hopefully it will help to avoid kernel bug(s).

I think the lesson from this is that malfunctioning OSD can send incorrect data to client.
I hope additional (integrity) checks can be implemented to avoid this.

Actions #6

Updated by Greg Farnum almost 10 years ago

Sorry for dropping this...it sounds like you used incorrect replicas as the authoritative sources for the repairs you ran. I think a smarter scrub repair might be up for the next cycle, but in the meantime you shouldn't use scrub repair unless you know that the primaries have the correct data.
Better integrity checking will also come as time passes, but we're really hoping to get some kernel interfaces to work with first. ;)

Actions #7

Updated by Dmitry Smirnov almost 10 years ago

Greg Farnum wrote:

it sounds like you used incorrect replicas as the authoritative sources for the repairs you ran.

I'm not sure how that is possible. You make it sound like if it is a mis-configuration on my side which is doubtful (How can I deliberately use incorrect replicas? I'd like to know so I could avoid doing that incidentally, especially if it ever happened). Besides errors appeared before repair and disappeared after (presumably successful) repair...

I think a smarter scrub repair might be up for the next cycle, but in the meantime you shouldn't use scrub repair unless you know that the primaries have the correct data.

This is shocking to realise that repairs are not to be trusted. This is one of the essential Ceph features. Is there a repair more that is known to work reliably? Any instructions for manual repair? How to identify when (and when not) to use repair?

Better integrity checking will also come as time passes, but we're really hoping to get some kernel interfaces to work with first. ;)

Thank you. It is good to know that there are integrity-related improvements on TODO list. I can think of only few issues that might be of greater importance than data integrity and recovery. Please prioritise this issue with recovery that you mentioned. Is it already logged as bug that I could follow?

Actions #8

Updated by Greg Farnum almost 10 years ago

Dmitry Smirnov wrote:

Greg Farnum wrote:

it sounds like you used incorrect replicas as the authoritative sources for the repairs you ran.

I'm not sure how that is possible. You make it sound like if it is a mis-configuration on my side which is doubtful (How can I deliberately use incorrect replicas? I'd like to know so I could avoid doing that incidentally, especially if it ever happened). Besides errors appeared before repair and disappeared after (presumably successful) repair...

I think a smarter scrub repair might be up for the next cycle, but in the meantime you shouldn't use scrub repair unless you know that the primaries have the correct data.

This is shocking to realise that repairs are not to be trusted. This is one of the essential Ceph features. Is there a repair more that is known to work reliably? Any instructions for manual repair? How to identify when (and when not) to use repair?

I'm sorry if this is a surprise to you, but I believe the behavior of repair is fully documented at ceph.com/docs. Current repair behavior with replicated PGs is to copy the primary's data to the other nodes; this makes the Ceph cluster self-consistent, but might cause problems for consumers if the primary had the wrong data.
I agree that's undesirable, but without the checksums or some similar method, RADOS really can't tell what the right data is. It relies on the underlying local filesystems not to lie to it, and not to corrupt their own data; this is what happened here. (Yes, local filesystems losing old files just because the server lost power is annoying. We've seen it happen both as a result of misconfigured systems, and as a result of bugs in some kernel releases.)

Better integrity checking will also come as time passes, but we're really hoping to get some kernel interfaces to work with first. ;)

Thank you. It is good to know that there are integrity-related improvements on TODO list. I can think of only few issues that might be of greater importance than data integrity and recovery. Please prioritise this issue with recovery that you mentioned. Is it already logged as bug that I could follow?

Not at this time; I think there might have been some blueprints from past CDS events, and the EC pools do better since their constrained interfaces does allow keeping checksums of every piece of data.

Actions #9

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to Closed
  • Assignee set to Greg Farnum
Actions

Also available in: Atom PDF