Bug #11898
closedWriting on forced umount causes silent data loss
0%
Description
Writing on a CephFS bind mount, (for example, inside a LXC container), for which the parent is umounted or broken, emits the following kernel message on the client machine:
ceph: writepage_start ffff883exxxxxxxx on forced umount
The "dmesg" quickly fills up with such messages. The file has a seemingly correct size but is now full of nulls instead of the expected data. The writing process does not block or report an error. After writing a small file on this client machine, I see that other clients reading the file may block.
Ceph: 0.87.1
Kernel: kernel.org 3.18.0 with Ubuntu deb package configuration
I've joined the logs for the 2 MDS. The machine was booted at 2015-06-04 16:43. data loss started when a container with a bind mount rebooted around 2015-06-04 ~18:28 and ended with a restart of all containers and remount of parent mount at 2015-06-05 ~11:21
There's nothing out of the ordinary in mon and osd logs.
To be honest, I'm not entirely sure of the chain of events because I'm not able to replicate this. I'd be happy to provide more info.
Files
Updated by Greg Farnum almost 9 years ago
Is this in any way different from what happens against any other filesystem? This sounds more like something is bad in how forced umount and bind mounts interact in general (although it definitely could be us as well).
Updated by Kevin Lamontagne almost 9 years ago
I don't unmount directly, it's what lxc-stop does.
I hope to have some time to create a repeatable test case soon. A quick test using a local ext4 bind mounted then force umounted, shows the directory structure becomes unavailable and files can't be opened. The "original" mount is untouched.
Meanwhile I found a related open Debian bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=613904#15
It wouldn't be too bad if the mount became unavailable, but I've seen a broken mount create new files and overwriting files, filling them with nulls, for a long period of time.