Bug #37617
closedCephFS did not recover re-plugging network cable
0%
Description
Today my hardware hoster had a planned switch maintenance during which the switch one of my 3 ceph nodes was connected to was restarted.
Unfortunately, my CephFS fuse mount did not recover from this interruption; running e.g. `ls` or `df` on the mount point hung forever, even after all other services on the machine had recovered.
I'm using Ceph 13.2.2.
The logs look like this:
Dec 11 03:14:45 node-2 kernel: e1000e: eth0 NIC Link is Down ... Dec 11 03:15:02 node-2 ceph-osd[20482]: 2018-12-11 03:15:02.491 7f02a74f2700 -1 osd.1 598 heartbeat_check: no reply from 10.0.0.1:6804 osd.0 since back 2018-12-11 03:14:> Dec 11 03:15:02 node-2 ceph-osd[20482]: 2018-12-11 03:15:02.491 7f02a74f2700 -1 osd.1 598 heartbeat_check: no reply from 10.0.0.3:6807 osd.2 since back 2018-12-11 03:14:> ... Dec 11 03:24:34 node-2 kernel: e1000e: eth0 NIC Link is Up 10 Mbps Full Duplex, Flow Control: Rx/Tx
A script that I have that continuously tries to write to the mount continued with this output, starting from the downtime past the network recovery time:
Dec 11 03:28:03 node-2 myfs-mounted-pre-start[6973]: ceph-fuse /var/myfs/files fuse.ceph-fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 Dec 11 03:28:03 node-2 myfs-mounted-pre-start[6973]: touch: cannot touch '/var/myfs/files/.myfs-writable': Cannot send after transport endpoint shutdown
I also found these messages in the logs:
Dec 11 03:25:49 node-2 ceph-fuse[20150]: 2018-12-11 03:25:49.472 7f252e811700 -1 client.227355134 I was blacklisted at osd epoch 602 Dec 11 04:19:46 node-2 ceph-fuse[20150]: 2018-12-11 04:19:46.069 7f252e811700 -1 client.227355134 mds.0 rejected us (unknown error)
I also have full `journalctl` logs from that period, but leaving it at this for now in case it is already clear what the problem is or in case it's a known issue.
I would expect Ceph to recover automatically from this short 11-minute network interruption.
Unfortunately, it did not, and I had to kill the ceph-fuse process and `fusermount -u` the mount point, then restart ceph-fuse.
I am also not sure if it is a CephFS issue specifically, or one of the lower Ceph layers.
Files