Project

General

Profile

Actions

Bug #37617

closed

CephFS did not recover re-plugging network cable

Added by Niklas Hambuechen over 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Today my hardware hoster had a planned switch maintenance during which the switch one of my 3 ceph nodes was connected to was restarted.

Unfortunately, my CephFS fuse mount did not recover from this interruption; running e.g. `ls` or `df` on the mount point hung forever, even after all other services on the machine had recovered.

I'm using Ceph 13.2.2.

The logs look like this:

Dec 11 03:14:45 node-2 kernel: e1000e: eth0 NIC Link is Down
...
Dec 11 03:15:02 node-2 ceph-osd[20482]: 2018-12-11 03:15:02.491 7f02a74f2700 -1 osd.1 598 heartbeat_check: no reply from 10.0.0.1:6804 osd.0 since back 2018-12-11 03:14:>
Dec 11 03:15:02 node-2 ceph-osd[20482]: 2018-12-11 03:15:02.491 7f02a74f2700 -1 osd.1 598 heartbeat_check: no reply from 10.0.0.3:6807 osd.2 since back 2018-12-11 03:14:>
...
Dec 11 03:24:34 node-2 kernel: e1000e: eth0 NIC Link is Up 10 Mbps Full Duplex, Flow Control: Rx/Tx

A script that I have that continuously tries to write to the mount continued with this output, starting from the downtime past the network recovery time:

Dec 11 03:28:03 node-2 myfs-mounted-pre-start[6973]: ceph-fuse /var/myfs/files fuse.ceph-fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
Dec 11 03:28:03 node-2 myfs-mounted-pre-start[6973]: touch: cannot touch '/var/myfs/files/.myfs-writable': Cannot send after transport endpoint shutdown

I also found these messages in the logs:

Dec 11 03:25:49 node-2 ceph-fuse[20150]: 2018-12-11 03:25:49.472 7f252e811700 -1 client.227355134 I was blacklisted at osd epoch 602
Dec 11 04:19:46 node-2 ceph-fuse[20150]: 2018-12-11 04:19:46.069 7f252e811700 -1 client.227355134 mds.0 rejected us (unknown error)

I also have full `journalctl` logs from that period, but leaving it at this for now in case it is already clear what the problem is or in case it's a known issue.

I would expect Ceph to recover automatically from this short 11-minute network interruption.

Unfortunately, it did not, and I had to kill the ceph-fuse process and `fusermount -u` the mount point, then restart ceph-fuse.

I am also not sure if it is a CephFS issue specifically, or one of the lower Ceph layers.


Files

diff.patch (733 Bytes) diff.patch geng jichao, 01/09/2020 06:44 AM
Actions

Also available in: Atom PDF