Bug #37617: CephFS did not recover re-plugging network cable - CephFS - Ceph

Actions

Copy link

Bug #37617

closed

CephFS did not recover re-plugging network cable

Added by Niklas Hambuechen over 5 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.2

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

33169

Crash signature (v1):

Crash signature (v2):

Description

Today my hardware hoster had a planned switch maintenance during which the switch one of my 3 ceph nodes was connected to was restarted.

Unfortunately, my CephFS fuse mount did not recover from this interruption; running e.g. `ls` or `df` on the mount point hung forever, even after all other services on the machine had recovered.

I'm using Ceph 13.2.2.

The logs look like this:

Dec 11 03:14:45 node-2 kernel: e1000e: eth0 NIC Link is Down
...
Dec 11 03:15:02 node-2 ceph-osd[20482]: 2018-12-11 03:15:02.491 7f02a74f2700 -1 osd.1 598 heartbeat_check: no reply from 10.0.0.1:6804 osd.0 since back 2018-12-11 03:14:>
Dec 11 03:15:02 node-2 ceph-osd[20482]: 2018-12-11 03:15:02.491 7f02a74f2700 -1 osd.1 598 heartbeat_check: no reply from 10.0.0.3:6807 osd.2 since back 2018-12-11 03:14:>
...
Dec 11 03:24:34 node-2 kernel: e1000e: eth0 NIC Link is Up 10 Mbps Full Duplex, Flow Control: Rx/Tx

A script that I have that continuously tries to write to the mount continued with this output, starting from the downtime past the network recovery time:

Dec 11 03:28:03 node-2 myfs-mounted-pre-start[6973]: ceph-fuse /var/myfs/files fuse.ceph-fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
Dec 11 03:28:03 node-2 myfs-mounted-pre-start[6973]: touch: cannot touch '/var/myfs/files/.myfs-writable': Cannot send after transport endpoint shutdown

I also found these messages in the logs:

Dec 11 03:25:49 node-2 ceph-fuse[20150]: 2018-12-11 03:25:49.472 7f252e811700 -1 client.227355134 I was blacklisted at osd epoch 602
Dec 11 04:19:46 node-2 ceph-fuse[20150]: 2018-12-11 04:19:46.069 7f252e811700 -1 client.227355134 mds.0 rejected us (unknown error)

I also have full `journalctl` logs from that period, but leaving it at this for now in case it is already clear what the problem is or in case it's a known issue.

I would expect Ceph to recover automatically from this short 11-minute network interruption.

Unfortunately, it did not, and I had to kill the ceph-fuse process and `fusermount -u` the mount point, then restart ceph-fuse.

I am also not sure if it is a CephFS issue specifically, or one of the lower Ceph layers.

Files

diff.patch (733 Bytes) diff.patch

geng jichao, 01/09/2020 06:44 AM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #37617

CephFS did not recover re-plugging network cable

Updated by Niklas Hambuechen over 5 years ago

Updated by Patrick Donnelly over 5 years ago

Updated by Niklas Hambuechen over 5 years ago

Updated by Patrick Donnelly over 5 years ago

Updated by Niklas Hambuechen over 5 years ago

Updated by geng jichao over 4 years ago

Updated by Greg Farnum about 4 years ago