Project

General

Profile

Actions

Bug #51191

open

Cannot Mount CephFS No Timeout, mount error 5 = Input/output error

Added by Brian Rogers almost 3 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, ceph-fuse, libcephfs
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On most hosts, mounting the CephFS via the kernel or ceph-fuse will not succeed. On one host, a Raspberry PI 4, it did mount. However, immediately after mounting, a simple cp filea.txt /cephfs/filea.txt fails. Creating a directory did work, but the mkdir /cephfs/test hangs.

Ceph mon stat

e30: 3 mons at {b=[v2:10.102.28.116:3300/0,v1:10.102.28.116:6789/0],n=[v2:10.101.151.140:3300/0,v1:10.101.151.140:6789/0],p=[v2:10.102.247.49:3300/0,v1:10.102.247.49:6789/0]}, election epoch 722, leader 0 b, quorum 0,1,2 b,n,p

Dmesg output

[45567.875725] Key type ceph registered
[45567.878545] libceph: loaded (mon/osd proto 15/24)
[45567.885258] FS-Cache: Netfs 'ceph' registered for caching
[45567.885263] ceph: loaded (mds proto 32)
[45567.903786] libceph: mon0 (1)10.102.28.116:6789 session established
[45567.904552] libceph: mon0 (1)10.102.28.116:6789 socket closed (con state OPEN)
[45567.904566] libceph: mon0 (1)10.102.28.116:6789 session lost, hunting for new mon
[45567.908889] libceph: mon1 (1)10.102.28.116:6789 session established
[45567.909875] libceph: client7984614 fsid 5d44c1b2-3fa7-42a1-827b-c4fb1b4a8e76
[46597.917058] libceph: mon2 (1)10.102.247.49:6789 session established
[46597.918225] libceph: mon2 (1)10.102.247.49:6789 socket closed (con state OPEN)
[46597.918236] libceph: mon2 (1)10.102.247.49:6789 session lost, hunting for new mon
[46597.921514] libceph: mon1 (1)10.102.28.116:6789 session established

Attempt to mount CephFS via ceph-fuse

sudo ceph-fuse -d -f -s -m 10.102.28.116,10.101.151.140,10.102.247.49 /cephfs --keyring /etc/ceph/ceph.client.test.keyring --name client.test

ceph-fuse[9754]: starting ceph client
2021-05-26T07:46:11.944-0700 7f34ab943100  0 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable), process ceph-fuse, pid 9754
2021-05-26T07:46:11.944-0700 7f34ab943100 -1 init, newargv = 0x55b11a825eb0 newargc=17

Actions #1

Updated by Brian Rogers almost 3 years ago

More details on the issue can be found here: https://github.com/rook/rook/issues/7994

Actions #2

Updated by Brian Rogers almost 3 years ago

An updated attempt to mount using ceph-fuse

...@...:~$ sudo ceph-fuse -d -f -s -m 10.102.28.116,10.101.151.140,10.102.247.49 /cephfs --keyring /etc/ceph/ceph.client.test.keyring --name client.test
ceph-fuse[20685]: starting ceph client
2021-06-12T20:17:24.288-0700 7f7a4cd22100  0 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable), process ceph-fuse, pid 20685
2021-06-12T20:17:24.288-0700 7f7a4cd22100 -1 init, newargv = 0x5565340b1f80 newargc=17
FUSE library version: 2.9.7
ceph-fuse[20685]: starting fuse
unique: 2, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
INIT: 7.31
flags=0x03fffffb
max_readahead=0x00020000
   INIT: 7.19
   flags=0x0000043b
   max_readahead=0x00020000
   max_write=0x00020000
   max_background=0
   congestion_threshold=0
   unique: 2, success, outsize: 40
unique: 4, opcode: ACCESS (34), nodeid: 1, insize: 48, pid: 11086
   unique: 4, success, outsize: 16
unique: 6, opcode: LOOKUP (1), nodeid: 1, insize: 47, pid: 11086
   unique: 6, error: -2 (No such file or directory), outsize: 16
unique: 8, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 11086
   unique: 8, error: -2 (No such file or directory), outsize: 16
unique: 10, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 20717
   unique: 10, success, outsize: 120
unique: 12, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 20717
   unique: 12, success, outsize: 120

Actions #3

Updated by Patrick Donnelly almost 3 years ago

  • Status changed from New to Need More Info

Looks like this is probably a networking issue of some kind. Are you using host or pod networking in rook? Also, any firewalls?

Actions #4

Updated by Brian Rogers almost 3 years ago

Patrick Donnelly wrote:

Looks like this is probably a networking issue of some kind. Are you using host or pod networking in rook? Also, any firewalls?

Hey Patrick, I appreciate the response. It is doubtful that it has any issues with firewalls as the cluster is hosting many other services that work fine, including NFS shares and UDP syslog services. The environment is routed using BGP and Calico/BIRD.

I have left the Rook network configuration as a default (https://github.com/rook/rook/blob/master/Documentation/ceph-cluster-crd.md#network-configuration-settings) from my understanding. I did not configure the environment for host based networking.

I can telnet to the IP address/port and get a response from ceph back as well. The IP address targets are Pod IPs rather than host IPs as well.

ceph v027☻à

Actions #6

Updated by Brian Rogers almost 3 years ago

Patrick Donnelly wrote:

You could be hitting: https://github.com/rook/rook/issues/8085

I appreciate you pointing me this way. The issues states that, after a full reboot of the nodes, that the locking issue would go away. It also states that other volume mounts would fail.

@billimek Confirm that reboot of nodes did the trick and that's horrible.

All my other volume mounts are working and I have done two full reboots of the nodes. The other difference I noticed was the response from the mount was actually a connection failure in the issue, mine is not, I do connect but then hunt again.

Their dmesg

[Fri Jun 11 10:41:37 2021] libceph: connect (1)10.43.21.150:6789 error -101
[Fri Jun 11 10:41:37 2021] libceph: mon0 (1)10.43.21.150:6789 connect error
[Fri Jun 11 10:41:38 2021] libceph: connect (1)10.43.21.150:6789 error -101
[Fri Jun 11 10:41:38 2021] libceph: mon0 (1)10.43.21.150:6789 connect error 

My dmesg

[45567.903786] libceph: mon0 (1)10.102.28.116:6789 session established
[45567.904552] libceph: mon0 (1)10.102.28.116:6789 socket closed (con state OPEN)
[45567.904566] libceph: mon0 (1)10.102.28.116:6789 session lost, hunting for new mon
[45567.908889] libceph: mon1 (1)10.102.28.116:6789 session established

The other thing to note, is that the command I have shared is not being done via Kubernetes but rather I tested on two different machines, one in the K8s network and one outside.

I don't disagree that there could be something in the Rook configuration, but all other services are working except CephFS; which I just created after upgrading to Ceph 16.2.4, are working as expected.

Actions #7

Updated by Loïc Dachary almost 3 years ago

  • Target version deleted (v16.2.5)
Actions #8

Updated by Brian Rogers almost 3 years ago

I have upgraded the Ceph cluster to v16.2.5 and upgrade Rook to v1.6.7. The issue still remains.

[586901.665789] FS-Cache: Loaded
[586901.699203] Key type ceph registered
[586901.699625] libceph: loaded (mon/osd proto 15/24)
[586901.748555] FS-Cache: Netfs 'ceph' registered for caching
[586901.748574] ceph: loaded (mds proto 32)
[586901.768060] libceph: mon2 (1)10.102.247.49:6789 session established
[586901.769235] libceph: mon2 (1)10.102.247.49:6789 socket closed (con state OPEN)
[586901.769279] libceph: mon2 (1)10.102.247.49:6789 session lost, hunting for new mon
[586901.777243] libceph: mon0 (1)10.101.151.140:6789 session established
[586901.780274] libceph: client9007280 fsid 5d44c1b2-3fa7-42a1-827b-c4fb1b4a8e76
[588147.716154] libceph: mon0 (1)10.102.28.116:6789 session established
[588147.716562] libceph: mon0 (1)10.102.28.116:6789 socket closed (con state OPEN)
[588147.716618] libceph: mon0 (1)10.102.28.116:6789 session lost, hunting for new mon
[588147.721931] libceph: mon2 (1)10.102.247.49:6789 session established
[588147.724631] libceph: client10404339 fsid 5d44c1b2-3fa7-42a1-827b-c4fb1b4a8e76
Actions

Also available in: Atom PDF