Bug #51191
openCannot Mount CephFS No Timeout, mount error 5 = Input/output error
0%
Description
On most hosts, mounting the CephFS via the kernel or ceph-fuse will not succeed. On one host, a Raspberry PI 4, it did mount. However, immediately after mounting, a simple cp filea.txt /cephfs/filea.txt fails. Creating a directory did work, but the mkdir /cephfs/test hangs.
Ceph mon stat
e30: 3 mons at {b=[v2:10.102.28.116:3300/0,v1:10.102.28.116:6789/0],n=[v2:10.101.151.140:3300/0,v1:10.101.151.140:6789/0],p=[v2:10.102.247.49:3300/0,v1:10.102.247.49:6789/0]}, election epoch 722, leader 0 b, quorum 0,1,2 b,n,p
Dmesg output
[45567.875725] Key type ceph registered [45567.878545] libceph: loaded (mon/osd proto 15/24) [45567.885258] FS-Cache: Netfs 'ceph' registered for caching [45567.885263] ceph: loaded (mds proto 32) [45567.903786] libceph: mon0 (1)10.102.28.116:6789 session established [45567.904552] libceph: mon0 (1)10.102.28.116:6789 socket closed (con state OPEN) [45567.904566] libceph: mon0 (1)10.102.28.116:6789 session lost, hunting for new mon [45567.908889] libceph: mon1 (1)10.102.28.116:6789 session established [45567.909875] libceph: client7984614 fsid 5d44c1b2-3fa7-42a1-827b-c4fb1b4a8e76 [46597.917058] libceph: mon2 (1)10.102.247.49:6789 session established [46597.918225] libceph: mon2 (1)10.102.247.49:6789 socket closed (con state OPEN) [46597.918236] libceph: mon2 (1)10.102.247.49:6789 session lost, hunting for new mon [46597.921514] libceph: mon1 (1)10.102.28.116:6789 session established
Attempt to mount CephFS via ceph-fuse
sudo ceph-fuse -d -f -s -m 10.102.28.116,10.101.151.140,10.102.247.49 /cephfs --keyring /etc/ceph/ceph.client.test.keyring --name client.test ceph-fuse[9754]: starting ceph client 2021-05-26T07:46:11.944-0700 7f34ab943100 0 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable), process ceph-fuse, pid 9754 2021-05-26T07:46:11.944-0700 7f34ab943100 -1 init, newargv = 0x55b11a825eb0 newargc=17
Updated by Brian Rogers almost 3 years ago
More details on the issue can be found here: https://github.com/rook/rook/issues/7994
Updated by Brian Rogers almost 3 years ago
An updated attempt to mount using ceph-fuse
...@...:~$ sudo ceph-fuse -d -f -s -m 10.102.28.116,10.101.151.140,10.102.247.49 /cephfs --keyring /etc/ceph/ceph.client.test.keyring --name client.test ceph-fuse[20685]: starting ceph client 2021-06-12T20:17:24.288-0700 7f7a4cd22100 0 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable), process ceph-fuse, pid 20685 2021-06-12T20:17:24.288-0700 7f7a4cd22100 -1 init, newargv = 0x5565340b1f80 newargc=17 FUSE library version: 2.9.7 ceph-fuse[20685]: starting fuse unique: 2, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0 INIT: 7.31 flags=0x03fffffb max_readahead=0x00020000 INIT: 7.19 flags=0x0000043b max_readahead=0x00020000 max_write=0x00020000 max_background=0 congestion_threshold=0 unique: 2, success, outsize: 40 unique: 4, opcode: ACCESS (34), nodeid: 1, insize: 48, pid: 11086 unique: 4, success, outsize: 16 unique: 6, opcode: LOOKUP (1), nodeid: 1, insize: 47, pid: 11086 unique: 6, error: -2 (No such file or directory), outsize: 16 unique: 8, opcode: LOOKUP (1), nodeid: 1, insize: 52, pid: 11086 unique: 8, error: -2 (No such file or directory), outsize: 16 unique: 10, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 20717 unique: 10, success, outsize: 120 unique: 12, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 20717 unique: 12, success, outsize: 120
Updated by Patrick Donnelly almost 3 years ago
- Status changed from New to Need More Info
Looks like this is probably a networking issue of some kind. Are you using host or pod networking in rook? Also, any firewalls?
Updated by Brian Rogers almost 3 years ago
Patrick Donnelly wrote:
Looks like this is probably a networking issue of some kind. Are you using host or pod networking in rook? Also, any firewalls?
Hey Patrick, I appreciate the response. It is doubtful that it has any issues with firewalls as the cluster is hosting many other services that work fine, including NFS shares and UDP syslog services. The environment is routed using BGP and Calico/BIRD.
I have left the Rook network configuration as a default (https://github.com/rook/rook/blob/master/Documentation/ceph-cluster-crd.md#network-configuration-settings) from my understanding. I did not configure the environment for host based networking.
I can telnet to the IP address/port and get a response from ceph back as well. The IP address targets are Pod IPs rather than host IPs as well.
ceph v027☻à
Updated by Patrick Donnelly almost 3 years ago
You could be hitting: https://github.com/rook/rook/issues/8085
Updated by Brian Rogers almost 3 years ago
Patrick Donnelly wrote:
You could be hitting: https://github.com/rook/rook/issues/8085
I appreciate you pointing me this way. The issues states that, after a full reboot of the nodes, that the locking issue would go away. It also states that other volume mounts would fail.
@billimek Confirm that reboot of nodes did the trick and that's horrible.
All my other volume mounts are working and I have done two full reboots of the nodes. The other difference I noticed was the response from the mount was actually a connection failure in the issue, mine is not, I do connect but then hunt again.
Their dmesg
[Fri Jun 11 10:41:37 2021] libceph: connect (1)10.43.21.150:6789 error -101 [Fri Jun 11 10:41:37 2021] libceph: mon0 (1)10.43.21.150:6789 connect error [Fri Jun 11 10:41:38 2021] libceph: connect (1)10.43.21.150:6789 error -101 [Fri Jun 11 10:41:38 2021] libceph: mon0 (1)10.43.21.150:6789 connect error
My dmesg
[45567.903786] libceph: mon0 (1)10.102.28.116:6789 session established [45567.904552] libceph: mon0 (1)10.102.28.116:6789 socket closed (con state OPEN) [45567.904566] libceph: mon0 (1)10.102.28.116:6789 session lost, hunting for new mon [45567.908889] libceph: mon1 (1)10.102.28.116:6789 session established
The other thing to note, is that the command I have shared is not being done via Kubernetes but rather I tested on two different machines, one in the K8s network and one outside.
I don't disagree that there could be something in the Rook configuration, but all other services are working except CephFS; which I just created after upgrading to Ceph 16.2.4, are working as expected.
Updated by Brian Rogers almost 3 years ago
I have upgraded the Ceph cluster to v16.2.5 and upgrade Rook to v1.6.7. The issue still remains.
[586901.665789] FS-Cache: Loaded
[586901.699203] Key type ceph registered
[586901.699625] libceph: loaded (mon/osd proto 15/24)
[586901.748555] FS-Cache: Netfs 'ceph' registered for caching
[586901.748574] ceph: loaded (mds proto 32)
[586901.768060] libceph: mon2 (1)10.102.247.49:6789 session established
[586901.769235] libceph: mon2 (1)10.102.247.49:6789 socket closed (con state OPEN)
[586901.769279] libceph: mon2 (1)10.102.247.49:6789 session lost, hunting for new mon
[586901.777243] libceph: mon0 (1)10.101.151.140:6789 session established
[586901.780274] libceph: client9007280 fsid 5d44c1b2-3fa7-42a1-827b-c4fb1b4a8e76
[588147.716154] libceph: mon0 (1)10.102.28.116:6789 session established
[588147.716562] libceph: mon0 (1)10.102.28.116:6789 socket closed (con state OPEN)
[588147.716618] libceph: mon0 (1)10.102.28.116:6789 session lost, hunting for new mon
[588147.721931] libceph: mon2 (1)10.102.247.49:6789 session established
[588147.724631] libceph: client10404339 fsid 5d44c1b2-3fa7-42a1-827b-c4fb1b4a8e76