Project

General

Profile

Bug #55824

ceph-fuse[88614]: ceph mount failed with (65536) Unknown error 65536

Added by Xiubo Li 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/vshankar-2022-05-31_02:47:51-fs-wip-vshankar-fscrypt-20220530-091336-testing-default-smithi/6853737/

When mounting a ceph-fuse it failed:

2022-05-31T03:13:43.772 DEBUG:teuthology.orchestra.run:got remote process result: 32
2022-05-31T03:13:43.773 INFO:teuthology.orchestra.run:Running command with timeout 300
2022-05-31T03:13:43.773 DEBUG:teuthology.orchestra.run.smithi045:> ls /sys/fs/fuse/connections
2022-05-31T03:13:43.806 INFO:tasks.cephfs.fuse_mount.ceph-fuse.admin.smithi045.stderr:2022-05-31T03:13:43.806+0000 7f7ece3c53c0 -1 init, newargv = 0x55c8f14132f0 newargc=15
2022-05-31T03:13:43.806 INFO:tasks.cephfs.fuse_mount.ceph-fuse.admin.smithi045.stderr:ceph-fuse[88614]: starting ceph client
2022-05-31T03:13:43.811 INFO:tasks.cephfs.fuse_mount.ceph-fuse.admin.smithi045.stderr:ceph-fuse[88614]: probably no MDS server is up?
2022-05-31T03:13:43.812 INFO:tasks.cephfs.fuse_mount.ceph-fuse.admin.smithi045.stderr:ceph-fuse[88614]: ceph mount failed with (65536) Unknown error 65536
2022-05-31T03:13:43.975 INFO:tasks.cephfs.fuse_mount.ceph-fuse.admin.smithi045.stderr:daemon-helper: command failed with exit status 1
2022-05-31T03:13:44.931 DEBUG:teuthology.orchestra.run.smithi045:> sudo modprobe fuse
2022-05-31T03:13:45.023 INFO:teuthology.orchestra.run:Running command with timeout 30
2022-05-31T03:13:45.023 DEBUG:teuthology.orchestra.run.smithi045:> sudo mount -t fusectl /sys/fs/fuse/connections /sys/fs/fuse/connections
2022-05-31T03:13:45.106 INFO:teuthology.orchestra.run.smithi045.stderr:mount: /sys/fs/fuse/connections: /sys/fs/fuse/connections already mounted or mount point busy.
2022-05-31T03:13:45.110 DEBUG:teuthology.orchestra.run:got remote process result: 32
2022-05-31T03:13:45.111 INFO:teuthology.orchestra.run:Running command with timeout 300
2022-05-31T03:13:45.111 DEBUG:teuthology.orchestra.run.smithi045:> ls /sys/fs/fuse/connections
2022-05-31T03:13:45.165 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-05-31T03:13:45.165 INFO:tasks.cephfs.fuse_mount:mount command failed.
2022-05-31T03:13:45.165 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_1b30281f276b97a8186594b7f92fe1f728418ada/teuthology/contextutil.py", line 31, in nested
    vars.append(enter())
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/ceph.py", line 447, in cephfs_setup
    fs = Filesystem(ctx, fs_config=temp, name=name, create=True)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/cephfs/filesystem.py", line 512, in __init__
    self.create()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/cephfs/filesystem.py", line 680, in create
    self.run_client_payload(f"setfattr -n ceph.dir.layout.pool -v {ec_data_pool_name} . && getfattr -n ceph.dir.layout .")
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/cephfs/filesystem.py", line 718, in run_client_payload
    m.mount_wait()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/cephfs/mount.py", line 453, in mount_wait
    self.mount(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/cephfs/fuse_mount.py", line 52, in mount
    return self._mount(mntopts, check_status)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/cephfs/fuse_mount.py", line 68, in _mount
    retval = self._run_mount_cmd(mntopts, check_status)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/cephfs/fuse_mount.py", line 96, in _run_mount_cmd
    check_status, pre_mount_conns, mountcmd_stdout, mountcmd_stderr)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_9e1d1677a609de7d04588e347b7aaa17ce9b38a2/qa/tasks/cephfs/fuse_mount.py", line 171, in _wait_and_record_our_fuse_conn
    self.fuse_daemon.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_1b30281f276b97a8186594b7f92fe1f728418ada/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_1b30281f276b97a8186594b7f92fe1f728418ada/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi045 with status 1: "(cd /home/ubuntu/cephtest && exec sudo nsenter --net=/var/run/netns/ceph-ns--home-ubuntu-cephtest-mnt.admin sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-fuse -f --admin-socket '/var/run/ceph/$cluster-$name.$pid.asok' /home/ubuntu/cephtest/mnt.admin --id admin --client_fs=cephfs)" 
2022-05-31T03:13:45.168 INFO:teuthology.misc:Shutting down mds daemons...
2022-05-31T03:13:45.168 DEBUG:tasks.ceph.mds.a:waiting for process to exit
2022-05-31T03:13:45.168 INFO:teuthology.orchestra.run:waiting for 300

From the mds.0 logs:

2022-05-31T03:13:43.482+0000 7f21b5089700 10 mds.0.cache cache not ready for trimming
2022-05-31T03:13:43.482+0000 7f21b5089700 20 mds.0.cache upkeep thread waiting interval 1.000000000s
2022-05-31T03:13:43.493+0000 7f21b708d700  5 mds.beacon.d Sending beacon up:active seq 5
2022-05-31T03:13:43.493+0000 7f21b708d700  1 -- [v2:172.21.15.136:6835/1045808825,v1:172.21.15.136:6837/1045808825] --> [v2:172.21.15.136:3300/0,v1:172.21.15.136:6789/0] -- mdsbeacon(4270/d up:active seq=5 v5) v8 -- 0x5639ce331340 con 0x5639ce159400
2022-05-31T03:13:43.493+0000 7f21b708d700 20 mds.beacon.d sender thread waiting interval 4s
2022-05-31T03:13:43.957+0000 7f21b9091700  1 -- [v2:172.21.15.136:6835/1045808825,v1:172.21.15.136:6837/1045808825] <== mon.1 v2:172.21.15.136:3300/0 49 ==== mdsmap(e 6) v2 ==== 1094+0+0 (secure 0 0 0) 0x5639ce1bb380 con 0x5639ce159400
2022-05-31T03:13:43.957+0000 7f21b9091700  1 mds.d Updating MDS map to version 6 from mon.1
2022-05-31T03:13:43.957+0000 7f21b9091700 10 mds.d      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-05-31T03:13:43.957+0000 7f21b9091700 10 mds.d  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-05-31T03:13:43.957+0000 7f21b9091700 10 mds.d my gid is 4270
2022-05-31T03:13:43.957+0000 7f21b9091700 10 mds.d map says I am mds.0.4 state up:active
2022-05-31T03:13:43.957+0000 7f21b9091700 10 mds.d msgr says I am [v2:172.21.15.136:6835/1045808825,v1:172.21.15.136:6837/1045808825]
2022-05-31T03:13:43.957+0000 7f21b9091700 10 mds.d handle_mds_map: handling map as rank 0
2022-05-31T03:13:43.959+0000 7f21bc097700  1 -- [v2:172.21.15.136:6835/1045808825,v1:172.21.15.136:6837/1045808825] <== mon.1 v2:172.21.15.136:3300/0 50 ==== mdsbeacon(4270/d up:active seq=5 v6) v8 ==== 130+0+0 (secure 0 0 0) 0x5639ce331080 con 0x5639ce159400
2022-05-31T03:13:43.959+0000 7f21bc097700  5 mds.beacon.d received beacon reply up:active seq 5 rtt 0.46599
2022-05-31T03:13:44.051+0000 7f21b688c700  1 -- [v2:172.21.15.136:6835/1045808825,v1:172.21.15.136:6837/1045808825] --> [v2:172.21.15.136:6832/87765,v1:172.21.15.136:6833/87765] -- mgrreport(unknown.d +0-0 packed 1534 task_status=0) v9 -- 0x5639cd42d180 con 0x5639ce159800

The client logs:

2022-05-31T03:13:43.811+0000 7f7eaa7fc700 20 client.4441 tick
2022-05-31T03:13:43.811+0000 7f7eaa7fc700 20 client.4441 collect_and_send_metrics
2022-05-31T03:13:43.811+0000 7f7eaa7fc700 20 client.4441 collect_and_send_global_metrics
2022-05-31T03:13:43.811+0000 7f7eaa7fc700  5 client.4441 collect_and_send_global_metrics: no session with rank=0 -- not sending metric
2022-05-31T03:13:43.811+0000 7f7eaa7fc700 20 client.4441 trim_cache size 0 max 16384
2022-05-31T03:13:43.811+0000 7f7eaa7fc700 20 client.4441 upkeep thread waiting interval 1.000000000s
2022-05-31T03:13:43.812+0000 7f7eabfff700  1 client.4441 handle_mds_map epoch 5
2022-05-31T03:13:43.812+0000 7f7ece3c53c0 10 client.4441 mds cluster unavailable: epoch=5
2022-05-31T03:13:43.812+0000 7f7ece3c53c0  1 client.4441 shutdown
2022-05-31T03:13:43.812+0000 7f7ece3c53c0 10 client.4441 shutdown stopping cache invalidator finisher
2022-05-31T03:13:43.812+0000 7f7ece3c53c0 10 client.4441 shutdown stopping dentry invalidator finisher
2022-05-31T03:13:43.812+0000 7f7ece3c53c0 10 client.4441 shutdown stopping interrupt finisher
2022-05-31T03:13:43.812+0000 7f7ece3c53c0 10 client.4441 shutdown stopping remount finisher

Still not sure why the ceph-fuse didn't detect any active MDS.


Related issues

Copied to CephFS - Backport #56105: pacific: ceph-fuse[88614]: ceph mount failed with (65536) Unknown error 65536 Resolved
Copied to CephFS - Backport #56106: quincy: ceph-fuse[88614]: ceph mount failed with (65536) Unknown error 65536 Resolved

History

#1 Updated by Xiubo Li 6 months ago

  • Description updated (diff)

#2 Updated by Xiubo Li 6 months ago

Xiubo Li wrote:

https://pulpito.ceph.com/vshankar-2022-05-31_02:47:51-fs-wip-vshankar-fscrypt-20220530-091336-testing-default-smithi/6853737/

When mounting a ceph-fuse it failed:

[...]

From the mds.0 logs:

[...]

The client logs:

[...]

Checked the mds.0 logs, for the epoch 5, the mds.0 was still in the up:creating state:

.21.15.136:6789/0] -- mdsbeacon(4270/d up:active seq=4 v4) v8 -- 0x5639ce331080 con 0x5639ce159400
2022-05-31T03:13:39.494+0000 7f21b3085700 10 mds.0.snapclient sync
2022-05-31T03:13:39.494+0000 7f21b3085700 10 mds.0.snapclient refresh want 1
2022-05-31T03:13:39.494+0000 7f21b9091700  1 -- [v2:172.21.15.136:6835/1045808825,v1:172.21.15.136:6837/1045808825] <== mon.1 v2:172.21.15.136:3300/0 46 ==== mdsmap(e 5) v2 ==== 1094+0+0 (secure 0 0 0) 0x5639ce1bb380 con 0x5639ce159400
2022-05-31T03:13:39.494+0000 7f21b9091700  1 mds.d Updating MDS map to version 5 from mon.1
2022-05-31T03:13:39.494+0000 7f21b9091700 10 mds.d      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-05-31T03:13:39.494+0000 7f21b9091700 10 mds.d  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-05-31T03:13:39.494+0000 7f21b9091700 10 mds.d my gid is 4270
2022-05-31T03:13:39.494+0000 7f21b9091700 10 mds.d map says I am mds.0.4 state up:creating
2022-05-31T03:13:39.494+0000 7f21b9091700 10 mds.d msgr says I am [v2:172.21.15.136:6835/1045808825,v1:172.21.15.136:6837/1045808825]
2022-05-31T03:13:39.494+0000 7f21b9091700 10 mds.d handle_mds_map: handling map as rank 0
2022-05-31T03:13:39.494+0000 7f21b9091700 10 notify_mdsmap: mds.metrics
2022-05-31T03:13:39.494+0000 7f21b9091700 10 notify_mdsmap: mds.metrics: rank0 is unavailable
2022-05-31T03:13:39.494+0000 7f21b9091700 10 reset_seq: mds.metrics: last_updated_seq=0
2022-05-31T03:13:39.494+0000 7f21b9091700 20 set_next_seq: mds.metrics: current sequence number 0, setting next sequence number 0
2022-05-31T03:13:39.495+0000 7f21b9091700  1 -- [v2:172.21.15.136:6835/1045808825,v1:172.21.15.136:6837/1045808825] <== mon.1 v2:172.21.15.136:3300/0 47 ==== osd_map(26..26 src has 1..26) v4 ==== 685+0+0 (secure 0 0 0) 0x5639cd41f6c0 con 0x5639ce159400
2022-05-31T03:13:39.495+0000 7f21b9091700  7 mds.0.server operator(): full = 0 epoch = 26
2022-05-31T03:13:39.495+0000 7f21b9091700 10 mds.0.server apply_blocklist: killed 0

So no MDS is in up.

Still not sure why the ceph-fuse didn't detect any active MDS.

#3 Updated by Xiubo Li 6 months ago

Maybe we should wait for a while when mounting the ceph-fuse ?

#4 Updated by Venky Shankar 6 months ago

  • Category set to Administration/Usability
  • Assignee set to Xiubo Li
  • Target version set to v18.0.0
  • Backport set to quincy, pacific

#5 Updated by Xiubo Li 6 months ago

The ec is enabled and the test will create and set the ec pool to the layout:

    ceph:
      cephfs:
        ec_profile:
        - m=2
        - k=2
        - crush-failure-domain=osd
        max_mds: 5
        session_timeout: 300
        standby_replay: true

The logs:

2022-06-06T22:30:37.348 INFO:teuthology.orchestra.run:Running command with timeout 300
2022-06-06T22:30:37.348 DEBUG:teuthology.orchestra.run.smithi102:> stat --file-system '--printf=%T
2022-06-06T22:30:37.349 DEBUG:teuthology.orchestra.run.smithi102:> ' -- /home/ubuntu/cephtest/mnt.admin
2022-06-06T22:30:37.405 INFO:teuthology.orchestra.run.smithi102.stdout:fuseblk
2022-06-06T22:30:37.405 INFO:tasks.cephfs.fuse_mount:ceph-fuse is mounted on /home/ubuntu/cephtest/mnt.admin
2022-06-06T22:30:37.406 INFO:teuthology.orchestra.run:Running command with timeout 300
2022-06-06T22:30:37.406 DEBUG:teuthology.orchestra.run.smithi102:> sudo chmod 1777 /home/ubuntu/cephtest/mnt.admin
2022-06-06T22:30:37.493 INFO:teuthology.orchestra.run:Running command with timeout 300
2022-06-06T22:30:37.494 DEBUG:teuthology.orchestra.run.smithi102:> (cd /home/ubuntu/cephtest/mnt.admin && exec bash -c 'setfattr -n ceph.dir.layout.pool -v cephfs_data_ec . && getfattr -n ceph.dir.layout .')
2022-06-06T22:30:37.654 INFO:teuthology.orchestra.run.smithi102.stdout:# file: .
2022-06-06T22:30:37.654 INFO:teuthology.orchestra.run.smithi102.stdout:ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data_ec" 
2022-06-06T22:30:37.655 INFO:teuthology.orchestra.run.smithi102.stdout:
2022-06-06T22:30:37.656 INFO:tasks.cephfs.fuse_mount:Running fusermount -u on ubuntu@smithi102.front.sepia.ceph.com...
2022-06-06T22:30:37.657 INFO:teuthology.orchestra.run:Running command with timeout 300
2022-06-06T22:30:37.657 DEBUG:teuthology.orchestra.run.smithi102:> sudo fusermount -u /home/ubuntu/cephtest/mnt.admin
2022-06-06T22:30:37.737 INFO:tasks.cephfs.fuse_mount.ceph-fuse.admin.smithi102.stderr:ceph-fuse[124989]: fuse finished with error 0 and tester_r 0
2022-06-06T22:30:37.743 INFO:teuthology.orchestra.run:waiting for 300
2022-06-06T22:30:37.771 INFO:tasks.cephfs.mount:Cleaning up mount ubuntu@smithi102.front.sepia.ceph.com

From the code from qa/tasks/cephfs/filesystem.py, in Line#680 it will mount a fuse client and then set the ec pool to the layout:

 626     def create(self):
 ...
 653         if self.metadata_overlay:
 654             self.mon_manager.raw_cluster_cmd('fs', 'new',
 655                                              self.name, self.metadata_pool_name, data_pool_name,
 656                                              '--allow-dangerous-metadata-overlay')
 657         else:
 658             self.mon_manager.raw_cluster_cmd('fs', 'new',
 659                                              self.name,
 660                                              self.metadata_pool_name,
 661                                              data_pool_name)
 662 
 663             if self.ec_profile and 'disabled' not in self.ec_profile:
 664                 ec_data_pool_name = data_pool_name + "_ec" 
 665                 log.debug("EC profile is %s", self.ec_profile)
 666                 cmd = ['osd', 'erasure-code-profile', 'set', ec_data_pool_name]
 667                 cmd.extend(self.ec_profile)
 668                 self.mon_manager.raw_cluster_cmd(*cmd)
 669                 self.mon_manager.raw_cluster_cmd(
 670                     'osd', 'pool', 'create', ec_data_pool_name,
 671                     'erasure', ec_data_pool_name,
 672                     '--pg_num_min', str(self.pg_num_min),
 673                     '--target_size_ratio', str(self.target_size_ratio_ec))
 674                 self.mon_manager.raw_cluster_cmd(
 675                     'osd', 'pool', 'set',
 676                     ec_data_pool_name, 'allow_ec_overwrites', 'true')
 677                 self.add_data_pool(ec_data_pool_name, create=False)
 678                 self.check_pool_application(ec_data_pool_name)
 679 
 680                 self.run_client_payload(f"setfattr -n ceph.dir.layout.pool -v {ec_data_pool_name} . && getfattr -n ceph.dir.layout .")
 681 
 ...

The code of run_client_payload():

 713     def run_client_payload(self, cmd):
 714         # avoid circular dep by importing here:
 715         from tasks.cephfs.fuse_mount import FuseMount
 716         d = misc.get_testdir(self._ctx)                                                                                                           
 717         m = FuseMount(self._ctx, d, "admin", self.client_remote, cephfs_name=self.name)
 718         m.mount_wait()
 719         m.run_shell_payload(cmd)
 720         m.umount_wait(require_clean=True)
 721        

But it ran too fast that the MDSes in the ceph cluster were not ready yet. So we need to wait for a while to get at least the rank 0 to be up:active.

#6 Updated by Jos Collin 6 months ago

  • Pull request ID set to 46560

#7 Updated by Xiubo Li 6 months ago

  • Status changed from New to Fix Under Review

#8 Updated by Venky Shankar 6 months ago

  • Status changed from Fix Under Review to Pending Backport

#9 Updated by Backport Bot 6 months ago

  • Copied to Backport #56105: pacific: ceph-fuse[88614]: ceph mount failed with (65536) Unknown error 65536 added

#10 Updated by Backport Bot 6 months ago

  • Copied to Backport #56106: quincy: ceph-fuse[88614]: ceph mount failed with (65536) Unknown error 65536 added

#11 Updated by Xiubo Li 5 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF