Bug #54411: mds_upgrade_sequence: "overall HEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available; 33 daemons have recently crashed" during suites/fsstress.sh - CephFS - Ceph

Actions

Copy link

Bug #54411

closed

mds_upgrade_sequence: "overall HEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available; 33 daemons have recently crashed" during suites/fsstress.sh

Added by Laura Flores about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

Target version:

% Done:

Source:

Tags:

Backport:

quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

45370

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2022-02-21_15:48:20-rados-wip-yuri7-testing-2022-02-17-0852-pacific-distro-default-smithi/6698603

2022-02-21T22:00:01.210 INFO:journalctl@ceph.mon.smithi119.smithi119.stdout:Feb 21 22:00:00 smithi119 conmon[66118]: cluster 2022-02-21T22:00:00.000143+0000 mon.smithi107 (mon.0) 2988 : cluster [WRN] overall HEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available; 33 daemons have recently crashed

...

2022-02-21T22:07:07.894 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.1...
2022-02-21T22:07:07.895 DEBUG:teuthology.orchestra.run.smithi119:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.1 /home/ubuntu/cephtest/clone.client.1
2022-02-21T22:07:07.936 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Feb 21 22:07:07 smithi107 conmon[101214]: cluster 2022-02-21T22:07:05.802071+0000 mgr.smithi119.mindyy (mgr.24491) 6642
2022-02-21T22:07:07.936 INFO:journalctl@ceph.mon.smithi107.smithi107.stdout:Feb 21 22:07:07 smithi107 conmon[101214]:  : cluster [DBG] pgmap v5672: 65 pgs: 65 active+clean; 2.5 GiB data, 7.5 GiB used, 529 GiB / 536 GiB avail
2022-02-21T22:07:07.940 INFO:journalctl@ceph.mon.smithi119.smithi119.stdout:Feb 21 22:07:07 smithi119 conmon[66118]: cluster 2022-02-21T22:07:05.802071+0000 mgr.smithi119.mindyy (mgr.24491) 6642 : cluster [DBG] pgmap v5672: 65 pgs: 65 active+clean; 2.5 GiB data, 7.5 GiB used, 529 GiB / 536 GiB avail
2022-02-21T22:07:08.150 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/task/parallel.py", line 56, in task
    p.spawn(_run_spawned, ctx, confg, taskname)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/task/parallel.py", line 64, in _run_spawned
    mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/task/sequential.py", line 47, in task
    mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=confg)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_9f91d3caa3f16637a5668f2b678fb3a44b6977ba/qa/tasks/workunit.py", line 147, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_9f91d3caa3f16637a5668f2b678fb3a44b6977ba/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_9f91d3caa3f16637a5668f2b678fb3a44b6977ba/qa/tasks/workunit.py", line 426, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_eea2521245e542c7c1a063d296779f572aa0255a/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test suites/fsstress.sh) on smithi119 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=9f91d3caa3f16637a5668f2b678fb3a44b6977ba TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 CEPH_MNT=/home/ubuntu/cephtest/mnt.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/suites/fsstress.sh'
2022-02-21T22:07:08.279 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=de497daf715d4cd5840a23b715fe2354

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Assignee set to Xiubo Li

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Status changed from New to Triaged

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Status changed from Triaged to In Progress

Actions

Copy link

Updated by Xiubo Li about 2 years ago

The mds crashed in :

2022-02-21T19:08:19.892+0000 7f058d981700  1 -- [v2:172.21.15.119:6824/1669177137,v1:172.21.15.119:6825/1669177137] --> [v2:172.21.15.119:3300/0,v1:172.21.15.119:6789/0] -- mdsbeacon(24267/cephfs.smithi119.cylrka up:rejoin fs=cephfs seq=36 v16) v8 -- 0x55c745eb2580 con 0x55c745deb000
2022-02-21T19:08:20.069+0000 7f058d981700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/include/cephfs/metrics/Types.h: In function 'std::ostream& operator<<(std::ostream&, const ClientMetricType&)' thread 7f058d981700 time 2022-02-21T19:08:20.069632+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/include/cephfs/metrics/Types.h: 56: ceph_abort_msg("abort() called")

 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7f05965e2cdc]
 2: (operator<<(std::ostream&, ClientMetricType const&)+0x10e) [0x7f059686742e]
 3: (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7f0596867601]
 4: (DispatchQueue::pre_dispatch(boost::intrusive_ptr<Message> const&)+0x710) [0x7f059681ac30]
 5: (DispatchQueue::entry()+0xdeb) [0x7f059681c69b]
 6: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f05968ccb71]
 7: /lib64/libpthread.so.0(+0x814a) [0x7f059538414a]
 8: clone()

2022-02-21T19:08:20.070+0000 7f058d981700 -1 *** Caught signal (Aborted) **
 in thread 7f058d981700 thread_name:ms_dispatch

 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7f059538eb20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x7f05965e2dad]
 5: (operator<<(std::ostream&, ClientMetricType const&)+0x10e) [0x7f059686742e]
 6: (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7f0596867601]
 7: (DispatchQueue::pre_dispatch(boost::intrusive_ptr<Message> const&)+0x710) [0x7f059681ac30]
 8: (DispatchQueue::entry()+0xdeb) [0x7f059681c69b]
 9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f05968ccb71]
 10: /lib64/libpthread.so.0(+0x814a) [0x7f059538414a]
 11: clone()

Actions

Copy link

Updated by Xiubo Li about 2 years ago

This is one issue similar with https://tracker.ceph.com/issues/53293, which is for the kclient. But this is for libcephfs clients.

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 45372
ceph-qa-suite fs added
Component(FS) MDS added
Labels (FS) crash added

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Related to Bug #54459: fs:upgrade fails with "hit max job timeout" added

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Pull request ID changed from 45372 to 45370

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Thanks Kotresh, the mgr will connect the cephfs cluster via libcephfs does internl required filesystem operations:

https://github.com/ceph/ceph/blob/0b44878cb24290d231e0ab807c8203cec30cf563/src/pybind/mgr/mgr_util.py#L152

So the libcephfs clients will also send metrics to the old MDSes.

Actions

Copy link

#10

Updated by Jeff Layton about 2 years ago

Why on earth are we trying to fix this in the client? This is an MDS bug plain and simple, and a security-sensitive one to boot. This needs to be fixed in the MDS - full stop.

Actions

Copy link

#11

Updated by Venky Shankar about 2 years ago

Status changed from Fix Under Review to Pending Backport
Backport set to quincy

Actions

Copy link

#12

Updated by Backport Bot about 2 years ago

Copied to Backport #55447: quincy: mds_upgrade_sequence: "overall HEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available; 33 daemons have recently crashed" during suites/fsstress.sh added

Actions

Copy link

#13

Updated by Venky Shankar almost 2 years ago

Backport changed from quincy to quincy,pacific

Actions

Copy link

#14

Updated by Backport Bot almost 2 years ago

Copied to Backport #55449: pacific: mds_upgrade_sequence: "overall HEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available; 33 daemons have recently crashed" during suites/fsstress.sh added

Actions

Copy link

#15