Project

General

Profile

Bug #44030

mds: "mds daemon damaged" after restarting MDS - Filesystem DOWN

Added by Luca Cervigni 14 days ago. Updated 4 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature:

Description

CRASH LOG:

{
"crash_id": "2020-02-07_06:44:44.534106Z_bfebeb65-8d38-49ed-b811-731f0152325f",
"timestamp": "2020-02-07 06:44:44.534106Z",
"process_name": "ceph-mds",
"entity_name": "mds.ceph-mon-01",
"ceph_version": "14.2.7",
"utsname_hostname": "ceph-mon-01",
"utsname_sysname": "Linux",
"utsname_release": "4.15.0-76-generic",
"utsname_version": "#86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020",
"utsname_machine": "x86_64",
"os_name": "Ubuntu",
"os_id": "ubuntu",
"os_version_id": "18.04",
"os_version": "18.04.4 LTS (Bionic Beaver)",
"assert_condition": "r 0",
"assert_func": "virtual void C_MDS_mknod_finish::finish(int)",
"assert_file": "/build/ceph-14.2.7/src/mds/Server.cc",
"assert_line": 5651,
"assert_thread_name": "fn_anonymous",
"assert_msg": "/build/ceph-14.2.7/src/mds/Server.cc: In function 'virtual void C_MDS_mknod_finish::finish(int)' thread 7f3c67462700 time 2020-02-07 06:44:44.532145\n/build/ceph-14.2.7/src/mds/Server.cc: 5651: FAILED ceph_assert(r 0)\n",
"backtrace": [
"(()+0x12890) [0x7f3c76608890]",
"(gsignal()+0xc7) [0x7f3c75700e97]",
"(abort()+0x141) [0x7f3c75702801]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7f3c76cf42d3]",
"(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f3c76cf445d]",
"(C_MDS_mknod_finish::finish(int)+0x27e) [0x5565679faace]",
"(MDSContext::complete(int)+0x73) [0x556567be6873]",
"(MDSIOContextBase::complete(int)+0x15a) [0x556567be6afa]",
"(MDSLogContextBase::complete(int)+0x40) [0x556567be6d80]",
"(Finisher::finisher_thread_entry()+0x16e) [0x7f3c76d3f08e]",
"(()+0x76db) [0x7f3c765fd6db]",
"(clone()+0x3f) [0x7f3c757e388f]"
]
}

How it happened
Running nautilus 14.2.7. The data in the FS are important and cannot be lost.

Today I increased the PGS of the volume pool from 8k to 16k. The active mds started reporting slow ops. (the filesystem is not in the volume pool). After few hours the FS was very slow, I reduced the backfill to 1 and since the situation was not improving, I restarted the MDS (no other standby MDSs. it was a single mds).

After that the crash. The mds does not goes back up with this error:

020-02-07 07:03:32.477 7fbf69647700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol
2020-02-07 07:03:32.541 7fbf65e6a700 1 mds.ceph-mon-01 Updating MDS map to version 48461 from mon.2
2020-02-07 07:03:37.613 7fbf65e6a700 1 mds.ceph-mon-01 Updating MDS map to version 48462 from mon.2
2020-02-07 07:03:37.613 7fbf65e6a700 1 mds.ceph-mon-01 Map has assigned me to become a standby
2020-02-07 07:14:11.789 7fbf66e42700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0
2020-02-07 07:14:11.789 7fbf66e42700 -1 mds.ceph-mon-01 * got signal Terminated
2020-02-07 07:14:11.789 7fbf66e42700 1 mds.ceph-mon-01 suicide! Wanted state up:standby
2020-02-07 07:14:12.565 7fbf65e6a700 0 ms_deliver_dispatch: unhandled message 0x563fcb438d00 mdsmap(e 48465) v1 from mon.2 v1:10.3.78.32:6789/0
2020-02-07 07:25:16.782 7f26c39de2c0 0 set uid:gid to 64045:64045 (ceph:ceph)
2020-02-07 07:25:16.782 7f26c39de2c0 0 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process ceph-mds, pid 3724
2020-02-07 07:25:16.782 7f26c39de2c0 0 pidfile_write: ignore empty --pid-file
2020-02-07 07:25:16.786 7f26b5326700 -1 NetHandler create_socket couldn't create socket (97) Address family not supported by protocol
2020-02-07 07:25:16.790 7f26b1b49700 1 mds.ceph-mon-01 Updating MDS map to version 48472 from mon.0
2020-02-07 07:25:17.691 7f26b1b49700 1 mds.ceph-mon-01 Updating MDS map to version 48473 from mon.0
2020-02-07 07:25:17.691 7f26b1b49700 1 mds.ceph-mon-01 Map has assigned me to become a standby
2020-02-07 07:29:50.306 7f26b2b21700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0
2020-02-07 07:29:50.306 7f26b2b21700 -1 mds.ceph-mon-01
got signal Terminated *
2020-02-07 07:29:50.306 7f26b2b21700 1 mds.ceph-mon-01 suicide! Wanted state up:standby
2020-02-07 07:29:50.526 7f26b5b27700 1 mds.beacon.ceph-mon-01 discarding unexpected beacon reply down:dne seq 70 dne
2020-02-07 07:29:52.802 7f26b1b49700 0 ms_deliver_dispatch: unhandled message 0x55ef110ab200 mdsmap(e 48474) v1 from mon.0 v1:10.3.78.22:6789/0

Rebooting did not help

I asked #CEPH OFTC and they suggested to bring up another "fresh" mds. I did that, and they do not start, going to standby. LOGS:

2020-02-07 07:12:46.696 7fe4b388b2c0 0 set uid:gid to 64045:64045 (ceph:ceph)
2020-02-07 07:12:46.696 7fe4b388b2c0 0 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process ceph-mds, pid 74742
2020-02-07 07:12:46.696 7fe4b388b2c0 0 pidfile_write: ignore empty --pid-file
2020-02-07 07:12:46.704 7fe4a19f6700 1 mds.ceph-mon-02 Updating MDS map to version 48462 from mon.0
2020-02-07 07:12:47.456 7fe4a19f6700 1 mds.ceph-mon-02 Updating MDS map to version 48463 from mon.0
2020-02-07 07:12:47.456 7fe4a19f6700 1 mds.ceph-mon-02 Map has assigned me to become a standby
2020-02-07 07:14:16.615 7fe4a29ce700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0
2020-02-07 07:14:16.615 7fe4a29ce700 -1 mds.ceph-mon-02 * got signal Terminated
2020-02-07 07:14:16.615 7fe4a29ce700 1 mds.ceph-mon-02 suicide! Wanted state up:standby
2020-02-07 07:14:16.947 7fe4a51d3700 1 mds.beacon.ceph-mon-02 discarding unexpected beacon reply down:dne seq 24 dne
2020-02-07 07:14:18.715 7fe4a19f6700 0 ms_deliver_dispatch: unhandled message 0x5602fbc6df80 mdsmap(e 48466) v1 from mon.0 v2:10.3.78.22:3300/0
2020-02-07 07:25:02.093 7f3c2f92a2c0 0 set uid:gid to 64045:64045 (ceph:ceph)
2020-02-07 07:25:02.093 7f3c2f92a2c0 0 ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process ceph-mds, pid 75471
2020-02-07 07:25:02.093 7f3c2f92a2c0 0 pidfile_write: ignore empty --pid-file
2020-02-07 07:25:02.097 7f3c1da95700 1 mds.ceph-mon-02 Updating MDS map to version 48471 from mon.2
2020-02-07 07:25:06.413 7f3c1da95700 1 mds.ceph-mon-02 Updating MDS map to version 48472 from mon.2
2020-02-07 07:25:06.413 7f3c1da95700 1 mds.ceph-mon-02 Map has assigned me to become a standby
2020-02-07 07:29:56.869 7f3c1ea6d700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0
2020-02-07 07:29:56.869 7f3c1ea6d700 -1 mds.ceph-mon-02
got signal Terminated *
2020-02-07 07:29:56.869 7f3c1ea6d700 1 mds.ceph-mon-02 suicide! Wanted state up:standby
2020-02-07 07:29:58.113 7f3c1da95700 0 ms_deliver_dispatch: unhandled message 0x563c5df33f80 mdsmap(e 48475) v1 from mon.2 v2:10.3.78.32:3300/0

Here ceph status

cluster:
id: a8dde71d-ca7b-4cf5-bd38-8989c6a27011
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
2 daemons have recently crashed
services:
mon: 3 daemons, quorum ceph-mon-01,ceph-mon-02,ceph-mon-03 (age 41m)
mgr: ceph-mon-02(active, since 41m), standbys: ceph-mon-03, ceph-mon-01
mds: pawsey-sync-fs:0/1, 1 damaged
osd: 925 osds: 715 up (since 2h), 715 in (since 23h)
rgw: 3 daemons active (radosgw-01, radosgw-02, radosgw-03)
data:
pools: 24 pools, 26569 pgs
objects: 52.64M objects, 199 TiB
usage: 685 TiB used, 6.7 PiB / 7.3 PiB avail
pgs: 26513 active+clean
54 active+clean+scrubbing+deep
2 active+clean+scrubbing

Ceph osd ls detail: https://pastebin.com/raw/bxi4HSa5

the metadata pool is on NVMe

Can anyone give me some help?

Any command run like journal repairs do not work as they expect the MDs to be up.

Now none of my MDSs cannot be brought up.

Thanks

Cheers

History

#1 Updated by Luca Cervigni 14 days ago

SECOND CRASH LOG: {
"crash_id": "2020-02-07_03:38:59.667251Z_18b5e608-2954-4c6f-b205-d6d6d52d65c3",
"timestamp": "2020-02-07 03:38:59.667251Z",
"process_name": "ceph-mds",
"entity_name": "mds.ceph-mon-01",
"ceph_version": "14.2.7",
"utsname_hostname": "ceph-mon-01",
"utsname_sysname": "Linux",
"utsname_release": "4.15.0-76-generic",
"utsname_version": "#86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020",
"utsname_machine": "x86_64",
"os_name": "Ubuntu",
"os_id": "ubuntu",
"os_version_id": "18.04",
"os_version": "18.04.4 LTS (Bionic Beaver)",
"assert_condition": "r 0",
"assert_func": "virtual void C_MDS_unlink_local_finish::finish(int)",
"assert_file": "/build/ceph-14.2.7/src/mds/Server.cc",
"assert_line": 6767,
"assert_thread_name": "fn_anonymous",
"assert_msg": "/build/ceph-14.2.7/src/mds/Server.cc: In function 'virtual void C_MDS_unlink_local_finish::finish(int)' thread 7f5d35fe4700 time 2020-02-07 03:38:59.663792\n/build/ceph-14.2.7/src/mds/Server.cc: 6767: FAILED ceph_assert(r 0)\n",
"backtrace": [
"(()+0x12890) [0x7f5d4518a890]",
"(gsignal()+0xc7) [0x7f5d44282e97]",
"(abort()+0x141) [0x7f5d44284801]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7f5d458762d3]",
"(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f5d4587645d]",
"(C_MDS_SlaveRmdirPrep::finish(int)+0) [0x558ef557efe0]",
"(MDSContext::complete(int)+0x73) [0x558ef576a873]",
"(MDSIOContextBase::complete(int)+0x15a) [0x558ef576aafa]",
"(MDSLogContextBase::complete(int)+0x40) [0x558ef576ad80]",
"(Finisher::finisher_thread_entry()+0x16e) [0x7f5d458c108e]",
"(()+0x76db) [0x7f5d4517f6db]",
"(clone()+0x3f) [0x7f5d4436588f]"
]

#2 Updated by Luca Cervigni 14 days ago

ceph.log
ceph-post-file: 2c9c6886-840f-4270-b4c1-323343c9efa4

ceph-mon.log
ceph-post-file: 2c9c6886-840f-4270-b4c1-323343c9efa4

ceph-mds.log
ceph-post-file: a6e304d2-78c6-4215-9c37-869b3f698619

crash logs
ceph-post-file: bc5f5df1-4fd6-4d9a-a5fb-17e8d07f4e0e

#3 Updated by Luca Cervigni 11 days ago

dvanders is helping me for further investigation since we have our filesystem down now since more than 48 hours. It suggests that from the logs, the MDS is refusing to start because the rank is damaged. He is also suggesting a DEV should first understand and explain the implication of the original ceph_assert(r == 0) before advising how to clean up the damaged rank.
Would it be possible to for someone to have a look at this and let me know if need more information?

#4 Updated by Luca Cervigni 11 days ago

Today tried:
stop all mds again.
- Forced mds up with ceph mds repaired
- starting mds leads to another crash, with mds not starting.

ceph-post-file: 3775769f-dc90-4ce8-a0b8-6f2bfea14d6c

then I tried to follow the disaster recovery page:
https://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts

I run:
cephfs-journal-tool --rank=pawsey-sync-fs:0 event recover_dentries summary
cephfs-journal-tool --rank=pawsey-sync-fs:0 journal reset
cephfs-table-tool all reset session

Now the mds starts and I can see files with serveral errors in the logs. This is private though, I posted it there:

ceph-post-file: 6f9f6a57-52ea-4657-878e-0e60e5a069c2

2020-02-10 05:36:36.457 7f8afab48700 0 mds.0.cache.dir(0x609) _fetched badness: got (but i already had) [inode 0x1000008370d [2,head] "/correct file path" auth v4466 s=2997 n(v0 rc2020-01-16 01:39:10.112907 b2997 1=1+0) (ifi

So the files seems to be there somewhere, but when I mount the directory, I cannot see them:
There should be a directory in the / but I can see only the files. The directory seems to be a number:

root@xxxxxxx:/mnt/cephfs# ls la
total 5
drwxr-xr-x 5 root root 3 Feb 4 14:45 .
drwxr-xr-x 4 root root 4096 Jan 15 15:41 ..
-rw------
1 root root 0 Feb 4 14:45 .asd.swp
rw------ 1 root root 0 Feb 4 14:45 .asd.swpx
rw------ 1 root root 0 Feb 4 14:44 .pippo.swp
rw------ 1 root root 0 Feb 4 14:44 .pippo.swpx
rw-r--r- 1 root root 0 Feb 4 14:44 4913 <----- maybe this is the missing DIR
rw-r--r- 1 root root 7 Feb 4 14:45 asd
rw-r--r- 1 root root 0 Feb 4 14:44 pippo
rw-r--r- 1 root root 0 Feb 4 14:44 pippo~

How do I recover my files?

#5 Updated by Patrick Donnelly 4 days ago

  • Project changed from Ceph to fs
  • Subject changed from "mds daemon damaged" after restarting MDS - Filesystem DOWN to mds: "mds daemon damaged" after restarting MDS - Filesystem DOWN
  • Target version deleted (v14.2.7)

Also available in: Atom PDF