Bug #46831: nautilus: mds: SIGSEGV in MDCache::finish_uncommitted_slave - CephFS - Ceph

Actions

Copy link

Bug #46831

closed

nautilus: mds: SIGSEGV in MDCache::finish_uncommitted_slave

Added by Patrick Donnelly over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Zheng Yan

Category:

Target version:

Ceph - v14.2.11

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

36462

Crash signature (v1):

Crash signature (v2):

Description

2020-08-04T09:18:26.606 INFO:tasks.ceph.mds.c.smithi163.stderr:*** Caught signal (Segmentation fault) **
2020-08-04T09:18:26.607 INFO:tasks.ceph.mds.c.smithi163.stderr: in thread 7fc450483700 thread_name:md_log_replay
2020-08-04T09:18:26.607 INFO:tasks.ceph.mds.c.smithi163.stderr: ceph version 14.2.10-256-gf23ff76200 (f23ff7620014d0d1324261eb383e8e25c588bdae) nautilus (stable)
2020-08-04T09:18:26.607 INFO:tasks.ceph.mds.c.smithi163.stderr: 1: (()+0x128a0) [0x7fc4604408a0]
2020-08-04T09:18:26.607 INFO:tasks.ceph.mds.c.smithi163.stderr: 2: (MDCache::finish_uncommitted_slave(metareqid_t, bool)+0x21e) [0x55d4c5f8004e]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 3: (ESlaveUpdate::replay(MDSRank*)+0xf9) [0x55d4c617de89]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 4: (MDLog::_replay_thread()+0x8b2) [0x55d4c611b6f2]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 5: (MDLog::ReplayThread::entry()+0xd) [0x55d4c5e7d80d]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 6: (()+0x76db) [0x7fc4604356db]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 7: (clone()+0x3f) [0x7fc45f61ba3f]
2020-08-04T09:18:26.609 INFO:tasks.ceph.mds.c.smithi163.stderr:2020-08-04 09:18:26.600 7fc450483700 -1 *** Caught signal (Segmentation fault) **

From: /ceph/teuthology-archive/teuthology-2020-08-04_01:12:01-fs-nautilus-distro-basic-smithi/5285373/teuthology.log

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from In Progress to New
Assignee changed from Patrick Donnelly to Zheng Yan

Looks like this occurs shortly after the upgrade from Luminous:

2020-08-04T09:18:21.934 INFO:teuthology.run_tasks:Running task ceph.restart...
2020-08-04T09:18:21.947 INFO:tasks.ceph.mds.a:Restarting daemon
2020-08-04T09:18:21.947 INFO:teuthology.orchestra.run.smithi163:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mds -f --cluster ceph -i a
2020-08-04T09:18:21.977 INFO:tasks.ceph.mds.a:Started
2020-08-04T09:18:21.978 INFO:tasks.ceph.mds.b:Restarting daemon
2020-08-04T09:18:21.978 INFO:teuthology.orchestra.run.smithi163:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mds -f --cluster ceph -i b
2020-08-04T09:18:21.980 INFO:tasks.ceph.mds.b:Started
2020-08-04T09:18:21.981 INFO:tasks.ceph.mds.c:Restarting daemon
2020-08-04T09:18:21.981 INFO:teuthology.orchestra.run.smithi163:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mds -f --cluster ceph -i c
2020-08-04T09:18:21.983 INFO:tasks.ceph.mds.c:Started
...
2020-08-04T09:18:26.606 INFO:tasks.ceph.mds.c.smithi163.stderr:*** Caught signal (Segmentation fault) **
2020-08-04T09:18:26.607 INFO:tasks.ceph.mds.c.smithi163.stderr: in thread 7fc450483700 thread_name:md_log_replay
2020-08-04T09:18:26.607 INFO:tasks.ceph.mds.c.smithi163.stderr: ceph version 14.2.10-256-gf23ff76200 (f23ff7620014d0d1324261eb383e8e25c588bdae) nautilus (stable)
2020-08-04T09:18:26.607 INFO:tasks.ceph.mds.c.smithi163.stderr: 1: (()+0x128a0) [0x7fc4604408a0]
2020-08-04T09:18:26.607 INFO:tasks.ceph.mds.c.smithi163.stderr: 2: (MDCache::finish_uncommitted_slave(metareqid_t, bool)+0x21e) [0x55d4c5f8004e]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 3: (ESlaveUpdate::replay(MDSRank*)+0xf9) [0x55d4c617de89]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 4: (MDLog::_replay_thread()+0x8b2) [0x55d4c611b6f2]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 5: (MDLog::ReplayThread::entry()+0xd) [0x55d4c5e7d80d]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 6: (()+0x76db) [0x7fc4604356db]
2020-08-04T09:18:26.608 INFO:tasks.ceph.mds.c.smithi163.stderr: 7: (clone()+0x3f) [0x7fc45f61ba3f]
2020-08-04T09:18:26.609 INFO:tasks.ceph.mds.c.smithi163.stderr:2020-08-04 09:18:26.600 7fc450483700 -1 *** Caught signal (Segmentation fault) **
2020-08-04T09:18:26.609 INFO:tasks.ceph.mds.c.smithi163.stderr: in thread 7fc450483700 thread_name:md_log_replay

And, indeed, the MDS is replaying ESlaveUpdate. Zheng, can you take a closer look at this?

This ticket is targeting 16.0.0 for now assuming it's also a bug in master.

Actions

Copy link

Updated by Zheng Yan over 3 years ago

Status changed from New to Fix Under Review

https://github.com/ceph/ceph/pull/36462

Actions

Copy link

Updated by Ramana Raja over 3 years ago

This issue was already reported at https://tracker.ceph.com/issues/46675

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Target version changed from v16.0.0 to v14.2.11
Backport deleted (~~octopus,nautilus~~)

Bug is only in nautilus.

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Pull request ID set to 36462

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Related to Backport #45709: nautilus: mds: wrong link count under certain circumstance added

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Has duplicate Bug #46675: nautilus: fs/upgrade test: Crash: 'wait_until_healthy' reached maximum tries (150) after waiting for 900 seconds added

Actions

Copy link