Bug #51077: MDSMonitor: crash when attempting to mount cephfs - CephFS - Ceph

Actions

Copy link

Bug #51077

closed

MDSMonitor: crash when attempting to mount cephfs

Added by Stanislav Datskevych almost 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Patrick Donnelly

Category:

Correctness/Safety

Target version:

Ceph - v17.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

pacific

Regression:

Severity:

Reviewed:

Affected Versions:

Ceph - v16.2.4

ceph-qa-suite:

Component(FS):

MDSMonitor

Labels (FS):

crash

Pull request ID:

41899

Crash signature (v1):

Crash signature (v2):

Description

I'm using ceph v16.2.4 deployed with cephadm/docker.
When I try mounting the cephfs from a client, all 3 monitor containers crash.

The cephfs (and the client for it) were created using the following commands:

ceph fs new bareos_backups bareos_backups_metadata bareos_backups_data --force (force is because the pool bareos_backups_data is an EC pool)
ceph fs authorize bareos_backups client.bareos_backups / rw

The configuration of Ubuntu 20.04.2 client:
Ceph versions:

# dpkg -l | grep ceph | awk '{ print $2," ",$3, " ",$4}'
ceph-common   15.2.11-0ubuntu0.20.04.2   amd64
libcephfs2   15.2.11-0ubuntu0.20.04.2   amd64
python3-ceph-argparse   15.2.11-0ubuntu0.20.04.2   amd64
python3-ceph-common   15.2.11-0ubuntu0.20.04.2   all
python3-cephfs   15.2.11-0ubuntu0.20.04.2   amd64

Fstab (I only used one of the monitor IP for the test, but all of the monitors crash nevertheless):

100.90.1.13:/ /mnt/ceph ceph name=bareos_backups,secretfile=/etc/ceph.secret,noatime,_netdev 0 0

When I attempt to run "mount /mnt/ceph" I see the following messages in the client dmesg:

[13436.808890] libceph: mon0 (1)100.90.1.13:6789 session established
[13436.810034] libceph: mon0 (1)100.90.1.13:6789 socket closed (con state OPEN)
[13436.810055] libceph: mon0 (1)100.90.1.13:6789 session lost, hunting for new mon
[13436.816322] libceph: mon2 (1)100.90.1.14:6789 session established
[13437.367487] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN)
[13437.367520] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon
[13437.389389] libceph: mon0 (1)100.90.1.12:6789 session established
[13438.129616] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state OPEN)
[13438.129667] libceph: mon0 (1)100.90.1.12:6789 session lost, hunting for new mon
[13444.450124] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING)
[13445.410163] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING)
[13446.402105] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING)
[13448.418115] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING)
[13452.647592] libceph: mon2 (1)100.90.1.14:6789 session established
[13452.841658] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN)
[13452.841694] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon
[13452.848163] libceph: mon0 (1)100.90.1.12:6789 session established
[13453.139576] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state OPEN)
[13453.139614] libceph: mon0 (1)100.90.1.12:6789 session lost, hunting for new mon
[13453.145211] libceph: mon1 (1)100.90.1.13:6789 session established
[13453.585151] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN)
[13453.585185] libceph: mon1 (1)100.90.1.13:6789 session lost, hunting for new mon
[13453.586192] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13454.402183] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13455.426124] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13457.410047] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13461.601997] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13465.447114] libceph: mon1 (1)100.90.1.13:6789 session established
[13465.624148] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN)
[13465.624172] libceph: mon1 (1)100.90.1.13:6789 session lost, hunting for new mon
[13479.809892] libceph: mon2 (1)100.90.1.14:6789 session established
[13480.009943] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN)
[13480.009989] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon
[13486.020207] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN)
[13496.928447] ceph: No mds server is up or the cluster is laggy

At the same time all monitor containers crash with the message:

debug      0> 2021-06-03T12:14:28.190+0000 7fb14dc17700 -1 *** Caught signal (Aborted) **
 in thread 7fb14dc17700 thread_name:ms_dispatch

 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7fb15928ab20]
 2: gsignal()
 3: abort()
 4: /lib64/libstdc++.so.6(+0x9009b) [0x7fb1588a809b]
 5: /lib64/libstdc++.so.6(+0x9653c) [0x7fb1588ae53c]
 6: /lib64/libstdc++.so.6(+0x96597) [0x7fb1588ae597]
 7: /lib64/libstdc++.so.6(+0x967f8) [0x7fb1588ae7f8]
 8: /lib64/libstdc++.so.6(+0x92045) [0x7fb1588aa045]
 9: /usr/bin/ceph-mon(+0x4d8da6) [0x559d953eada6]
 10: (MDSMonitor::check_sub(Subscription*)+0x819) [0x559d953e1329]
 11: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0xcd8) [0x559d951d3258]
 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x78d) [0x559d951f92ed]
 13: (Monitor::_ms_dispatch(Message*)+0x670) [0x559d951fa910]
 14: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x559d95228fdc]
 15: (DispatchQueue::entry()+0x126a) [0x7fb15b9cab1a]
 16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fb15ba7ab71]
 17: /lib64/libpthread.so.0(+0x814a) [0x7fb15928014a]
 18: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  140399414929152 / rstore_compact
  140399431714560 / ms_dispatch
  140399448499968 / rocksdb:dump_st
  140399473678080 / msgr-worker-0
  140399490463488 / ms_dispatch
  140399532427008 / safe_timer
  140399582783232 / rocksdb:high0
  140399591175936 / rocksdb:low0
  max_recent     10000
  max_new        10000
  log_file /var/lib/ceph/crash/2021-06-03T12:14:28.191940Z_c1fbea06-3d75-4053-9c28-24de6ab45fd5/log
--- end dump of recent events ---

I'd be glad to provide more info/perform other tests if its needed.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Project changed from RADOS to CephFS
Category changed from Correctness/Safety to Correctness/Safety

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Subject changed from mon: crash when attempting to mount cephfs to MDSMonitor: crash when attempting to mount cephfs
Description updated (diff)
Component(FS) MDSMonitor added
Labels (FS) crash added

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Status changed from New to Triaged
Assignee set to Rishabh Dave
Priority changed from Normal to Urgent
Target version set to v17.0.0

Actions

Copy link

Updated by Stanislav Datskevych almost 3 years ago

An update:

I seem to have found the reason of the issue:

I had already had one CephFS which was working fine.
Then, I created the CephFS "bareos_backups" (which is mentioned in the issue), so it is the second filesystem in the cluster.

Today I tried using ceph-fuse hoping it won't trigger the crash, using the command like this:

ceph-fuse -n client.bareos_server -client_fs bareos_backups /mnt/ceph
As expected, it crashed the monitors.

Then I thought, maybe specifying the FS name explicitly will help, so I tried this command:

ceph-fuse -n client.bareos_server --client_fs bareos_backups /mnt/ceph

And it mounted the FS without crashing monitors.

Hope it helps

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Stanislav Datskevych wrote:

An update:

I seem to have found the reason of the issue:

I had already had one CephFS which was working fine.
Then, I created the CephFS "bareos_backups" (which is mentioned in the issue), so it is the second filesystem in the cluster.
Today I tried using ceph-fuse hoping it won't trigger the crash, using the command like this:

ceph-fuse -n client.bareos_server -client_fs bareos_backups /mnt/ceph
As expected, it crashed the monitors.

Then I thought, maybe specifying the FS name explicitly will help, so I tried this command:

ceph-fuse -n client.bareos_server --client_fs bareos_backups /mnt/ceph

Sorry I'm not seeing the difference between the two command except "--client_fs" correctly has two hyphens in the second command (typo?). Can you explain?

Actions

Copy link

Updated by Stanislav Datskevych almost 3 years ago

I'm sorry, I must have copypasted the same command twice.
The first command of course was:
ceph-fuse -n client.bareos_server /mnt/ceph
i.e. without specifying the FS.

When I created first cephfs and mounted it on servers, it worked and continues working fine (probably because it was the only FS, thus it was implicitly selected to mount without triggering the crash).
Then, I created the second fs (bareos_backups), and it starts triggering the crash (probably because now it is two filesystems, and the correct one can't be implicitly selected or something).

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Status changed from Triaged to In Progress
Assignee changed from Rishabh Dave to Patrick Donnelly

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 41899

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Backport set to pacific

Actions

Copy link

#10

Updated by Patrick Donnelly almost 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#11

Updated by Patrick Donnelly almost 3 years ago

Thanks for the detailed notes! It was very helpful tracking the bug down.

Actions

Copy link

#12

Updated by Backport Bot almost 3 years ago

Copied to Backport #51286: pacific: MDSMonitor: crash when attempting to mount cephfs added

Actions

Copy link

#13

Updated by Loïc Dachary almost 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #51077

MDSMonitor: crash when attempting to mount cephfs

Updated by Neha Ojha almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Stanislav Datskevych almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Stanislav Datskevych almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Loïc Dachary almost 3 years ago