Project

General

Profile

Actions

Bug #51077

closed

MDSMonitor: crash when attempting to mount cephfs

Added by Stanislav Datskevych almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm using ceph v16.2.4 deployed with cephadm/docker.
When I try mounting the cephfs from a client, all 3 monitor containers crash.

The cephfs (and the client for it) were created using the following commands:
  1. ceph fs new bareos_backups bareos_backups_metadata bareos_backups_data --force (force is because the pool bareos_backups_data is an EC pool)
  2. ceph fs authorize bareos_backups client.bareos_backups / rw

The configuration of Ubuntu 20.04.2 client:
Ceph versions:

# dpkg -l | grep ceph | awk '{ print $2," ",$3, " ",$4}'
ceph-common   15.2.11-0ubuntu0.20.04.2   amd64
libcephfs2   15.2.11-0ubuntu0.20.04.2   amd64
python3-ceph-argparse   15.2.11-0ubuntu0.20.04.2   amd64
python3-ceph-common   15.2.11-0ubuntu0.20.04.2   all
python3-cephfs   15.2.11-0ubuntu0.20.04.2   amd64

Fstab (I only used one of the monitor IP for the test, but all of the monitors crash nevertheless):

100.90.1.13:/ /mnt/ceph ceph name=bareos_backups,secretfile=/etc/ceph.secret,noatime,_netdev 0 0

When I attempt to run "mount /mnt/ceph" I see the following messages in the client dmesg:

[13436.808890] libceph: mon0 (1)100.90.1.13:6789 session established
[13436.810034] libceph: mon0 (1)100.90.1.13:6789 socket closed (con state OPEN)
[13436.810055] libceph: mon0 (1)100.90.1.13:6789 session lost, hunting for new mon
[13436.816322] libceph: mon2 (1)100.90.1.14:6789 session established
[13437.367487] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN)
[13437.367520] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon
[13437.389389] libceph: mon0 (1)100.90.1.12:6789 session established
[13438.129616] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state OPEN)
[13438.129667] libceph: mon0 (1)100.90.1.12:6789 session lost, hunting for new mon
[13444.450124] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING)
[13445.410163] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING)
[13446.402105] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING)
[13448.418115] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING)
[13452.647592] libceph: mon2 (1)100.90.1.14:6789 session established
[13452.841658] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN)
[13452.841694] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon
[13452.848163] libceph: mon0 (1)100.90.1.12:6789 session established
[13453.139576] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state OPEN)
[13453.139614] libceph: mon0 (1)100.90.1.12:6789 session lost, hunting for new mon
[13453.145211] libceph: mon1 (1)100.90.1.13:6789 session established
[13453.585151] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN)
[13453.585185] libceph: mon1 (1)100.90.1.13:6789 session lost, hunting for new mon
[13453.586192] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13454.402183] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13455.426124] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13457.410047] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13461.601997] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING)
[13465.447114] libceph: mon1 (1)100.90.1.13:6789 session established
[13465.624148] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN)
[13465.624172] libceph: mon1 (1)100.90.1.13:6789 session lost, hunting for new mon
[13479.809892] libceph: mon2 (1)100.90.1.14:6789 session established
[13480.009943] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN)
[13480.009989] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon
[13486.020207] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN)
[13496.928447] ceph: No mds server is up or the cluster is laggy

At the same time all monitor containers crash with the message:

debug      0> 2021-06-03T12:14:28.190+0000 7fb14dc17700 -1 *** Caught signal (Aborted) **
 in thread 7fb14dc17700 thread_name:ms_dispatch

 ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7fb15928ab20]
 2: gsignal()
 3: abort()
 4: /lib64/libstdc++.so.6(+0x9009b) [0x7fb1588a809b]
 5: /lib64/libstdc++.so.6(+0x9653c) [0x7fb1588ae53c]
 6: /lib64/libstdc++.so.6(+0x96597) [0x7fb1588ae597]
 7: /lib64/libstdc++.so.6(+0x967f8) [0x7fb1588ae7f8]
 8: /lib64/libstdc++.so.6(+0x92045) [0x7fb1588aa045]
 9: /usr/bin/ceph-mon(+0x4d8da6) [0x559d953eada6]
 10: (MDSMonitor::check_sub(Subscription*)+0x819) [0x559d953e1329]
 11: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0xcd8) [0x559d951d3258]
 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x78d) [0x559d951f92ed]
 13: (Monitor::_ms_dispatch(Message*)+0x670) [0x559d951fa910]
 14: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x559d95228fdc]
 15: (DispatchQueue::entry()+0x126a) [0x7fb15b9cab1a]
 16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fb15ba7ab71]
 17: /lib64/libpthread.so.0(+0x814a) [0x7fb15928014a]
 18: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  140399414929152 / rstore_compact
  140399431714560 / ms_dispatch
  140399448499968 / rocksdb:dump_st
  140399473678080 / msgr-worker-0
  140399490463488 / ms_dispatch
  140399532427008 / safe_timer
  140399582783232 / rocksdb:high0
  140399591175936 / rocksdb:low0
  max_recent     10000
  max_new        10000
  log_file /var/lib/ceph/crash/2021-06-03T12:14:28.191940Z_c1fbea06-3d75-4053-9c28-24de6ab45fd5/log
--- end dump of recent events ---

I'd be glad to provide more info/perform other tests if its needed.


Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #51286: pacific: MDSMonitor: crash when attempting to mount cephfsResolvedPatrick DonnellyActions
Actions #1

Updated by Neha Ojha almost 3 years ago

  • Project changed from RADOS to CephFS
  • Category changed from Correctness/Safety to Correctness/Safety
Actions #2

Updated by Patrick Donnelly almost 3 years ago

  • Subject changed from mon: crash when attempting to mount cephfs to MDSMonitor: crash when attempting to mount cephfs
  • Description updated (diff)
  • Component(FS) MDSMonitor added
  • Labels (FS) crash added
Actions #3

Updated by Patrick Donnelly almost 3 years ago

  • Status changed from New to Triaged
  • Assignee set to Rishabh Dave
  • Priority changed from Normal to Urgent
  • Target version set to v17.0.0
Actions #4

Updated by Stanislav Datskevych almost 3 years ago

An update:

I seem to have found the reason of the issue:

I had already had one CephFS which was working fine.
Then, I created the CephFS "bareos_backups" (which is mentioned in the issue), so it is the second filesystem in the cluster.

Today I tried using ceph-fuse hoping it won't trigger the crash, using the command like this:
  1. ceph-fuse -n client.bareos_server -client_fs bareos_backups /mnt/ceph
    As expected, it crashed the monitors.
Then I thought, maybe specifying the FS name explicitly will help, so I tried this command:
  1. ceph-fuse -n client.bareos_server --client_fs bareos_backups /mnt/ceph

And it mounted the FS without crashing monitors.

Hope it helps

Actions #5

Updated by Patrick Donnelly almost 3 years ago

Stanislav Datskevych wrote:

An update:

I seem to have found the reason of the issue:

I had already had one CephFS which was working fine.
Then, I created the CephFS "bareos_backups" (which is mentioned in the issue), so it is the second filesystem in the cluster.

Today I tried using ceph-fuse hoping it won't trigger the crash, using the command like this:
  1. ceph-fuse -n client.bareos_server -client_fs bareos_backups /mnt/ceph
    As expected, it crashed the monitors.
Then I thought, maybe specifying the FS name explicitly will help, so I tried this command:
  1. ceph-fuse -n client.bareos_server --client_fs bareos_backups /mnt/ceph

Sorry I'm not seeing the difference between the two command except "--client_fs" correctly has two hyphens in the second command (typo?). Can you explain?

Actions #6

Updated by Stanislav Datskevych almost 3 years ago

I'm sorry, I must have copypasted the same command twice.
The first command of course was:
ceph-fuse -n client.bareos_server /mnt/ceph
i.e. without specifying the FS.

When I created first cephfs and mounted it on servers, it worked and continues working fine (probably because it was the only FS, thus it was implicitly selected to mount without triggering the crash).
Then, I created the second fs (bareos_backups), and it starts triggering the crash (probably because now it is two filesystems, and the correct one can't be implicitly selected or something).

Actions #7

Updated by Patrick Donnelly almost 3 years ago

  • Status changed from Triaged to In Progress
  • Assignee changed from Rishabh Dave to Patrick Donnelly
Actions #8

Updated by Patrick Donnelly almost 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 41899
Actions #9

Updated by Patrick Donnelly almost 3 years ago

  • Backport set to pacific
Actions #10

Updated by Patrick Donnelly almost 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #11

Updated by Patrick Donnelly almost 3 years ago

Thanks for the detailed notes! It was very helpful tracking the bug down.

Actions #12

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #51286: pacific: MDSMonitor: crash when attempting to mount cephfs added
Actions #13

Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF