Bug #51077
closedMDSMonitor: crash when attempting to mount cephfs
0%
Description
I'm using ceph v16.2.4 deployed with cephadm/docker.
When I try mounting the cephfs from a client, all 3 monitor containers crash.
- ceph fs new bareos_backups bareos_backups_metadata bareos_backups_data --force (force is because the pool bareos_backups_data is an EC pool)
- ceph fs authorize bareos_backups client.bareos_backups / rw
The configuration of Ubuntu 20.04.2 client:
Ceph versions:
# dpkg -l | grep ceph | awk '{ print $2," ",$3, " ",$4}' ceph-common 15.2.11-0ubuntu0.20.04.2 amd64 libcephfs2 15.2.11-0ubuntu0.20.04.2 amd64 python3-ceph-argparse 15.2.11-0ubuntu0.20.04.2 amd64 python3-ceph-common 15.2.11-0ubuntu0.20.04.2 all python3-cephfs 15.2.11-0ubuntu0.20.04.2 amd64
Fstab (I only used one of the monitor IP for the test, but all of the monitors crash nevertheless):
100.90.1.13:/ /mnt/ceph ceph name=bareos_backups,secretfile=/etc/ceph.secret,noatime,_netdev 0 0
When I attempt to run "mount /mnt/ceph" I see the following messages in the client dmesg:
[13436.808890] libceph: mon0 (1)100.90.1.13:6789 session established [13436.810034] libceph: mon0 (1)100.90.1.13:6789 socket closed (con state OPEN) [13436.810055] libceph: mon0 (1)100.90.1.13:6789 session lost, hunting for new mon [13436.816322] libceph: mon2 (1)100.90.1.14:6789 session established [13437.367487] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN) [13437.367520] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon [13437.389389] libceph: mon0 (1)100.90.1.12:6789 session established [13438.129616] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state OPEN) [13438.129667] libceph: mon0 (1)100.90.1.12:6789 session lost, hunting for new mon [13444.450124] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING) [13445.410163] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING) [13446.402105] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING) [13448.418115] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state CONNECTING) [13452.647592] libceph: mon2 (1)100.90.1.14:6789 session established [13452.841658] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN) [13452.841694] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon [13452.848163] libceph: mon0 (1)100.90.1.12:6789 session established [13453.139576] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state OPEN) [13453.139614] libceph: mon0 (1)100.90.1.12:6789 session lost, hunting for new mon [13453.145211] libceph: mon1 (1)100.90.1.13:6789 session established [13453.585151] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN) [13453.585185] libceph: mon1 (1)100.90.1.13:6789 session lost, hunting for new mon [13453.586192] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING) [13454.402183] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING) [13455.426124] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING) [13457.410047] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING) [13461.601997] libceph: mon0 (1)100.90.1.12:6789 socket closed (con state CONNECTING) [13465.447114] libceph: mon1 (1)100.90.1.13:6789 session established [13465.624148] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN) [13465.624172] libceph: mon1 (1)100.90.1.13:6789 session lost, hunting for new mon [13479.809892] libceph: mon2 (1)100.90.1.14:6789 session established [13480.009943] libceph: mon2 (1)100.90.1.14:6789 socket closed (con state OPEN) [13480.009989] libceph: mon2 (1)100.90.1.14:6789 session lost, hunting for new mon [13486.020207] libceph: mon1 (1)100.90.1.13:6789 socket closed (con state OPEN) [13496.928447] ceph: No mds server is up or the cluster is laggy
At the same time all monitor containers crash with the message:
debug 0> 2021-06-03T12:14:28.190+0000 7fb14dc17700 -1 *** Caught signal (Aborted) ** in thread 7fb14dc17700 thread_name:ms_dispatch ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable) 1: /lib64/libpthread.so.0(+0x12b20) [0x7fb15928ab20] 2: gsignal() 3: abort() 4: /lib64/libstdc++.so.6(+0x9009b) [0x7fb1588a809b] 5: /lib64/libstdc++.so.6(+0x9653c) [0x7fb1588ae53c] 6: /lib64/libstdc++.so.6(+0x96597) [0x7fb1588ae597] 7: /lib64/libstdc++.so.6(+0x967f8) [0x7fb1588ae7f8] 8: /lib64/libstdc++.so.6(+0x92045) [0x7fb1588aa045] 9: /usr/bin/ceph-mon(+0x4d8da6) [0x559d953eada6] 10: (MDSMonitor::check_sub(Subscription*)+0x819) [0x559d953e1329] 11: (Monitor::handle_subscribe(boost::intrusive_ptr<MonOpRequest>)+0xcd8) [0x559d951d3258] 12: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x78d) [0x559d951f92ed] 13: (Monitor::_ms_dispatch(Message*)+0x670) [0x559d951fa910] 14: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x559d95228fdc] 15: (DispatchQueue::entry()+0x126a) [0x7fb15b9cab1a] 16: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fb15ba7ab71] 17: /lib64/libpthread.so.0(+0x814a) [0x7fb15928014a] 18: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 140399414929152 / rstore_compact 140399431714560 / ms_dispatch 140399448499968 / rocksdb:dump_st 140399473678080 / msgr-worker-0 140399490463488 / ms_dispatch 140399532427008 / safe_timer 140399582783232 / rocksdb:high0 140399591175936 / rocksdb:low0 max_recent 10000 max_new 10000 log_file /var/lib/ceph/crash/2021-06-03T12:14:28.191940Z_c1fbea06-3d75-4053-9c28-24de6ab45fd5/log --- end dump of recent events ---
I'd be glad to provide more info/perform other tests if its needed.
Updated by Neha Ojha almost 3 years ago
- Project changed from RADOS to CephFS
- Category changed from Correctness/Safety to Correctness/Safety
Updated by Patrick Donnelly almost 3 years ago
- Subject changed from mon: crash when attempting to mount cephfs to MDSMonitor: crash when attempting to mount cephfs
- Description updated (diff)
- Component(FS) MDSMonitor added
- Labels (FS) crash added
Updated by Patrick Donnelly almost 3 years ago
- Status changed from New to Triaged
- Assignee set to Rishabh Dave
- Priority changed from Normal to Urgent
- Target version set to v17.0.0
Updated by Stanislav Datskevych almost 3 years ago
An update:
I seem to have found the reason of the issue:
I had already had one CephFS which was working fine.
Then, I created the CephFS "bareos_backups" (which is mentioned in the issue), so it is the second filesystem in the cluster.
- ceph-fuse -n client.bareos_server -client_fs bareos_backups /mnt/ceph
As expected, it crashed the monitors.
- ceph-fuse -n client.bareos_server --client_fs bareos_backups /mnt/ceph
And it mounted the FS without crashing monitors.
Hope it helps
Updated by Patrick Donnelly almost 3 years ago
Stanislav Datskevych wrote:
An update:
I seem to have found the reason of the issue:
I had already had one CephFS which was working fine.
Today I tried using ceph-fuse hoping it won't trigger the crash, using the command like this:
Then, I created the CephFS "bareos_backups" (which is mentioned in the issue), so it is the second filesystem in the cluster.Then I thought, maybe specifying the FS name explicitly will help, so I tried this command:
- ceph-fuse -n client.bareos_server -client_fs bareos_backups /mnt/ceph
As expected, it crashed the monitors.
- ceph-fuse -n client.bareos_server --client_fs bareos_backups /mnt/ceph
Sorry I'm not seeing the difference between the two command except "--client_fs" correctly has two hyphens in the second command (typo?). Can you explain?
Updated by Stanislav Datskevych almost 3 years ago
I'm sorry, I must have copypasted the same command twice.
The first command of course was:
ceph-fuse -n client.bareos_server /mnt/ceph
i.e. without specifying the FS.
When I created first cephfs and mounted it on servers, it worked and continues working fine (probably because it was the only FS, thus it was implicitly selected to mount without triggering the crash).
Then, I created the second fs (bareos_backups), and it starts triggering the crash (probably because now it is two filesystems, and the correct one can't be implicitly selected or something).
Updated by Patrick Donnelly almost 3 years ago
- Status changed from Triaged to In Progress
- Assignee changed from Rishabh Dave to Patrick Donnelly
Updated by Patrick Donnelly almost 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 41899
Updated by Patrick Donnelly almost 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Patrick Donnelly almost 3 years ago
Thanks for the detailed notes! It was very helpful tracking the bug down.
Updated by Backport Bot almost 3 years ago
- Copied to Backport #51286: pacific: MDSMonitor: crash when attempting to mount cephfs added
Updated by Loïc Dachary almost 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".