Project

General

Profile

Bug #56529

ceph-fs crashes on getfattr

Added by Frank Schilder over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

100%

Source:
Community (user)
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GCZ3F3ONVA2YIR7DJNQJFG53Y4DWQABN/

We made a very weird observation on our ceph test cluster today. A simple getfattr with a misspelled attribute name sends the MDS cluster into a crash+restart loop. Something as simple as

getfattr -n ceph.dir.layout.po /mnt/cephfs

kills a ceph-fs completely. The problem can be resolved if one executes a "umount -f /mnt/cephfs" on the host where the getfattr was executed. The MDS daemons need a restart. One might also need to clear the OSD blacklist.

We observe this with a kernel client on 5.18.6-1.el7.elrepo.x86_64 (Centos 7) with mimic and I'm absolutely sure I have not seen this problem with mimic on earlier 5.9.X-kernel versions.

Continuation:

Also removing a data pool fails:

# setfattr -x ceph.dir.layout /mnt/cephfs
setfattr: /mnt/cephfs: Invalid argument

In addition, we observed that after

- creating a file and directory on the default data pool,
- setting the data pool on "/" to a secondary data pool fs-data

an "rm -rf /mnt/cephfs/*" followed by an "ls /mnt/cephfs" would hang and blocked ops warnings showed up in the log. Handling these vattribs seems seriously broken in newer kernel versions.

I attached relevant sections of the mds log and /var/log/messages. They should contain some restarts.

mds-log.zip (41.4 KB) Frank Schilder, 07/12/2022 10:58 AM

messages.zip (271 KB) Frank Schilder, 07/12/2022 10:58 AM


Related issues

Related to CephFS - Bug #56522: Do not abort MDS on unknown messages Resolved
Copied to CephFS - Backport #57239: pacific: ceph-fs crashes on getfattr Resolved
Copied to CephFS - Backport #57240: quincy: ceph-fs crashes on getfattr Resolved

History

#1 Updated by Xiubo Li over 1 year ago

  • Assignee set to Xiubo Li

Will work on it.

#2 Updated by Xiubo Li over 1 year ago

Tried the Pacific and Quincy cephs with the latest upstream kernel, I couldn't reproduce this. I am sure I have also test the getattr like this months earlier, and didn't see this issue too.

The Minic ceph is end of life, https://docs.ceph.com/en/quincy/releases/index.html.

Could you try new ceph versions ?

#3 Updated by Xiubo Li over 1 year ago

  • Status changed from New to Need More Info

#4 Updated by Frank Schilder over 1 year ago

Quoting Gregory Farnum in the conversation on the ceph-user list:

That obviously shouldn't happen. Please file a tracker ticket.

There's been a fair bit of churn in how we handle the "vxattrs" so my
guess is an incompatibility got introduced between newer clients and
the old server implementation, but obviously we want it to work and we
especially shouldn't be crashing the MDS. Skimming through it I'm
actually not seeing what a client could do in that path to crash the
server so I'm a bit confused...
Oh. I think I see it now, but I'd like to confirm. Yeah, please make
that tracker ticket and attach the backtrace you get.
Thanks,
-Greg

This is not a "mimic is EOL - case solved" kind of situation, but one about broken yet required backward compatibility. Unless it is a new policy that a yum update can and should rig ceph storage clusters, I would really appreciate if the recent inflationary use of the phrase "XYZ is EOL" would return to acceptable levels and focus on "rock solid ceph" would increase again to pre RH-IBM merger times.

Best regards,
Frank

#5 Updated by Xiubo Li over 1 year ago

Frank Schilder wrote:

Quoting Gregory Farnum in the conversation on the ceph-user list:

That obviously shouldn't happen. Please file a tracker ticket.

There's been a fair bit of churn in how we handle the "vxattrs" so my
guess is an incompatibility got introduced between newer clients and
the old server implementation, but obviously we want it to work and we
especially shouldn't be crashing the MDS. Skimming through it I'm
actually not seeing what a client could do in that path to crash the
server so I'm a bit confused...
Oh. I think I see it now, but I'd like to confirm. Yeah, please make
that tracker ticket and attach the backtrace you get.
Thanks,
-Greg

This is not a "mimic is EOL - case solved" kind of situation, but one about broken yet required backward compatibility. Unless it is a new policy that a yum update can and should rig ceph storage clusters, I would really appreciate if the recent inflationary use of the phrase "XYZ is EOL" would return to acceptable levels and focus on "rock solid ceph" would increase again to pre RH-IBM merger times.

Best regards,
Frank

Missed that thread. Okay, will check it tomorrow about this and will check which commit introduced this.

I didn't see there has any test case is doing the backward compatibility.

#6 Updated by Xiubo Li over 1 year ago

  • Status changed from Need More Info to In Progress

#7 Updated by Xiubo Li over 1 year ago

-1166> 2022-07-11 16:47:30.761 7f690d68b700  1 mds.0.server  unknown client op 262
 -1166> 2022-07-11 16:47:30.762 7f690d68b700 -1 *** Caught signal (Aborted) **
 in thread 7f690d68b700 thread_name:ms_dispatch

 ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)
 1: (()+0xf630) [0x7f6914910630]
 2: (gsignal()+0x37) [0x7f691391b387]
 3: (abort()+0x148) [0x7f691391ca78]
 4: (Server::perf_gather_op_latency(MClientRequest const*, utime_t)+0x5f) [0x56358214a30f]
 5: (Server::reply_client_request(boost::intrusive_ptr<MDRequestImpl>&, MClientReply*)+0x221) [0x56358215d331]
 6: (Server::handle_client_request(MClientRequest*)+0x49e) [0x56358219b13e]
 7: (Server::dispatch(Message*)+0x2db) [0x56358219ee3b]
 8: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x56358210f994]
 9: (MDSRank::_dispatch(Message*, bool)+0x88b) [0x563582126b8b]
 10: (MDSRankDispatcher::ms_dispatch(Message*)+0xa0) [0x5635821271f0]
 11: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x5635821077f3]
 12: (DispatchQueue::entry()+0xb7a) [0x7f6916d963ba]
 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f6916e3462d]
 14: (()+0x7ea5) [0x7f6914908ea5]
 15: (clone()+0x6d) [0x7f69139e38dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The 262 op is:

CEPH_MDS_OP_GETVXATTR  = 0x00106,

Which is not supported in Mimic.

We should check the MDS version before sending the requests in clients. This should be easy to fix.

#8 Updated by Xiubo Li over 1 year ago

It was introduced by:

commit 6ddf5f165f13ab623d04aee2a473d35818255199
Author: Milind Changire <milindchangire@gmail.com>
Date:   Mon Feb 14 05:01:01 2022 +0000

    ceph: add getvxattr op

    Problem:
    Some directory vxattrs (e.g. ceph.dir.pin.random) are governed by
    information that isn't necessarily shared with the client. Add support
    for the new GETVXATTR operation, which allows the client to query the
    MDS directly for vxattrs.
    When the client is queried for a vxattr that doesn't have a special
    handler, have it issue a GETVXATTR to the MDS directly.

    Solution:
    Adds new getvxattr op to fetch ceph.dir.pin*, ceph.dir.layout* and
    ceph.file.layout* vxattrs.
    If the entire layout for a dir or a file is being set, then it is
    expected that the layout be set in standard JSON format. Individual
    field value retrieval is not wrapped in JSON. The JSON format also
    applies while setting the vxattr if the entire layout is being set in
    one go.
    As a temporary measure, setting a vxattr can also be done in the old
    format. The old format will be deprecated in the future.

    URL: https://tracker.ceph.com/issues/51062
    Signed-off-by: Milind Changire <mchangir@redhat.com>
    Reviewed-by: Jeff Layton <jlayton@kernel.org>
    Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

#9 Updated by Xiubo Li over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 47063

Added one CEPHFS_FEATURE_OP_GETVXATTR feature bit support in mds side and fixed it in libcephfs in PR#47063. Will send a patch to fix the kclient.

#10 Updated by Xiubo Li over 1 year ago

  • Backport set to quincy,pacific

#11 Updated by Frank Schilder over 1 year ago

Dear Xiubo Li, thanks for tracking this down so fast! Would be great if you could indicate here when an updated kclient is available for Centos7/8.

I would like to mention that there are many old ceph clusters running. For example, we have a user at our institution running a kraken cluster. There are two promises about compatibility in ceph that I know of:

1. One can always upgrade ceph in steps of 1 or 2 major releases from any earlier release (stated explicitly). OS (kernel) updates may need to be included at important points.
2. One can install a ceph version with an OS release (kernel series) and install security patches for the life-time of the OS release (implied) without upgrading ceph. For example, it would be the ability to do a yum update on a kraken/luminous/mimic/nautilus cluster on Centos7 without loosing the cluster.

Removing the second assertion would open a huge can of worms that could severely impact the faith of all users in the reliability of ceph. The second point is particularly important, because the ceph version rotation (release cadence) is much faster than the average storage admin would like it to be. It is quite typical to run a long time on a fixed version and only do a sprint of upgrades when it really makes sense, for example, there is now this one killer feature that one really wants or an OS distribution reaches end of life.

I would expect a severe loss of faith of the user community if the ability to do such a sprint upgrade goes away because of a missing link in backward compatibility.

Therefore, as you mention it yourself, including sufficiently long windows of backward compatibility (adjusted to life time of an OS distribution) of clients might be a very valuable addition to the test suites. For example, clients of a 10 year life-time LTS OS distribution will need to support compatibility with 5-6 major ceph versions.

Best regards,
Frank

#12 Updated by Patrick Donnelly over 1 year ago

  • Related to Bug #56522: Do not abort MDS on unknown messages added

#13 Updated by Venky Shankar over 1 year ago

Just for completeness -- commit 2f4060b8c41004d10d9a64676ccd847f6e1304dd is the (mds side) fix for this.

#14 Updated by Venky Shankar over 1 year ago

FWIW - we need to get this going: https://tracker.ceph.com/issues/53573.

The question is - how far back in release cycles should we test newer clients against?

#15 Updated by Xiubo Li over 1 year ago

Frank Schilder wrote:

Dear Xiubo Li, thanks for tracking this down so fast! Would be great if you could indicate here when an updated kclient is available for Centos7/8.

Hi Frank,

We are still discussing to find a best approach to fix this or similar issues, once it is done the fix patches should go into the kernel mainline in 1~2 weeks, and then takes around 2 weeks go into RHEL downstream. This is what we can do.

After that when will it be available for the CentOS, I have no idea.

#16 Updated by Patrick Donnelly over 1 year ago

  • Category set to Correctness/Safety
  • Target version set to v18.0.0
  • Source set to Community (user)
  • Component(FS) Client, MDS added

#17 Updated by Frank Schilder over 1 year ago

Xiubo Li wrote:

We are still discussing to find a best approach to fix this or similar issues ...

Since my comment #56529#note-11 above is in this tracker item, I post here a thought that might be relevant for #53573. The main complication with maintaining backward compatibility is that its at an overlap between application developers and OS distro maintainers. Just because a distro decides to have a 100-tear LTS version cannot imply that now all affected application devs need to maintain 100 years of backward compatibility. In an ideal world, maintaining an application and maintaining backward compatibility (life cycle management) would be orthogonal.

A concept that could handle this is to separate a kernel client into 2 (or more) components, (1) the kclient-latest that provides the most recent API implementation with a reasonable compatibility window (say, 2 major ceph versions) and (2) appropriate small long-term compatibility kernel modules that hook into the API when needed (and handle only the diffs/translations in the API calls to the user and/or back-end servers). If such a separation is technically possible, maybe even only for newer versions with a transition period, then the app devs take care of kclient-latest and OS maintainers are in charge of keeping the compat-suite working for their choice of an LTS window. The compat-suite should cover every major ceph version.

How this could look to a user is, that kclient-latest always checks if it supports the ceph server version it is asked to connect to and, if it doesn't know it, tries to load a compat-module. On the command line, a mount would either succeed or fail with something like "error: could not load kernel module ceph-compat-mimic", for example. This would prevent the kind of accidents we have seen here and is also quite explicit about what needs to be done to make it work.

I don't know if it is possible to design ceph clients in that way, but it would solve the issue that OS maintainers can choose a support window (almost) independently of the app devs (distro life cycle management and app development only overlap on the smallest amount of code required).

Best regards,
Frank

#18 Updated by Frank Schilder over 1 year ago

Hi all,

this story continues, this time with a valid vxattr name. I just observed exactly the same problem now with an octopus MDS with the following commands on a client mounting the ceph fs:

[root@rit-tceph ~]# setfattr -n ceph.dir.pin.distributed -v 1 /mnt/adm/cephfs/data/blobs
[root@rit-tceph ~]# getfattr -n ceph.dir.pin.distributed /mnt/adm/cephfs/data/blobs
^C

The first command completes (but does it succeed??). The second command just hangs and the MDS goes into a restart loop.
Relevant MDS log snippet:

   -10> 2022-08-09T14:11:44.422+0200 7fabc3ec2700  1 mds.0.723 recovery_done -- successful recovery!
    -9> 2022-08-09T14:11:44.422+0200 7fabc3ec2700  1 mds.0.723 active_start
    -8> 2022-08-09T14:11:44.423+0200 7fabc3ec2700  4 mds.0.server handle_client_request client_request(client.
443938:77 ??? #0x10000000000 ceph.dir.pin.distributed 2022-08-09T14:11:33.532063+0200 RETRY=1 caller_uid=0, ca
ller_gid=0{0,}) v4
    -7> 2022-08-09T14:11:44.423+0200 7fabc3ec2700  5 mds.0.server waiting for root
    -6> 2022-08-09T14:11:44.423+0200 7fabc3ec2700  1 mds.0.723 cluster recovered.
    -5> 2022-08-09T14:11:44.423+0200 7fabc3ec2700  4 mds.0.723 set_osd_epoch_barrier: epoch=1629
    -4> 2022-08-09T14:11:44.424+0200 7fabc5ec6700  5 mds.beacon.tceph-02 received beacon reply up:active seq 5
 rtt 1.05601
    -3> 2022-08-09T14:11:44.426+0200 7fabbeeb8700  4 mds.0.server handle_client_request client_request(client.
443938:77 ??? #0x10000000000 ceph.dir.pin.distributed 2022-08-09T14:11:33.532063+0200 RETRY=1 caller_uid=0, ca
ller_gid=0{0,}) v4
    -2> 2022-08-09T14:11:44.426+0200 7fabbeeb8700  1 mds.0.server  unknown client op 262
    -1> 2022-08-09T14:11:44.428+0200 7fabbeeb8700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_6
4/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.16/rpm/el8/BUIL
D/ceph-15.2.16/src/mds/Server.cc: In function 'void Server::perf_gather_op_latency(ceph::cref_t<MClientRequest
>&, utime_t)' thread 7fabbeeb8700 time 2022-08-09T14:11:44.427479+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/c
entos8/MACHINE_SIZE/gigantic/release/15.2.16/rpm/el8/BUILD/ceph-15.2.16/src/mds/Server.cc: 2012: ceph_abort_ms
g("abort() called")

 ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)

Is the patch addressing this case as well (previously it was on non-existing vxattr names, this time the vxattr should actually exist). I thought that the commit mentioned in https://tracker.ceph.com/issues/56529#note-8 added this operation to the MDS? Was it fogotten to backport this to octopus? Will the "final release octopus.2.17" still get this commit if necessary?

Best regards,
Frank

#19 Updated by Xiubo Li over 1 year ago

Frank Schilder wrote:

Hi all,

this story continues, this time with a valid vxattr name. I just observed exactly the same problem now with an octopus MDS with the following commands on a client mounting the ceph fs:

[...]

The first command completes (but does it succeed??). The second command just hangs and the MDS goes into a restart loop.
Relevant MDS log snippet:

[...]

Is the patch addressing this case as well (previously it was on non-existing vxattr names, this time the vxattr should actually exist). I thought that the commit mentioned in https://tracker.ceph.com/issues/56529#note-8 added this operation to the MDS? Was it fogotten to backport this to octopus? Will the "final release octopus.2.17" still get this commit if necessary?

That commit hasn't been backported to Octupus. If new clients are connecting to Octupus/Nautilus they should have the same crash issue.

#20 Updated by Frank Schilder over 1 year ago

Thanks for the quick answer. Then, I guess, the patch to the ceph-fs clients will handle this once it is approved? I saw there was quite a discussion going on on the pull request and that there was some activity yesterday. Is this fix approved and can we expect it sometime soon? We are holding upgrades back due to this problem, we use the vxattrib interface a lot.

Best regards,
Frank

#21 Updated by Xiubo Li over 1 year ago

Frank Schilder wrote:

Thanks for the quick answer. Then, I guess, the patch to the ceph-fs clients will handle this once it is approved? I saw there was quite a discussion going on on the pull request and that there was some activity yesterday. Is this fix approved and can we expect it sometime soon? We are holding upgrades back due to this problem, we use the vxattrib interface a lot.

Yeah, locally my test of it has passed and still waiting for Venky to test more if needed.

The kernel client fixing is here [1], and it's ready to be merged. Still waiting for the next merge window to be opened this week or next.

[1] https://patchwork.kernel.org/project/ceph-devel/patch/20220808070823.707829-1-xiubli@redhat.com/

#22 Updated by Venky Shankar over 1 year ago

Frank Schilder wrote:

Thanks for the quick answer. Then, I guess, the patch to the ceph-fs clients will handle this once it is approved? I saw there was quite a discussion going on on the pull request and that there was some activity yesterday. Is this fix approved and can we expect it sometime soon? We are holding upgrades back due to this problem, we use the vxattrib interface a lot.

The fix is nearly there. We are adding tests with it so that such bugs do not creep in. Will plan to run tests this week (and then the fix needs backporting).

#23 Updated by Venky Shankar over 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#24 Updated by Backport Bot over 1 year ago

#25 Updated by Backport Bot over 1 year ago

#26 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#27 Updated by Frank Schilder over 1 year ago

Hi, just a confirmation. The problem is solved in ML 5.19.10-1.el7 and probably all other stable kernel lines, including the check for the ceph client-key permissions:

# getfattr -n ceph.dir.layout.po /mnt/cephfs
/mnt/cephfs: ceph.dir.layout.po: No such attribute

# getfattr -n ceph.dir.layout.po /mnt/adm/cephfs
/mnt/adm/cephfs: ceph.dir.layout.po: No such attribute

# getfattr -n ceph.dir.layout.pool /mnt/cephfs
/mnt/cephfs: ceph.dir.layout.pool: No such attribute

# getfattr -n ceph.dir.layout.pool /mnt/adm/cephfs
getfattr: Removing leading '/' from absolute path names
# file: mnt/adm/cephfs
ceph.dir.layout.pool="fs-data" 

#28 Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (backport_processed)

Also available in: Atom PDF