Project

General

Profile

Actions

Bug #56522

closed

Do not abort MDS on unknown messages

Added by Greg Farnum almost 2 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Right now, in Server::dispatch(), we abort the MDS if we get a message type we don't understand.

This is horrible: it means that any malicious client can crash the server by just sending a message of a new type to the server! That's a trivial denial of service.
Besides malicious clients, it also means that when there's a protocol issue such as a new client erroneously sending new messages to the server, it crashes the whole system instead of just the new client.

Instead, we'll need to drop the message in a way that makes any kind of sense — perhaps we respond to unknown messages by blacklisting the client and closing the session?


Related issues 3 (0 open3 closed)

Related to CephFS - Bug #56529: ceph-fs crashes on getfattrResolvedXiubo Li

Actions
Copied to CephFS - Backport #57665: pacific: Do not abort MDS on unknown messagesResolvedDhairya ParmarActions
Copied to CephFS - Backport #57666: quincy: Do not abort MDS on unknown messagesResolvedDhairya ParmarActions
Actions #1

Updated by Xiubo Li almost 2 years ago

Greg Farnum wrote:

Right now, in Server::dispatch(), we abort the MDS if we get a message type we don't understand.

This is horrible: it means that any malicious client can crash the server by just sending a message of a new type to the server! That's a trivial denial of service.
Besides malicious clients, it also means that when there's a protocol issue such as a new client erroneously sending new messages to the server, it crashes the whole system instead of just the new client.

Instead, we'll need to drop the message in a way that makes any kind of sense — perhaps we respond to unknown messages by blacklisting the client and closing the session?

IMO we'd better just drop it, because it's possible that in the future when we added a new type of message and what if the newer clients are connecting to the old ceph ? Just like the client metric bugs we hit before.

Actions #2

Updated by Venky Shankar almost 2 years ago

Greg Farnum wrote:

Right now, in Server::dispatch(), we abort the MDS if we get a message type we don't understand.

This is horrible: it means that any malicious client can crash the server by just sending a message of a new type to the server! That's a trivial denial of service.
Besides malicious clients, it also means that when there's a protocol issue such as a new client erroneously sending new messages to the server, it crashes the whole system instead of just the new client.

That's horrid. We fixed bugs of this class (commit: 2f4060b8c41004d10d9a64676ccd847f6e1304dd), but its really bad that there are leftovers of similar sorts.

Instead, we'll need to drop the message in a way that makes any kind of sense — perhaps we respond to unknown messages by blacklisting the client and closing the session?

We obviously do not want to abort the mds. If we drop the message, how do clients react? Blocklisting and closing the connection sounds harsh given that a newer client can talk to old MDSs.

Actions #3

Updated by Xiubo Li almost 2 years ago

Venky Shankar wrote:

Greg Farnum wrote:

Right now, in Server::dispatch(), we abort the MDS if we get a message type we don't understand.

This is horrible: it means that any malicious client can crash the server by just sending a message of a new type to the server! That's a trivial denial of service.
Besides malicious clients, it also means that when there's a protocol issue such as a new client erroneously sending new messages to the server, it crashes the whole system instead of just the new client.

That horrid. We fixed bugs of this class (commit: 2f4060b8c41004d10d9a64676ccd847f6e1304dd), but its really bad that there are leftovers of similar sorts.

Instead, we'll need to drop the message in a way that makes any kind of sense — perhaps we respond to unknown messages by blacklisting the client and closing the session?

We obviously do not want to abort the mds. If we drop the message, how do clients react? Blocklisting and closing the connection sounds harsh given that a newer client can talk to old MDSs.

Yeah, the clients should wait forever. So perhaps should respond to client with a reply to tell clients this is a request unsupported.

Actions #4

Updated by Venky Shankar almost 2 years ago

  • Status changed from New to Triaged
  • Assignee set to Dhairya Parmar
  • Backport set to quincy, pacific
Actions #5

Updated by Patrick Donnelly almost 2 years ago

I think the MDS should close the session and blocklist the client. If a newer client is using features an older cluster does not understand, that should be fixed by the admin (by turning the feature off). Or, in the ideal case, this should be caught by the CephFS feature bits exchanged by the MDS/client so that the client isn't sending messages it should know the MDS does not understand.

Actions #6

Updated by Patrick Donnelly almost 2 years ago

  • Related to Bug #56529: ceph-fs crashes on getfattr added
Actions #7

Updated by Greg Farnum almost 2 years ago

Venky Shankar wrote:

We obviously do not want to abort the mds. If we drop the message, how do clients react? Blocklisting and closing the connection sounds harsh given that a newer client can talk to old MDSs.

As Patrick says, if a new client is sending an old MDS ops it doesn't understand, that's an error in the client. (One probably caused by us, but still a client error.) There's a good chance in such situations that it will do other things we don't understand, and blocklisting is about the cleanest break we can make to signal a major problem and prevent ongoing issues.

Silently dropping messages causes a LOT of issues — besides the risk of out-of-order problems, the client won't be able to move various queues forward as it waits for replies on the one dropped message...

Actions #8

Updated by Dhairya Parmar almost 2 years ago

  • Pull request ID set to 47080

Greg Farnum wrote:

Venky Shankar wrote:

We obviously do not want to abort the mds. If we drop the message, how do clients react? Blocklisting and closing the connection sounds harsh given that a newer client can talk to old MDSs.

As Patrick says, if a new client is sending an old MDS ops it doesn't understand, that's an error in the client. (One probably caused by us, but still a client error.) There's a good chance in such situations that it will do other things we don't understand, and blocklisting is about the cleanest break we can make to signal a major problem and prevent ongoing issues.

Silently dropping messages causes a LOT of issues — besides the risk of out-of-order problems, the client won't be able to move various queues forward as it waits for replies on the one dropped message...

After reading all the messages, I think It's better to expel the client from further communication. Only printing message could lead to deadly issues and it might be hard to trace down bugs that could arise due to this silent message. I feel patrick is right, admin can turning off the feature that is not yet supported to prevent getting blocklisted. Keeping these things in mind, I have pushed a PR that would close the session and blocklist the client. Would appreciate any comments/suggestions on the PR.

Actions #9

Updated by Stefan Kooman almost 2 years ago

@Dhairya Parmar

If the connection would be silently closed, it would be highly appreciated that the MDS logs this with the reason why. This way an admin has a way to troubleshoot / verify.

Actions #10

Updated by Dhairya Parmar almost 2 years ago

Stefan Kooman wrote:

@Dhairya Parmar

If the connection would be silently closed, it would be highly appreciated that the MDS logs this with the reason why. This way an admin has a way to troubleshoot / verify.

@Stefan Kleijkers Kooman

Yes, It will be logged in MDS logs. You can also check the PR if in case you have any changes to propose.

Actions #11

Updated by Dhairya Parmar almost 2 years ago

  • Status changed from Triaged to Fix Under Review
Actions #12

Updated by Milind Changire almost 2 years ago

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

Actions #13

Updated by Xiubo Li almost 2 years ago

Milind Changire wrote:

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

For this issue for existing old MDSs, it's very hard to figure out whether will they crash when receive an unknown OP. It seems we couldn't figure out whether that fix in MDS was backported or not.

Except the feature bit, is there any other better approach to tell clients do not send unknown OPs to old MDS ?

Actions #14

Updated by Milind Changire almost 2 years ago

Xiubo Li wrote:

Milind Changire wrote:

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

For this issue for existing old MDSs, it's very hard to figure out whether will they crash when receive an unknown OP. It seems we couldn't figure out whether that fix in MDS was backported or not.

Except the feature bit, is there any other better approach to tell clients do not send unknown OPs to old MDS ?

The specific crash mitigation fix by Venky was backported to quincy, pacific and octopus.

For the client should not send unknown ops to mds, there's still the problem of somebody crafting a new client with an unknown op and causing the MDS to crash. So we can't rely on the client to block unknown ops. Unknown ops must be handled at the MDS side and reported back as EOPNOTSUPP (Operation not supported).

Actions #15

Updated by Xiubo Li almost 2 years ago

Milind Changire wrote:

Xiubo Li wrote:

Milind Changire wrote:

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

For this issue for existing old MDSs, it's very hard to figure out whether will they crash when receive an unknown OP. It seems we couldn't figure out whether that fix in MDS was backported or not.

Except the feature bit, is there any other better approach to tell clients do not send unknown OPs to old MDS ?

The specific crash mitigation fix by Venky was backported to quincy, pacific and octopus.

For the client should not send unknown ops to mds, there's still the problem of somebody crafting a new client with an unknown op and causing the MDS to crash. So we can't rely on the client to block unknown ops. Unknown ops must be handled at the MDS side and reported back as EOPNOTSUPP (Operation not supported).

Yeah, I couldn't agree more and this is what we should do.

While for the issues like in https://tracker.ceph.com/issues/56529, we couldn't depend on the MDS side to report back as EOPNOTSUPP. Because for much older cephs and in the earlier quincy, pacific and octopus releases which haven't backported that fix yet but are already running by users/customers. They still will crash the MDSs.

Actions #16

Updated by Venky Shankar almost 2 years ago

Milind Changire wrote:

Xiubo Li wrote:

Milind Changire wrote:

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

For this issue for existing old MDSs, it's very hard to figure out whether will they crash when receive an unknown OP. It seems we couldn't figure out whether that fix in MDS was backported or not.

Except the feature bit, is there any other better approach to tell clients do not send unknown OPs to old MDS ?

The specific crash mitigation fix by Venky was backported to quincy, pacific and octopus.

For the client should not send unknown ops to mds, there's still the problem of somebody crafting a new client with an unknown op and causing the MDS to crash. So we can't rely on the client to block unknown ops. Unknown ops must be handled at the MDS side and reported back as EOPNOTSUPP (Operation not supported).

We obviously need the fix on the mds side to not crash when it does not understand a file operation and these fixes are backported to last 2 releases. So, does this mean for releases older than that, we rely on client side changes to avoid sending operations that the mds does not implement?

Actions #17

Updated by Venky Shankar over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #18

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57665: pacific: Do not abort MDS on unknown messages added
Actions #19

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57666: quincy: Do not abort MDS on unknown messages added
Actions #20

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #21

Updated by Dhairya Parmar about 1 year ago

  • Status changed from Pending Backport to Resolved
  • Assignee changed from Dhairya Parmar to Casey Bodley

PR merged into main, both backports PRs merged into their respective branches.

Actions #22

Updated by Dhairya Parmar about 1 year ago

  • Assignee changed from Casey Bodley to Dhairya Parmar
  • Labels (FS) crash added
Actions

Also available in: Atom PDF