Bug #56522: Do not abort MDS on unknown messages - CephFS - Ceph

Actions

Copy link

Bug #56522

closed

Do not abort MDS on unknown messages

Added by Greg Farnum almost 2 years ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Dhairya Parmar

Category:

Correctness/Safety

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

47080

Crash signature (v1):

Crash signature (v2):

Description

Right now, in Server::dispatch(), we abort the MDS if we get a message type we don't understand.

This is horrible: it means that any malicious client can crash the server by just sending a message of a new type to the server! That's a trivial denial of service.
Besides malicious clients, it also means that when there's a protocol issue such as a new client erroneously sending new messages to the server, it crashes the whole system instead of just the new client.

Instead, we'll need to drop the message in a way that makes any kind of sense — perhaps we respond to unknown messages by blacklisting the client and closing the session?

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Xiubo Li almost 2 years ago

Greg Farnum wrote:

Right now, in Server::dispatch(), we abort the MDS if we get a message type we don't understand.

This is horrible: it means that any malicious client can crash the server by just sending a message of a new type to the server! That's a trivial denial of service.
Besides malicious clients, it also means that when there's a protocol issue such as a new client erroneously sending new messages to the server, it crashes the whole system instead of just the new client.

Instead, we'll need to drop the message in a way that makes any kind of sense — perhaps we respond to unknown messages by blacklisting the client and closing the session?

IMO we'd better just drop it, because it's possible that in the future when we added a new type of message and what if the newer clients are connecting to the old ceph ? Just like the client metric bugs we hit before.

Actions

Copy link

Updated by Venky Shankar almost 2 years ago

Greg Farnum wrote:

Right now, in Server::dispatch(), we abort the MDS if we get a message type we don't understand.

This is horrible: it means that any malicious client can crash the server by just sending a message of a new type to the server! That's a trivial denial of service.
Besides malicious clients, it also means that when there's a protocol issue such as a new client erroneously sending new messages to the server, it crashes the whole system instead of just the new client.

That's horrid. We fixed bugs of this class (commit: 2f4060b8c41004d10d9a64676ccd847f6e1304dd), but its really bad that there are leftovers of similar sorts.

Instead, we'll need to drop the message in a way that makes any kind of sense — perhaps we respond to unknown messages by blacklisting the client and closing the session?

We obviously do not want to abort the mds. If we drop the message, how do clients react? Blocklisting and closing the connection sounds harsh given that a newer client can talk to old MDSs.

Actions

Copy link

Updated by Xiubo Li almost 2 years ago

Venky Shankar wrote:

Greg Farnum wrote:

Right now, in Server::dispatch(), we abort the MDS if we get a message type we don't understand.

This is horrible: it means that any malicious client can crash the server by just sending a message of a new type to the server! That's a trivial denial of service.
Besides malicious clients, it also means that when there's a protocol issue such as a new client erroneously sending new messages to the server, it crashes the whole system instead of just the new client.

That horrid. We fixed bugs of this class (commit: 2f4060b8c41004d10d9a64676ccd847f6e1304dd), but its really bad that there are leftovers of similar sorts.

Instead, we'll need to drop the message in a way that makes any kind of sense — perhaps we respond to unknown messages by blacklisting the client and closing the session?

We obviously do not want to abort the mds. If we drop the message, how do clients react? Blocklisting and closing the connection sounds harsh given that a newer client can talk to old MDSs.

Yeah, the clients should wait forever. So perhaps should respond to client with a reply to tell clients this is a request unsupported.

Actions

Copy link

Updated by Venky Shankar almost 2 years ago

Status changed from New to Triaged
Assignee set to Dhairya Parmar
Backport set to quincy, pacific

Actions

Copy link

Updated by Patrick Donnelly almost 2 years ago

I think the MDS should close the session and blocklist the client. If a newer client is using features an older cluster does not understand, that should be fixed by the admin (by turning the feature off). Or, in the ideal case, this should be caught by the CephFS feature bits exchanged by the MDS/client so that the client isn't sending messages it should know the MDS does not understand.

Actions

Copy link

Updated by Patrick Donnelly almost 2 years ago

Related to Bug #56529: ceph-fs crashes on getfattr added

Actions

Copy link

Updated by Greg Farnum almost 2 years ago

Venky Shankar wrote:

We obviously do not want to abort the mds. If we drop the message, how do clients react? Blocklisting and closing the connection sounds harsh given that a newer client can talk to old MDSs.

As Patrick says, if a new client is sending an old MDS ops it doesn't understand, that's an error in the client. (One probably caused by us, but still a client error.) There's a good chance in such situations that it will do other things we don't understand, and blocklisting is about the cleanest break we can make to signal a major problem and prevent ongoing issues.

Silently dropping messages causes a LOT of issues — besides the risk of out-of-order problems, the client won't be able to move various queues forward as it waits for replies on the one dropped message...

Actions

Copy link

Updated by Dhairya Parmar almost 2 years ago

Pull request ID set to 47080

Greg Farnum wrote:

Venky Shankar wrote:

We obviously do not want to abort the mds. If we drop the message, how do clients react? Blocklisting and closing the connection sounds harsh given that a newer client can talk to old MDSs.

As Patrick says, if a new client is sending an old MDS ops it doesn't understand, that's an error in the client. (One probably caused by us, but still a client error.) There's a good chance in such situations that it will do other things we don't understand, and blocklisting is about the cleanest break we can make to signal a major problem and prevent ongoing issues.

Silently dropping messages causes a LOT of issues — besides the risk of out-of-order problems, the client won't be able to move various queues forward as it waits for replies on the one dropped message...

After reading all the messages, I think It's better to expel the client from further communication. Only printing message could lead to deadly issues and it might be hard to trace down bugs that could arise due to this silent message. I feel patrick is right, admin can turning off the feature that is not yet supported to prevent getting blocklisted. Keeping these things in mind, I have pushed a PR that would close the session and blocklist the client. Would appreciate any comments/suggestions on the PR.

Actions

Copy link

Updated by Stefan Kooman almost 2 years ago

@Dhairya Parmar

If the connection would be silently closed, it would be highly appreciated that the MDS logs this with the reason why. This way an admin has a way to troubleshoot / verify.

Actions

Copy link

#10

Updated by Dhairya Parmar almost 2 years ago

Stefan Kooman wrote:

@Dhairya Parmar

If the connection would be silently closed, it would be highly appreciated that the MDS logs this with the reason why. This way an admin has a way to troubleshoot / verify.

@Stefan Kleijkers Kooman

Yes, It will be logged in MDS logs. You can also check the PR if in case you have any changes to propose.

Actions

Copy link

#11

Updated by Dhairya Parmar almost 2 years ago

Status changed from Triaged to Fix Under Review

Actions

Copy link

#12

Updated by Milind Changire almost 2 years ago

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

Actions

Copy link

#13

Updated by Xiubo Li almost 2 years ago

Milind Changire wrote:

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

For this issue for existing old MDSs, it's very hard to figure out whether will they crash when receive an unknown OP. It seems we couldn't figure out whether that fix in MDS was backported or not.

Except the feature bit, is there any other better approach to tell clients do not send unknown OPs to old MDS ?

Actions

Copy link

#14

Updated by Milind Changire almost 2 years ago

Xiubo Li wrote:

Milind Changire wrote:

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

For this issue for existing old MDSs, it's very hard to figure out whether will they crash when receive an unknown OP. It seems we couldn't figure out whether that fix in MDS was backported or not.

Except the feature bit, is there any other better approach to tell clients do not send unknown OPs to old MDS ?

The specific crash mitigation fix by Venky was backported to quincy, pacific and octopus.

For the client should not send unknown ops to mds, there's still the problem of somebody crafting a new client with an unknown op and causing the MDS to crash. So we can't rely on the client to block unknown ops. Unknown ops must be handled at the MDS side and reported back as EOPNOTSUPP (Operation not supported).

Actions

Copy link

#15

Updated by Xiubo Li almost 2 years ago

Milind Changire wrote:

Xiubo Li wrote:

Milind Changire wrote:

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

For this issue for existing old MDSs, it's very hard to figure out whether will they crash when receive an unknown OP. It seems we couldn't figure out whether that fix in MDS was backported or not.

Except the feature bit, is there any other better approach to tell clients do not send unknown OPs to old MDS ?

The specific crash mitigation fix by Venky was backported to quincy, pacific and octopus.

For the client should not send unknown ops to mds, there's still the problem of somebody crafting a new client with an unknown op and causing the MDS to crash. So we can't rely on the client to block unknown ops. Unknown ops must be handled at the MDS side and reported back as EOPNOTSUPP (Operation not supported).

Yeah, I couldn't agree more and this is what we should do.

While for the issues like in https://tracker.ceph.com/issues/56529, we couldn't depend on the MDS side to report back as EOPNOTSUPP. Because for much older cephs and in the earlier quincy, pacific and octopus releases which haven't backported that fix yet but are already running by users/customers. They still will crash the MDSs.

Actions

Copy link

#16

Updated by Venky Shankar almost 2 years ago

Milind Changire wrote:

Xiubo Li wrote:

Milind Changire wrote:

I had started the GETVXATTR RPC implementation with the introduction of a feature bit for this very purpose. I was told that it wasn't a good idea to introduce any more feature bits and that I should work my way around the issue. I tried to test for the mds versions instead and that idea was killed as well in the favor of providing the "crash mitigation fix" for older mds which was much trivial and unintrusive to implement.

For this issue for existing old MDSs, it's very hard to figure out whether will they crash when receive an unknown OP. It seems we couldn't figure out whether that fix in MDS was backported or not.

Except the feature bit, is there any other better approach to tell clients do not send unknown OPs to old MDS ?

The specific crash mitigation fix by Venky was backported to quincy, pacific and octopus.

For the client should not send unknown ops to mds, there's still the problem of somebody crafting a new client with an unknown op and causing the MDS to crash. So we can't rely on the client to block unknown ops. Unknown ops must be handled at the MDS side and reported back as EOPNOTSUPP (Operation not supported).

We obviously need the fix on the mds side to not crash when it does not understand a file operation and these fixes are backported to last 2 releases. So, does this mean for releases older than that, we rely on client side changes to avoid sending operations that the mds does not implement?

Actions

Copy link

#17