Project

General

Profile

Bug #49500

qa: "Assertion `cb_done' failed."

Added by Patrick Donnelly about 2 months ago. Updated 15 days ago.

Status:
Pending Backport
Priority:
High
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, qa-suite
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Related issues

Copied from CephFS - Bug #49309: nautilus: qa: "Assertion `cb_done' failed." Pending Backport
Copied to CephFS - Backport #50188: octopus: qa: "Assertion `cb_done' failed." Need More Info
Copied to CephFS - Backport #50189: nautilus: qa: "Assertion `cb_done' failed." Rejected
Copied to CephFS - Backport #50190: pacific: qa: "Assertion `cb_done' failed." In Progress

History

#1 Updated by Patrick Donnelly about 2 months ago

  • Copied from Bug #49309: nautilus: qa: "Assertion `cb_done' failed." added

#2 Updated by Jeff Layton about 1 month ago

With the most recent change to make that variable atomic, I doubt we're hitting cache-coherency problems. It seems more likely that the callback just didn't happen. Could it be that the cluster in this case is OK with 1000 inodes on the client and doesn't trigger the ino_release_cb?

#3 Updated by Jeff Layton about 1 month ago

Yeah, looking at the MDS logs from the above run. I don't see any occurrences of the word "recall" in there and at least some of the dout(7) messages in Server::recall_client_state should have fired. I think that this test just didn't trigger any recalls. Have the MDS's default limits changed?

#4 Updated by Patrick Donnelly about 1 month ago

Jeff Layton wrote:

Yeah, looking at the MDS logs from the above run. I don't see any occurrences of the word "recall" in there and at least some of the dout(7) messages in Server::recall_client_state should have fired. I think that this test just didn't trigger any recalls. Have the MDS's default limits changed?

Ah, yes that is probably it. I think it's caused by 63392e1b65fbead6ef8c7acd6a70e6ef5b322390 and the new mds_min_caps_working_set option.

#5 Updated by Jeff Layton about 1 month ago

I'm not sure that setting is enough to explain this. AFAICT, that setting is only consulted in notify_health(), so I think that should just affect health warnings.

This test was written when the logic to trigger cap recall was pretty simple. Once you hit ~1k caps outstanding, the MDS would ask the client to shrink its caps. This has evidently changed recently, but the test was not updated to take that into account.

Basically we want this test to find new inodes up until the point where we know that the MDS will start recalling them. What's the right way to do that now?

#6 Updated by Jeff Layton about 1 month ago

Maybe we could lower mds_max_caps_per_client for this test? It defaults to 1M now, but we could take that down to 500 or so for this test (and then reset it when we're done)?

#7 Updated by Patrick Donnelly 28 days ago

Jeff Layton wrote:

Maybe we could lower mds_max_caps_per_client for this test? It defaults to 1M now, but we could take that down to 500 or so for this test (and then reset it when we're done)?

Looking at this test more closely... why it ever worked is unclear to me. MDS does not normally drive recall for a client reaching 1k caps. What is supposed to trigger call release_cb?

We can reduce `mds_max_caps_per_client` but I'd like to understand what's supposed to be tested. Just that the callback works?

#8 Updated by Jeff Layton 28 days ago

Patrick Donnelly wrote:

Jeff Layton wrote:

Maybe we could lower mds_max_caps_per_client for this test? It defaults to 1M now, but we could take that down to 500 or so for this test (and then reset it when we're done)?

Looking at this test more closely... why it ever worked is unclear to me. MDS does not normally drive recall for a client reaching 1k caps. What is supposed to trigger call release_cb?

It used to do that, IIRC, but it was based on some rather fluid limits.

We can reduce `mds_max_caps_per_client` but I'd like to understand what's supposed to be tested. Just that the callback works?

Yes, just that the callback is called when inodes are being recalled (a'la CEPH_SESSION_RECALL_STATE).

#9 Updated by Patrick Donnelly 27 days ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Jeff Layton to Patrick Donnelly
  • Pull request ID set to 40418

#10 Updated by Patrick Donnelly 15 days ago

  • Status changed from Fix Under Review to Pending Backport
  • Component(FS) Client added

#11 Updated by Backport Bot 15 days ago

  • Copied to Backport #50188: octopus: qa: "Assertion `cb_done' failed." added

#12 Updated by Backport Bot 15 days ago

  • Copied to Backport #50189: nautilus: qa: "Assertion `cb_done' failed." added

#13 Updated by Backport Bot 15 days ago

  • Copied to Backport #50190: pacific: qa: "Assertion `cb_done' failed." added

Also available in: Atom PDF