Bug #49500: qa: "Assertion `cb_done' failed." - CephFS - Ceph

Actions

Copy link

Bug #49500

closed

qa: "Assertion `cb_done' failed."

Added by Patrick Donnelly about 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

Patrick Donnelly

Category:

Target version:

Ceph - v17.0.0

% Done:

Source:

Q/A

Tags:

Backport:

pacific,octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client, qa-suite

Labels (FS):

qa, qa-failure

Pull request ID:

40418

Crash signature (v1):

Crash signature (v2):

Description

Clone of #49309. The (good) fix we thought might help did not.

https://pulpito.ceph.com/pdonnell-2021-02-25_21:22:22-fs-wip-pdonnell-testing-20210225.184709-distro-basic-smithi/5913697/

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Copied from Bug #49309: nautilus: qa: "Assertion `cb_done' failed." added

Actions

Copy link

Updated by Jeff Layton about 3 years ago

With the most recent change to make that variable atomic, I doubt we're hitting cache-coherency problems. It seems more likely that the callback just didn't happen. Could it be that the cluster in this case is OK with 1000 inodes on the client and doesn't trigger the ino_release_cb?

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Yeah, looking at the MDS logs from the above run. I don't see any occurrences of the word "recall" in there and at least some of the dout(7) messages in Server::recall_client_state should have fired. I think that this test just didn't trigger any recalls. Have the MDS's default limits changed?

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Jeff Layton wrote:

Yeah, looking at the MDS logs from the above run. I don't see any occurrences of the word "recall" in there and at least some of the dout(7) messages in Server::recall_client_state should have fired. I think that this test just didn't trigger any recalls. Have the MDS's default limits changed?

Ah, yes that is probably it. I think it's caused by 63392e1b65fbead6ef8c7acd6a70e6ef5b322390 and the new mds_min_caps_working_set option.

Actions

Copy link

Updated by Jeff Layton about 3 years ago

I'm not sure that setting is enough to explain this. AFAICT, that setting is only consulted in notify_health(), so I think that should just affect health warnings.

This test was written when the logic to trigger cap recall was pretty simple. Once you hit ~1k caps outstanding, the MDS would ask the client to shrink its caps. This has evidently changed recently, but the test was not updated to take that into account.

Basically we want this test to find new inodes up until the point where we know that the MDS will start recalling them. What's the right way to do that now?

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Maybe we could lower mds_max_caps_per_client for this test? It defaults to 1M now, but we could take that down to 500 or so for this test (and then reset it when we're done)?

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Jeff Layton wrote:

Maybe we could lower mds_max_caps_per_client for this test? It defaults to 1M now, but we could take that down to 500 or so for this test (and then reset it when we're done)?

Looking at this test more closely... why it ever worked is unclear to me. MDS does not normally drive recall for a client reaching 1k caps. What is supposed to trigger call release_cb?

We can reduce `mds_max_caps_per_client` but I'd like to understand what's supposed to be tested. Just that the callback works?

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Patrick Donnelly wrote:

Jeff Layton wrote:

Maybe we could lower mds_max_caps_per_client for this test? It defaults to 1M now, but we could take that down to 500 or so for this test (and then reset it when we're done)?

Looking at this test more closely... why it ever worked is unclear to me. MDS does not normally drive recall for a client reaching 1k caps. What is supposed to trigger call release_cb?

It used to do that, IIRC, but it was based on some rather fluid limits.

We can reduce `mds_max_caps_per_client` but I'd like to understand what's supposed to be tested. Just that the callback works?

Yes, just that the callback is called when inodes are being recalled (a'la CEPH_SESSION_RECALL_STATE).

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Status changed from New to Fix Under Review
Assignee changed from Jeff Layton to Patrick Donnelly
Pull request ID set to 40418

Actions

Copy link

#10

Updated by Patrick Donnelly about 3 years ago

Status changed from Fix Under Review to Pending Backport
Component(FS) Client added

Actions

Copy link

#11

Updated by Backport Bot about 3 years ago

Copied to Backport #50188: octopus: qa: "Assertion `cb_done' failed." added

Actions

Copy link

#12

Updated by Backport Bot about 3 years ago

Copied to Backport #50189: nautilus: qa: "Assertion `cb_done' failed." added

Actions

Copy link

#13

Updated by Backport Bot about 3 years ago

Copied to Backport #50190: pacific: qa: "Assertion `cb_done' failed." added

Actions

Copy link

#14

Updated by Loïc Dachary over 2 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #49500

qa: "Assertion `cb_done' failed."

Updated by Patrick Donnelly about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Loïc Dachary over 2 years ago