Project

General

Profile

Bug #10449

MDS crashing in "C_SM_Save::finish" due to OSD response "-90 ((90) Message too long"

Added by Markus Blank-Burian over 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
01/05/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

The relevant part of the log is the following:

2014-12-22T23:07:47+01:00 bagheera ceph-mds:    -10> 2014-12-22 23:07:46.661649 7faf11689700  1 -- 192.168.1.254:6800/13238 --> 192.168.1.77:0/4088967648 -- client_reply(???:3964944 = 0 (0) Success safe) v1 -- ?+0 0x7faee2299a00 con 0x7faeff041f00
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -9> 2014-12-22 23:07:46.662300 7faf11689700  1 -- 192.168.1.254:6800/13238 <== client.1232254 192.168.1.84:0/2255178081 129759 ==== client_request(client.1232254:1969711 setfilelockrule 1, type 1, owner 5109, pid 184466121
50328341504, start 0, length 0, wait 1 #100001a2519) ==== 114+0+0 (3915353074 0 0) 0x7faef784d000 con 0x7faeff043400
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -8> 2014-12-22 23:07:46.662333 7faf11689700  4 mds.0.server handle_client_request client_request(client.1232254:1969711 setfilelockrule 1, type 1, owner 5109, pid 18446612150328341504, start 0, length 0, wait 1 #100001a251
9)
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -7> 2014-12-22 23:07:46.662376 7faf11689700  0 mds.0.server handle_client_file_setlock: start: 0, length: 0, client: 1232254, owner: 5109, pid: 18446612150328341504, type: 1
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -6> 2014-12-22 23:07:46.662395 7faf11689700  1 -- 192.168.1.254:6800/13238 --> 192.168.1.84:0/2255178081 -- client_reply(???:1969711 = 0 (0) Success safe) v1 -- ?+0 0x7faee229ac00 con 0x7faeff043400
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -5> 2014-12-22 23:07:46.662453 7faf11689700  1 -- 192.168.1.254:6800/13238 <== client.1233454 192.168.1.76:0/458678446 588400 ==== client_request(client.1233454:2243876 setfilelockrule 1, type 1, owner 28921, pid 184466121
60231705344, start 0, length 0, wait 1 #10000187902) ==== 114+0+0 (330691582 0 0) 0x7faed5c49280 con 0x7faeff042980
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -4> 2014-12-22 23:07:46.662468 7faf11689700  4 mds.0.server handle_client_request client_request(client.1233454:2243876 setfilelockrule 1, type 1, owner 28921, pid 18446612160231705344, start 0, length 0, wait 1 #100001879
02)
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -3> 2014-12-22 23:07:46.662489 7faf11689700  0 mds.0.server handle_client_file_setlock: start: 0, length: 0, client: 1233454, owner: 28921, pid: 18446612160231705344, type: 1
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -2> 2014-12-22 23:07:46.662502 7faf11689700  1 -- 192.168.1.254:6800/13238 --> 192.168.1.76:0/458678446 -- client_reply(???:2243876 = 0 (0) Success safe) v1 -- ?+0 0x7faee229be00 con 0x7faeff042980
2014-12-22T23:07:47+01:00 bagheera ceph-mds:     -1> 2014-12-22 23:07:46.665039 7faf11689700  1 -- 192.168.1.254:6800/13238 <== osd.40 192.168.1.63:6800/3603 365 ==== osd_op_reply(57603 mds0_sessionmap [writefull 0~99908250] v0'0 uv0 ondisk = -90 ((90) Message too long))
 v6 ==== 182+0+0 (366616621 0 0) 0x7faf0c564900 con 0x7faedfab9380
2014-12-22T23:07:47+01:00 bagheera ceph-mds:      0> 2014-12-22 23:07:46.725879 7faf11689700 -1 mds/SessionMap.cc: In function 'virtual void C_SM_Save::finish(int)' thread 7faf11689700 time 2014-12-22 23:07:46.724191
mds/SessionMap.cc: 106: FAILED assert(r == 0)

 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
 1: /usr/bin/ceph-mds() [0x7cacb5]
 2: (Context::complete(int)+0x9) [0x5723d5]
 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xecf) [0x7f3655]
 4: (MDS::handle_core_message(Message*)+0xacc) [0x593d24]
 5: (MDS::_dispatch(Message*)+0x45) [0x593f19]
 6: (MDS::ms_dispatch(Message*)+0xa9) [0x5958ab]
 7: (DispatchQueue::entry()+0x52e) [0x9d1984]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8e74ad]
 9: (()+0x8ed3) [0x7faf16ddfed3]
 10: (clone()+0x6d) [0x7faf1575de2d]

On the osd log I see the following:

2014-12-18T20:59:54+01:00 kaa-63 ceph-osd: 2014-12-18 20:59:54.054029 7f84549f6700 -1 osd.40 32266 handle_op msg data len 94682218 > osd_max_write_size 94371840 on osd_op(mds.0.30:1575838 mds0_sessionmap [writefull 0~94682218] 1.3270c60b ondisk+write e32266) v4

My setup includes 1 mds, 2 mon, 68 osd and i am using the kernel cephfs client. osd.40 uses xfs as underlying filesystem. I could reproduce the bug with both v0.80.4 and v0.87 using the following python script (started 16 times on one client, took roughly 1 day):

#!/usr/bin/env python3
import fcntl, os, argparse

parser = argparse.ArgumentParser()
parser.add_argument("filename", nargs = 1)
args = parser.parse_args()

with open(args.filename[0], "w+") as f:
        try:
                fcntl.lockf(f, fcntl.LOCK_SH)
                while True:
                        fcntl.lockf(f, fcntl.LOCK_EX)
                        f.seek(0, os.SEEK_SET)
                        f.write("asdf")
                        f.seek(0, os.SEEK_END)
                        f.write("test")
                        fcntl.lockf(f, fcntl.LOCK_SH)
        finally:
                fcntl.lockf(f, fcntl.LOCK_UN)

This represents roughly the kind of active workload on the clients (except they to bigger writes).

I hope this bug is fixable without further testing, since due to another bug, each mds crash causes some of our client-machines to crash with a "NULL pointer dereference" in send_mds_reconnect (I am opening another ticket for this).

ceph-mds-bagheera.log-20141214.gz (795 KB) Markus Blank-Burian, 01/08/2015 02:07 AM

ceph-mds-bagheera.log-20141229.gz (730 KB) Markus Blank-Burian, 01/08/2015 02:07 AM

ceph-mon-bagheera.log-20141214.gz (210 KB) Markus Blank-Burian, 01/08/2015 02:07 AM

ceph-mon-bagheera.log-20141229.gz (793 KB) Markus Blank-Burian, 01/08/2015 02:07 AM

0001-ceph-fix-kick_requests.patch View (1.08 KB) Zheng Yan, 01/08/2015 04:39 AM

0002-ceph-protect-kick_requests-with-mdsc-mutex.patch View (1.19 KB) Zheng Yan, 01/08/2015 04:39 AM

0003-ceph-make-sure-request-isn-t-in-any-waiting-list-whe.patch View (885 Bytes) Zheng Yan, 01/08/2015 04:39 AM


Related issues

Related to fs - Feature #10649: Store MDS tables without size limit Resolved 01/27/2015

History

#1 Updated by John Spray over 4 years ago

  • Status changed from New to Verified

Funnily enough, I was just noticing this in the code while working on #9883 -- all the MDS table persistence code uses single-operation reads and writes for objects of unbounded size.

We need something to break up the reads and writes into chunks, while retaining atomicity. We might need something like a double buffering scheme where we write out new sessionmaps into a fresh object in multiple ops, and then finally mark that object as complete and remove the previous object.

It's interesting that the fcntl workload is triggering this, apparently the completed_requests field of session_info_t must be getting big under this load, which seems like a separate issue to the failure to write the resulting large sessionmap.

I think we might get quite a nice automated test out of this.

#2 Updated by Greg Farnum over 4 years ago

  • Priority changed from Normal to High

If this is two separate bugs I'd like to figure out if we can prevent the session_info_t from growing so large pretty quickly — I can't imagine what it takes to get a 94MB session table out of one client's request info, and there's a good chance that's easier to fix than the single-write updates.

#3 Updated by Zheng Yan over 4 years ago

I think there was a stuck request, which prevented mds from trimming completed_requests

#4 Updated by Markus Blank-Burian over 4 years ago

Regarding the stuck requests .. I have definitely seen one or two stuck mds requests. But I had then thought they were caused by the crash and not causing it. I am currently using kernel 3.14.26, so is there eventually already a patch available i could try?

#5 Updated by Zheng Yan over 4 years ago

there are some file lock related fixes after 3.14. did you see something like below in ceph.log

slow request 121.322570 seconds old, received at 2014-11-07 23:36:39.832336: client_request(client.7071:115 getattr pAsLsXsFs #1000002bdbe 2014-11-07 23:36:39.000000) currently failed to rdlock, waiting

#6 Updated by Markus Blank-Burian over 4 years ago

Out of 8 crashes, there was only one incident with slow requests in the logs. I added the mds/mon logs for further inspection.

#8 Updated by Markus Blank-Burian over 4 years ago

ok, i have now a test running with all three kernel patches plus the one mentioned in #10450.

#9 Updated by Markus Blank-Burian over 4 years ago

Up until now, I could not reproduce the bug with all the patches applied. But I stumpled across bug #5429 which is marked as resolved with a rbd fix, but I could trigger it also restarting all our OSDs simultaneously using only cephfs and no rbd.

#10 Updated by Markus Blank-Burian over 4 years ago

I had our cluster running under full load for some days and the bug still did not appear.

#11 Updated by Greg Farnum over 4 years ago

  • Status changed from Verified to Testing
  • Assignee set to Zheng Yan

Is this in the kernel tree, and is there something we should do on the server side as well?

#12 Updated by Zheng Yan over 4 years ago

it's a kernel bug. hang request prevents mds from trimming completed_requests in sessionmap. there is nothing to do with mds. (maybe we should add some code to MDS to show warning when this bug happens)

#13 Updated by John Spray over 4 years ago

Created #10649 to track the need to overcome sessiontable size limit, to avoid overloading this ticket.

I agree that we should have an MDS health warning for stuck requests too

#14 Updated by Greg Farnum over 4 years ago

  • Status changed from Testing to Resolved

#10657 is a feature ticket for the warning.

The libceph bug reference to #5429 is being tracked in #8087 and doesn't seem to have anything to do with the FS bits. So, closing this out! :)

#15 Updated by Greg Farnum about 3 years ago

  • Component(FS) MDS added

Also available in: Atom PDF