Bug #10449
MDS crashing in "C_SM_Save::finish" due to OSD response "-90 ((90) Message too long"
Description
The relevant part of the log is the following:
2014-12-22T23:07:47+01:00 bagheera ceph-mds: -10> 2014-12-22 23:07:46.661649 7faf11689700 1 -- 192.168.1.254:6800/13238 --> 192.168.1.77:0/4088967648 -- client_reply(???:3964944 = 0 (0) Success safe) v1 -- ?+0 0x7faee2299a00 con 0x7faeff041f00 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -9> 2014-12-22 23:07:46.662300 7faf11689700 1 -- 192.168.1.254:6800/13238 <== client.1232254 192.168.1.84:0/2255178081 129759 ==== client_request(client.1232254:1969711 setfilelockrule 1, type 1, owner 5109, pid 184466121 50328341504, start 0, length 0, wait 1 #100001a2519) ==== 114+0+0 (3915353074 0 0) 0x7faef784d000 con 0x7faeff043400 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -8> 2014-12-22 23:07:46.662333 7faf11689700 4 mds.0.server handle_client_request client_request(client.1232254:1969711 setfilelockrule 1, type 1, owner 5109, pid 18446612150328341504, start 0, length 0, wait 1 #100001a251 9) 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -7> 2014-12-22 23:07:46.662376 7faf11689700 0 mds.0.server handle_client_file_setlock: start: 0, length: 0, client: 1232254, owner: 5109, pid: 18446612150328341504, type: 1 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -6> 2014-12-22 23:07:46.662395 7faf11689700 1 -- 192.168.1.254:6800/13238 --> 192.168.1.84:0/2255178081 -- client_reply(???:1969711 = 0 (0) Success safe) v1 -- ?+0 0x7faee229ac00 con 0x7faeff043400 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -5> 2014-12-22 23:07:46.662453 7faf11689700 1 -- 192.168.1.254:6800/13238 <== client.1233454 192.168.1.76:0/458678446 588400 ==== client_request(client.1233454:2243876 setfilelockrule 1, type 1, owner 28921, pid 184466121 60231705344, start 0, length 0, wait 1 #10000187902) ==== 114+0+0 (330691582 0 0) 0x7faed5c49280 con 0x7faeff042980 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -4> 2014-12-22 23:07:46.662468 7faf11689700 4 mds.0.server handle_client_request client_request(client.1233454:2243876 setfilelockrule 1, type 1, owner 28921, pid 18446612160231705344, start 0, length 0, wait 1 #100001879 02) 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -3> 2014-12-22 23:07:46.662489 7faf11689700 0 mds.0.server handle_client_file_setlock: start: 0, length: 0, client: 1233454, owner: 28921, pid: 18446612160231705344, type: 1 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -2> 2014-12-22 23:07:46.662502 7faf11689700 1 -- 192.168.1.254:6800/13238 --> 192.168.1.76:0/458678446 -- client_reply(???:2243876 = 0 (0) Success safe) v1 -- ?+0 0x7faee229be00 con 0x7faeff042980 2014-12-22T23:07:47+01:00 bagheera ceph-mds: -1> 2014-12-22 23:07:46.665039 7faf11689700 1 -- 192.168.1.254:6800/13238 <== osd.40 192.168.1.63:6800/3603 365 ==== osd_op_reply(57603 mds0_sessionmap [writefull 0~99908250] v0'0 uv0 ondisk = -90 ((90) Message too long)) v6 ==== 182+0+0 (366616621 0 0) 0x7faf0c564900 con 0x7faedfab9380 2014-12-22T23:07:47+01:00 bagheera ceph-mds: 0> 2014-12-22 23:07:46.725879 7faf11689700 -1 mds/SessionMap.cc: In function 'virtual void C_SM_Save::finish(int)' thread 7faf11689700 time 2014-12-22 23:07:46.724191 mds/SessionMap.cc: 106: FAILED assert(r == 0) ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f) 1: /usr/bin/ceph-mds() [0x7cacb5] 2: (Context::complete(int)+0x9) [0x5723d5] 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xecf) [0x7f3655] 4: (MDS::handle_core_message(Message*)+0xacc) [0x593d24] 5: (MDS::_dispatch(Message*)+0x45) [0x593f19] 6: (MDS::ms_dispatch(Message*)+0xa9) [0x5958ab] 7: (DispatchQueue::entry()+0x52e) [0x9d1984] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8e74ad] 9: (()+0x8ed3) [0x7faf16ddfed3] 10: (clone()+0x6d) [0x7faf1575de2d]
On the osd log I see the following:
2014-12-18T20:59:54+01:00 kaa-63 ceph-osd: 2014-12-18 20:59:54.054029 7f84549f6700 -1 osd.40 32266 handle_op msg data len 94682218 > osd_max_write_size 94371840 on osd_op(mds.0.30:1575838 mds0_sessionmap [writefull 0~94682218] 1.3270c60b ondisk+write e32266) v4
My setup includes 1 mds, 2 mon, 68 osd and i am using the kernel cephfs client. osd.40 uses xfs as underlying filesystem. I could reproduce the bug with both v0.80.4 and v0.87 using the following python script (started 16 times on one client, took roughly 1 day):
#!/usr/bin/env python3 import fcntl, os, argparse parser = argparse.ArgumentParser() parser.add_argument("filename", nargs = 1) args = parser.parse_args() with open(args.filename[0], "w+") as f: try: fcntl.lockf(f, fcntl.LOCK_SH) while True: fcntl.lockf(f, fcntl.LOCK_EX) f.seek(0, os.SEEK_SET) f.write("asdf") f.seek(0, os.SEEK_END) f.write("test") fcntl.lockf(f, fcntl.LOCK_SH) finally: fcntl.lockf(f, fcntl.LOCK_UN)
This represents roughly the kind of active workload on the clients (except they to bigger writes).
I hope this bug is fixable without further testing, since due to another bug, each mds crash causes some of our client-machines to crash with a "NULL pointer dereference" in send_mds_reconnect (I am opening another ticket for this).
Related issues
History
#1 Updated by John Spray over 8 years ago
- Status changed from New to 12
Funnily enough, I was just noticing this in the code while working on #9883 -- all the MDS table persistence code uses single-operation reads and writes for objects of unbounded size.
We need something to break up the reads and writes into chunks, while retaining atomicity. We might need something like a double buffering scheme where we write out new sessionmaps into a fresh object in multiple ops, and then finally mark that object as complete and remove the previous object.
It's interesting that the fcntl workload is triggering this, apparently the completed_requests field of session_info_t must be getting big under this load, which seems like a separate issue to the failure to write the resulting large sessionmap.
I think we might get quite a nice automated test out of this.
#2 Updated by Greg Farnum over 8 years ago
- Priority changed from Normal to High
If this is two separate bugs I'd like to figure out if we can prevent the session_info_t from growing so large pretty quickly — I can't imagine what it takes to get a 94MB session table out of one client's request info, and there's a good chance that's easier to fix than the single-write updates.
#3 Updated by Zheng Yan over 8 years ago
I think there was a stuck request, which prevented mds from trimming completed_requests
#4 Updated by Markus Blank-Burian over 8 years ago
Regarding the stuck requests .. I have definitely seen one or two stuck mds requests. But I had then thought they were caused by the crash and not causing it. I am currently using kernel 3.14.26, so is there eventually already a patch available i could try?
#5 Updated by Zheng Yan over 8 years ago
there are some file lock related fixes after 3.14. did you see something like below in ceph.log
slow request 121.322570 seconds old, received at 2014-11-07 23:36:39.832336: client_request(client.7071:115 getattr pAsLsXsFs #1000002bdbe 2014-11-07 23:36:39.000000) currently failed to rdlock, waiting
#6 Updated by Markus Blank-Burian over 8 years ago
- File ceph-mds-bagheera.log-20141214.gz added
- File ceph-mds-bagheera.log-20141229.gz added
- File ceph-mon-bagheera.log-20141214.gz added
- File ceph-mon-bagheera.log-20141229.gz added
Out of 8 crashes, there was only one incident with slow requests in the logs. I added the mds/mon logs for further inspection.
#7 Updated by Zheng Yan over 8 years ago
- File 0001-ceph-fix-kick_requests.patch View added
- File 0002-ceph-protect-kick_requests-with-mdsc-mutex.patch View added
- File 0003-ceph-make-sure-request-isn-t-in-any-waiting-list-whe.patch View added
looks like kernel bug, please try attached kernel patches
#8 Updated by Markus Blank-Burian over 8 years ago
ok, i have now a test running with all three kernel patches plus the one mentioned in #10450.
#9 Updated by Markus Blank-Burian over 8 years ago
Up until now, I could not reproduce the bug with all the patches applied. But I stumpled across bug #5429 which is marked as resolved with a rbd fix, but I could trigger it also restarting all our OSDs simultaneously using only cephfs and no rbd.
#10 Updated by Markus Blank-Burian over 8 years ago
I had our cluster running under full load for some days and the bug still did not appear.
#11 Updated by Greg Farnum over 8 years ago
- Status changed from 12 to 7
- Assignee set to Zheng Yan
Is this in the kernel tree, and is there something we should do on the server side as well?
#12 Updated by Zheng Yan over 8 years ago
it's a kernel bug. hang request prevents mds from trimming completed_requests in sessionmap. there is nothing to do with mds. (maybe we should add some code to MDS to show warning when this bug happens)
#13 Updated by John Spray over 8 years ago
Created #10649 to track the need to overcome sessiontable size limit, to avoid overloading this ticket.
I agree that we should have an MDS health warning for stuck requests too
#14 Updated by Greg Farnum over 8 years ago
- Status changed from 7 to Resolved
#15 Updated by Greg Farnum about 7 years ago
- Component(FS) MDS added