Bug #14807
closedMDS crashes repeatedly after upgrade to Infernalis from Hammer
0%
Description
I have a small cluster (1xMON 1xMDS 2xOSD). It has been very stable for the last year.
However, after the upgrade to Infernalis I am experiencing trouble. When the system is quiescent I can mount the cephfs volume and it works fine. However, as I add more clients some of them appear to hit a file which causes the MDS to crash:
-16> 2016-02-18 15:38:57.842565 7f7e1abe4700 4 mds.0.server handle_client_request client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -15> 2016-02-18 15:38:57.842603 7f7e1abe4700 5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.663535, event: throttled, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -14> 2016-02-18 15:38:57.842639 7f7e1abe4700 5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.663570, event: all_read, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -13> 2016-02-18 15:38:57.842679 7f7e1abe4700 5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.840156, event: dispatched, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -12> 2016-02-18 15:38:57.842751 7f7e1abe4700 5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.842751, event: acquired locks, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -11> 2016-02-18 15:38:57.842810 7f7e1abe4700 5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.842810, event: replying, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -10> 2016-02-18 15:38:57.842852 7f7e1abe4700 1 -- 10.245.22.92:6802/7018 --> 10.245.22.86:0/3269742904 -- client_reply(???:39751 = 0 (0) Success) v1 -- ?+0 0x7f7e4ca5f8c0 con 0x7f7e29689080 -9> 2016-02-18 15:38:57.842951 7f7e1abe4700 5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.842951, event: finishing request, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -8> 2016-02-18 15:38:57.843025 7f7e1abe4700 5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.843025, event: cleaned up request, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -7> 2016-02-18 15:38:57.843053 7f7e1abe4700 5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.843053, event: done, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19) -6> 2016-02-18 15:38:57.843074 7f7e1abe4700 4 mds.0.server handle_client_request client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22) -5> 2016-02-18 15:38:57.843090 7f7e1abe4700 5 -- op tracker -- seq: 2, time: 2016-02-18 15:38:57.665733, event: throttled, op: client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22) -4> 2016-02-18 15:38:57.843104 7f7e1abe4700 5 -- op tracker -- seq: 2, time: 2016-02-18 15:38:57.665742, event: all_read, op: client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22) -3> 2016-02-18 15:38:57.843117 7f7e1abe4700 5 -- op tracker -- seq: 2, time: 2016-02-18 15:38:57.840181, event: dispatched, op: client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22) -2> 2016-02-18 15:38:57.843240 7f7e1abe4700 5 -- op tracker -- seq: 2, time: 2016-02-18 15:38:57.843240, event: acquired locks, op: client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22) -1> 2016-02-18 15:38:57.843360 7f7e188dd700 1 -- 10.245.22.92:6802/7018 <== osd.2 10.245.22.111:6800/12250 54567 ==== osd_op_reply(135460 600.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr (62)] v0'0 uv3192151 ondisk = 0) v6 ==== 263+0+292 (2860936119 0 2592524424) 0x7f7e2970eb00 con 0x7f7e296886e0 0> 2016-02-18 15:38:57.846653 7f7e1abe4700 -1 mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f7e1abe4700 time 2016-02-18 15:38:57.843282 mds/MDCache.cc: 269: FAILED assert(inode_map.count(in->vino()) == 0) ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f7e24da2d2b] 2: (()+0x29bae6) [0x7f7e24ab3ae6] 3: (Server::prepare_new_inode(std::shared_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, ceph_file_layout*)+0xf18) [0x7f7e24a56768] 4: (Server::handle_client_openc(std::shared_ptr<MDRequestImpl>&)+0xd5a) [0x7f7e24a5a37a] 5: (Server::dispatch_client_request(std::shared_ptr<MDRequestImpl>&)+0xabc) [0x7f7e24a78a5c] 6: (Server::handle_client_request(MClientRequest*)+0x47f) [0x7f7e24a78f6f] 7: (Server::dispatch(Message*)+0x3ab) [0x7f7e24a7d16b] 8: (MDSRank::handle_deferrable_message(Message*)+0x7fc) [0x7f7e24a07b2c] 9: (MDSRank::_dispatch(Message*, bool)+0x1da) [0x7f7e24a121ba] 10: (MDSRank::retry_dispatch(Message*)+0x12) [0x7f7e24a132f2] 11: (MDSInternalContextBase::complete(int)+0x1d3) [0x7f7e24c24fd3] 12: (MDSRank::_advance_queues()+0x372) [0x7f7e24a119f2] 13: (MDSRank::ProgressThread::entry()+0x4a) [0x7f7e24a11e6a] 14: (()+0x8182) [0x7f7e24172182] 15: (clone()+0x6d) [0x7f7e22ae647d]
Files
Updated by Greg Farnum about 8 years ago
Do you have the MDS log from when this first started happening, and can you please upload it? (ceph-post-file will let you upload arbitrarily-large files, and keep them private to ceph developers.)
Did you have any issues prior to updating to infernalis?
Updated by Christopher Nelson about 8 years ago
I was not aware of any issues prior to the upgrade. I am posting the files now, and I'll let you know the tag when it finishes.
Updated by Christopher Nelson about 8 years ago
- File ceph-mds.usmeps024.log.4.gz ceph-mds.usmeps024.log.4.gz added
- File ceph-mds.usmeps024.log.5.gz ceph-mds.usmeps024.log.5.gz added
- File ceph-mds.usmeps024.log.6.gz ceph-mds.usmeps024.log.6.gz added
- File ceph-mds.usmeps024.log.7.gz ceph-mds.usmeps024.log.7.gz added
Apparently my institution blocks outbound scp, so I had to post the files here. Sorry for the delay.
Updated by Christopher Nelson about 8 years ago
It turns out the main files are too large. Is there any other way I can upload them?
Updated by Zheng Yan about 8 years ago
Christopher Nelson wrote:
It turns out the main files are too large. Is there any other way I can upload them?
you can upload it to google drive, then share it
Updated by Greg Farnum about 8 years ago
- Status changed from New to Need More Info
Updated by Greg Farnum about 8 years ago
- Priority changed from High to Low
Haven't seen this elsewhere, and no logs.
Updated by Patrick Donnelly over 4 years ago
- Status changed from Need More Info to Can't reproduce