Project

General

Profile

Actions

Bug #14807

closed

MDS crashes repeatedly after upgrade to Infernalis from Hammer

Added by Christopher Nelson about 8 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a small cluster (1xMON 1xMDS 2xOSD). It has been very stable for the last year.

However, after the upgrade to Infernalis I am experiencing trouble. When the system is quiescent I can mount the cephfs volume and it works fine. However, as I add more clients some of them appear to hit a file which causes the MDS to crash:

 -16> 2016-02-18 15:38:57.842565 7f7e1abe4700  4 mds.0.server handle_client_request client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
   -15> 2016-02-18 15:38:57.842603 7f7e1abe4700  5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.663535, event: throttled, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
   -14> 2016-02-18 15:38:57.842639 7f7e1abe4700  5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.663570, event: all_read, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
   -13> 2016-02-18 15:38:57.842679 7f7e1abe4700  5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.840156, event: dispatched, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
   -12> 2016-02-18 15:38:57.842751 7f7e1abe4700  5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.842751, event: acquired locks, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
   -11> 2016-02-18 15:38:57.842810 7f7e1abe4700  5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.842810, event: replying, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
   -10> 2016-02-18 15:38:57.842852 7f7e1abe4700  1 -- 10.245.22.92:6802/7018 --> 10.245.22.86:0/3269742904 -- client_reply(???:39751 = 0 (0) Success) v1 -- ?+0 0x7f7e4ca5f8c0 con 0x7f7e29689080
    -9> 2016-02-18 15:38:57.842951 7f7e1abe4700  5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.842951, event: finishing request, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
    -8> 2016-02-18 15:38:57.843025 7f7e1abe4700  5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.843025, event: cleaned up request, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
    -7> 2016-02-18 15:38:57.843053 7f7e1abe4700  5 -- op tracker -- seq: 1, time: 2016-02-18 15:38:57.843053, event: done, op: client_request(client.10893215:39751 getattr pAsLsXsFs #1000003af9f RETRY=19)
    -6> 2016-02-18 15:38:57.843074 7f7e1abe4700  4 mds.0.server handle_client_request client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22)
    -5> 2016-02-18 15:38:57.843090 7f7e1abe4700  5 -- op tracker -- seq: 2, time: 2016-02-18 15:38:57.665733, event: throttled, op: client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22)
    -4> 2016-02-18 15:38:57.843104 7f7e1abe4700  5 -- op tracker -- seq: 2, time: 2016-02-18 15:38:57.665742, event: all_read, op: client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22)
    -3> 2016-02-18 15:38:57.843117 7f7e1abe4700  5 -- op tracker -- seq: 2, time: 2016-02-18 15:38:57.840181, event: dispatched, op: client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22)
    -2> 2016-02-18 15:38:57.843240 7f7e1abe4700  5 -- op tracker -- seq: 2, time: 2016-02-18 15:38:57.843240, event: acquired locks, op: client_request(client.10893209:42138 create #100007be287/iam-3_univ_sqr_5_xs.jpg.staged RETRY=22)
    -1> 2016-02-18 15:38:57.843360 7f7e188dd700  1 -- 10.245.22.92:6802/7018 <== osd.2 10.245.22.111:6800/12250 54567 ==== osd_op_reply(135460 600.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr (62)] v0'0 uv3192151 ondisk = 0) v6 ==== 263+0+292 (2860936119 0 2592524424) 0x7f7e2970eb00 con 0x7f7e296886e0
     0> 2016-02-18 15:38:57.846653 7f7e1abe4700 -1 mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f7e1abe4700 time 2016-02-18 15:38:57.843282
mds/MDCache.cc: 269: FAILED assert(inode_map.count(in->vino()) == 0)

 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f7e24da2d2b]
 2: (()+0x29bae6) [0x7f7e24ab3ae6]
 3: (Server::prepare_new_inode(std::shared_ptr<MDRequestImpl>&, CDir*, inodeno_t, unsigned int, ceph_file_layout*)+0xf18) [0x7f7e24a56768]
 4: (Server::handle_client_openc(std::shared_ptr<MDRequestImpl>&)+0xd5a) [0x7f7e24a5a37a]
 5: (Server::dispatch_client_request(std::shared_ptr<MDRequestImpl>&)+0xabc) [0x7f7e24a78a5c]
 6: (Server::handle_client_request(MClientRequest*)+0x47f) [0x7f7e24a78f6f]
 7: (Server::dispatch(Message*)+0x3ab) [0x7f7e24a7d16b]
 8: (MDSRank::handle_deferrable_message(Message*)+0x7fc) [0x7f7e24a07b2c]
 9: (MDSRank::_dispatch(Message*, bool)+0x1da) [0x7f7e24a121ba]
 10: (MDSRank::retry_dispatch(Message*)+0x12) [0x7f7e24a132f2]
 11: (MDSInternalContextBase::complete(int)+0x1d3) [0x7f7e24c24fd3]
 12: (MDSRank::_advance_queues()+0x372) [0x7f7e24a119f2]
 13: (MDSRank::ProgressThread::entry()+0x4a) [0x7f7e24a11e6a]
 14: (()+0x8182) [0x7f7e24172182]
 15: (clone()+0x6d) [0x7f7e22ae647d]


Files

ceph-mds.usmeps024.log.4.gz (14.5 KB) ceph-mds.usmeps024.log.4.gz Christopher Nelson, 02/23/2016 02:35 PM
ceph-mds.usmeps024.log.5.gz (302 Bytes) ceph-mds.usmeps024.log.5.gz Christopher Nelson, 02/23/2016 02:35 PM
ceph-mds.usmeps024.log.6.gz (238 Bytes) ceph-mds.usmeps024.log.6.gz Christopher Nelson, 02/23/2016 02:35 PM
ceph-mds.usmeps024.log.7.gz (287 Bytes) ceph-mds.usmeps024.log.7.gz Christopher Nelson, 02/23/2016 02:35 PM
Actions #1

Updated by Greg Farnum about 8 years ago

Do you have the MDS log from when this first started happening, and can you please upload it? (ceph-post-file will let you upload arbitrarily-large files, and keep them private to ceph developers.)

Did you have any issues prior to updating to infernalis?

Actions #2

Updated by Christopher Nelson about 8 years ago

I was not aware of any issues prior to the upgrade. I am posting the files now, and I'll let you know the tag when it finishes.

Actions #4

Updated by Christopher Nelson about 8 years ago

It turns out the main files are too large. Is there any other way I can upload them?

Actions #5

Updated by Zheng Yan about 8 years ago

Christopher Nelson wrote:

It turns out the main files are too large. Is there any other way I can upload them?

you can upload it to google drive, then share it

Actions #6

Updated by Greg Farnum about 8 years ago

  • Status changed from New to Need More Info
Actions #7

Updated by Greg Farnum about 8 years ago

  • Priority changed from High to Low

Haven't seen this elsewhere, and no logs.

Actions #8

Updated by Patrick Donnelly over 4 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF