Project

General

Profile

Actions

Bug #38490

open

mds: multimds stuck

Added by Patrick Donnelly about 5 years ago. Updated about 5 years ago.

Status:
New
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sorry for vague $subject, not sure what's wrong yet.

2019-02-26 18:57:12.716 7f6806098700  1 -- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] --> [v2:172.21.15.36:6800/36393,v1:172.21.15.36:6801/36393] -- mgrreport(unknown.b +0-0 packed 1366) v7 -- 0x55810979c300 con 0x558109071000
2019-02-26 18:57:15.308 7f680b0a2700  1 -- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] >> [v2:172.21.15.145:6838/3486520481,v1:172.21.15.145:6839/3486520481] conn(0x558108391000 msgr2=0x55810912a580 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 33
2019-02-26 18:57:15.308 7f680b0a2700  1 -- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] >> [v2:172.21.15.145:6838/3486520481,v1:172.21.15.145:6839/3486520481] conn(0x558108391000 msgr2=0x55810912a580 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2019-02-26 18:57:15.308 7f680b0a2700  1 --2- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] >> [v2:172.21.15.145:6838/3486520481,v1:172.21.15.145:6839/3486520481] conn(0x558108391000 0x55810912a580 crc :-1 s=READY pgs=11 cs=0 l=0 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2019-02-26 18:57:15.308 7f680b0a2700  1 --2- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] >> [v2:172.21.15.145:6838/3486520481,v1:172.21.15.145:6839/3486520481] conn(0x558108391000 0x55810912a580 unknown :-1 s=READY pgs=11 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby
2019-02-26 18:57:15.358 7f680a0a0700  1 -- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] >> [v2:172.21.15.145:6834/2671704713,v1:172.21.15.145:6835/2671704713] conn(0x558109170400 msgr2=0x558108f38580 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 38
2019-02-26 18:57:15.358 7f680a0a0700  1 -- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] >> [v2:172.21.15.145:6834/2671704713,v1:172.21.15.145:6835/2671704713] conn(0x558109170400 msgr2=0x558108f38580 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2019-02-26 18:57:15.358 7f680a0a0700  1 --2- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] >> [v2:172.21.15.145:6834/2671704713,v1:172.21.15.145:6835/2671704713] conn(0x558109170400 0x558108f38580 crc :-1 s=READY pgs=13 cs=0 l=0 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2019-02-26 18:57:15.358 7f680a0a0700  1 --2- [v2:172.21.15.145:6836/1016281496,v1:172.21.15.145:6837/1016281496] >> [v2:172.21.15.145:6834/2671704713,v1:172.21.15.145:6835/2671704713] conn(0x558109170400 0x558108f38580 unknown :-1 s=READY pgs=13 cs=0 l=0 rx=0 tx=0)._fault with nothing to send, going to standby

From: /ceph/teuthology-archive/pdonnell-2019-02-26_07:49:50-multimds-wip-pdonnell-testing-20190226.051327-distro-basic-smithi/3641251/remote/smithi145/log/ceph-mds.b.log.gz

I suspect this may be some messenger2 issue.

Actions #1

Updated by Patrick Donnelly about 5 years ago

Similar: /ceph/teuthology-archive/pdonnell-2019-02-26_07:49:50-multimds-wip-pdonnell-testing-20190226.051327-distro-basic-smithi/3641205/remote/smithi173/log/ceph-mds.i.log.gz

Actions #2

Updated by Zheng Yan about 5 years ago

Patrick Donnelly wrote:

Similar: /ceph/teuthology-archive/pdonnell-2019-02-26_07:49:50-multimds-wip-pdonnell-testing-20190226.051327-distro-basic-smithi/3641205/remote/smithi173/log/ceph-mds.i.log.gz

This was caused by removal of "standby for rank" option. needs to update test case

Actions #3

Updated by Zheng Yan about 5 years ago

For /ceph/teuthology-archive/pdonnell-2019-02-26_07:49:50-multimds-wip-pdonnell-testing-20190226.051327-distro-basic-smithi/3641251/remote/smithi145

It's likely "tar cf doc.tar $KERNEL" hung. But there is no 'slow request' in mds logs. It's more like kclient bug

Actions #4

Updated by Patrick Donnelly about 5 years ago

Zheng Yan wrote:

Patrick Donnelly wrote:

Similar: /ceph/teuthology-archive/pdonnell-2019-02-26_07:49:50-multimds-wip-pdonnell-testing-20190226.051327-distro-basic-smithi/3641205/remote/smithi173/log/ceph-mds.i.log.gz

This was caused by removal of "standby for rank" option. needs to update test case

Oops, I missed that in my grep with "standby_for". Thanks Zheng!

Actions #5

Updated by Patrick Donnelly about 5 years ago

  • Target version changed from v14.0.0 to v15.0.0
Actions #6

Updated by Patrick Donnelly about 5 years ago

  • Target version deleted (v15.0.0)
Actions

Also available in: Atom PDF