Bug #21433: mds: failed to decode message of type 43 v7: buffer::end_of_buffer - CephFS - Ceph

Actions

Copy link

Bug #21433

closed

mds: failed to decode message of type 43 v7: buffer::end_of_buffer

Added by Christian Salzmann-Jäckel over 6 years ago. Updated over 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v10.2.9

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

we run cephfs (10.2.9 on Debian jessie; 108 OSDs on 9 nodes) as a scratch filesystem for a slurm cluster using IPoIB interconnect and Debian backports kernel (4.9.30).

Our cephfs kernel clients startet blocking on file system access. Logs show 'mds0: Behind on trimming' and slow requests to one osd (osd.049).
Replacing the disk of osd.049 didn't show any effect. Clust health is ok (besides d

'ceph daemon mds.cephmon1 dump_ops_in_flight' shows ops from client sessions which are no longer present according to 'ceph daemon mds.cephmon1 session ls'.

We observe constant traffic of ~200 Mbps on mds node and this OSD (osd.049).
Stopping the mds process ends the traffic.
Stopping OSD instance osd.049 shifts traffic to another OSD (osd.095).
ceph logs show slow requests even after stopping all clients.

Debug log on osd.049 show zillions of lines of a single pg (4.22e) of the cephfs_metadata pool which resides on OSDs [49, 95, 9].

2017-09-19 12:20:08.535383 7fd6b98c3700 20 osd.49 pg_epoch: 240725 pg[4.22e( v 240141'1432046 (239363'1429042,240141'1432046] local-les=240073 n=4848 ec=451 les/c/f 240073/240073/0 239916/240072/240072) [49,95,9] r=0 lpr=240072 crt=240129'1432044 lcod 240130'1432045 mlcod 240130'1432045 active+clean] Found key .chunk_4761369_head

ciao
Christian

Files

Download all files

pg_query_4.22e (15.8 KB) pg_query_4.22e	pg_query	Christian Salzmann-Jäckel, 09/19/2017 10:42 AM
session_ls.19.09.2017-11_24_10.out (29.2 KB) session_ls.19.09.2017-11_24_10.out	session_ls	Christian Salzmann-Jäckel, 09/19/2017 10:42 AM
mds-dump_ops_in_flight.19.09.2017-11_24_03.out (3.37 KB) mds-dump_ops_in_flight.19.09.2017-11_24_03.out	dump_ops_in_flight	Christian Salzmann-Jäckel, 09/19/2017 10:42 AM

Actions

Copy link

Updated by Zheng Yan over 6 years ago

Sorry for the delay. have you recovered the FS? if not, please set debug_ms=1 on both mds and osd.049, send logs to us.

Actions

Copy link

Updated by Patrick Donnelly over 6 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Greg Farnum over 6 years ago

This is presumably the same root cause as http://tracker.ceph.com/issues/16010

Actions

Copy link

Updated by Christian Salzmann-Jäckel over 6 years ago

After Greg pointed us to the right direction, we recovered the FS by upgrading the cluster to luminous, now profiting from multiple active MDSes and directory fragmentation.

ciao
Christian

Actions

Copy link

Updated by Zheng Yan over 6 years ago

Status changed from Need More Info to Closed

great

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #21433

mds: failed to decode message of type 43 v7: buffer::end_of_buffer

Updated by Zheng Yan over 6 years ago

Updated by Patrick Donnelly over 6 years ago

Updated by Greg Farnum over 6 years ago

Updated by Christian Salzmann-Jäckel over 6 years ago

Updated by Zheng Yan over 6 years ago