Bug #17236: MDS goes damaged on blacklist (failed to read JournalPointer: -108 ((108) Cannot send after transport endpoint shutdown) - CephFS - Ceph

Actions

Copy link

Bug #17236

closed

MDS goes damaged on blacklist (failed to read JournalPointer: -108 ((108) Cannot send after transport endpoint shutdown)

Added by John Spray over 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

other

Tags:

Backport:

jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/teuthology-2016-09-05_17:25:02-kcephfs-master-testing-basic-mira/401388/

OSD log:

2016-09-05 19:15:04.996890 7f3a92cea700 10 osd.1 pg_epoch: 9 pg[2.7( v 8'7692 (7'4629,8'7692] local-les=7 n=376 ec=6 les/c/f 7/7/0 6/6/6) [1,2] r=0 lpr=6 luod=8'7691 lua=8'7689 crt=8'7688 lcod 8'7690 mlcod 8'7688 active+clean] do_op 172.21.5.140:6808/12206 is blacklisted

remote/mira037/log/ceph-osd.1.log.gz:2016-09-05 19:15:18.303488 7f3a92cea700  1 -- 172.21.5.140:6804/11520 >> 172.21.8.106:6808/19233 conn(0x55d18aee9000 sd=66 :6804 s=STATE_OPEN pgs=44 cs=1 l=1). == tx == 0x55d18c1e3e40 osd_op_reply(4 400.00000000 [read 0~0] v0'0 uv0 ack = -108 ((108) Cannot send after transport endpoint shutdown)) v7

There is at least one case here where r!=0 is being taken to mean damage, but we should be just respawning when seeing EBLACKLISTED. Almost everywhere else MDSIOContext handles this, but JournalPointer doesn't use it because it works outside of the MDS lock.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by John Spray over 7 years ago

Priority changed from Normal to Urgent

http://pulpito.ceph.com/jspray-2016-09-13_08:34:56-fs-wip-no-recordlock-test-testing-basic-smithi/413348/

Promoting to urgent because I've seen this fail the fs suite more than once.

I looked further into logs this time and found that the original blacklist was happening because the MDS was re-using a PID.

It's respawning again and again with the same PID, because teuthology runs daemons with "-f" -- usually we get a new PID when daemonizing. I have no idea why we've only just starting seeing this issue.

Actions

Copy link

Updated by John Spray over 7 years ago

Status changed from New to Fix Under Review

https://github.com/ceph/ceph/pull/11138

Actions

Copy link

Updated by John Spray over 7 years ago

As for the mystery of why this started happening, I think the MDS failures are triggered by #17308

Actions

Copy link

Updated by John Spray over 7 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by John Spray over 7 years ago

Backport set to jewel

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

Copied to Backport #17478: jewel: MDS goes damaged on blacklist (failed to read JournalPointer: -108 ((108) Cannot send after transport endpoint shutdown) added

Actions

Copy link

Updated by Nathan Cutler about 7 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #17236

MDS goes damaged on blacklist (failed to read JournalPointer: -108 ((108) Cannot send after transport endpoint shutdown)

Updated by John Spray over 7 years ago

Updated by John Spray over 7 years ago

Updated by John Spray over 7 years ago

Updated by John Spray over 7 years ago

Updated by John Spray over 7 years ago

Updated by Loïc Dachary over 7 years ago

Updated by Nathan Cutler about 7 years ago