Bug #15230: "unhandled write error..force readonly" in infernalis-x - Ceph - Ceph

Custom queries

Backports: mimic
Backports: missing release
Backports: nautilus
Bluestore
Bug queue
Bug queue - no subprojects
Bug triage
Ceph backlog
Crash queue
Crash triage
Feature Requests
Feedback
My issues
Need Review
Pending backports
Priority queue
Product Backlog Scrub
Project Triage
Test Failures

Actions

Copy link

Bug #15230

closed

"unhandled write error..force readonly" in infernalis-x

Added by Yuri Weinstein about 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Objecter

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

upgrade/infernalis-x

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs in /home/yuriw/logs/test_cephfs on teuthology box

1495991-2016-03-21T15:56:04.056 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 6 pgs stale
1496074-2016-03-21T15:56:04.656 INFO:tasks.ceph.osd.3.vpm070.stderr:2016-03-21 22:56:04.654342 7f12a76fc800 -1 osd.3 14 log_to_monitors {default=true}
1496217-2016-03-21T15:56:08.396 INFO:tasks.ceph.mds.a.vpm107.stderr:2016-03-21 22:56:08.392069 7f9339d48700 -1 log_channel(cluster) log [ERR] : failed to commit dir 10000002c9e object, errno -2
1496403:2016-03-21T15:56:08.397 INFO:tasks.ceph.mds.a.vpm107.stderr:2016-03-21 22:56:08.392110 7f9339d48700 -1 mds.0.2 unhandled write error (2) No such file or directory, force readonly...
1496585-2016-03-21T15:56:08.891 INFO:tasks.workunit.client.2.vpm026.stderr:rm: cannot remove ‘/home/ubuntu/cephtest/mnt.2/client.2/tmp/blogbench-1.0/src/blogtest_in/blog-14’: Read-only file system

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

Description updated (diff)

Actions

Copy link

Updated by Greg Farnum about 8 years ago

Category set to OSD

The object gets migrated from osd.2 to osd.0. The last op on the object in osd.2 results in

2016-03-21 22:55:02.015822 7fbb4d042700  3 osd.2 pg_epoch: 14 pg[2.1( v 14'3015 (0'0,14'3015] local-les=6 n=2551 ec=5 les/c/f 6/6/0 5/5/5) [2,3] r=0 lpr=5 luod=14'3009 crt=14'3008 lcod 14'3008 mlcod 14'3008 active+clean] do_op dup unknown.0.2:0 was 13'1096

And indeed, it looks like every op is sent in with that tid of "unknown.0.2:0". This is troubling! The MDS should be changing IDs after it restarts and isn't, but that might have been a known issue in infernalis — certainly it's one we saw and have fixed, and might not have gotten backported. :/
The part where the op ID isn't changing from 0 is more concerning though. I don't think the MDS can touch that? Is this some kind of message decoding error? It looks like everything is still on 9.2.1* at that point...

Actions

Copy link

Updated by Greg Farnum about 8 years ago

Category changed from OSD to Objecter

Okay, it is earlier using mds.0.1:N, where N properly increments. But it switches over with

2016-03-21 22:51:26.056465 7fbb41910700  1 -- 172.21.2.70:6800/11710 <== mds.0 172.21.2.107:6800/13596 1 ==== osd_op(unknown.0.2:0 mds_snaptable [read 0~0] 1.d90270ad ack+read+known_if_redirected+full_force e13) v6 ==== 199+0+0 (3863754343 0 0) 0x7fbb6f3bc3c0 con 0x7fbb6e2f35a0

That's a startup read all right, looking to see what's in the various mdstables. I'm having trouble correlating this with stuff in the teuthology.log (different timezones, possibly with an additional offset?). But again, I don't see how the MDS could possibly have screwed up its osd tids like that. And this is running against infernalis on both the MDS and OSD.

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

Easy reproducible see http://qa-proxy.ceph.com/teuthology/yuriw-2016-03-23_12:59:33-upgrade:infernalis-x-master-distro-basic-vps/82264/teuthology.log

Actions

Copy link

Updated by Sage Weil about 8 years ago

it looks like the mds is still infernalis, but the osd is jewel. i suspect this is a bug in the new MOSDOp encoding, especially either

  osd_reqid_t get_reqid() const {
    if (reqid != osd_reqid_t())
      return reqid;
    else
      return osd_reqid_t(get_orig_source(),
                         client_inc,
             header.tid);
  }

(that's infernalis)
vs the jewel version,

  osd_reqid_t get_reqid() const {
    assert(!partial_decode_needed);
    if (reqid.name != entity_name_t() || reqid.tid != 0) {
      return reqid;
    } else {
      if (!final_decode_needed)
    assert(reqid.inc == (int32_t)client_inc);  // decode() should have done this
      return osd_reqid_t(get_orig_source(),
                         reqid.inc,
             header.tid);
    }
  }

The simpelst theory is that the reqid is getting filled in by the infernalis objecter, but I can't see where that ever happens in Objecter (as used by MDS, vs librados). Hrm.

Actions

Copy link

Updated by Sage Weil about 8 years ago

Ah, other way around. The MDS was just upgraded to

2016-03-23 21:34:32.642024 7fa1cb38d180 0 ceph version 10.0.5-2735-g1aa2fe6 (1aa2fe6ca27d8bc95ce1599d607d272626fe86cc), process ceph-mds, pid 17732

and the OSD was still

2016-03-23 21:30:50.374675 7f509e38c940 0 ceph version 9.2.1-14-geff3ff4 (eff3ff4cbe9bfef7c6429b183f7dc0a16359c395), process ceph-osd, pid 15347

when the first request comes in:

2016-03-23 21:34:36.933524 7f50755b3700 10 osd.2 13 new session 0x7f50a33e6a80 con=0x7f50a1887860 addr=172.21.2.29:6800/17732
2016-03-23 21:34:36.933616 7f50755b3700 10 osd.2 13 session 0x7f50a33e6a80 mds.a has caps osdcap[grant(*)] 'allow *'
2016-03-23 21:34:36.934277 7f50755b3700 1 -- 172.21.2.154:6800/15347 <== mds.0 172.21.2.29:6800/17732 1 ==== osd_op(unknown.0.2:0 mds_snaptable [read 0~0] 1.d90270ad ack+read+known_if_redirected+full_force e13) v6 ==== 199+0+0 (2170848432 0 0) 0x7f50a2606c00 con 0x7f50a1887860

Actions

Copy link

Updated by Sage Weil about 8 years ago

Status changed from New to Fix Under Review

https://github.com/ceph/ceph/pull/8299

Actions

Copy link

Updated by Sage Weil about 8 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

Status changed from Resolved to New

Still see in run: http://pulpito.ceph.com/yuriw-2016-03-27_09:52:17-upgrade:infernalis-x-master-distro-basic-vps/
Jobs: ['89886', '89890']
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2016-03-27_09:52:17-upgrade:infernalis-x-master-distro-basic-vps/89886/teuthology.log

2016-03-27T13:46:21.185 INFO:tasks.ceph.mds.a.vpm011.stderr:2016-03-27 20:46:21.179324 7fcb6b87b700 -1 mds.0.2 unhandled write error (2) No such file or directory, force readonly...

Actions

Copy link

#10

Updated by Sage Weil about 8 years ago

Status changed from New to Resolved

that run didn't have the fix.

http://pulpito.ceph.com/sage-2016-03-26_13:11:20-upgrade:infernalis-x-wip-sage-testing---basic-mira

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #15230

"unhandled write error..force readonly" in infernalis-x

Updated by Yuri Weinstein about 8 years ago

Updated by Greg Farnum about 8 years ago

Updated by Greg Farnum about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Sage Weil about 8 years ago

Updated by Sage Weil about 8 years ago

Updated by Sage Weil about 8 years ago

Updated by Sage Weil about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Sage Weil about 8 years ago