Project

General

Profile

Bug #15230

"unhandled write error..force readonly" in infernalis-x

Added by Yuri Weinstein about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Objecter
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/infernalis-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs in /home/yuriw/logs/test_cephfs on teuthology box

1495991-2016-03-21T15:56:04.056 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 6 pgs stale
1496074-2016-03-21T15:56:04.656 INFO:tasks.ceph.osd.3.vpm070.stderr:2016-03-21 22:56:04.654342 7f12a76fc800 -1 osd.3 14 log_to_monitors {default=true}
1496217-2016-03-21T15:56:08.396 INFO:tasks.ceph.mds.a.vpm107.stderr:2016-03-21 22:56:08.392069 7f9339d48700 -1 log_channel(cluster) log [ERR] : failed to commit dir 10000002c9e object, errno -2
1496403:2016-03-21T15:56:08.397 INFO:tasks.ceph.mds.a.vpm107.stderr:2016-03-21 22:56:08.392110 7f9339d48700 -1 mds.0.2 unhandled write error (2) No such file or directory, force readonly...
1496585-2016-03-21T15:56:08.891 INFO:tasks.workunit.client.2.vpm026.stderr:rm: cannot remove ‘/home/ubuntu/cephtest/mnt.2/client.2/tmp/blogbench-1.0/src/blogtest_in/blog-14’: Read-only file system

Associated revisions

Revision bdca28dd (diff)
Added by Sage Weil about 8 years ago

messages/MOSDOp: clear reqid inc for v6 encoding

For the new v7 encoding, we put the client incarnation in the
reqid so that we can have it at the beginning of the encoding.
However, we do not want to do this for the v6 encoding, as that
version of get_reqid() will return reqid if it is not ==
osd_reqid_t().

Fixes: #15230
Signed-off-by: Sage Weil <>

History

#1 Updated by Yuri Weinstein about 8 years ago

  • Description updated (diff)

#2 Updated by Greg Farnum about 8 years ago

  • Category set to OSD

The object gets migrated from osd.2 to osd.0. The last op on the object in osd.2 results in

2016-03-21 22:55:02.015822 7fbb4d042700  3 osd.2 pg_epoch: 14 pg[2.1( v 14'3015 (0'0,14'3015] local-les=6 n=2551 ec=5 les/c/f 6/6/0 5/5/5) [2,3] r=0 lpr=5 luod=14'3009 crt=14'3008 lcod 14'3008 mlcod 14'3008 active+clean] do_op dup unknown.0.2:0 was 13'1096

And indeed, it looks like every op is sent in with that tid of "unknown.0.2:0". This is troubling! The MDS should be changing IDs after it restarts and isn't, but that might have been a known issue in infernalis — certainly it's one we saw and have fixed, and might not have gotten backported. :/
The part where the op ID isn't changing from 0 is more concerning though. I don't think the MDS can touch that? Is this some kind of message decoding error? It looks like everything is still on 9.2.1* at that point...

#3 Updated by Greg Farnum about 8 years ago

  • Category changed from OSD to Objecter

Okay, it is earlier using mds.0.1:N, where N properly increments. But it switches over with

2016-03-21 22:51:26.056465 7fbb41910700  1 -- 172.21.2.70:6800/11710 <== mds.0 172.21.2.107:6800/13596 1 ==== osd_op(unknown.0.2:0 mds_snaptable [read 0~0] 1.d90270ad ack+read+known_if_redirected+full_force e13) v6 ==== 199+0+0 (3863754343 0 0) 0x7fbb6f3bc3c0 con 0x7fbb6e2f35a0

That's a startup read all right, looking to see what's in the various mdstables. I'm having trouble correlating this with stuff in the teuthology.log (different timezones, possibly with an additional offset?). But again, I don't see how the MDS could possibly have screwed up its osd tids like that. And this is running against infernalis on both the MDS and OSD.

#5 Updated by Sage Weil about 8 years ago

it looks like the mds is still infernalis, but the osd is jewel. i suspect this is a bug in the new MOSDOp encoding, especially either

  osd_reqid_t get_reqid() const {
    if (reqid != osd_reqid_t())
      return reqid;
    else
      return osd_reqid_t(get_orig_source(),
                         client_inc,
             header.tid);
  }

(that's infernalis)
vs the jewel version,
  osd_reqid_t get_reqid() const {
    assert(!partial_decode_needed);
    if (reqid.name != entity_name_t() || reqid.tid != 0) {
      return reqid;
    } else {
      if (!final_decode_needed)
    assert(reqid.inc == (int32_t)client_inc);  // decode() should have done this
      return osd_reqid_t(get_orig_source(),
                         reqid.inc,
             header.tid);
    }
  }

The simpelst theory is that the reqid is getting filled in by the infernalis objecter, but I can't see where that ever happens in Objecter (as used by MDS, vs librados). Hrm.

#6 Updated by Sage Weil about 8 years ago

Ah, other way around. The MDS was just upgraded to

2016-03-23 21:34:32.642024 7fa1cb38d180 0 ceph version 10.0.5-2735-g1aa2fe6 (1aa2fe6ca27d8bc95ce1599d607d272626fe86cc), process ceph-mds, pid 17732

and the OSD was still

2016-03-23 21:30:50.374675 7f509e38c940 0 ceph version 9.2.1-14-geff3ff4 (eff3ff4cbe9bfef7c6429b183f7dc0a16359c395), process ceph-osd, pid 15347

when the first request comes in:

2016-03-23 21:34:36.933524 7f50755b3700 10 osd.2 13 new session 0x7f50a33e6a80 con=0x7f50a1887860 addr=172.21.2.29:6800/17732
2016-03-23 21:34:36.933616 7f50755b3700 10 osd.2 13 session 0x7f50a33e6a80 mds.a has caps osdcap[grant(*)] 'allow *'
2016-03-23 21:34:36.934277 7f50755b3700 1 -- 172.21.2.154:6800/15347 <== mds.0 172.21.2.29:6800/17732 1 ==== osd_op(unknown.0.2:0 mds_snaptable [read 0~0] 1.d90270ad ack+read+known_if_redirected+full_force e13) v6 ==== 199+0+0 (2170848432 0 0) 0x7f50a2606c00 con 0x7f50a1887860

#7 Updated by Sage Weil about 8 years ago

  • Status changed from New to Fix Under Review

#8 Updated by Sage Weil about 8 years ago

  • Status changed from Fix Under Review to Resolved

#9 Updated by Yuri Weinstein about 8 years ago

  • Status changed from Resolved to New

Still see in run: http://pulpito.ceph.com/yuriw-2016-03-27_09:52:17-upgrade:infernalis-x-master-distro-basic-vps/
Jobs: ['89886', '89890']
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2016-03-27_09:52:17-upgrade:infernalis-x-master-distro-basic-vps/89886/teuthology.log

2016-03-27T13:46:21.185 INFO:tasks.ceph.mds.a.vpm011.stderr:2016-03-27 20:46:21.179324 7fcb6b87b700 -1 mds.0.2 unhandled write error (2) No such file or directory, force readonly...

#10 Updated by Sage Weil about 8 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF