Bug #15230
closed
"unhandled write error..force readonly" in infernalis-x
Added by Yuri Weinstein about 8 years ago.
Updated about 8 years ago.
ceph-qa-suite:
upgrade/infernalis-x
Description
Logs in /home/yuriw/logs/test_cephfs on teuthology box
1495991-2016-03-21T15:56:04.056 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 6 pgs stale
1496074-2016-03-21T15:56:04.656 INFO:tasks.ceph.osd.3.vpm070.stderr:2016-03-21 22:56:04.654342 7f12a76fc800 -1 osd.3 14 log_to_monitors {default=true}
1496217-2016-03-21T15:56:08.396 INFO:tasks.ceph.mds.a.vpm107.stderr:2016-03-21 22:56:08.392069 7f9339d48700 -1 log_channel(cluster) log [ERR] : failed to commit dir 10000002c9e object, errno -2
1496403:2016-03-21T15:56:08.397 INFO:tasks.ceph.mds.a.vpm107.stderr:2016-03-21 22:56:08.392110 7f9339d48700 -1 mds.0.2 unhandled write error (2) No such file or directory, force readonly...
1496585-2016-03-21T15:56:08.891 INFO:tasks.workunit.client.2.vpm026.stderr:rm: cannot remove ‘/home/ubuntu/cephtest/mnt.2/client.2/tmp/blogbench-1.0/src/blogtest_in/blog-14’: Read-only file system
- Description updated (diff)
The object gets migrated from osd.2 to osd.0. The last op on the object in osd.2 results in
2016-03-21 22:55:02.015822 7fbb4d042700 3 osd.2 pg_epoch: 14 pg[2.1( v 14'3015 (0'0,14'3015] local-les=6 n=2551 ec=5 les/c/f 6/6/0 5/5/5) [2,3] r=0 lpr=5 luod=14'3009 crt=14'3008 lcod 14'3008 mlcod 14'3008 active+clean] do_op dup unknown.0.2:0 was 13'1096
And indeed, it looks like every op is sent in with that tid of "unknown.0.2:0". This is troubling! The MDS should be changing IDs after it restarts and isn't, but that might have been a known issue in infernalis — certainly it's one we saw and have fixed, and might not have gotten backported. :/
The part where the op ID isn't changing from 0 is more concerning though. I don't think the MDS can touch that? Is this some kind of message decoding error? It looks like everything is still on 9.2.1* at that point...
- Category changed from OSD to Objecter
Okay, it is earlier using mds.0.1:N, where N properly increments. But it switches over with
2016-03-21 22:51:26.056465 7fbb41910700 1 -- 172.21.2.70:6800/11710 <== mds.0 172.21.2.107:6800/13596 1 ==== osd_op(unknown.0.2:0 mds_snaptable [read 0~0] 1.d90270ad ack+read+known_if_redirected+full_force e13) v6 ==== 199+0+0 (3863754343 0 0) 0x7fbb6f3bc3c0 con 0x7fbb6e2f35a0
That's a startup read all right, looking to see what's in the various mdstables. I'm having trouble correlating this with stuff in the teuthology.log (different timezones, possibly with an additional offset?). But again, I don't see how the MDS could possibly have screwed up its osd tids like that. And this is running against infernalis on both the MDS and OSD.
it looks like the mds is still infernalis, but the osd is jewel. i suspect this is a bug in the new MOSDOp encoding, especially either
osd_reqid_t get_reqid() const {
if (reqid != osd_reqid_t())
return reqid;
else
return osd_reqid_t(get_orig_source(),
client_inc,
header.tid);
}
(that's infernalis)
vs the jewel version,
osd_reqid_t get_reqid() const {
assert(!partial_decode_needed);
if (reqid.name != entity_name_t() || reqid.tid != 0) {
return reqid;
} else {
if (!final_decode_needed)
assert(reqid.inc == (int32_t)client_inc); // decode() should have done this
return osd_reqid_t(get_orig_source(),
reqid.inc,
header.tid);
}
}
The simpelst theory is that the reqid is getting filled in by the infernalis objecter, but I can't see where that ever happens in Objecter (as used by MDS, vs librados). Hrm.
Ah, other way around. The MDS was just upgraded to
2016-03-23 21:34:32.642024 7fa1cb38d180 0 ceph version 10.0.5-2735-g1aa2fe6 (1aa2fe6ca27d8bc95ce1599d607d272626fe86cc), process ceph-mds, pid 17732
and the OSD was still
2016-03-23 21:30:50.374675 7f509e38c940 0 ceph version 9.2.1-14-geff3ff4 (eff3ff4cbe9bfef7c6429b183f7dc0a16359c395), process ceph-osd, pid 15347
when the first request comes in:
2016-03-23 21:34:36.933524 7f50755b3700 10 osd.2 13 new session 0x7f50a33e6a80 con=0x7f50a1887860 addr=172.21.2.29:6800/17732
2016-03-23 21:34:36.933616 7f50755b3700 10 osd.2 13 session 0x7f50a33e6a80 mds.a has caps osdcap[grant(*)] 'allow *'
2016-03-23 21:34:36.934277 7f50755b3700 1 -- 172.21.2.154:6800/15347 <== mds.0 172.21.2.29:6800/17732 1 ==== osd_op(unknown.0.2:0 mds_snaptable [read 0~0] 1.d90270ad ack+read+known_if_redirected+full_force e13) v6 ==== 199+0+0 (2170848432 0 0) 0x7f50a2606c00 con 0x7f50a1887860
- Status changed from New to Fix Under Review
- Status changed from Fix Under Review to Resolved
- Status changed from Resolved to New
- Status changed from New to Resolved
Also available in: Atom
PDF