Bug #805
closedmds startup: _replay journaler got error -22, aborting
0%
Description
As per #803, the assert is now fixed but starting up the cluster I now get:
2011-02-14 21:02:14.026899 7f91c9865700 mds0.14 ms_handle_connect on 10.135.211.78:6801/22875 2011-02-14 21:02:59.867133 7f91c9865700 mds0.cache creating system inode with ino:100 2011-02-14 21:02:59.867415 7f91c9865700 mds0.cache creating system inode with ino:1 2011-02-14 21:03:05.579551 7f91c7059700 mds0.journaler try_read_entry got 0 len entry at offset 4051698581 2011-02-14 21:03:05.579615 7f91c7059700 mds0.log _replay journaler got error -22, aborting 2011-02-14 21:03:05.582298 7f91c7059700 mds0.14 boot_start encountered an error, failing 2011-02-14 21:03:05.582323 7f91c7059700 mds0.14 suicide. wanted up:replay, now down:dne 2011-02-14 21:03:05.583656 7f91cc303720 stopped.
See #803 for more logs and other details
Files
Updated by Greg Farnum about 13 years ago
- Assignee set to Greg Farnum
You don't have any other logs from before, do you? At this point there's a problem with the way the MDS journal ended up on-disk. If you could generate one more log with
debug mds = 20 debug journaler = 20
that'd be helpful in diagnosing what could have happened.
Sage also thinks there are going to be fundamental problems with large directory reads until fragmentation gets enabled*, but I'm not sure they would actually manifest like this.
[*] Since directories and their contents are stored in a single object on-disk, when doing an ls it reads the directory into memory. Once your directory gets larger than the MDS cache size, you're going to have problems with cache swapping.
Updated by John Leach about 13 years ago
- File mds.srv-an56n.log mds.srv-an56n.log added
Updated by John Leach about 13 years ago
Dumped and bzipped the journal here: http://johnleach.co.uk/downloads/ceph/805-journal.dump.tbz2
Updated by Greg Farnum about 13 years ago
Okay, somehow expire < trim. That shouldn't happen.
Updated by Greg Farnum about 13 years ago
- Status changed from New to Can't reproduce
Well, I can't figure out how the header could have gotten corrupted like that, but I've put in a few more checks for it and will consult on it when Sage gets back. :/
Re-open if you see this again!
Updated by Greg Farnum about 13 years ago
- Status changed from Can't reproduce to In Progress
We got this on our playground today. I have found one promising lead. Object 200.0001b88d_head is the last object in the journal, and logs are available for the OSD storing it:grep "200.0001b88d_head" /var/log/ceph/osd.4.log.1 | less
(partial results only)
2011-03-08 03:32:41.361755 7f1b6b9bd910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 1404340~1010 2011-03-08 03:32:41.369422 7f1b6b9bd910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 1404340~1010 = 1010 2011-03-08 03:32:41.369441 7f1b6b9bd910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' len 153 2011-03-08 03:32:41.369485 7f1b6b9bd910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' len 153 = 153 2011-03-08 03:32:41.369497 7f1b6b9bd910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' len 26 2011-03-08 03:32:41.369536 7f1b6b9bd910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' len 26 = 26 2011-03-08 03:35:21.535741 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' 2011-03-08 03:35:21.535890 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' = 153 2011-03-08 03:35:21.535905 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' 2011-03-08 03:35:21.535924 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' = 26 2011-03-08 03:35:27.778222 7f1b6b1bc910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 0~1595 2011-03-08 03:35:27.778614 7f1b6b1bc910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 0~1595 = 1595
Looks like the journal entry gets written out to a certain point, then it starts all over again on the same object -- did the MDS crash and this object got lost track of during replay? Then, while overwriting this object, it doesn't overwrite all the data:
2011-03-08 03:37:47.204260 7f1b6b1bc910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 690703~1021 2011-03-08 03:37:47.204314 7f1b6b1bc910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 690703~1021 = 1021 2011-03-08 03:37:47.204328 7f1b6b1bc910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' len 153 2011-03-08 03:37:47.204369 7f1b6b1bc910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' len 153 = 153 2011-03-08 03:37:47.204380 7f1b6b1bc910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' len 26 2011-03-08 03:37:47.204421 7f1b6b1bc910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' len 26 = 26 2011-03-08 03:38:06.910883 7f1b658b0910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' 2011-03-08 03:38:06.910996 7f1b658b0910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' = 153 2011-03-08 03:38:06.911164 7f1b658b0910 filestore(/mnt/osd) stat /mnt/osd/current/1.60_head/200.0001b88d_head = 0 2011-03-08 03:39:22.894022 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' 2011-03-08 03:39:22.894167 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' = 153 2011-03-08 03:39:22.894459 7f1b660b1910 filestore(/mnt/osd) read /mnt/osd/current/1.60_head/200.0001b88d_head 0~1405350 2011-03-08 03:39:22.896465 7f1b660b1910 filestore(/mnt/osd) FileStore::read /mnt/osd/current/1.60_head/200.0001b88d_head 0~1405350/1405350
This ending offset (from the rewrite) is where the journal replay fails!
More investigation is coming.
Updated by Greg Farnum about 13 years ago
So, the MDS that originally wrote that journal object got blacklisted shortly after writing it, and then a standby MDS came along and took over, and that's the one that starts over on writing the object.
The first time the new MDS ever touches 200.0001b88d
is
2011-03-08 03:35:21.535573 7f1b681b6910 -- 10.14.0.111:6800/4014 <== mds0 10.14.0.101:6802/13896 1502 ==== osd_op(mds0.293:9328 200.0001b88d [write 0~1595] 1.3060) v1 ==== 120+0+1595 (1654109329 0 228038280) 0x291cb40 con 0x7fc5500
ie, a write. It should get read on probe, but for some reason isn't...
Updated by Greg Farnum about 13 years ago
Okay, here's the problem. The previous object is 200.0001b88c
, on osd0. It is first touched by mds0.293 at:
2011-03-08 03:33:28.831406 7f31e5ade910 -- 10.14.0.104:6800/9682 <== mds0 10.14.0.101:6802/13896 1 ==== osd_op(mds0.293:7 200.0001b88c [stat 0~0] 1.cd9) v1 ==== 120+0+0 (2444897950 0 0) 0x5201000 con 0x448b640
and the response is sent out at
2011-03-08 03:33:28.837780 7f31e31d8910 -- 10.14.0.104:6800/9682 --> 10.14.0.101:6802/13896 -- osd_op_reply(7 200.0001b88c [stat 0~0] = 0) v1 -- ?+16 0x6cb5700 con 0x448b640
But the last write from mds0.292 is completed later, on a sub_op_reply:
2011-03-08 03:33:30.087922 7f31e39d9910 osd0 26692 dequeue_op osd_sub_op_reply(mds0.292:7918 1.d9 200.0001b88c/head [] ondisk = 0) v1 pg pg[1.d9( v 26691'140462 (26691'140179,26691'140462] n=7429 ec=2 les=26576 26570/26570/26561) [0,3] r=0 mlcod 26691'140460 active+clean], 6 more pending
Sam suggests that forcing a read-modify-write cycle would make sure the new MDS got the correct information on the stat. We could do this by, eg, writing an xattr with the mds name and version (mds0.293) on every object before we stat it. This isn't a very elegant solution and it wouldn't play too nicely with standby-replay, although I could hack something up. We'll need to do something, though!
Even with this, though, I'm a little concerned about the consequences of separating the MDS switchover (the MDSMap) from the blacklisting mechanism (the OSDMap). I think it's possible for object 200.c to be on OSD3, for object 200.d to be on OSD4, and for OSD3 to get the new OSDMap earlier than OSD4 so that object 200.c gets rejected but object 200.d gets written down. Am I wrong about this possibility? If not, what's a good mechanism for dealing with it?
Updated by Greg Farnum about 13 years ago
After discussion with Sage, it turns out that we don't need to worry about later blocks getting accepted but earlier blocks getting rejected: the MDS journaler operates strictly linearly, so later writes won't be acked to the client. Hurray, no data loss!
This does mean that the clients will need to replay ops that, strictly speaking, made it to disk, but any workaround to avoid that would be..immensely complicated. And not nearly worth it given the unlikelihood of that race occurring.
So, to resolve: We will set the probe stats operations as may_write, forcing any in-progress writes to complete before the stat does.
To deal with possible lost ahead data, we will truncate 1 object ahead whenever we open a new object for logging, and force the MDS to wait to finish the current object until that truncation has completed. (Or, if that performs badly/is too complicated, do some stronger variant to make sure the invariant holds that we know we have not accidentally written up to old trash data.)
Updated by Greg Farnum about 13 years ago
- Status changed from In Progress to Resolved
The misread size got fixed by commit: d3464af7d046cfff887127710ad9656b4877c452.
The issue with lost data is an unlikely race, so I've put it into a separate bug #873, so we can resolve this particular one as fixed. Hurray!
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.