Bug #805: mds startup: _replay journaler got error -22, aborting - CephFS - Ceph

Actions

Copy link

Bug #805

closed

mds startup: _replay journaler got error -22, aborting

Added by John Leach about 13 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

As per #803, the assert is now fixed but starting up the cluster I now get:

2011-02-14 21:02:14.026899 7f91c9865700 mds0.14 ms_handle_connect on 10.135.211.78:6801/22875
2011-02-14 21:02:59.867133 7f91c9865700 mds0.cache creating system inode with ino:100
2011-02-14 21:02:59.867415 7f91c9865700 mds0.cache creating system inode with ino:1
2011-02-14 21:03:05.579551 7f91c7059700 mds0.journaler try_read_entry got 0 len entry at offset 4051698581
2011-02-14 21:03:05.579615 7f91c7059700 mds0.log _replay journaler got error -22, aborting
2011-02-14 21:03:05.582298 7f91c7059700 mds0.14 boot_start encountered an error, failing
2011-02-14 21:03:05.582323 7f91c7059700 mds0.14 suicide.  wanted up:replay, now down:dne
2011-02-14 21:03:05.583656 7f91cc303720 stopped.

See #803 for more logs and other details

Files

mds.srv-an56n.log (15.3 KB) mds.srv-an56n.log

John Leach, 02/14/2011 02:12 PM

Actions

Copy link

Updated by Greg Farnum about 13 years ago

Assignee set to Greg Farnum

You don't have any other logs from before, do you? At this point there's a problem with the way the MDS journal ended up on-disk. If you could generate one more log with

debug mds = 20
debug journaler = 20

that'd be helpful in diagnosing what could have happened.

Sage also thinks there are going to be fundamental problems with large directory reads until fragmentation gets enabled*, but I'm not sure they would actually manifest like this.

[*] Since directories and their contents are stored in a single object on-disk, when doing an ls it reads the directory into memory. Once your directory gets larger than the MDS cache size, you're going to have problems with cache swapping.

Actions

Copy link

Updated by John Leach about 13 years ago

File mds.srv-an56n.log mds.srv-an56n.log added

Actions

Copy link

Updated by John Leach about 13 years ago

Dumped and bzipped the journal here: http://johnleach.co.uk/downloads/ceph/805-journal.dump.tbz2

Actions

Copy link

Updated by Greg Farnum about 13 years ago

Okay, somehow expire < trim. That shouldn't happen.

Actions

Copy link

Updated by Greg Farnum about 13 years ago

Status changed from New to Can't reproduce

Well, I can't figure out how the header could have gotten corrupted like that, but I've put in a few more checks for it and will consult on it when Sage gets back. :/

Re-open if you see this again!

Actions

Copy link

Updated by Greg Farnum about 13 years ago

Status changed from Can't reproduce to In Progress

We got this on our playground today. I have found one promising lead. Object 200.0001b88d_head is the last object in the journal, and logs are available for the OSD storing it:
grep "200.0001b88d_head" /var/log/ceph/osd.4.log.1 | less (partial results only)

2011-03-08 03:32:41.361755 7f1b6b9bd910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 1404340~1010
2011-03-08 03:32:41.369422 7f1b6b9bd910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 1404340~1010 = 1010
2011-03-08 03:32:41.369441 7f1b6b9bd910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' len 153
2011-03-08 03:32:41.369485 7f1b6b9bd910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' len 153 = 153
2011-03-08 03:32:41.369497 7f1b6b9bd910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' len 26
2011-03-08 03:32:41.369536 7f1b6b9bd910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' len 26 = 26
2011-03-08 03:35:21.535741 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_'
2011-03-08 03:35:21.535890 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' = 153
2011-03-08 03:35:21.535905 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset'
2011-03-08 03:35:21.535924 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' = 26
2011-03-08 03:35:27.778222 7f1b6b1bc910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 0~1595
2011-03-08 03:35:27.778614 7f1b6b1bc910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 0~1595 = 1595

Looks like the journal entry gets written out to a certain point, then it starts all over again on the same object -- did the MDS crash and this object got lost track of during replay? Then, while overwriting this object, it doesn't overwrite all the data:

2011-03-08 03:37:47.204260 7f1b6b1bc910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 690703~1021
2011-03-08 03:37:47.204314 7f1b6b1bc910 filestore(/mnt/osd) write /mnt/osd/current/1.60_head/200.0001b88d_head 690703~1021 = 1021
2011-03-08 03:37:47.204328 7f1b6b1bc910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' len 153
2011-03-08 03:37:47.204369 7f1b6b1bc910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' len 153 = 153
2011-03-08 03:37:47.204380 7f1b6b1bc910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' len 26
2011-03-08 03:37:47.204421 7f1b6b1bc910 filestore(/mnt/osd) setattr /mnt/osd/current/1.60_head/200.0001b88d_head 'snapset' len 26 = 26
2011-03-08 03:38:06.910883 7f1b658b0910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_'
2011-03-08 03:38:06.910996 7f1b658b0910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' = 153
2011-03-08 03:38:06.911164 7f1b658b0910 filestore(/mnt/osd) stat /mnt/osd/current/1.60_head/200.0001b88d_head = 0
2011-03-08 03:39:22.894022 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_'
2011-03-08 03:39:22.894167 7f1b660b1910 filestore(/mnt/osd) getattr /mnt/osd/current/1.60_head/200.0001b88d_head '_' = 153
2011-03-08 03:39:22.894459 7f1b660b1910 filestore(/mnt/osd) read /mnt/osd/current/1.60_head/200.0001b88d_head 0~1405350
2011-03-08 03:39:22.896465 7f1b660b1910 filestore(/mnt/osd) FileStore::read /mnt/osd/current/1.60_head/200.0001b88d_head 0~1405350/1405350

This ending offset (from the rewrite) is where the journal replay fails!

More investigation is coming.

Actions

Copy link

Updated by Greg Farnum about 13 years ago

So, the MDS that originally wrote that journal object got blacklisted shortly after writing it, and then a standby MDS came along and took over, and that's the one that starts over on writing the object.
The first time the new MDS ever touches 200.0001b88d is

2011-03-08 03:35:21.535573 7f1b681b6910 -- 10.14.0.111:6800/4014 <== mds0 10.14.0.101:6802/13896 1502 ==== osd_op(mds0.293:9328 200.0001b88d [write 0~1595] 1.3060) v1 ==== 120+0+1595 (1654109329 0 228038280) 0x291cb40 con 0x7fc5500

ie, a write. It should get read on probe, but for some reason isn't...

Actions

Copy link

Updated by Greg Farnum about 13 years ago

Okay, here's the problem. The previous object is 200.0001b88c, on osd0. It is first touched by mds0.293 at:

2011-03-08 03:33:28.831406 7f31e5ade910 -- 10.14.0.104:6800/9682 <== mds0 10.14.0.101:6802/13896 1 ==== osd_op(mds0.293:7 200.0001b88c [stat 0~0] 1.cd9) v1 ====  120+0+0 (2444897950 0 0) 0x5201000 con 0x448b640

and the response is sent out at

2011-03-08 03:33:28.837780 7f31e31d8910 -- 10.14.0.104:6800/9682 --> 10.14.0.101:6802/13896 -- osd_op_reply(7 200.0001b88c [stat 0~0] = 0) v1 -- ?+16 0x6cb5700 con 0x448b640

But the last write from mds0.292 is completed later, on a sub_op_reply:

2011-03-08 03:33:30.087922 7f31e39d9910 osd0 26692 dequeue_op osd_sub_op_reply(mds0.292:7918 1.d9 200.0001b88c/head [] ondisk = 0) v1 pg pg[1.d9( v 26691'140462 (26691'140179,26691'140462] n=7429 ec=2 les=26576 26570/26570/26561) [0,3] r=0 mlcod 26691'140460 active+clean], 6 more pending

Sam suggests that forcing a read-modify-write cycle would make sure the new MDS got the correct information on the stat. We could do this by, eg, writing an xattr with the mds name and version (mds0.293) on every object before we stat it. This isn't a very elegant solution and it wouldn't play too nicely with standby-replay, although I could hack something up. We'll need to do something, though!

Even with this, though, I'm a little concerned about the consequences of separating the MDS switchover (the MDSMap) from the blacklisting mechanism (the OSDMap). I think it's possible for object 200.c to be on OSD3, for object 200.d to be on OSD4, and for OSD3 to get the new OSDMap earlier than OSD4 so that object 200.c gets rejected but object 200.d gets written down. Am I wrong about this possibility? If not, what's a good mechanism for dealing with it?

Actions

Copy link

Updated by Greg Farnum about 13 years ago

After discussion with Sage, it turns out that we don't need to worry about later blocks getting accepted but earlier blocks getting rejected: the MDS journaler operates strictly linearly, so later writes won't be acked to the client. Hurray, no data loss!
This does mean that the clients will need to replay ops that, strictly speaking, made it to disk, but any workaround to avoid that would be..immensely complicated. And not nearly worth it given the unlikelihood of that race occurring.

So, to resolve: We will set the probe stats operations as may_write, forcing any in-progress writes to complete before the stat does.

To deal with possible lost ahead data, we will truncate 1 object ahead whenever we open a new object for logging, and force the MDS to wait to finish the current object until that truncation has completed. (Or, if that performs badly/is too complicated, do some stronger variant to make sure the invariant holds that we know we have not accidentally written up to old trash data.)

Actions

Copy link

#10

Updated by Greg Farnum about 13 years ago

Status changed from In Progress to Resolved

The misread size got fixed by commit: d3464af7d046cfff887127710ad9656b4877c452.

The issue with lost data is an unlikely race, so I've put it into a separate bug #873, so we can resolve this particular one as fixed. Hurray!

Actions

Copy link

#11

Updated by John Spray over 7 years ago

Project changed from Ceph to CephFS
Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #805

mds startup: _replay journaler got error -22, aborting

Updated by Greg Farnum about 13 years ago

Updated by John Leach about 13 years ago

Updated by John Leach about 13 years ago

Updated by Greg Farnum about 13 years ago

Updated by Greg Farnum about 13 years ago

Updated by Greg Farnum about 13 years ago

Updated by Greg Farnum about 13 years ago

Updated by Greg Farnum about 13 years ago

Updated by Greg Farnum about 13 years ago

Updated by Greg Farnum about 13 years ago

Updated by John Spray over 7 years ago