Bug #5250: ceph-mds 0.61.2 aborts on start - CephFS - Ceph

Actions

Copy link

Bug #5250

closed

ceph-mds 0.61.2 aborts on start

Added by Jérôme Poulin almost 11 years ago. Updated almost 8 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After rebooting the whole cluster using the "shut the braker off" method, I had some BTRFS corruption which was fixed using btrfs scrub and OSD corruption which was fixed using pg repair. Afterward, cluster was HEALTH_WARN with all PGs active+clean, then I tried starting the MDS and it aborts.

2013-06-04 18:07:46.983018 7fadd60617c0 0 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mds, pid 11826
starting mds.1 at :/0
2013-06-04 18:07:47.387643 7fadd0aa6700 1 mds.-1.0 handle_mds_map standby
2013-06-04 18:07:47.704154 7fadd0aa6700 1 mds.0.45 handle_mds_map i am now mds.0.45
2013-06-04 18:07:47.704159 7fadd0aa6700 1 mds.0.45 handle_mds_map state change up:standby --> up:replay
2013-06-04 18:07:47.704171 7fadd0aa6700 1 mds.0.45 replay_start
2013-06-04 18:07:47.704190 7fadd0aa6700 1 mds.0.45 recovery set is
2013-06-04 18:07:47.704193 7fadd0aa6700 1 mds.0.45 need osdmap epoch 1710, have 1709
2013-06-04 18:07:47.704196 7fadd0aa6700 1 mds.0.45 waiting for osdmap 1710 (which blacklists prior instance)
2013-06-04 18:07:47.711616 7fadd0aa6700 0 mds.0.cache creating system inode with ino:100
2013-06-04 18:07:47.712032 7fadd0aa6700 0 mds.0.cache creating system inode with ino:1
mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7fadcd594700 time 2013-06-04 18:07:49.260303
mds/journal.cc: 1170: FAILED assert(in->first p->dnfirst || (in->is_multiversion() && in->first > p->dnfirst))
ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x3af5) [0x523235]
2: (EUpdate::replay(MDS*)+0x3a) [0x52bd9a]
3: (MDLog::_replay_thread()+0x5cf) [0x6f42cf]
4: (MDLog::ReplayThread::entry()+0xd) [0x505cbd]
5: (()+0x7f8e) [0x7fadd5c3ef8e]
6: (clone()+0x6d) [0x7fadd455be1d]
NOTE: a copy of the executable, or `objdump rdS <executable>` is needed to interpret this.
2013-06-04 18:07:49.261299 7fadcd594700 -1 mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7fadcd594700 time 2013-06-04 18:07:49.260303
mds/journal.cc: 1170: FAILED assert(in>first p->dnfirst || (in->is_multiversion() && in->first > p->dnfirst))

ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
 1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x3af5) [0x523235]
 2: (EUpdate::replay(MDS*)+0x3a) [0x52bd9a]
 3: (MDLog::_replay_thread()+0x5cf) [0x6f42cf]
 4: (MDLog::ReplayThread::entry()+0xd) [0x505cbd]
 5: (()+0x7f8e) [0x7fadd5c3ef8e]
 6: (clone()+0x6d) [0x7fadd455be1d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Files

mds.log.xz (7.16 MB) mds.log.xz

Jérôme Poulin, 06/04/2013 12:34 PM

Actions

Copy link

Updated by Jérôme Poulin almost 11 years ago

Full log at pastebin.com : http://pastebin.com/9YPMjw0t

Actions

Copy link

Updated by Greg Farnum almost 11 years ago

Project changed from Ceph to CephFS
Category changed from 1 to 47
Source changed from other to Community (user)

Can you provide the output of "ceph -s" as well, please. And start up an MDS daemon after setting "debug mds = 20" and "debug journaler = 20" in its config file/on the command line — that will dump a lot more useful information to the log than what's currently there.
It looks like you're running multiple active MDS servers?

Actions

Copy link

Updated by Jérôme Poulin almost 11 years ago

File mds.log.xz mds.log.xz added

I'm running a single MDS on the same server as a MON and a ODS. We're not using the FS very much, just testing, this cluster is mostly used for RBD devices.

root@Ceph1:~# ceph -s
health HEALTH_WARN mds cluster is degraded
monmap e1: 3 mons at {1=10.10.4.231:6789/0,2=10.10.4.232:6789/0,3=10.10.4.233:6789/0}, election epoch 486, quorum 0,1,2 1,2,3
osdmap e1720: 6 osds: 6 up, 6 in
pgmap v608429: 896 pgs: 896 active+clean; 84817 MB data, 291 GB used, 1501 GB / 1800 GB avail; 43952B/s wr, 5op/s
mdsmap e1646: 1/1/1 up {0=1=up:replay}

I attached an XZ compressed version of the mds.log because it is 229 MB.

Actions

Copy link

Updated by Jérôme Poulin almost 11 years ago

Is it useful for me to keep the FS in this state much longuer, right now the FS is unusable. Is it possible to clear the log?

Actions

Copy link

Updated by Sage Weil almost 11 years ago

probably the workaround is to comment out that assert..

Actions

Copy link

Updated by Sage Weil almost 11 years ago

were you using the mds snapshots?

Actions

Copy link

Updated by Jérôme Poulin almost 11 years ago

I'll try commenting out the assert, and yes, we tried the snapshots feature of the MDS hours before the shutdown.

Actions

Copy link

Updated by Jérôme Poulin almost 11 years ago

I noticed that resetting the MDS journal using ceph-mds -i 1 --reset-journal 0 -d hangs there.

root@Ceph1:/usr/src/ceph# ceph-mds -i 1 --reset-journal 0 -d --debug_mds 20 --debug_ms 1
2013-06-11 13:31:36.335546 7f6d12dfb7c0  0 ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532), process ceph-mds, pid 27919
2013-06-11 13:31:36.339916 7f6d12dfb7c0  1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6803/27919 need_addr=1
2013-06-11 13:31:36.340050 7f6d12dfb7c0  0 resetting journal
2013-06-11 13:31:36.340169 7f6d12dfb7c0  1 accepter.accepter.start
2013-06-11 13:31:36.340205 7f6d12dfb7c0  1 -- 0.0.0.0:6803/27919 messenger.start
2013-06-11 13:31:36.340839 7f6d12dfb7c0  1 -- 0.0.0.0:6803/27919 --> 10.10.4.232:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x33106c0 con 0x3320dc0
2013-06-11 13:31:36.341612 7f6d12df9700  1 -- 10.10.4.231:6803/27919 learned my addr 10.10.4.231:6803/27919
2013-06-11 13:31:36.343008 7f6d0e842700  1 -- 10.10.4.231:6803/27919 <== mon.1 10.10.4.232:6789/0 1 ==== mon_map v1 ==== 473+0+0 (3706936977 0 0) 0x33663c0 con 0x3320dc0
2013-06-11 13:31:36.343164 7f6d0e842700  1 -- 10.10.4.231:6803/27919 <== mon.1 10.10.4.232:6789/0 2 ==== auth_reply(proto 2 0 Success) v1 ==== 33+0+0 (645818730 0 0) 0x331b400 con 0x3320dc0
2013-06-11 13:31:36.343474 7f6d0e842700  1 -- 10.10.4.231:6803/27919 --> 10.10.4.232:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x3310900 con 0x3320dc0
2013-06-11 13:31:36.344528 7f6d0e842700  1 -- 10.10.4.231:6803/27919 <== mon.1 10.10.4.232:6789/0 3 ==== auth_reply(proto 2 0 Success) v1 ==== 206+0+0 (1191764472 0 0) 0x331b800 con 0x3320dc0
2013-06-11 13:31:36.344720 7f6d0e842700  1 -- 10.10.4.231:6803/27919 --> 10.10.4.232:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x3310d80 con 0x3320dc0
2013-06-11 13:31:36.345975 7f6d0e842700  1 -- 10.10.4.231:6803/27919 <== mon.1 10.10.4.232:6789/0 4 ==== auth_reply(proto 2 0 Success) v1 ==== 580+0+0 (181571411 0 0) 0x331b600 con 0x3320dc0
2013-06-11 13:31:36.346157 7f6d0e842700  1 -- 10.10.4.231:6803/27919 --> 10.10.4.232:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x3340000 con 0x3320dc0
2013-06-11 13:31:36.346225 7f6d0e842700  1 -- 10.10.4.231:6803/27919 --> 10.10.4.232:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x3310b40 con 0x3320dc0
2013-06-11 13:31:36.346455 7f6d12dfb7c0  1 -- 10.10.4.231:6803/27919 --> 10.10.4.232:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x3340380 con 0x3320dc0
2013-06-11 13:31:36.346856 7f6d0e842700  1 -- 10.10.4.231:6803/27919 <== mon.1 10.10.4.232:6789/0 5 ==== mon_map v1 ==== 473+0+0 (3706936977 0 0) 0x33665a0 con 0x3320dc0
2013-06-11 13:31:36.346954 7f6d0e842700  1 -- 10.10.4.231:6803/27919 <== mon.1 10.10.4.232:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (3723835308 0 0) 0x3340000 con 0x3320dc0
2013-06-11 13:31:36.347147 7f6d0e842700  1 -- 10.10.4.231:6803/27919 <== mon.1 10.10.4.232:6789/0 7 ==== auth_reply(proto 2 0 Success) v1 ==== 194+0+0 (1486132023 0 0) 0x331ba00 con 0x3320dc0
2013-06-11 13:31:36.347563 7f6d0e842700  1 -- 10.10.4.231:6803/27919 <== mon.1 10.10.4.232:6789/0 8 ==== osd_map(2316..2316 src has 1579..2316) v3 ==== 4840+0+0 (2105977932 0 0) 0x3311200 con 0x3320dc0

Actions

Copy link

Updated by Jérôme Poulin almost 11 years ago

Removing the assert worked around the problem:

root@Ceph1:/usr/src/ceph/src# ./ceph-mds -i 1 -d
2013-06-11 14:09:58.210478 7fc606ad1780  0 starting mds.1 at :/0
ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532), process ceph-mds, pid 15527
2013-06-11 14:09:58.648021 7fc5fd896700  1 mds.-1.0 handle_mds_map standby
2013-06-11 14:09:59.106287 7fc5fd896700  1 mds.0.60 handle_mds_map i am now mds.0.60
2013-06-11 14:09:59.106291 7fc5fd896700  1 mds.0.60 handle_mds_map state change up:standby --> up:replay
2013-06-11 14:09:59.106297 7fc5fd896700  1 mds.0.60 replay_start
2013-06-11 14:09:59.106309 7fc5fd896700  1 mds.0.60  recovery set is 
2013-06-11 14:09:59.106311 7fc5fd896700  1 mds.0.60  need osdmap epoch 2317, have 2316
2013-06-11 14:09:59.106314 7fc5fd896700  1 mds.0.60  waiting for osdmap 2317 (which blacklists prior instance)
2013-06-11 14:09:59.216224 7fc5fd896700  0 mds.0.cache creating system inode with ino:100
2013-06-11 14:09:59.216605 7fc5fd896700  0 mds.0.cache creating system inode with ino:1
2013-06-11 14:10:01.005481 7fc5f63f1700  1 mds.0.60 replay_done
2013-06-11 14:10:01.005507 7fc5f63f1700  1 mds.0.60 making mds journal writeable
2013-06-11 14:10:01.180302 7fc5fd896700  1 mds.0.60 handle_mds_map i am now mds.0.60
2013-06-11 14:10:01.180316 7fc5fd896700  1 mds.0.60 handle_mds_map state change up:replay --> up:reconnect
2013-06-11 14:10:01.180335 7fc5fd896700  1 mds.0.60 reconnect_start
2013-06-11 14:10:01.180337 7fc5fd896700  1 mds.0.60 reopen_log
2013-06-11 14:10:01.180346 7fc5fd896700  1 mds.0.server reconnect_clients -- 5 sessions
2013-06-11 14:10:48.306602 7fc5f77fe700  1 mds.0.server reconnect gave up on client.7450 10.10.4.13:0/2982
2013-06-11 14:10:48.306626 7fc5f77fe700  1 mds.0.server reconnect gave up on client.7815 10.10.4.131:0/11089
2013-06-11 14:10:48.306633 7fc5f77fe700  1 mds.0.server reconnect gave up on client.10134 10.10.4.233:0/7686
2013-06-11 14:10:48.306638 7fc5f77fe700  1 mds.0.server reconnect gave up on client.10216 10.10.4.234:0/15879
2013-06-11 14:10:48.306643 7fc5f77fe700  1 mds.0.server reconnect gave up on client.10334 10.10.4.235:0/15894
2013-06-11 14:10:48.306658 7fc5f77fe700  1 mds.0.60 reconnect_done
2013-06-11 14:10:48.459450 7fc5fd896700  1 mds.0.60 handle_mds_map i am now mds.0.60
2013-06-11 14:10:48.459460 7fc5fd896700  1 mds.0.60 handle_mds_map state change up:reconnect --> up:rejoin
2013-06-11 14:10:48.459466 7fc5fd896700  1 mds.0.60 rejoin_joint_start
2013-06-11 14:10:48.467054 7fc5fd896700  1 mds.0.60 rejoin_done
2013-06-11 14:10:48.600952 7fc5fd896700  1 mds.0.60 handle_mds_map i am now mds.0.60
2013-06-11 14:10:48.600954 7fc5fd896700  1 mds.0.60 handle_mds_map state change up:rejoin --> up:active
2013-06-11 14:10:48.600962 7fc5fd896700  1 mds.0.60 recovery_done -- successful recovery!
2013-06-11 14:10:48.601807 7fc5fd896700  1 mds.0.60 active_start
2013-06-11 14:10:48.602340 7fc5fd896700  1 mds.0.60 cluster recovered.

Actions

Copy link

#10

Updated by Jérôme Poulin almost 11 years ago

I'm still using the cluster with the modified ceph-mds program, it still works. I caused another power outage (this is a test cluster, we're testing the robustness) and found out ceph-mds still won't start, it seems the corruption remains. I had to restart the modified ceph-mds before the CephFS started working again.

Actions

Copy link

#11

Updated by Chris Clayton almost 11 years ago

We have fit a very similar problem with V0.61.2. We are unable to start any MDS daemons following testing that involved killing off all the MDS dameons.

On attempted restart of the daemon /var/log/ceph/ ceph-mds.<ID>.log shows

2013-06-20 09:43:45.700482 7f933027f780 0 ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60), process ceph-mds, pid 17029
2013-06-20 09:43:45.712140 7f932ce3b700 1 mds.-1.0 handle_mds_map standby
2013-06-20 09:44:01.359406 7f932ce3b700 1 mds.0.22 handle_mds_map i am now mds.0.22
2013-06-20 09:44:01.359407 7f932ce3b700 1 mds.0.22 handle_mds_map state change up:standby --> up:replay
2013-06-20 09:44:01.359410 7f932ce3b700 1 mds.0.22 replay_start
2013-06-20 09:44:01.359413 7f932ce3b700 1 mds.0.22 recovery set is
2013-06-20 09:44:01.359414 7f932ce3b700 1 mds.0.22 need osdmap epoch 871, have 870
2013-06-20 09:44:01.359415 7f932ce3b700 1 mds.0.22 waiting for osdmap 871 (which blacklists prior instance)
2013-06-20 09:44:01.441461 7f932ce3b700 0 mds.0.cache creating system inode with ino:100
2013-06-20 09:44:01.441624 7f932ce3b700 0 mds.0.cache creating system inode with ino:1
2013-06-20 09:44:01.448373 7f9328e28700 1 mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f9328e28700 time 2013-06-20 09:44:01.447845
mds/journal.cc: 1170: FAILED assert(in>first == p->dnfirst || (in->is_multiversion() && in->first > p->dnfirst))

In the above trail Jerome stated that

I'll try commenting out the assert, and yes, we tried the snapshots feature of the MDS hours before the shutdown.

We to has just tried the snapshot feature for the first time the day before. This may just be coincidence. Not sure, but wanted to mention it.

Are there any suggestions on how we can clear the underlying corruption, rather than commenting out the assert (which from Jerome's comment a few days back does not seem to work)?

Actions

Copy link

#12

Updated by Greg Farnum almost 11 years ago

Unfortunately this is an area where CephFS needs some hardening and some recovery tools — part of why we don't recommend it outside of testing environments right now.
Things will be significantly more stable if you don't use any snapshot features, though.

Actions

Copy link

#13

Updated by Chris Clayton almost 11 years ago

This is all in the lab at present.

We have been doing some additional testing, and have now confirmed that this issue is caused by using the snapshot feature.

In brief, we:

1) Rebuilt the cluster from scratch
2) Did some IO to file system
3) Killed all the MDS daemons, first the active one, then the standby after failoiver had occurred. Some 10 second pauses as MDS daemon failover occured, and then eventually IO hung permanently when last MDS daemon killed.
4) Restarted an MDS daemon. IO continued.
5) Restarted second MDS daemon. All good.

Then we ran the following snapshot commands
$ ls la
total 6
drwxrwxrwx 1 ch83287 root 5 Jun 19 07:38 .
drwx----- 3 ch83287 ch83287 4096 Jun 18 06:54 ..
~~rw-rw-rw~~ 1 ch83287 ch83287 3 Dec 7 2012 chrisfile
~~rw-rw-r-~~ 1 ch83287 ch83287 5 Jun 18 06:54 chris.txt
~~rw-rw-r-~~ 1 ch83287 ch83287 0 Jun 19 07:38 newfile
$ mkdir .snap/one
$ ls .snap
one
$ rm newfile
$ ls -F .
chrisfile chris.txt
$ ls -F .snap/one
chrisfile chris.txt newfile

Now when we filled the active MDS daemon, the standby on another server also died (sic!) and all IO to filesystem stopped. Tried to restart the MDS daemons (on both servers) but neither would restart and ceph mds log file showed same errors that we mention above.

So we conclude that it is definatley the above snapshot activity that is causing this issue, assuming you had not already decided that.

Actions

Copy link

#14

Updated by Greg Farnum almost 11 years ago

Status changed from New to 12

Wow, that is a much simpler test case than I would expect to be required. I can reproduce with a single file and this bug is definitely a regression from when we were testing snapshots more regularly. :/
Thank you for the debugging and process to reproduce!

Actions

Copy link

#15

Updated by Jérôme Poulin over 10 years ago

I just updated to 0.67.3 and I want to confirm that I still have to recompile to get around the abort.

0> 2013-10-03 22:44:46.447115 7f957926b700 1 mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f957926b700 time 2013-10-03 22:44:46.443841
mds/journal.cc: 1096: FAILED assert(in>first == p->dnfirst || (in->is_multiversion() && in->first > p->dnfirst))

ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
 1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x398d) [0x65ad7d]
 2: (EUpdate::replay(MDS*)+0x3a) [0x6638fa]
 3: (MDLog::_replay_thread()+0x5cf) [0x82ef2f]
 4: (MDLog::ReplayThread::entry()+0xd) [0x63918d]
 5: (()+0x7f8e) [0x7f958192ff8e]
 6: (clone()+0x6d) [0x7f9580030e1d]

Actions

Copy link

#16