Project

General

Profile

Bug #2173

MDS crash when start with end of buffer

Added by soft crack about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

My system is ubuntu 11.10 64bit. Mds just crashes when startup.

I noticed the message: 'No such file or directory'.
But I don't know which file.

See attachment for mds log message

mds.1.log View (15.9 KB) soft crack, 03/14/2012 04:14 AM

osd.2.log.gz - osd debug messages (2.86 MB) soft crack, 03/15/2012 06:01 PM

pgdump (27.7 KB) soft crack, 03/15/2012 06:01 PM

osdmap (2.28 KB) soft crack, 03/16/2012 04:43 PM

History

#1 Updated by soft crack about 12 years ago

I also tried: 'ceph-mds -i 1 -d --reset-journal 0'.

It just freeze.

#2 Updated by Greg Farnum about 12 years ago

  • Assignee set to Greg Farnum
  • Priority changed from Urgent to Normal

Huh. Is this a new filesystem? Have you had any problems with the RADOS cluster (the OSDs)?

What's happening now is that the MDS is trying to load its sessionmap off disk and gets ENOENT back from the OSD and fails to handle it. That's a new bug (#2179). The only way I can see it happening though is if earlier an MDS did start up but crashed before successfully writing the SessionMap to disk, which would require logs from that run (rather than anything we can do now) to diagnose.

If there's no data in the cluster, it's probably easiest to just reformat.

#3 Updated by soft crack about 12 years ago

Thanks for your responses.

I created this ceph file system with 1 mon, 1 osd, 1 mds. It works perferctly, and I wrote about 78G data on the system.

Then I tried to expand osd server to 2. The new osd server started up, but no about 10-15 MB data was written to disk (through 'df -h'). And at this time, the whole ceph file system freeze.

Then I tried to reformat the new osd several times.

The new osd server works with replication level being 2 now. But it not clean with message:

pg v37696: 209 pgs: 185 active+clean, 23 active+remapped, 1 active+degraded; 78000 MB data, 149 GB used, 4019 GB / 4184 GB avail

Would you like tell me what is meaning? Did I lost my data? Is it possible to get my data back?

#4 Updated by Greg Farnum about 12 years ago

  • Category changed from 1 to OSD
  • Priority changed from Normal to High
  • Source changed from Development to Community (user)

Well that's exciting; this means it's an OSD bug.
The meaning of that output is that of your 209 PGs, 185 are happy; 23 of them are happy but temporarily have a different primary node set; and 1 of them does not have its objects fully replicated across both OSDs.

Do you have OSD logs available which cover when you added the new OSD? Please attach any that you do and maybe we can track down what happened.
In terms of current status, please give me the full "ceph -s" output and the output of "ceph pg dump". It's easy enough to hack together something that will give you an empty SessionMap and allow things to start back up again, but if you're missing other objects we're going to have trouble.

(I'll try and be available later on IRC tonight if you can get me some of this stuff.)

#5 Updated by soft crack about 12 years ago

ceph -s:

2012-03-16 09:00:10.707189    pg v38781: 210 pgs: 170 active+clean, 18 active+remapped, 22 active+degraded; 75503 MB data, 215 GB used, 3953 GB / 4184 GB avail
2012-03-16 09:00:10.707818   mds e49911: 1/1/1 up {0=1=up:replay(laggy or crashed)}, 2 up:oneshot-replay(laggy or crashed)
2012-03-16 09:00:10.707846   osd e3220: 3 osds: 3 up, 3 in
2012-03-16 09:00:10.707905   log 2012-03-16 08:55:04.635865 mon.0 192.168.10.201:6789/0 103 : [INF] osd.2 192.168.10.201:6803/26781 boot
2012-03-16 09:00:10.707969   mon e3: 1 mons at {1=192.168.10.201:6789/0}

ceph osd dump -o - :

2012-03-16 09:00:10.707189    pg v38781: 210 pgs: 170 active+clean, 18 active+remapped, 22 active+degraded; 75503 MB data, 215 GB used, 3953 GB / 4184 GB avail
2012-03-16 09:00:10.707818   mds e49911: 1/1/1 up {0=1=up:replay(laggy or crashed)}, 2 up:oneshot-replay(laggy or crashed)
2012-03-16 09:00:10.707846   osd e3220: 3 osds: 3 up, 3 in
2012-03-16 09:00:10.707905   log 2012-03-16 08:55:04.635865 mon.0 192.168.10.201:6789/0 103 : [INF] osd.2 192.168.10.201:6803/26781 boot
2012-03-16 09:00:10.707969   mon e3: 1 mons at {1=192.168.10.201:6789/0}

ceph pg dump: see attachment.

ceph log file: see attachment.

#6 Updated by soft crack about 12 years ago

Sorry for mistake.

ceph osd dump -o -:

2012-03-16 09:10:04.887611 mon <- [osd,dump]
2012-03-16 09:10:04.888161 mon.0 -> 'dumped osdmap epoch 3220' (0)
epoch 3220
fsid a743a194-fa91-48fb-8778-e294483273d9
created 2012-03-01 18:06:10.677024
modifed 2012-03-16 08:55:07.597505
flags

pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 lpg_num 2 lpgp_num 2 last_change 3172 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 lpg_num 2 lpgp_num 2 last_change 3162 owner 0
pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 lpg_num 2 lpgp_num 2 last_change 3160 owner 0

max_osd 3
osd.0 up in weight 1 up_from 3037 up_thru 3219 down_at 3035 last_clean_interval [2865,3034) 192.168.10.205:6800/6301 192.168.10.205:6801/6301 192.168.10.205:6802/6301 exists,up
osd.1 up in weight 1 up_from 3055 up_thru 3217 down_at 3054 last_clean_interval [3013,3053) lost_at 358 192.168.10.201:6800/20518 192.168.10.201:6801/20518 192.168.10.201:6802/20518 exists,up
osd.2 up in weight 1 up_from 3217 up_thru 3218 down_at 3213 last_clean_interval [3211,3212) 192.168.10.201:6803/26781 192.168.10.201:6804/26781 192.168.10.201:6805/26781 exists,up

pg_temp 0.7 [0,2,1]
pg_temp 0.14 [2,1,0]
pg_temp 0.1d [0,2,1]
pg_temp 0.2f [0,2,1]
pg_temp 0.36 [0,2,1]
pg_temp 0.37 [2,1,0]
pg_temp 1.6 [0,2,1]
pg_temp 1.13 [2,1,0]
pg_temp 1.1c [0,2,1]
pg_temp 1.2e [0,2,1]
pg_temp 1.35 [0,2,1]
pg_temp 1.36 [2,1,0]
pg_temp 2.5 [0,2,1]
pg_temp 2.12 [2,1,0]
pg_temp 2.1b [0,2,1]
pg_temp 2.2d [0,2,1]
pg_temp 2.34 [0,2,1]
pg_temp 2.35 [2,1,0]

#7 Updated by soft crack about 12 years ago

I managed to insert a empty sessionmap. The server continue starting.
And I get an assert error:

ceph-mds: mds/MDLog.cc:170: void MDLog::submit_entry(LogEvent*, Context*): Assertion `!segments.empty()' failed.
*** Caught signal (Aborted) **
 in thread 7fcd7e2f3700
 ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
 1: ceph-mds() [0x7a2206]
 2: (()+0x10060) [0x7fcd825c7060]
 3: (gsignal()+0x35) [0x7fcd80d483a5]
 4: (abort()+0x17b) [0x7fcd80d4bb0b]
 5: (__assert_fail()+0xdd) [0x7fcd80d40d4d]
 6: (MDLog::submit_entry(LogEvent*, Context*)+0x50e) [0x674ace]
 7: (Server::handle_client_reconnect(MClientReconnect*)+0xa23) [0x524173]
 8: (Server::dispatch(Message*)+0x215) [0x531265]
 9: (MDS::handle_deferrable_message(Message*)+0xb5f) [0x4b766f]
 10: (MDS::_dispatch(Message*)+0xa3a) [0x4c6cea]
 11: (MDS::ms_dispatch(Message*)+0x1da) [0x4c7b2a]
 12: (SimpleMessenger::dispatch_entry()+0x783) [0x722993]
 13: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4a545c]
 14: (()+0x7efc) [0x7fcd825beefc]
 15: (clone()+0x6d) [0x7fcd80df389d]

The log file says:

2012-03-16 23:09:27.989430 7fcd7e2f3700 mds.0.log replay start, from 104858377 to 232118700
2012-03-16 23:09:27.989578 7fcd7bbe8700 mds.0.log _replay_thread start
2012-03-16 23:09:27.990040 7fcd7e2f3700 mds.0.journaler(ro) _finish_read got error -2
2012-03-16 23:09:27.991896 7fcd7bbe8700 mds.0.log _replay journaler got error -2, aborting
2012-03-16 23:09:27.991933 7fcd7bbe8700 mds.0.log _replay_thread kicking waiters
2012-03-16 23:09:27.991942 7fcd7bbe8700 mds.0.97 replay_done
2012-03-16 23:09:27.991946 7fcd7bbe8700 mds.0.97 making mds journal writeable
2012-03-16 23:09:27.992192 7fcd7bbe8700 mds.0.log _replay_thread finish

Can I reset mds journal?

#8 Updated by Sage Weil about 12 years ago

Can you post an mds log with debug mds = 20 leading up to that last crash?

Resetting the journal is not something that is well-tested or particularly well supported yet. I think we'll more likely to succeed by working through the asserts you see.

#9 Updated by Greg Farnum about 12 years ago

  • Assignee changed from Greg Farnum to Josh Durgin

Unfortunately we can see that this assert too is caused by ENOENT on an object that really ought to be there, which makes me assume vast portions of the metadata are just gone. This needs to be approached from the OSD side (I know Josh started on this yesterday — reassigning to him).

#10 Updated by Josh Durgin about 12 years ago

Could you attach the output of 'ceph osd dump 3212' and the binary version of that osdmap (ceph osd getmap 3212 -o /tmp/osdmap)? There are a few weird things happening here:

1) several pgs have only two up/acting osds when replication level is 3 and all osds are up
2) there were a bunch log bound mismatches during read log (is this expected sage?):

2012-03-16 08:30:59.321083 7fe147afd7a0 osd.2 3208 pg[0.6( v 340'2644 (34'1644,340'2644] n=0 ec=1 les/c 3208/3208 3207/3207/3207) [] r=0 lpr=0 (info mismatch, log(34'1644,0'0]) (log bound mismatch, actual=[34'1645,340'2643]) lcod 0'0 mlcod 0'0 inactive] read_log 123876 340'2644 (340'2643) delete   5477f286/1000000003d.00000000/head by mds.0.6:55070 2012-03-09 11:14:23.805301

3) pgs are staying remapped despite all osds being up (in each case remapped from 2 osds to 3 osds)

#11 Updated by Greg Farnum about 12 years ago

Did all the pools get set to 3x replication, or are the confused PGs all part of the metadata pool?

#12 Updated by Josh Durgin about 12 years ago

Greg: look at the osd dump above: all pools are rep size 3.

#13 Updated by soft crack about 12 years ago

osd map file for 'ceph osd getmap 3212 -o /tmp/osdmap'

#14 Updated by Josh Durgin about 12 years ago

Talked more on irc, soft crack is trying to reset his journal since it looks like at least all his metadata objects are there other than the session map.

I don't think we can figure out how the objects got lost at this point, but the pgs only mapped to 2 osds does look like a crush bug:

./osdmaptool --print osdmap
./osdmaptool: osdmap file 'osdmap'
epoch 3212
fsid a743a194-fa91-48fb-8778-e294483273d9
created 2012-03-01 02:06:10.677024
modifed 2012-03-15 17:31:02.260488
flags 

pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 lpg_num 2 lpgp_num 2 last_change 3172 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 lpg_num 2 lpgp_num 2 last_change 3162 owner 0
pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 lpg_num 2 lpgp_num 2 last_change 3160 owner 0

max_osd 3
osd.0 up   in  weight 1 up_from 3037 up_thru 3211 down_at 3035 last_clean_interval [2865,3034) 192.168.10.205:6800/6301 192.168.10.205:6801/6301 192.168.10.205:6802/6301 exists,up
osd.1 up   in  weight 1 up_from 3055 up_thru 3211 down_at 3054 last_clean_interval [3013,3053) lost_at 358 192.168.10.201:6800/20518 192.168.10.201:6801/20518 192.168.10.201:6802/20518 exists,up
osd.2 up   in  weight 1 up_from 3211 up_thru 3211 down_at 3209 last_clean_interval [3207,3208) 192.168.10.201:6803/26378 192.168.10.201:6806/26378 192.168.10.201:6807/26378 exists,up

pg_temp 0.7 [0,2,1]
pg_temp 1.6 [0,2,1]
pg_temp 1.1c [0,2,1]

$ ./osdmaptool --test-map-pg 0.6 osdmap
./osdmaptool: osdmap file 'osdmap'
 parsed '0.6' -> 0.6
0.6 raw [2,1] up [2,1] acting [2,1]

$ ./osdmaptool --test-map-pg 2.4 osdmap
./osdmaptool: osdmap file 'osdmap'
 parsed '2.4' -> 2.4
2.4 raw [2,1] up [2,1] acting [2,1]

$ ./osdmaptool --test-map-pg 2.6 osdmap
./osdmaptool: osdmap file 'osdmap'
 parsed '2.6' -> 2.6
2.6 raw [0,2,1] up [0,2,1] acting [0,2,1]

I haven't looked at the pgs stuck remapped problem yet.

#15 Updated by soft crack about 12 years ago

I have managed to start mds server after resetting the journal. So I can get my data back.
Thanks very much to all of you!

now 'ceph -s' says:

2012-03-19 13:12:46.173044    pg v39212: 210 pgs: 164 active+clean, 18 active+remapped, 28 active+degraded; 70401 MB data, 201 GB used, 3967 GB / 4184 GB avail; -277/52941 degraded (-0.523%)

#16 Updated by soft crack about 12 years ago

Shall we colse this bug, as the mds server was recovered by providing an empty session map and we can not reproduced the bug?

I will create a new bug for the degraded PGs.

#17 Updated by Josh Durgin about 12 years ago

  • Category deleted (OSD)
  • Status changed from New to Resolved

Also available in: Atom PDF