Bug #623
closedMDS: MDSTable::load_2
0%
Description
On a small test machine I have a Ceph RC cluster running (Which was running a old unstable before), after my upgrade I saw a MDS crash.
I saw:
2010-12-02 13:57:08.952173 7f9d92cbe710 mds0.8 MDS::ms_get_authorizer type=osd 2010-12-02 13:57:08.952242 7f9d94fc5710 mds0.8 ms_handle_connect on [2a00:f10:113:1:230:48ff:fe8d:a21f]:6804/2045 2010-12-02 13:57:08.952494 7f9d94fc5710 mds0.8 ms_handle_connect on [2a00:f10:113:1:230:48ff:fe8d:a21f]:6807/2128 2010-12-02 13:57:08.953187 7f9d94fc5710 -- [2a00:f10:113:1:230:48ff:fe8d:a21f]:6800/2831 <== osd2 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6807/2128 1 ==== osd_op_reply(5 200.00000000 [read 0~0] = -23 (Too many open files in system)) v1 ==== 98+0+0 (2432835435 0 0) 0x153b1c0 2010-12-02 13:57:09.325890 7f9d94fc5710 -- [2a00:f10:113:1:230:48ff:fe8d:a21f]:6800/2831 <== osd0 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6801/1975 1 ==== osd_op_reply(1 mds0_inotable [read 0~0] = -23 (Too many open files in system)) v1 ==== 99+0+0 (1732481521 0 0) 0x153bc40 2010-12-02 13:57:09.325971 7f9d94fc5710 mds0.inotable: load_2 found no table mds/MDSTable.cc: In function 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)': mds/MDSTable.cc:148: FAILED assert(0) ceph version 0.24~rc (commit:78a14622438addcd5c337c4924cce1f67d053ee9) 1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x5be) [0x61582e] 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x674) [0x665bc4] 3: (MDS::_dispatch(Message*)+0x20b4) [0x4ab924] 4: (MDS::ms_dispatch(Message*)+0x6d) [0x4abefd] 5: (SimpleMessenger::dispatch_entry()+0x759) [0x4812c9] 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4790bc] 7: (Thread::_entry_func(void*)+0xa) [0x48d96a] 8: (()+0x69ca) [0x7f9d977299ca] 9: (clone()+0x6d) [0x7f9d966e170d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-23 (Too many open files in system)), that caught my attention, but trying to raise it to 64.000 wouldn't help.
root@noisy:~# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 64000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited root@noisy:~#
The cluster isn't busy at all, and not much data / objects on it:
root@noisy:~# ceph -s 2010-12-02 14:19:04.984055 pg v1008: 792 pgs: 792 active+clean; 5672 MB data, 10782 MB used, 283 GB / 300 GB avail 2010-12-02 14:19:04.986335 mds e29: 1/1/1 up {0=up:replay(laggy or crashed)} 2010-12-02 14:19:04.986376 osd e48: 3 osds: 3 up, 3 in 2010-12-02 14:19:04.986444 log 2010-12-02 14:17:39.411756 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6804/2045 50 : [INF] 3.1p1 scrub ok 2010-12-02 14:19:04.986555 class rbd (v1.3 [x86-64]) 2010-12-02 14:19:04.986578 mon e1: 1 mons at {noisy=[2a00:f10:113:1:230:48ff:fe8d:a21f]:6789/0} root@noisy:~#
Is this due to the number of open files?
Updated by Sage Weil over 13 years ago
- Assignee set to Sage Weil
- Priority changed from Normal to Immediate
- Target version set to v0.24
Updated by Sage Weil over 13 years ago
- Assignee changed from Sage Weil to Colin McCabe
actually -23 is NFILE, which is I think coming from the LOST code...but that should never trigger unless the admin has explicit marked an osd as lost, and I'm pretty sure Wido hasn't. Maybe the bool lost isn't getting properly initialized somewhere? Or is decoding improperly?
fyi:
(11:44:10 AM) wido: sagewk: Oh, no hurry at all, but I don't want to send you on a ghost chase. Btw, from the logger machine it's simply ssh root@noisy
Updated by Colin McCabe over 13 years ago
root@noisy:/var/log/ceph# grep mark_all_unfound_as_lost *
[ no results ]
So we're not marking things as lost in PG::mark_all_unfound_as_lost, at least
Updated by Colin McCabe over 13 years ago
- Status changed from New to 7
I think that commit:da5ab7c9a49f8996b41783175683d4b8b13ece4d should fix this issue.
wido, can you re-run with the latest rc? Hopefully the bad state created by this bug will be transitory and it will work for you after this commit.
Updated by Wido den Hollander over 13 years ago
Yes, tried with the latest rc, works!
MDS starts and recovers, als mounting and using the FS goes fine.
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1) - Target version deleted (
v0.24)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.