Bug #623: MDS: MDSTable::load_2 - CephFS - Ceph

Actions

Copy link

Bug #623

closed

MDS: MDSTable::load_2

Added by Wido den Hollander over 13 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Colin McCabe

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

On a small test machine I have a Ceph RC cluster running (Which was running a old unstable before), after my upgrade I saw a MDS crash.

I saw:

2010-12-02 13:57:08.952173 7f9d92cbe710 mds0.8 MDS::ms_get_authorizer type=osd
2010-12-02 13:57:08.952242 7f9d94fc5710 mds0.8 ms_handle_connect on [2a00:f10:113:1:230:48ff:fe8d:a21f]:6804/2045
2010-12-02 13:57:08.952494 7f9d94fc5710 mds0.8 ms_handle_connect on [2a00:f10:113:1:230:48ff:fe8d:a21f]:6807/2128
2010-12-02 13:57:08.953187 7f9d94fc5710 -- [2a00:f10:113:1:230:48ff:fe8d:a21f]:6800/2831 <== osd2 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6807/2128 1 ==== osd_op_reply(5 200.00000000 [read 0~0] = -23 (Too many open files in system)) v1 ==== 98+0+0 (2432835435 0 0) 0x153b1c0
2010-12-02 13:57:09.325890 7f9d94fc5710 -- [2a00:f10:113:1:230:48ff:fe8d:a21f]:6800/2831 <== osd0 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6801/1975 1 ==== osd_op_reply(1 mds0_inotable [read 0~0] = -23 (Too many open files in system)) v1 ==== 99+0+0 (1732481521 0 0) 0x153bc40
2010-12-02 13:57:09.325971 7f9d94fc5710 mds0.inotable: load_2 found no table
mds/MDSTable.cc: In function 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)':
mds/MDSTable.cc:148: FAILED assert(0)
 ceph version 0.24~rc (commit:78a14622438addcd5c337c4924cce1f67d053ee9)
 1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x5be) [0x61582e]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x674) [0x665bc4]
 3: (MDS::_dispatch(Message*)+0x20b4) [0x4ab924]
 4: (MDS::ms_dispatch(Message*)+0x6d) [0x4abefd]
 5: (SimpleMessenger::dispatch_entry()+0x759) [0x4812c9]
 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4790bc]
 7: (Thread::_entry_func(void*)+0xa) [0x48d96a]
 8: (()+0x69ca) [0x7f9d977299ca]
 9: (clone()+0x6d) [0x7f9d966e170d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

-23 (Too many open files in system)), that caught my attention, but trying to raise it to 64.000 wouldn't help.

root@noisy:~# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 64000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
root@noisy:~#

The cluster isn't busy at all, and not much data / objects on it:

root@noisy:~# ceph -s
2010-12-02 14:19:04.984055    pg v1008: 792 pgs: 792 active+clean; 5672 MB data, 10782 MB used, 283 GB / 300 GB avail
2010-12-02 14:19:04.986335   mds e29: 1/1/1 up {0=up:replay(laggy or crashed)}
2010-12-02 14:19:04.986376   osd e48: 3 osds: 3 up, 3 in
2010-12-02 14:19:04.986444   log 2010-12-02 14:17:39.411756 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6804/2045 50 : [INF] 3.1p1 scrub ok
2010-12-02 14:19:04.986555   class rbd (v1.3 [x86-64])
2010-12-02 14:19:04.986578   mon e1: 1 mons at {noisy=[2a00:f10:113:1:230:48ff:fe8d:a21f]:6789/0}
root@noisy:~#

Is this due to the number of open files?

Actions

Copy link

Updated by Sage Weil over 13 years ago

Assignee set to Sage Weil
Priority changed from Normal to Immediate
Target version set to v0.24

Actions

Copy link

Updated by Sage Weil over 13 years ago

Assignee changed from Sage Weil to Colin McCabe

actually -23 is NFILE, which is I think coming from the LOST code...but that should never trigger unless the admin has explicit marked an osd as lost, and I'm pretty sure Wido hasn't. Maybe the bool lost isn't getting properly initialized somewhere? Or is decoding improperly?

fyi:
(11:44:10 AM) wido: sagewk: Oh, no hurry at all, but I don't want to send you on a ghost chase. Btw, from the logger machine it's simply ssh root@noisy

Actions

Copy link