Project

General

Profile

Actions

Bug #623

closed

MDS: MDSTable::load_2

Added by Wido den Hollander over 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On a small test machine I have a Ceph RC cluster running (Which was running a old unstable before), after my upgrade I saw a MDS crash.

I saw:

2010-12-02 13:57:08.952173 7f9d92cbe710 mds0.8 MDS::ms_get_authorizer type=osd
2010-12-02 13:57:08.952242 7f9d94fc5710 mds0.8 ms_handle_connect on [2a00:f10:113:1:230:48ff:fe8d:a21f]:6804/2045
2010-12-02 13:57:08.952494 7f9d94fc5710 mds0.8 ms_handle_connect on [2a00:f10:113:1:230:48ff:fe8d:a21f]:6807/2128
2010-12-02 13:57:08.953187 7f9d94fc5710 -- [2a00:f10:113:1:230:48ff:fe8d:a21f]:6800/2831 <== osd2 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6807/2128 1 ==== osd_op_reply(5 200.00000000 [read 0~0] = -23 (Too many open files in system)) v1 ==== 98+0+0 (2432835435 0 0) 0x153b1c0
2010-12-02 13:57:09.325890 7f9d94fc5710 -- [2a00:f10:113:1:230:48ff:fe8d:a21f]:6800/2831 <== osd0 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6801/1975 1 ==== osd_op_reply(1 mds0_inotable [read 0~0] = -23 (Too many open files in system)) v1 ==== 99+0+0 (1732481521 0 0) 0x153bc40
2010-12-02 13:57:09.325971 7f9d94fc5710 mds0.inotable: load_2 found no table
mds/MDSTable.cc: In function 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)':
mds/MDSTable.cc:148: FAILED assert(0)
 ceph version 0.24~rc (commit:78a14622438addcd5c337c4924cce1f67d053ee9)
 1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x5be) [0x61582e]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x674) [0x665bc4]
 3: (MDS::_dispatch(Message*)+0x20b4) [0x4ab924]
 4: (MDS::ms_dispatch(Message*)+0x6d) [0x4abefd]
 5: (SimpleMessenger::dispatch_entry()+0x759) [0x4812c9]
 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4790bc]
 7: (Thread::_entry_func(void*)+0xa) [0x48d96a]
 8: (()+0x69ca) [0x7f9d977299ca]
 9: (clone()+0x6d) [0x7f9d966e170d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

-23 (Too many open files in system)), that caught my attention, but trying to raise it to 64.000 wouldn't help.

root@noisy:~# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 64000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
root@noisy:~# 

The cluster isn't busy at all, and not much data / objects on it:

root@noisy:~# ceph -s
2010-12-02 14:19:04.984055    pg v1008: 792 pgs: 792 active+clean; 5672 MB data, 10782 MB used, 283 GB / 300 GB avail
2010-12-02 14:19:04.986335   mds e29: 1/1/1 up {0=up:replay(laggy or crashed)}
2010-12-02 14:19:04.986376   osd e48: 3 osds: 3 up, 3 in
2010-12-02 14:19:04.986444   log 2010-12-02 14:17:39.411756 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21f]:6804/2045 50 : [INF] 3.1p1 scrub ok
2010-12-02 14:19:04.986555   class rbd (v1.3 [x86-64])
2010-12-02 14:19:04.986578   mon e1: 1 mons at {noisy=[2a00:f10:113:1:230:48ff:fe8d:a21f]:6789/0}
root@noisy:~# 

Is this due to the number of open files?

Actions #1

Updated by Sage Weil over 13 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to Immediate
  • Target version set to v0.24
Actions #2

Updated by Sage Weil over 13 years ago

  • Assignee changed from Sage Weil to Colin McCabe

actually -23 is NFILE, which is I think coming from the LOST code...but that should never trigger unless the admin has explicit marked an osd as lost, and I'm pretty sure Wido hasn't. Maybe the bool lost isn't getting properly initialized somewhere? Or is decoding improperly?

fyi:
(11:44:10 AM) wido: sagewk: Oh, no hurry at all, but I don't want to send you on a ghost chase. Btw, from the logger machine it's simply ssh root@noisy

Actions #3

Updated by Colin McCabe over 13 years ago

root@noisy:/var/log/ceph# grep mark_all_unfound_as_lost *
[ no results ]

So we're not marking things as lost in PG::mark_all_unfound_as_lost, at least

Actions #4

Updated by Colin McCabe over 13 years ago

  • Status changed from New to 7

I think that commit:da5ab7c9a49f8996b41783175683d4b8b13ece4d should fix this issue.

wido, can you re-run with the latest rc? Hopefully the bad state created by this bug will be transitory and it will work for you after this commit.

Actions #5

Updated by Wido den Hollander over 13 years ago

Yes, tried with the latest rc, works!

MDS starts and recovers, als mounting and using the FS goes fine.

Actions #6

Updated by Sage Weil over 13 years ago

  • Status changed from 7 to Resolved
Actions #7

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.24)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF