Actions
Bug #3491
closedtest_librbd_fsx: too many open files
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
this is probably something runaway in the msgr, but:
ubuntu@teuthology:/a/teuthology-2012-11-13_19:00:04-regression-next-testing-basic/15008$ cat config.yaml kernel: &id001 kdb: true sha1: 22cddde104d715600a4c218bf9224923208afe90 nuke-on-error: true overrides: ceph: conf: global: ms inject socket failures: 5000 fs: btrfs log-whitelist: - slow request sha1: 7926ef53935313501d4a7fe0e587f3e3b00b313c s3tests: branch: next workunit: sha1: 7926ef53935313501d4a7fe0e587f3e3b00b313c roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 targets: ubuntu@plana42.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzQfmtpfECJ+NZaaiSH/R8X+dGXHH+aDTCKGLLiHhW9fttxzfzcJJaBx1b664D3ynZAC7NiaegfLDTCMW7FFVDUltMQcWjsM4BqfFipIquDP4KOclCc6EwG5aYG/MLCJwL6sovt1uKg00bSkVQsUSHBgZbMJKCjCbBb0XPxfuS4dppA3diEZBOMt1YHr+NdV7sace/Gc7YBlGsNOinnqkKfVWIpfYCiTQ18cvaisSEHsQR6zhKqrX4afQk13cTjdvZeQp9AXxRIf1g9fq2zHVWMdJdVNR8D0BSBtfAzMqIqZ8qcJqmzQN0Zq9Wk9Y021vMFORZy2SFI6c7yBWDJLdT ubuntu@plana44.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDYE0eu9E8TQwtUy89Wldp54VbNBEoO9XQf77eXXzzmNwYUFRrNX0mZV/I8GqyRJuMrPG8V4aZBthBHTtnEmQ6RAS7fVdthi/hEgwnM9cAqY3KX9mR5xJnHBc/fa5KLrnSr3Wrztf42PpQNEN5Tk55K6wWUlZOTHU3vE0j3kF+YQ5FeBhQbghztHPKFR8bOmZJp9TpbXgbvEM2RWr9bYtro1KuQOgrairyVVNWdAuwZuxSQT4soyHoSkY9JmeXKsNRAOamxH9w57mDC3PXui7r6Fp8OCWSK+GmlLTtPaZtulSCcucaZtpVae7F4s9JNxaRl5RxuUtwMRfgAHGlL2BZv ubuntu@plana55.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCdrzGTR0Fbl6sedYlwlX+FlmF6fuE3l/RTu2kzOkmG47rPEn5CI37Injb7Epc50RXCbUIfzmDqtEY6uZT3YssYrE4jvhQlynPndbn1KmiTbgxTyuumGXv7O4OOntezighA1W49phUNZys1DhdEEO8VSQAIdHrBgBLhY9DDgC4LAhrP4BSbDTN0rUXtYYHBj4aa3sJV0o3sKjpsyjjlieEQnto6JkjK6EGZCSuY+AyMZyLJjFTgMwJ9i4aC5eZoWZAWSDfDsxo8PtFR+kjUmz5uiheyn5lAzKBxmd4ZNojf7wOhSGia0ghbtUeQkdoRZXZhP2ourNn3uAguf1xt43kX task: - ceph: conf: client: rbd cache: true rbd cache max dirty: 0 tasks: - internal.lock_machines: 3 - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - internal.timer: null - chef: null - clock: null - ceph: null - rbd_fsx: clients: - client.0 ops: 5000
tail of log:
2012-11-14 09:10:46.378617 7f4ccbc81700 -1 -- 10.214.132.23:0/1030198 >> 10.214.132.34:6803/21803 pipe(0x7f4f100f7920 sd=-1 :0 pgs=0 cs=0 l=1).connect couldn't created socket Too many open files 2012-11-14 09:10:46.378640 7f4ccb77c700 -1 -- 10.214.132.23:0/1030198 >> 10.214.132.34:6806/21804 pipe(0x7f4f103151a0 sd=-1 :0 pgs=0 cs=0 l=1).connect couldn't created socket Too many open files 2012-11-14 09:10:46.378647 7f4ccb87d700 -1 -- 10.214.132.23:0/1030198 >> 10.214.132.36:6800/10787 pipe(0x7f4f10064f90 sd=-1 :0 pgs=0 cs=0 l=1).connect couldn't created socket Too many open files 2012-11-14 09:10:47.088595 7f4ccb67b700 -1 -- 10.214.132.23:0/1030198 >> 10.214.132.34:6789/0 pipe(0x7f4f0c0aa300 sd=-1 :0 pgs=0 cs=0 l=1).connect couldn't created socket Too many open files 2012-11-14 09:10:47.088642 7f4ccb67b700 0 -- 10.214.132.23:0/1030198 >> 10.214.132.34:6789/0 pipe(0x7f4f0c0aa300 sd=-1 :0 pgs=0 cs=0 l=1).fault 2012-11-14 09:10:47.088660 7f4ccb67b700 -1 -- 10.214.132.23:0/1030198 >> 10.214.132.34:6789/0 pipe(0x7f4f0c0aa300 sd=-1 :0 pgs=0 cs=0 l=1).connect couldn't created socket Too many open files 2012-11-14 09:10:47.288840 7f4ccb67b700 -1 -- 10.214.132.23:0/1030198 >> 10.214.132.34:6789/0 pipe(0x7f4f0c0aa300 sd=-1 :0 pgs=0 cs=0 l=1).connect couldn't created socket Too many open files 2012-11-14 09:10:47.689041 7f4ccb67b700 -1 -- 10.214.132.23:0/1030198 >> 10.214.132.34:6789/0 pipe(0x7f4f0c0aa300 sd=-1 :0 pgs=0 cs=0 l=1).connect couldn't created socket Too many open files
Updated by Sage Weil over 11 years ago
a zillion msgr threasd blocked behind
Thread 916 (Thread 0x7f4e1a5e5700 (LWP 22937)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00007f4f2a00a085 in Wait (mutex=..., this=0x7f4efdd8b680) at ./common/Cond.h:55 #2 Throttle::_wait (this=0x1eccf10, c=33) at common/Throttle.cc:87 #3 0x00007f4f2a00ad64 in Throttle::get (this=0x1eccf10, c=33, m=0) at common/Throttle.cc:142 #4 0x00007f4f2a0c4be6 in Pipe::read_message (this=0x7f4f0c027700, pm=0x7f4e1a5e4db0) at msg/Pipe.cc:1487 #5 0x00007f4f2a0d6190 in Pipe::reader (this=0x7f4f0c027700) at msg/Pipe.cc:1199 #6 0x00007f4f2a0d8e3d in Pipe::Reader::entry (this=<optimized out>) at msg/Pipe.h:47 #7 0x00007f4f2959de9a in start_thread (arg=0x7f4e1a5e5700) at pthread_create.c:308 #8 0x00007f4f298a54bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #9 0x0000000000000000 in ?? ()
this is probably a deadlock where we aren't releasing anything to the throttle, and the msgr is faulting/retrying because of timeouts and such.
frustrating that mark_down won't zap the guys blocked on the throttler, but shouldn't be a problem if the throttler isn't blocked indefinitely... ?
process is still running
Updated by Sage Weil over 11 years ago
- Status changed from New to Resolved
commit:12c2b7fa20be6878bc0763404d2a5c648e5fadbc
Actions