Bug #3690
closedosd crashed in FileStore::_do_transaction
0%
Description
ceph version: 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3)
I had a cluster[burnupi15, burnupi19, burnupi20] running on argonaut and then upgraded to bobtail and started running tests on the cluster from two different clients. from client burnupi13[bobtail version of ceph-fuse], was running fsstress.sh and from burnupi14[argonaut version of ceph-fuse], was running bonnie.sh and hit the core pasted below on 2 of the osds [osd.1 on burnupi15 and osd.4 on burnupi19],
ceph version 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3)
1: /usr/bin/ceph-osd() [0x7839ca]
2: (()+0xfcb0) [0x7fa716788cb0]
3: (gsignal()+0x35) [0x7fa714a5c425]
4: (abort()+0x17b) [0x7fa714a5fb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fa7153ae69d]
6: (()+0xb5846) [0x7fa7153ac846]
7: (()+0xb5873) [0x7fa7153ac873]
8: (()+0xb596e) [0x7fa7153ac96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x82ea7f]
10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x912) [0x71a8d2]
11: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x4c) [0x7209cc]
12: (FileStore::_do_op(FileStore::OpSequencer*)+0x1b1) [0x6f0d21]
13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x823ecc]
14: (ThreadPool::WorkThread::entry()+0x10) [0x825cd0]
15: (()+0x7e9a) [0x7fa716780e9a]
16: (clone()+0x6d) [0x7fa714b19cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
4> 2012-12-27 17:15:41.120097 7fa6e0ed8700 1 - 10.214.134.22:6802/37265 >> :/0 pipe(0x492dd80 sd=24 :6802 pgs=0 cs=0 l=0).accept sd=24
3> 2012-12-27 17:15:41.132294 7fa6e0ed8700 1 - 10.214.134.22:6802/37265 >> :/0 pipe(0x492d240 sd=24 :6802 pgs=0 cs=0 l=0).accept sd=24
2> 2012-12-27 17:15:41.141558 7fa6e0ed8700 1 - 10.214.134.22:6802/37265 >> :/0 pipe(0x492d480 sd=24 :6802 pgs=0 cs=0 l=0).accept sd=24
-1> 2012-12-27 17:15:41.148697 7fa70cffd700 -1 ** Caught signal (Aborted) *
in thread 7fa70cffd700
ceph version 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3)
1: /usr/bin/ceph-osd() [0x7839ca]
2: (()+0xfcb0) [0x7fa716788cb0]
3: (gsignal()+0x35) [0x7fa714a5c425]
4: (abort()+0x17b) [0x7fa714a5fb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fa7153ae69d]
6: (()+0xb5846) [0x7fa7153ac846]
7: (()+0xb5873) [0x7fa7153ac873]
8: (()+0xb596e) [0x7fa7153ac96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x82ea7f]
10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x912) [0x71a8d2]
11: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x4c) [0x7209cc]
12: (FileStore::_do_op(FileStore::OpSequencer*)+0x1b1) [0x6f0d21]
13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x823ecc]
14: (ThreadPool::WorkThread::entry()+0x10) [0x825cd0]
15: (()+0x7e9a) [0x7fa716780e9a]
16: (clone()+0x6d) [0x7fa714b19cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
ceph.conf:
ubuntu@burnupi15:~$ sudo cat /etc/ceph/ceph.conf
[global]
auth client required = none
auth cluster required = none
auth service required = none
[osd]
osd journal size = 1000
filestore xattr use omap = true
[osd.1]
host = burnupi15
[osd.2]
host = burnupi15
[osd.3]
host = burnupi19
[osd.4]
host = burnupi19
[osd.5]
host = burnupi20
[osd.6]
host = burnupi20
[mon.a]
host = burnupi15
mon addr = 10.214.134.22:6789
[mon.b]
host = burnupi19
mon addr = 10.214.134.14:6789
[mon.c]
host = burnupi20
mon addr = 10.214.134.12:6789
[mds.a]
host = burnupi20
[client.radosgw.gateway]
host = burnupi15
keyring = /etc/ceph/keyring.radosgw.gateway
rgw socket path = /tmp/radosgw.sock
log file = /var/log/ceph/radosgw.log
Updated by Tamilarasi muthamizhan over 11 years ago
leaving the cluster as it is for someone to take a look at it.
Updated by Tamilarasi muthamizhan over 11 years ago
2012-12-27 17:15:38.413374 7fa70cffd700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_tran saction(ObjectStore::Transaction&, uint64_t, int)' thread 7fa70cffd700 time 2012-12-27 17:15:38.412267 os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error") ceph version 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3) 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x912) [0x71a8d2] 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transact ion*> >&, unsigned long)+0x4c) [0x7209cc] 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x1b1) [0x6f0d21] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x823ecc] 5: (ThreadPool::WorkThread::entry()+0x10) [0x825cd0] 6: (()+0x7e9a) [0x7fa716780e9a] 7: (clone()+0x6d) [0x7fa714b19cbd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Sage Weil over 11 years ago
2012-12-27 17:22:01.018700 7fb23d75a700 0 accepter.accepter no incoming connection? sd = -1 errno 24 Too many open files 2012-12-27 17:22:02.961930 7fb242f65700 0 filestore(/var/lib/ceph/osd/ceph-4) write couldn't open meta/1f98b06a/pglog_0.181/0//-1 flags 65: (24) Too many open files 2012-12-27 17:22:03.077591 7fb242f65700 0 filestore(/var/lib/ceph/osd/ceph-4) error (24) Too many open files not handled on operation 10 (46891.0.0, or op 0, counting from 0) 2012-12-27 17:22:03.077647 7fb242f65700 0 filestore(/var/lib/ceph/osd/ceph-4) unexpected error code 2012-12-27 17:22:03.077651 7fb242f65700 0 filestore(/var/lib/ceph/osd/ceph-4) transaction dump:
Updated by Sage Weil over 11 years ago
- Status changed from New to 7
made the default fd limit much higher in 672c56b18de3b02606e47013edfc2e8b679d8797
Updated by Sage Weil over 11 years ago
- Status changed from 7 to Resolved
the problem was old ceph-osd daemons on other hosts trying to connect. running code that didn't include 4d20b60970413ca1b55e73eaae087e686b72f6a5
Updated by Tamilarasi muthamizhan about 11 years ago
- Status changed from Resolved to In Progress
recent log: ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4059
0> 2013-02-10 02:19:24.170836 7f9da597b700 -1 *** Caught signal (Aborted) ** in thread 7f9da597b700 ceph version 0.56-707-gabc80ff (abc80ffc5b1aab3915c049701ab85c57fe93d550) 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x8acc3a] 2: (()+0xfcb0) [0x7f9daf313cb0] 3: (gsignal()+0x35) [0x7f9dad3da445] 4: (abort()+0x17b) [0x7f9dad3ddbab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9dadd2869d] 6: (()+0xb5846) [0x7f9dadd26846] 7: (()+0xb5873) [0x7f9dadd26873] 8: (()+0xb596e) [0x7f9dadd2696e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x297) [0x995e47] 10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x7f6c) [0x833f5c] 11: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0xa1) [0x8368b1] 12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x34b) [0x836c2b] 13: (FileStore::OpWQ::_process(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x15) [0x83d415] 14: (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*, ThreadPool::TPHandle&)+0x12) [0x838a52] 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x6ce) [0x9878de] 16: (ThreadPool::WorkThread::entry()+0x18) [0x98b038] 17: (Thread::_entry_func(void*)+0x12) [0x97a432] 18: (()+0x7e9a) [0x7f9daf30be9a] 19: (clone()+0x6d) [0x7f9dad4964bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4059$ cat config.yaml kernel: &id001 kdb: true sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5 nuke-on-error: true overrides: ceph: conf: mds: debug mds: 1/20 osd: osd op thread timeout: 60 coverage: true fs: btrfs log-whitelist: - slow request sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550 s3tests: branch: master workunit: sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - - client.0 targets: ubuntu@plana39.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDo+Kh24vRxeTQ6/n5PIIGuxrPHPRO/xMQlwoLHi7mR01cIXJMG5wet7mp2om3/5SZSDcLBHduDKrdWL142Sg5fC0zZPUggbxS7nz/UCjYBzMsOtHEUAU5Gs0KFopOCHXNEveK95ezsroMAD5+jS/IEpiooYCkrR3H+NSvUU0Ae352PlXqV0vamkYzyQyEMmhFE50ALhUXbKMve3d2mxJee5sqVZSBmQTbze9RKUA96t9iiwiheflXbN1i9WHlbBOIue5pZ5fM3/vqPWgaShfFpa0pT56QKJfjyFcDeCLOislo23E5qKAJOi5vn5BoYVtG3niNQpt/YbYGfDEHVeqt9 ubuntu@plana54.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC60+3L8IN2WBpWHY94YAuCOMVdKs3xUqYQpO1ie127fBomk7fiEhR0RovhmDHWIzNr3qvvNkIp9Y+MHUcpZ7C4MFFMYsy2+zq026Ag3XLEOyZWDSyPfMapd5+nmuvxJqEvAx4wAWBhYVEB3aPFmDmz4mayZ9aSYoA1lhsClxfYpAHZ0zRWX3kY1KxXlk6UrZy0igYGvKIvmubkYcmFzOPsI3aWpgWU1rEXGWsFHOlwaor0KJPnpEsZYTrlPyLZqJcKbI/EcHgti0ak22vsDT7LVMKoyPXeUFL5ZGUEpuqQ+IMiECCMKa8X8vPG2MN9V6DK3gQezF+lo5CRCAu7DYdn ubuntu@plana83.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPiffDjAR8Vr7B1C/WQ4ZiY0rnm4DSvknWpdrM1NMvHhxdrnceyfi6K8y0d+NdQKAkRXzCMbYFM/LU1gUEoIgT6k15wzz/KBUyWPbZ4BCNQNLng3z2aQ80FSA+CWYir0UVCOuj+glDN/MAqWmOwfJ7zHkej1/+hMH5mBTy9Nc8/VS7/yDIQgHLdHmeeMTlGVvBs+uc0b4rBd5kSzhzdwegxO+J86aR/HVsdL7/kEOVHeZQ0yWCT5mRa1+nTveZWYXBz+8ZIFzIx4e7QIhGw/XjDYqqhLD+ZjaUXuKNWG3GqLrQntt42LRvC6flIF11PoL1g0wsxW4DV5Vwdg+zkX71 tasks: - internal.lock_machines: - 3 - plana - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - internal.timer: null - chef: null - clock: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: null - kclient: null - workunit: clients: all: - suites/ffsb.sh ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4059$ cat summary.yaml ceph-sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550 client.0-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5 description: collection:kernel-thrash clusters:fixed-3.yaml fs:btrfs.yaml thrashers:default.yaml workloads:kclient_workunit_suites_ffsb.yaml duration: 1005.7241339683533 failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/binary/usr/local/bin/ceph-osd -f -i 5 -c /tmp/cephtest/ceph.conf''' flavor: gcov mon.a-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5 mon.b-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5 owner: scheduled_teuthology@teuthology success: false
Updated by Sage Weil about 11 years ago
- Status changed from In Progress to Resolved
resolved a bit later that day, 0942e005448efb60ab31fe98f156d1f1b0e377cd and others.
Updated by Tamilarasi muthamizhan about 11 years ago
- Status changed from Resolved to In Progress
recent log: ubuntu@teuthology:/a/teuthology-2013-02-25_01:00:05-regression-master-testing-gcov/11496
2013-02-25 02:30:27.412397 7ff4e3a1b700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_tran
saction(ObjectStore::Transaction&, uint64_t, int)' thread 7ff4e3a1b700 time 2013-02-25 02:30:27.410577
os/FileStore.cc: 2680: FAILED assert(0 == "unexpected error")
ceph version 0.57-502-g9217c4a (9217c4ac6856efd9dc3435244d95eee32edfd443)
1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x956) [0x72bd26]
2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transac
tion*> >&, unsigned long, ThreadPool::TPHandle*)+0x71) [0x731fc1]
3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x266) [0x732246]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8339d6]
5: (ThreadPool::WorkThread::entry()+0x10) [0x835800]
6: (()+0x7e9a) [0x7ff4ed99fe9a]
7: (clone()+0x6d) [0x7ff4ebd38cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
Updated by Sage Weil about 11 years ago
- Status changed from In Progress to Resolved