Project

General

Profile

Actions

Bug #3690

closed

osd crashed in FileStore::_do_transaction

Added by Tamilarasi muthamizhan over 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version: 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3)

I had a cluster[burnupi15, burnupi19, burnupi20] running on argonaut and then upgraded to bobtail and started running tests on the cluster from two different clients. from client burnupi13[bobtail version of ceph-fuse], was running fsstress.sh and from burnupi14[argonaut version of ceph-fuse], was running bonnie.sh and hit the core pasted below on 2 of the osds [osd.1 on burnupi15 and osd.4 on burnupi19],

ceph version 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3)
1: /usr/bin/ceph-osd() [0x7839ca]
2: (()+0xfcb0) [0x7fa716788cb0]
3: (gsignal()+0x35) [0x7fa714a5c425]
4: (abort()+0x17b) [0x7fa714a5fb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fa7153ae69d]
6: (()+0xb5846) [0x7fa7153ac846]
7: (()+0xb5873) [0x7fa7153ac873]
8: (()+0xb596e) [0x7fa7153ac96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x82ea7f]
10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x912) [0x71a8d2]
11: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x4c) [0x7209cc]
12: (FileStore::_do_op(FileStore::OpSequencer*)+0x1b1) [0x6f0d21]
13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x823ecc]
14: (ThreadPool::WorkThread::entry()+0x10) [0x825cd0]
15: (()+0x7e9a) [0x7fa716780e9a]
16: (clone()+0x6d) [0x7fa714b19cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
4> 2012-12-27 17:15:41.120097 7fa6e0ed8700 1 - 10.214.134.22:6802/37265 >> :/0 pipe(0x492dd80 sd=24 :6802 pgs=0 cs=0 l=0).accept sd=24
3> 2012-12-27 17:15:41.132294 7fa6e0ed8700 1 - 10.214.134.22:6802/37265 >> :/0 pipe(0x492d240 sd=24 :6802 pgs=0 cs=0 l=0).accept sd=24
2> 2012-12-27 17:15:41.141558 7fa6e0ed8700 1 - 10.214.134.22:6802/37265 >> :/0 pipe(0x492d480 sd=24 :6802 pgs=0 cs=0 l=0).accept sd=24
-1> 2012-12-27 17:15:41.148697 7fa70cffd700 -1 ** Caught signal (Aborted) *
in thread 7fa70cffd700

ceph version 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3)
1: /usr/bin/ceph-osd() [0x7839ca]
2: (()+0xfcb0) [0x7fa716788cb0]
3: (gsignal()+0x35) [0x7fa714a5c425]
4: (abort()+0x17b) [0x7fa714a5fb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fa7153ae69d]
6: (()+0xb5846) [0x7fa7153ac846]
7: (()+0xb5873) [0x7fa7153ac873]
8: (()+0xb596e) [0x7fa7153ac96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x82ea7f]
10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x912) [0x71a8d2]
11: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x4c) [0x7209cc]
12: (FileStore::_do_op(FileStore::OpSequencer*)+0x1b1) [0x6f0d21]
13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x823ecc]
14: (ThreadPool::WorkThread::entry()+0x10) [0x825cd0]
15: (()+0x7e9a) [0x7fa716780e9a]
16: (clone()+0x6d) [0x7fa714b19cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph.conf:
ubuntu@burnupi15:~$ sudo cat /etc/ceph/ceph.conf
[global]
auth client required = none
auth cluster required = none
auth service required = none

[osd]
osd journal size = 1000
filestore xattr use omap = true

[osd.1]
host = burnupi15

[osd.2]
host = burnupi15

[osd.3]
host = burnupi19

[osd.4]
host = burnupi19

[osd.5]
host = burnupi20

[osd.6]
host = burnupi20

[mon.a]
host = burnupi15
mon addr = 10.214.134.22:6789

[mon.b]
host = burnupi19
mon addr = 10.214.134.14:6789

[mon.c]
host = burnupi20
mon addr = 10.214.134.12:6789

[mds.a]
host = burnupi20

[client.radosgw.gateway]
host = burnupi15
keyring = /etc/ceph/keyring.radosgw.gateway
rgw socket path = /tmp/radosgw.sock
log file = /var/log/ceph/radosgw.log

Actions #1

Updated by Tamilarasi muthamizhan over 11 years ago

leaving the cluster as it is for someone to take a look at it.

Actions #2

Updated by Sage Weil over 11 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Tamilarasi muthamizhan over 11 years ago

2012-12-27 17:15:38.413374 7fa70cffd700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_tran
saction(ObjectStore::Transaction&, uint64_t, int)' thread 7fa70cffd700 time 2012-12-27 17:15:38.412267
os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error")

 ceph version 0.55.1-360-g6356739 (635673928a6b4dae6d4712cacad81cbac6412dc3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x912) [0x71a8d2]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transact
ion*> >&, unsigned long)+0x4c) [0x7209cc]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x1b1) [0x6f0d21]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x823ecc]
 5: (ThreadPool::WorkThread::entry()+0x10) [0x825cd0]
 6: (()+0x7e9a) [0x7fa716780e9a]
 7: (clone()+0x6d) [0x7fa714b19cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #4

Updated by Sage Weil over 11 years ago

2012-12-27 17:22:01.018700 7fb23d75a700  0 accepter.accepter no incoming connection?  sd = -1 errno 24 Too many open files
2012-12-27 17:22:02.961930 7fb242f65700  0 filestore(/var/lib/ceph/osd/ceph-4) write couldn't open meta/1f98b06a/pglog_0.181/0//-1 flags 65: (24) Too many open files
2012-12-27 17:22:03.077591 7fb242f65700  0 filestore(/var/lib/ceph/osd/ceph-4)  error (24) Too many open files not handled on operation 10 (46891.0.0, or op 0, counting from 0)
2012-12-27 17:22:03.077647 7fb242f65700  0 filestore(/var/lib/ceph/osd/ceph-4) unexpected error code
2012-12-27 17:22:03.077651 7fb242f65700  0 filestore(/var/lib/ceph/osd/ceph-4)  transaction dump:
Actions #5

Updated by Sage Weil over 11 years ago

  • Assignee set to Sage Weil
Actions #6

Updated by Sage Weil over 11 years ago

  • Status changed from New to 7

made the default fd limit much higher in 672c56b18de3b02606e47013edfc2e8b679d8797

Actions #7

Updated by Sage Weil over 11 years ago

  • Priority changed from Urgent to High
Actions #8

Updated by Sage Weil over 11 years ago

  • Status changed from 7 to Resolved

the problem was old ceph-osd daemons on other hosts trying to connect. running code that didn't include 4d20b60970413ca1b55e73eaae087e686b72f6a5

Actions #9

Updated by Tamilarasi muthamizhan about 11 years ago

  • Status changed from Resolved to In Progress

recent log: ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4059

     0> 2013-02-10 02:19:24.170836 7f9da597b700 -1 *** Caught signal (Aborted) **
 in thread 7f9da597b700

 ceph version 0.56-707-gabc80ff (abc80ffc5b1aab3915c049701ab85c57fe93d550)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x8acc3a]
 2: (()+0xfcb0) [0x7f9daf313cb0]
 3: (gsignal()+0x35) [0x7f9dad3da445]
 4: (abort()+0x17b) [0x7f9dad3ddbab]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9dadd2869d]
 6: (()+0xb5846) [0x7f9dadd26846]
 7: (()+0xb5873) [0x7f9dadd26873]
 8: (()+0xb596e) [0x7f9dadd2696e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x297) [0x995e47]
 10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x7f6c) [0x833f5c]
 11: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0xa1) [0x8368b1]
 12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x34b) [0x836c2b]
 13: (FileStore::OpWQ::_process(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x15) [0x83d415]
 14: (ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process(void*, ThreadPool::TPHandle&)+0x12) [0x838a52]
 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x6ce) [0x9878de]
 16: (ThreadPool::WorkThread::entry()+0x18) [0x98b038]
 17: (Thread::_entry_func(void*)+0x12) [0x97a432]
 18: (()+0x7e9a) [0x7f9daf30be9a]
 19: (clone()+0x6d) [0x7f9dad4964bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4059$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5
nuke-on-error: true
overrides:
  ceph:
    conf:
      mds:
        debug mds: 1/20
      osd:
        osd op thread timeout: 60
    coverage: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550
  s3tests:
    branch: master
  workunit:
    sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana39.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDo+Kh24vRxeTQ6/n5PIIGuxrPHPRO/xMQlwoLHi7mR01cIXJMG5wet7mp2om3/5SZSDcLBHduDKrdWL142Sg5fC0zZPUggbxS7nz/UCjYBzMsOtHEUAU5Gs0KFopOCHXNEveK95ezsroMAD5+jS/IEpiooYCkrR3H+NSvUU0Ae352PlXqV0vamkYzyQyEMmhFE50ALhUXbKMve3d2mxJee5sqVZSBmQTbze9RKUA96t9iiwiheflXbN1i9WHlbBOIue5pZ5fM3/vqPWgaShfFpa0pT56QKJfjyFcDeCLOislo23E5qKAJOi5vn5BoYVtG3niNQpt/YbYGfDEHVeqt9
  ubuntu@plana54.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC60+3L8IN2WBpWHY94YAuCOMVdKs3xUqYQpO1ie127fBomk7fiEhR0RovhmDHWIzNr3qvvNkIp9Y+MHUcpZ7C4MFFMYsy2+zq026Ag3XLEOyZWDSyPfMapd5+nmuvxJqEvAx4wAWBhYVEB3aPFmDmz4mayZ9aSYoA1lhsClxfYpAHZ0zRWX3kY1KxXlk6UrZy0igYGvKIvmubkYcmFzOPsI3aWpgWU1rEXGWsFHOlwaor0KJPnpEsZYTrlPyLZqJcKbI/EcHgti0ak22vsDT7LVMKoyPXeUFL5ZGUEpuqQ+IMiECCMKa8X8vPG2MN9V6DK3gQezF+lo5CRCAu7DYdn
  ubuntu@plana83.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPiffDjAR8Vr7B1C/WQ4ZiY0rnm4DSvknWpdrM1NMvHhxdrnceyfi6K8y0d+NdQKAkRXzCMbYFM/LU1gUEoIgT6k15wzz/KBUyWPbZ4BCNQNLng3z2aQ80FSA+CWYir0UVCOuj+glDN/MAqWmOwfJ7zHkej1/+hMH5mBTy9Nc8/VS7/yDIQgHLdHmeeMTlGVvBs+uc0b4rBd5kSzhzdwegxO+J86aR/HVsdL7/kEOVHeZQ0yWCT5mRa1+nTveZWYXBz+8ZIFzIx4e7QIhGw/XjDYqqhLD+ZjaUXuKNWG3GqLrQntt42LRvC6flIF11PoL1g0wsxW4DV5Vwdg+zkX71
tasks:
- internal.lock_machines:
  - 3
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds: null
- kclient: null
- workunit:
    clients:
      all:
      - suites/ffsb.sh

ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4059$ cat summary.yaml 
ceph-sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550
client.0-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5
description: collection:kernel-thrash clusters:fixed-3.yaml fs:btrfs.yaml thrashers:default.yaml
  workloads:kclient_workunit_suites_ffsb.yaml
duration: 1005.7241339683533
failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage
  /tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/binary/usr/local/bin/ceph-osd
  -f -i 5 -c /tmp/cephtest/ceph.conf'''
flavor: gcov
mon.a-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5
mon.b-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5
owner: scheduled_teuthology@teuthology
success: false

Actions #10

Updated by Sage Weil about 11 years ago

  • Status changed from In Progress to Resolved

resolved a bit later that day, 0942e005448efb60ab31fe98f156d1f1b0e377cd and others.

Actions #11

Updated by Tamilarasi muthamizhan about 11 years ago

  • Status changed from Resolved to In Progress

recent log: ubuntu@teuthology:/a/teuthology-2013-02-25_01:00:05-regression-master-testing-gcov/11496

2013-02-25 02:30:27.412397 7ff4e3a1b700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_tran
saction(ObjectStore::Transaction&, uint64_t, int)' thread 7ff4e3a1b700 time 2013-02-25 02:30:27.410577
os/FileStore.cc: 2680: FAILED assert(0 == "unexpected error")

ceph version 0.57-502-g9217c4a (9217c4ac6856efd9dc3435244d95eee32edfd443)
1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x956) [0x72bd26]
2: (FileStore::_do_transactions(std::list&lt;ObjectStore::Transaction*, std::allocator&lt;ObjectStore::Transac
tion*> >&, unsigned long, ThreadPool::TPHandle*)+0x71) [0x731fc1]
3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x266) [0x732246]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8339d6]
5: (ThreadPool::WorkThread::entry()+0x10) [0x835800]
6: (()+0x7e9a) [0x7ff4ed99fe9a]
7: (clone()+0x6d) [0x7ff4ebd38cbd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---

Actions #12

Updated by Sage Weil about 11 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF