Project

General

Profile

Bug #16257

Ceph random bug is killing osds (os/filestore/FileStore.cc: 2912: FAILED assert(0 == "unexpected error")

Added by Joao Castro over 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,
I am doing a large rsync into cephfs.
From time to time I get one OSD killed.. It happened to different osd's.

log:

ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x558d790a2d60]
2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xb81) [0x558d78d3f5f1]
3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x558d78d4591b]
4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x39d) [0x558d78d45ced]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x558d79094f31]
6: (ThreadPool::WorkThread::entry()+0x10) [0x558d79096030]
7: (()+0x76fa) [0x7fce441666fa]
8: (clone()+0x6d) [0x7fce421e1b5d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-2> 2016-06-13 05:35:22.841041 7fce306d7700 -1 os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7fce306d7700 time 2016-06-13 05:35:22.754304
os/filestore/FileStore.cc: 2912: FAILED assert(0 == "unexpected error")

Any idea?
Thanks!


Related issues

Related to Ceph - Bug #17972: hammer: enospc on setxattr on btrfs Won't Fix 11/21/2016
Related to Ceph - Bug #18037: leveldb corruption leads to "Operation not permitted not handled" and assert Resolved 11/24/2016

History

#1 Updated by Loïc Dachary over 6 years ago

  • Target version deleted (v10.2.2)

#2 Updated by Yuri Weinstein over 6 years ago

Also see in upgrade:hammer-x-jewel-distro-basic-vps/
Run: http://pulpito.ceph.com/teuthology-2016-06-25_18:15:02-upgrade:hammer-x-jewel-distro-basic-vps/
Job: 277367
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-06-25_18:15:02-upgrade:hammer-x-jewel-distro-basic-vps/277367/teuthology.log

2016-06-25T20:45:12.517 INFO:tasks.ceph.osd.3.vpm033.stderr:os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f170e299700 time 2016-06-26 03:45:12.752214
2016-06-25T20:45:12.517 INFO:tasks.ceph.osd.3.vpm033.stderr:os/FileStore.cc: 2761: FAILED assert(0 == "unexpected error")
2016-06-25T20:45:12.524 INFO:tasks.ceph.osd.3.vpm033.stderr: ceph version 0.94.7-26-g2e156d7 (2e156d7ad4b9f4ffd6028df3a460b50b30c8b0d3)
2016-06-25T20:45:12.525 INFO:tasks.ceph.osd.3.vpm033.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbb214b]
2016-06-25T20:45:12.525 INFO:tasks.ceph.osd.3.vpm033.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa6c) [0x90c9dc]
2016-06-25T20:45:12.525 INFO:tasks.ceph.osd.3.vpm033.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x913004]
2016-06-25T20:45:12.526 INFO:tasks.ceph.osd.3.vpm033.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x180) [0x9131a0]
2016-06-25T20:45:12.526 INFO:tasks.ceph.osd.3.vpm033.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xba2d36]
2016-06-25T20:45:12.526 INFO:tasks.ceph.osd.3.vpm033.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0xba3de0]
2016-06-25T20:45:12.526 INFO:tasks.ceph.osd.3.vpm033.stderr: 7: (()+0x8184) [0x7f171ae08184]
2016-06-25T20:45:12.527 INFO:tasks.ceph.osd.3.vpm033.stderr: 8: (clone()+0x6d) [0x7f171937337d]
2016-06-25T20:45:12.527 INFO:tasks.ceph.osd.3.vpm033.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.2016-06-25T20:45:12.517 INFO:tasks.ceph.osd.3.vpm033.stderr:os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f170e299700 time 2016-06-26 03:45:12.752214
2016-06-25T20:45:12.517 INFO:tasks.ceph.osd.3.vpm033.stderr:os/FileStore.cc: 2761: FAILED assert(0 == "unexpected error")
2016-06-25T20:45:12.524 INFO:tasks.ceph.osd.3.vpm033.stderr: ceph version 0.94.7-26-g2e156d7 (2e156d7ad4b9f4ffd6028df3a460b50b30c8b0d3)
2016-06-25T20:45:12.525 INFO:tasks.ceph.osd.3.vpm033.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbb214b]
2016-06-25T20:45:12.525 INFO:tasks.ceph.osd.3.vpm033.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa6c) [0x90c9dc]
2016-06-25T20:45:12.525 INFO:tasks.ceph.osd.3.vpm033.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x913004]
2016-06-25T20:45:12.526 INFO:tasks.ceph.osd.3.vpm033.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x180) [0x9131a0]
2016-06-25T20:45:12.526 INFO:tasks.ceph.osd.3.vpm033.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xba2d36]
2016-06-25T20:45:12.526 INFO:tasks.ceph.osd.3.vpm033.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0xba3de0]
2016-06-25T20:45:12.526 INFO:tasks.ceph.osd.3.vpm033.stderr: 7: (()+0x8184) [0x7f171ae08184]
2016-06-25T20:45:12.527 INFO:tasks.ceph.osd.3.vpm033.stderr: 8: (clone()+0x6d) [0x7f171937337d]
2016-06-25T20:45:12.527 INFO:tasks.ceph.osd.3.vpm033.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#3 Updated by Simon - over 6 years ago

Hi,

I can confirm this bug as it is happening for us too.
Unfortunately it appears that the bug takes multiple OSDs down at the same time, effectivly rendering the cluster useless.

Regards
Simon

#4 Updated by min fang over 6 years ago

yes, I also see this assert from my ceph cluster:
#0 0x00003fff99acf27c in _GI_raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00003fff99ad18f4 in __GI_abort () at abort.c:89
#2 0x00000000377f5db8 in ceph::
_ceph_assert_fail (assertion=0x379d5cd8 "0 == \"unexpected error\"", file=<optimized out>, line=<optimized out>, func=<optimized out>) at common/assert.cc:78
#3 0x00000000373f7268 in FileStore::_do_transaction (this=0x10005510000, t=..., op_seq=<optimized out>, trans_num=<optimized out>, handle=0x3fff8fc0d190) at os/filestore/FileStore.cc:2912
#4 0x00000000373fb47c in FileStore::_do_transactions (this=0x10005510000, tls=std::vector of length 2, capacity 2 = {...}, op_seq=690065, handle=0x3fff8fc0d190) at os/filestore/FileStore.cc:2117
#5 0x00000000373fb860 in FileStore::_do_op (this=0x10005510000, osr=0x1001530c4a0, handle=...) at os/filestore/FileStore.cc:1887
#6 0x0000000037415ff4 in FileStore::OpWQ::_process (handle=..., osr=<optimized out>, this=<optimized out>) at os/filestore/FileStore.h:375
#7 ThreadPool::WorkQueue<FileStore::OpSequencer>::_void_process (this=<optimized out>, p=<optimized out>, handle=...) at ./common/WorkQueue.h:282
#8 0x00000000377e78d8 in ThreadPool::worker (this=0x10005510ca8, wt=0x10005f04770) at common/WorkQueue.cc:128
#9 0x00000000377e8cc0 in ThreadPool::WorkThread::entry (this=<optimized out>) at common/WorkQueue.h:445
#10 0x00000000377d6fa0 in Thread::entry_wrapper (this=0x10005f04770) at common/Thread.cc:87
#11 0x00000000377d6ff8 in Thread::_entry_func (arg=<optimized out>) at common/Thread.cc:67
#12 0x00003fff9a5e84a0 in start_thread (arg=0x3fff8fc0dfe0) at pthread_create.c:335

#5 Updated by Samuel Just over 6 years ago

  • Status changed from New to Need More Info

Please include enough log so that we can see the actual error. One of the things that can cause that assert is an EIO.

#6 Updated by Yuri Weinstein over 6 years ago

Run: http://pulpito.ceph.com/yuriw-2016-08-05_14:30:29-rados-wip-yuri-testing_2016_08_05-distro-basic-smithi/
Job: 350802
Logs:http://qa-proxy.ceph.com/teuthology/yuriw-2016-08-05_14:30:29-rados-wip-yuri-testing_2016_08_05-distro-basic-smithi/350802/teuthology.log

2016-08-06T04:02:51.546 INFO:tasks.ceph.osd.5.smithi051.stderr:/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-1299-g2081100/src/os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f6518f4d700 time 2016-08-06 11:02:50.936465
2016-08-06T04:02:51.546 INFO:tasks.ceph.osd.5.smithi051.stderr:/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-1299-g2081100/src/os/filestore/FileStore.cc: 3002: FAILED assert(0 == "unexpected error")
2016-08-06T04:02:51.547 INFO:tasks.ceph.osd.5.smithi051.stderr: ceph version v11.0.0-1299-g2081100 (2081100b6dd557d4944f2e3deff762ceaa314cf8)
2016-08-06T04:02:51.547 INFO:tasks.ceph.osd.5.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xd5e8ab]
2016-08-06T04:02:51.547 INFO:tasks.ceph.osd.5.smithi051.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xee0) [0xa54530]
2016-08-06T04:02:51.547 INFO:tasks.ceph.osd.5.smithi051.stderr: 3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0xa5a04b]
2016-08-06T04:02:51.548 INFO:tasks.ceph.osd.5.smithi051.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2b5) [0xa5a335]
2016-08-06T04:02:51.548 INFO:tasks.ceph.osd.5.smithi051.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb65) [0xd65375]
2016-08-06T04:02:51.548 INFO:tasks.ceph.osd.5.smithi051.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0xd66340]
2016-08-06T04:02:51.548 INFO:tasks.ceph.osd.5.smithi051.stderr: 7: (()+0x8184) [0x7f65256c1184]
2016-08-06T04:02:51.549 INFO:tasks.ceph.osd.5.smithi051.stderr: 8: (clone()+0x6d) [0x7f652424937d]
2016-08-06T04:02:51.551 INFO:tasks.ceph.osd.5.smithi051.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#7 Updated by Yuri Weinstein over 6 years ago

http://qa-proxy.ceph.com/teuthology/teuthology-2016-08-27_22:00:09-rados-jewel-distro-basic-smithi/387794/teuthology.log

Assertion: os/filestore/FileStore.cc: 2912: FAILED assert(0 == "unexpected error")
ceph version 10.2.2-503-g3168a84 (3168a8482211a2aaf9b4d715c0f7a920e896d9bb)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fe1505da7e5]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xef6) [0x7fe1502b1d26]
 3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x7fe1502b7a6b]
 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2cd) [0x7fe1502b7d6d]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7fe1505cb82e]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x7fe1505cc710]
 7: (()+0x7dc5) [0x7fe14e2f5dc5]
 8: (clone()+0x6d) [0x7fe14c98128d]

#8 Updated by Samuel Just about 6 years ago

  • Status changed from Need More Info to Can't reproduce

None of these includes enough log to see the actual failure, so I'm marking it can't reproduce.

#9 Updated by Nathan Cutler about 6 years ago

  • Status changed from Can't reproduce to New
  • Assignee set to Samuel Just

Hi Sam, this is happening in hammer 0.94.10 integration testing. See http://tracker.ceph.com/issues/17151#note-11

The test in question (in fs suite) is http://pulpito.front.sepia.ceph.com/smithfarm-2016-11-18_09:59:00-fs-hammer-backports---basic-smithi/558529/

An OSD log containing the crash can be found here: /a/smithfarm-2016-11-18_09:59:00-fs-hammer-backports---basic-smithi/558529/remote/smithi083/log/ceph-osd.3.log.gz

#10 Updated by Samuel Just about 6 years ago

  • Status changed from New to Closed

Do not reopen this bug, it's unlikely that any two of the above reports are actually the same bug. Please open a new bug with the actual error (look back a bit in the log for the actual error code and offending transaction).

#11 Updated by Nathan Cutler about 6 years ago

  • Related to Bug #17972: hammer: enospc on setxattr on btrfs added

#12 Updated by Nathan Cutler about 6 years ago

  • Related to Bug #18037: leveldb corruption leads to "Operation not permitted not handled" and assert added

Also available in: Atom PDF