Bug #14242: "FileStore.cc: In function 'unsigned int" in rados-jewel-distro-basic-smithi - Ceph - Ceph

Actions

Copy link

Bug #14242

closed

"FileStore.cc: In function 'unsigned int" in rados-jewel-distro-basic-smithi

Added by Yuri Weinstein over 8 years ago. Updated about 8 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2016-01-02_19:00:08-rados-jewel-distro-basic-smithi/
Jobs: ['11871', '11893', '11894', '12018']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-01-02_19:00:08-rados-jewel-distro-basic-smithi/11871/teuthology.log

2016-01-04T04:46:37.344 INFO:tasks.ceph.osd.4.smithi035.stderr:os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f1ebadef700 time 2016-01-04 07:46:37.331636
2016-01-04T04:46:37.344 INFO:tasks.ceph.osd.4.smithi035.stderr:os/FileStore.cc: 2890: FAILED assert(0 == "unexpected error")
2016-01-04T04:46:37.357 INFO:tasks.ceph.osd.4.smithi035.stderr: ceph version 10.0.1-609-g749c424 (749c42422ec31e6ab4858fcb62d45c2c1a8da3f6)
2016-01-04T04:46:37.357 INFO:tasks.ceph.osd.4.smithi035.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f1ec9164eeb]
2016-01-04T04:46:37.357 INFO:tasks.ceph.osd.4.smithi035.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa5e) [0x7f1ec8ea7dae]
2016-01-04T04:46:37.357 INFO:tasks.ceph.osd.4.smithi035.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x7f1ec8eaf144]
2016-01-04T04:46:37.358 INFO:tasks.ceph.osd.4.smithi035.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x1a9) [0x7f1ec8eaf309]
2016-01-04T04:46:37.358 INFO:tasks.ceph.osd.4.smithi035.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x7f1ec915689e]
2016-01-04T04:46:37.358 INFO:tasks.ceph.osd.4.smithi035.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f1ec9157780]
2016-01-04T04:46:37.358 INFO:tasks.ceph.osd.4.smithi035.stderr: 7: (()+0x8182) [0x7f1ec7735182]
2016-01-04T04:46:37.359 INFO:tasks.ceph.osd.4.smithi035.stderr: 8: (clone()+0x6d) [0x7f1ec5a7c47d]
2016-01-04T04:46:37.359 INFO:tasks.ceph.osd.4.smithi035.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Yuri Weinstein over 8 years ago

See #14103

Actions

Copy link

Updated by Yuri Weinstein over 8 years ago

Related to Bug #14103: "FileStore.cc: 2757: FAILED assert(0 == "unexpected error")" in upgrade:hammer-x-jewel-distro-basic-openstack added

Actions

Copy link

Updated by Samuel Just over 8 years ago

Priority changed from Normal to Urgent

ENOSPC, disk is out of space.
1) The test needs to not write that much.
2) Why didn't the cluster stop writes/recovery before this point? 2) is an actual bug.

Actions

Copy link

Updated by Yuri Weinstein over 8 years ago

Seems similar in
Run: http://pulpito.ceph.com/teuthology-2016-01-04_19:00:01-rados-jewel-distro-basic-smithi/
Job: 13535
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-01-04_19:00:01-rados-jewel-distro-basic-smithi/13535/teuthology.log

2016-01-05T17:56:00.004 INFO:tasks.ceph.osd.2.smithi017.stderr:os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f8070fb4700 time 2016-01-05 20:56:00.013636
2016-01-05T17:56:00.005 INFO:tasks.ceph.osd.2.smithi017.stderr:os/FileStore.cc: 2890: FAILED assert(0 == "unexpected error")
2016-01-05T17:56:00.020 INFO:tasks.ceph.osd.2.smithi017.stderr: ceph version 10.0.1-612-g7bcb744 (7bcb744d6b76aea3aebf065edfc231d6b5c42d2f)
2016-01-05T17:56:00.020 INFO:tasks.ceph.osd.2.smithi017.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f807f50e6d5]
2016-01-05T17:56:00.021 INFO:tasks.ceph.osd.2.smithi017.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa95) [0x7f807f23c455]
2016-01-05T17:56:00.021 INFO:tasks.ceph.osd.2.smithi017.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x7f807f242184]
2016-01-05T17:56:00.021 INFO:tasks.ceph.osd.2.smithi017.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x194) [0x7f807f242334]
2016-01-05T17:56:00.021 INFO:tasks.ceph.osd.2.smithi017.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f807f4ffbce]
2016-01-05T17:56:00.022 INFO:tasks.ceph.osd.2.smithi017.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f807f500ab0]
2016-01-05T17:56:00.022 INFO:tasks.ceph.osd.2.smithi017.stderr: 7: (()+0x7df5) [0x7f807d553df5]
2016-01-05T17:56:00.022 INFO:tasks.ceph.osd.2.smithi017.stderr: 8: (clone()+0x6d) [0x7f807bdfc1ad]
2016-01-05T17:56:00.022 INFO:tasks.ceph.osd.2.smithi017.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

And job ['13370']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-01-04_19:00:01-rados-jewel-distro-basic-smithi/13370/teuthology.log

Actions

Copy link

Updated by Yuri Weinstein over 8 years ago

Related to Bug #14289: "No space left on device" + "FAILED assert(0 == "unexpected error")" in rados-infernalis-distro-basic-openstack added

Actions

Copy link

Updated by Tristan Cacqueray about 8 years ago

We are also reproducing this bug consistently. It does seems related to an osd running out of space, but the failing osd is also having drive error, as seen in kernel logs: "Buffer I/O error on device" then "Add. Sense: Unrecovered read error" and "end_request: critical medium error".

Not sure if it's related, but this ceph cluster is running a pool with only one replica too.

Anyway, once the osd fail with this exact stack trace, the whole cluster seems to be stuck until the osd is manually removed. Can the severity be raised from minor ?

Actions

Copy link

Updated by Sage Weil about 8 years ago

Status changed from New to Need More Info

Yuri- the second error is ENOSPC, ignore.

Tristan- it sounds like something else is going on. No single OSD failure should make your cluster hang, regardless of how it failed. I'm guessing things are already degraded, it's down to 1 replica, and min_size is 2...

The first failure (/a/teuthology-2016-01-04_19:00:01-rados-jewel-distro-basic-smithi/13535) is concerning, but unfortunately there are no logs.

Actions

Copy link