Project

General

Profile

Actions

Bug #14242

closed

"FileStore.cc: In function 'unsigned int" in rados-jewel-distro-basic-smithi

Added by Yuri Weinstein over 8 years ago. Updated about 8 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2016-01-02_19:00:08-rados-jewel-distro-basic-smithi/
Jobs: ['11871', '11893', '11894', '12018']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-01-02_19:00:08-rados-jewel-distro-basic-smithi/11871/teuthology.log

2016-01-04T04:46:37.344 INFO:tasks.ceph.osd.4.smithi035.stderr:os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f1ebadef700 time 2016-01-04 07:46:37.331636
2016-01-04T04:46:37.344 INFO:tasks.ceph.osd.4.smithi035.stderr:os/FileStore.cc: 2890: FAILED assert(0 == "unexpected error")
2016-01-04T04:46:37.357 INFO:tasks.ceph.osd.4.smithi035.stderr: ceph version 10.0.1-609-g749c424 (749c42422ec31e6ab4858fcb62d45c2c1a8da3f6)
2016-01-04T04:46:37.357 INFO:tasks.ceph.osd.4.smithi035.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f1ec9164eeb]
2016-01-04T04:46:37.357 INFO:tasks.ceph.osd.4.smithi035.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa5e) [0x7f1ec8ea7dae]
2016-01-04T04:46:37.357 INFO:tasks.ceph.osd.4.smithi035.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x7f1ec8eaf144]
2016-01-04T04:46:37.358 INFO:tasks.ceph.osd.4.smithi035.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x1a9) [0x7f1ec8eaf309]
2016-01-04T04:46:37.358 INFO:tasks.ceph.osd.4.smithi035.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x7f1ec915689e]
2016-01-04T04:46:37.358 INFO:tasks.ceph.osd.4.smithi035.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f1ec9157780]
2016-01-04T04:46:37.358 INFO:tasks.ceph.osd.4.smithi035.stderr: 7: (()+0x8182) [0x7f1ec7735182]
2016-01-04T04:46:37.359 INFO:tasks.ceph.osd.4.smithi035.stderr: 8: (clone()+0x6d) [0x7f1ec5a7c47d]
2016-01-04T04:46:37.359 INFO:tasks.ceph.osd.4.smithi035.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues 2 (0 open2 closed)

Related to Ceph - Bug #14103: "FileStore.cc: 2757: FAILED assert(0 == "unexpected error")" in upgrade:hammer-x-jewel-distro-basic-openstackClosed12/17/2015

Actions
Related to Ceph - Bug #14289: "No space left on device" + "FAILED assert(0 == "unexpected error")" in rados-infernalis-distro-basic-openstackResolved01/07/2016

Actions
Actions #1

Updated by Yuri Weinstein over 8 years ago

See #14103

Actions #2

Updated by Yuri Weinstein over 8 years ago

  • Related to Bug #14103: "FileStore.cc: 2757: FAILED assert(0 == "unexpected error")" in upgrade:hammer-x-jewel-distro-basic-openstack added
Actions #3

Updated by Samuel Just over 8 years ago

  • Priority changed from Normal to Urgent

ENOSPC, disk is out of space.
1) The test needs to not write that much.
2) Why didn't the cluster stop writes/recovery before this point? 2) is an actual bug.

Actions #4

Updated by Yuri Weinstein over 8 years ago

Seems similar in
Run: http://pulpito.ceph.com/teuthology-2016-01-04_19:00:01-rados-jewel-distro-basic-smithi/
Job: 13535
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-01-04_19:00:01-rados-jewel-distro-basic-smithi/13535/teuthology.log

2016-01-05T17:56:00.004 INFO:tasks.ceph.osd.2.smithi017.stderr:os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f8070fb4700 time 2016-01-05 20:56:00.013636
2016-01-05T17:56:00.005 INFO:tasks.ceph.osd.2.smithi017.stderr:os/FileStore.cc: 2890: FAILED assert(0 == "unexpected error")
2016-01-05T17:56:00.020 INFO:tasks.ceph.osd.2.smithi017.stderr: ceph version 10.0.1-612-g7bcb744 (7bcb744d6b76aea3aebf065edfc231d6b5c42d2f)
2016-01-05T17:56:00.020 INFO:tasks.ceph.osd.2.smithi017.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f807f50e6d5]
2016-01-05T17:56:00.021 INFO:tasks.ceph.osd.2.smithi017.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa95) [0x7f807f23c455]
2016-01-05T17:56:00.021 INFO:tasks.ceph.osd.2.smithi017.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x7f807f242184]
2016-01-05T17:56:00.021 INFO:tasks.ceph.osd.2.smithi017.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x194) [0x7f807f242334]
2016-01-05T17:56:00.021 INFO:tasks.ceph.osd.2.smithi017.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f807f4ffbce]
2016-01-05T17:56:00.022 INFO:tasks.ceph.osd.2.smithi017.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7f807f500ab0]
2016-01-05T17:56:00.022 INFO:tasks.ceph.osd.2.smithi017.stderr: 7: (()+0x7df5) [0x7f807d553df5]
2016-01-05T17:56:00.022 INFO:tasks.ceph.osd.2.smithi017.stderr: 8: (clone()+0x6d) [0x7f807bdfc1ad]
2016-01-05T17:56:00.022 INFO:tasks.ceph.osd.2.smithi017.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

And job ['13370']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-01-04_19:00:01-rados-jewel-distro-basic-smithi/13370/teuthology.log

Actions #5

Updated by Yuri Weinstein over 8 years ago

  • Related to Bug #14289: "No space left on device" + "FAILED assert(0 == "unexpected error")" in rados-infernalis-distro-basic-openstack added
Actions #6

Updated by Tristan Cacqueray about 8 years ago

We are also reproducing this bug consistently. It does seems related to an osd running out of space, but the failing osd is also having drive error, as seen in kernel logs: "Buffer I/O error on device" then "Add. Sense: Unrecovered read error" and "end_request: critical medium error".

Not sure if it's related, but this ceph cluster is running a pool with only one replica too.

Anyway, once the osd fail with this exact stack trace, the whole cluster seems to be stuck until the osd is manually removed. Can the severity be raised from minor ?

Actions #7

Updated by Sage Weil about 8 years ago

  • Status changed from New to Need More Info

Yuri- the second error is ENOSPC, ignore.

Tristan- it sounds like something else is going on. No single OSD failure should make your cluster hang, regardless of how it failed. I'm guessing things are already degraded, it's down to 1 replica, and min_size is 2...

The first failure (/a/teuthology-2016-01-04_19:00:01-rados-jewel-distro-basic-smithi/13535) is concerning, but unfortunately there are no logs.

Actions #8

Updated by Sage Weil about 8 years ago

Tristan- oh, you have 1 replica. Of course it will hang if an OSD fails.

Actions #9

Updated by Sage Weil about 8 years ago

  • Status changed from Need More Info to Can't reproduce

best guess is btrfs. the disk shouldn't have filled this quickly.

Actions

Also available in: Atom PDF