Project

General

Profile

Actions

Bug #15636

closed

BlueStore.cc: FAILED assert(0 == "unexpected error") OSDs fail and would not start

Added by Anonymous almost 8 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
bluestore
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OS: CentOS 7.2 (up-to-date)
Kernel: 4.5.2-1.el7.elrepo.x86_64
Release: Jewel 10.2.0
Setup: one node, baremetal, 1 mon, 1 mds, 2 osd on bluestore
Client: Fedora 23, 0.94.6
Log timestamp: 2016-04-28 01:13

Short: created RBD volume, but 'rbd ls' showed no images. After running 'ceph osd pool rm' on the pool, all(2) OSDs fail and would not start again.

Events:
0. Benchmarking rbd volume in VM, cluster healthy, works fast
1. [Client] Running 'rbd create bench/image --size 10000' - success
2. [Client] Running 'rbd -p bench ls' - no volumes in the pool
3. [Client] Running 'rbd -p bench create image --size 10000' - "librbd: rbd image image already exists"
4. [Node] Running 'ceph osd pool rm bench bench --yes...' - reports success
5. Immediately both OSDs fail with:

2016-04-28 01:13:10.563344 7f63168fc700  0 bluestore(/var/lib/ceph/osd/ceph-0)  error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
2016-04-28 01:13:10.563373 7f63168fc700  0 bluestore(/var/lib/ceph/osd/ceph-0)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "8.3_head",
            "oid": "#8:c0000000::::head#" 
        },
        {
            "op_num": 1,
            "op_name": "rmcoll",
            "collection": "8.3_head" 
        }
    ]
}

2016-04-28 01:13:10.655151 7f63168fc700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f63168fc700 time 2016-04-28 01:13:10.563432
os/bluestore/BlueStore.cc: 4425: FAILED assert(0 == "unexpected error")

 ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x562d48014b95]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x77a) [0x562d47c9c7aa]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3a5) [0x562d47c9efd5]
 4: (ObjectStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, Context*, Context*, Context*, Context*, std::shared_ptr<TrackedOp>)+0x2a6) [0x562d47c45106]
 5: (OSD::RemoveWQ::_process(std::pair<boost::intrusive_ptr<PG>, std::shared_ptr<DeletingState> >, ThreadPool::TPHandle&)+0x6e4) [0x562d479709d4]
 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::shared_ptr<DeletingState> >, std::pair<boost::intrusive_ptr<PG>, std::shared_ptr<DeletingState> > >::_void_process(void*, ThreadPool::TPHandle&)+0x11a) [0x562d479cd83a]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x562d48005bde]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x562d48006ac0]
 9: (()+0x7dc5) [0x7f6336eaedc5]
 10: (clone()+0x6d) [0x7f633553a28d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

6. All subsequent restarts of any of OSDs fail with:
2016-04-28 01:13:27.152853 7fb52fea3800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb52fea3800 time 2016-04-28 01:13:27.151148
osd/OSD.h: 885: FAILED assert(ret)

 ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x5637b7659b95]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x5637b703946d]
 3: (OSD::init()+0x1fe2) [0x5637b6fec6f2]
 4: (main()+0x2c55) [0x5637b6f50c35]
 5: (__libc_start_main()+0xf5) [0x7fb52cdb8b15]
 6: (()+0x353059) [0x5637b6f9b059]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

7. Neither OSD can start (but reported in/up by 'ceph -s'). Restarting services / rebooting does not help.

Note: Pool 'bench' had only 8 PGs, but benchmarks run fine


Files

ceph-osd.0.log (266 KB) ceph-osd.0.log Anonymous, 04/28/2016 12:26 AM
ceph-mon.ceph4.log (139 KB) ceph-mon.ceph4.log Anonymous, 04/28/2016 12:26 AM
ceph.conf (888 Bytes) ceph.conf Anonymous, 04/28/2016 12:26 AM
Actions #1

Updated by Anonymous almost 8 years ago

Update:
After finding out specific epoch of osdmap that both OSDs called get_map() with (epoch 70, not the latest), I have retrieved the osdmap from monitor and injected in both OSDs like this:

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op set-osdmap --epoch 70 --file /tmp/osdmap70 --force

Both OSDs than successfully started. Deep scrub showed 6% of PGs were inconsistent but all were repaired successfully.

However, ceph -s reported MDS as damaged:

health HEALTH_ERR
            mds rank 0 is damaged
            mds cluster is degraded

While trying to bring MDS up again, I might have corrupted metadata myself, so I had to run:
cephfs-journal-tool journal reset

After that cluster is healthy again, but of course, no files in CephFS.
So, this bug has caused data corruption but did not render OSDs completely dead. Will try to redeploy Ceph and see if I will trigger this error again.

Actions #2

Updated by Sage Weil almost 8 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to Urgent
Actions #3

Updated by Anonymous almost 8 years ago

Please see issue #15722 - a reproducible bug I have found that causes data corruption when using CephFS.
It might be related because I have done a lot of writing to CephFS before hitting this issue.

Actions #4

Updated by shasha lu almost 8 years ago

Maybe this is similar to http://tracker.ceph.com/issues/15520

Actions #5

Updated by Anonymous almost 8 years ago

I have read that issue before, but it is not similar. In that case, they had an issue with OS, not Ceph.
Here, OS was fine. OSDs crushed after interacting with cluster remotely, exclusively through Ceph protocol.

Actions #6

Updated by Samuel Just over 7 years ago

  • Priority changed from Urgent to High
  • Tags changed from BlueStore to bluestore
Actions #7

Updated by Samuel Just over 7 years ago

  • Status changed from New to Can't reproduce

bluestore has seen a lot of changes, unlikely to still be present

Actions

Also available in: Atom PDF