Bug #15636: BlueStore.cc: FAILED assert(0 == "unexpected error") OSDs fail and would not start - Ceph - Ceph

Actions

Copy link

Bug #15636

closed

BlueStore.cc: FAILED assert(0 == "unexpected error") OSDs fail and would not start

Added by Anonymous almost 8 years ago. Updated over 7 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

other

Tags:

bluestore

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v10.2.0

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

OS: CentOS 7.2 (up-to-date)
Kernel: 4.5.2-1.el7.elrepo.x86_64
Release: Jewel 10.2.0
Setup: one node, baremetal, 1 mon, 1 mds, 2 osd on bluestore
Client: Fedora 23, 0.94.6
Log timestamp: 2016-04-28 01:13

Short: created RBD volume, but 'rbd ls' showed no images. After running 'ceph osd pool rm' on the pool, all(2) OSDs fail and would not start again.

Events:
0. Benchmarking rbd volume in VM, cluster healthy, works fast
1. [Client] Running 'rbd create bench/image --size 10000' - success
2. [Client] Running 'rbd -p bench ls' - no volumes in the pool
3. [Client] Running 'rbd -p bench create image --size 10000' - "librbd: rbd image image already exists"
4. [Node] Running 'ceph osd pool rm bench bench --yes...' - reports success
5. Immediately both OSDs fail with:

2016-04-28 01:13:10.563344 7f63168fc700  0 bluestore(/var/lib/ceph/osd/ceph-0)  error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
2016-04-28 01:13:10.563373 7f63168fc700  0 bluestore(/var/lib/ceph/osd/ceph-0)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "8.3_head",
            "oid": "#8:c0000000::::head#" 
        },
        {
            "op_num": 1,
            "op_name": "rmcoll",
            "collection": "8.3_head" 
        }
    ]
}

2016-04-28 01:13:10.655151 7f63168fc700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f63168fc700 time 2016-04-28 01:13:10.563432
os/bluestore/BlueStore.cc: 4425: FAILED assert(0 == "unexpected error")

 ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x562d48014b95]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x77a) [0x562d47c9c7aa]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3a5) [0x562d47c9efd5]
 4: (ObjectStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, Context*, Context*, Context*, Context*, std::shared_ptr<TrackedOp>)+0x2a6) [0x562d47c45106]
 5: (OSD::RemoveWQ::_process(std::pair<boost::intrusive_ptr<PG>, std::shared_ptr<DeletingState> >, ThreadPool::TPHandle&)+0x6e4) [0x562d479709d4]
 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::shared_ptr<DeletingState> >, std::pair<boost::intrusive_ptr<PG>, std::shared_ptr<DeletingState> > >::_void_process(void*, ThreadPool::TPHandle&)+0x11a) [0x562d479cd83a]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x562d48005bde]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x562d48006ac0]
 9: (()+0x7dc5) [0x7f6336eaedc5]
 10: (clone()+0x6d) [0x7f633553a28d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

6. All subsequent restarts of any of OSDs fail with:

2016-04-28 01:13:27.152853 7fb52fea3800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb52fea3800 time 2016-04-28 01:13:27.151148
osd/OSD.h: 885: FAILED assert(ret)

 ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x5637b7659b95]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x5637b703946d]
 3: (OSD::init()+0x1fe2) [0x5637b6fec6f2]
 4: (main()+0x2c55) [0x5637b6f50c35]
 5: (__libc_start_main()+0xf5) [0x7fb52cdb8b15]
 6: (()+0x353059) [0x5637b6f9b059]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

7. Neither OSD can start (but reported in/up by 'ceph -s'). Restarting services / rebooting does not help.

Note: Pool 'bench' had only 8 PGs, but benchmarks run fine

Files

Download all files

ceph-osd.0.log (266 KB) ceph-osd.0.log		Anonymous, 04/28/2016 12:26 AM
ceph-mon.ceph4.log (139 KB) ceph-mon.ceph4.log		Anonymous, 04/28/2016 12:26 AM
ceph.conf (888 Bytes) ceph.conf		Anonymous, 04/28/2016 12:26 AM

Actions

Copy link

Updated by Anonymous almost 8 years ago

Update:
After finding out specific epoch of osdmap that both OSDs called get_map() with (epoch 70, not the latest), I have retrieved the osdmap from monitor and injected in both OSDs like this:

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --op set-osdmap --epoch 70 --file /tmp/osdmap70 --force

Both OSDs than successfully started. Deep scrub showed 6% of PGs were inconsistent but all were repaired successfully.

However, ceph -s reported MDS as damaged:

health HEALTH_ERR
            mds rank 0 is damaged
            mds cluster is degraded

While trying to bring MDS up again, I might have corrupted metadata myself, so I had to run:

cephfs-journal-tool journal reset

After that cluster is healthy again, but of course, no files in CephFS.
So, this bug has caused data corruption but did not render OSDs completely dead. Will try to redeploy Ceph and see if I will trigger this error again.

Actions

Copy link

Updated by Sage Weil almost 8 years ago

Assignee set to Sage Weil
Priority changed from Normal to Urgent

Actions

Copy link

Updated by Anonymous almost 8 years ago

Please see issue #15722 - a reproducible bug I have found that causes data corruption when using CephFS.
It might be related because I have done a lot of writing to CephFS before hitting this issue.

Actions

Copy link

Updated by shasha lu almost 8 years ago

Maybe this is similar to http://tracker.ceph.com/issues/15520

Actions

Copy link

Updated by Anonymous almost 8 years ago

I have read that issue before, but it is not similar. In that case, they had an issue with OS, not Ceph.
Here, OS was fine. OSDs crushed after interacting with cluster remotely, exclusively through Ceph protocol.

Actions

Copy link