Bug #15429
closed
ceph-qa-suite upgrade suite: OSD crash due to starting up post-jewel osd after firefly without stopping at hammer
Added by Yuri Weinstein about 8 years ago.
Updated over 7 years ago.
ceph-qa-suite:
upgrade/hammer-x
Description
Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-04-07_02:10:01-upgrade:hammer-x-jewel-distro-basic-openstack/
Job: 30519
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-04-07_02:10:01-upgrade:hammer-x-jewel-distro-basic-openstack/30519/teuthology.log
2016-04-07T06:36:08.498 INFO:tasks.ceph.osd.4.target065113.stderr:os/FileStore.cc: 2761: FAILED assert(0 == "unexpected error")
2016-04-07T06:36:08.499 INFO:tasks.ceph.osd.4.target065113.stderr: ceph version 0.94.6-254-ge219e85 (e219e85be00088eecde7b1f29d7699493a79bc4d)
2016-04-07T06:36:08.499 INFO:tasks.ceph.osd.4.target065113.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbb1c6b]
2016-04-07T06:36:08.499 INFO:tasks.ceph.osd.4.target065113.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa6c) [0x90c7ec]
2016-04-07T06:36:08.499 INFO:tasks.ceph.osd.4.target065113.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x912e14]
2016-04-07T06:36:08.499 INFO:tasks.ceph.osd.4.target065113.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x180) [0x912fb0]
2016-04-07T06:36:08.500 INFO:tasks.ceph.osd.4.target065113.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xba2856]
2016-04-07T06:36:08.500 INFO:tasks.ceph.osd.4.target065113.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0xba3900]
2016-04-07T06:36:08.500 INFO:tasks.ceph.osd.4.target065113.stderr: 7: (()+0x8182) [0x7fab83891182]
2016-04-07T06:36:08.500 INFO:tasks.ceph.osd.4.target065113.stderr: 8: (clone()+0x6d) [0x7fab81dfc47d]
2016-04-07T06:36:08.500 INFO:tasks.ceph.osd.4.target065113.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
- Related to Bug #14810: "FileStore.cc: 2855: FAILED assert(0 == "unexpected error")" in powercycle-infernalis-testing-basic-smithi added
Sage do we need to reduce ops = 2000 for rados load?
- Status changed from New to Can't reproduce
I don't think there's enough info here. It didn't manage to get the osd logs, so we cant' see what the unexpected op was. I don't think the disk filled up (ceph_test_rados doesn't write that much data).
- Status changed from Can't reproduce to New
Reopened
Run: http://pulpito.ceph.com/yuriw-2016-04-09_15:40:40-upgrade:hammer-x-jewel-distro-basic-smithi/
Job: 118321
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2016-04-09_15:40:40-upgrade:hammer-x-jewel-distro-basic-smithi/118321/teuthology.log
2016-04-09T18:06:02.363 INFO:teuthology.orchestra.run.smithi052:Running: 'sudo adjust-ulimits ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal --log-file=/var/log/ceph/objectstore_tool.\\$pid.log --op list-pgs'
2016-04-09T18:06:02.749 INFO:teuthology.orchestra.run.smithi052.stderr:os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f54dc0f08c0 time 2016-04-10 01:06:02.747386
2016-04-09T18:06:02.749 INFO:teuthology.orchestra.run.smithi052.stderr:os/FileStore.cc: 2761: FAILED assert(0 == "unexpected error")
2016-04-09T18:06:02.749 INFO:teuthology.orchestra.run.smithi052.stderr: ceph version 0.94.6-254-ge219e85 (e219e85be00088eecde7b1f29d7699493a79bc4d)
2016-04-09T18:06:02.750 INFO:teuthology.orchestra.run.smithi052.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc7f2b]
2016-04-09T18:06:02.750 INFO:teuthology.orchestra.run.smithi052.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa6c) [0x9ac19c]
2016-04-09T18:06:02.750 INFO:teuthology.orchestra.run.smithi052.stderr: 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x9b27c4]
2016-04-09T18:06:02.750 INFO:teuthology.orchestra.run.smithi052.stderr: 4: (JournalingObjectStore::journal_replay(unsigned long)+0x5cb) [0x9c973b]
2016-04-09T18:06:02.751 INFO:teuthology.orchestra.run.smithi052.stderr: 5: (FileStore::mount()+0x3bb6) [0x99be66]
2016-04-09T18:06:02.751 INFO:teuthology.orchestra.run.smithi052.stderr: 6: (main()+0x1e26) [0x650e86]
2016-04-09T18:06:02.751 INFO:teuthology.orchestra.run.smithi052.stderr: 7: (__libc_start_main()+0xf5) [0x7f54d6c83ec5]
2016-04-09T18:06:02.751 INFO:teuthology.orchestra.run.smithi052.stderr: 8: ceph-objectstore-tool() [0x66e5b7]
2016-04-09T18:06:02.752 INFO:teuthology.orchestra.run.smithi052.stderr: NOTE: a copy of the executable, or `objdump -rdS <exec
Hit this again here:
http://pulpito.ceph.com/smithfarm-2016-07-07_09:31:30-upgrade:hammer-x-wip-16598---basic-vps/
Right before the crash, there is this:
2016-07-07T10:36:52.434 INFO:tasks.ceph.osd.0.vpm137.stdout:starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
2016-07-07T10:36:52.504 INFO:tasks.ceph.osd.0.vpm137.stderr:2016-07-07 17:36:53.102149 7f1263d1f800 -1 filestore(/var/lib/ceph/osd/ceph-0) WARNING: max attr value size (1024) is smaller than osd_max_object_name_len (2048). Your backend filesystem appears to not support attrs large enough to handle the configured max rados name size. You may get unexpected ENAMETOOLONG errors on rados operations or buggy behavior
2016-07-07T10:36:52.504 INFO:tasks.ceph.osd.0.vpm137.stderr:2016-07-07 17:36:53.133047 7f1263d1f800 -1 filestore(/var/lib/ceph/osd/ceph-0) FileStore::mount: stale version stamp detected: 3. Proceeding, do_update is set, performing disk format upgrade.
2016-07-07T10:36:52.559 INFO:tasks.ceph.osd.1.vpm137.stdout:starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
Somehow that most recent one bypassed hammer and upgraded straight to master.
That test actually doesn't stop at hammer at all. I'm confused.
- Subject changed from "FileStore.cc: 2761: FAILED assert(0 == "unexpected error")" in upgrade:hammer-x-jewel-distro-basic-openstack to ceph-qa-suite upgrade suite: OSD crash due to starting up post-jewel osd after firefly without stopping at hammer
- Status changed from New to Can't reproduce
Also available in: Atom
PDF