Bug #17645: "terminate called after throwing an instance of 'std::out_of_range'" in upgrade:jewel-x-master-distro-basic-vps - Ceph - Ceph

Actions

Copy link

Bug #17645

closed

"terminate called after throwing an instance of 'std::out_of_range'" in upgrade:jewel-x-master-distro-basic-vps

Added by Yuri Weinstein over 7 years ago. Updated over 7 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Loïc Dachary

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

upgrade/jewel-x

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2016-10-20_04:20:02-upgrade:jewel-x-master-distro-basic-vps/
Job: 486914
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-10-20_04:20:02-upgrade:jewel-x-master-distro-basic-vps/486914/teuthology.log

2016-10-20T07:05:01.450 INFO:tasks.ceph.osd.3.vpm001.stdout:starting osd.3 at :/0 osd_data /var/lib/ceph/osd/ceph-3 /var/lib/ceph/osd/ceph-3/journal
2016-10-20T07:05:01.539 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.515985 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) WARNING: max attr value size (1024) is smaller than osd_max_object_name_len (2048).  Your backend filesystem appears to not support attrs large enough to handle the configured max rados name size.  You may get unexpected ENAMETOOLONG errors on rados operations or buggy behavior
2016-10-20T07:05:01.932 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.912986 7f469e2f4800 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2016-10-20T07:05:01.966 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937216 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:42569fed:::200.00000011:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937256 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:45d72e17:::1000000000a.00000000:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937315 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:48763df2:::10000000951.00000000:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937343 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:4c2aab48:::100000006f4.00000000:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937370 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:4def72d4:::10000000003.00000000:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937397 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:545c15c7:::200.00000017:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937423 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:590b566d:::200.00000018:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937449 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:5aaf593d:::1000000197a.00000000:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937476 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:5db0affa:::1000000192f.00000000:head# in index: (2) No such file or directory
2016-10-20T07:05:01.967 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.937529 7f469e2f4800 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find #1:40000000::::head# in index: (2) No such file or directory
2016-10-20T07:05:02.013 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:01.991504 7f469e2f4800 -1 osd.3 39 PGs are upgrading
2016-10-20T07:05:02.816 INFO:tasks.ceph.osd.3.vpm001.stderr:2016-10-20 07:05:02.796392 7f469e2f4800 -1 osd.3 39 log_to_monitors {default=true}

2016-10-20T07:10:46.544 INFO:tasks.ceph.osd.3.vpm001.stderr:terminate called after throwing an instance of 'std::out_of_range'
2016-10-20T07:10:46.544 INFO:tasks.ceph.osd.3.vpm001.stderr:  what():  map::at
2016-10-20T07:10:46.544 INFO:tasks.ceph.osd.3.vpm001.stderr:*** Caught signal (Aborted) **
2016-10-20T07:10:46.544 INFO:tasks.ceph.osd.3.vpm001.stderr: in thread 7f467f700700 thread_name:tp_osd_tp
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: ceph version v11.0.2-468-gcb82731 (cb827316bfae698193869f1dbefb29add1a37a41)
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 1: (()+0x8ac50a) [0x7f469ebc450a]
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 2: (()+0xf100) [0x7f469adb7100]
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 3: (gsignal()+0x37) [0x7f4699bd55f7]
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 4: (abort()+0x148) [0x7f4699bd6ce8]
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f469a4da9d5]
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 6: (()+0x5e946) [0x7f469a4d8946]
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 7: (()+0x5e973) [0x7f469a4d8973]
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 8: (()+0x5eb93) [0x7f469a4d8b93]
2016-10-20T07:10:46.545 INFO:tasks.ceph.osd.3.vpm001.stderr: 9: (std::__throw_out_of_range(char const*)+0x77) [0x7f469a52da17]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 10: (ReplicatedPG::recover_got(hobject_t, eversion_t)+0x32b) [0x7f469e83ae1b]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 11: (ReplicatedPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, ObjectStore::Transaction*)+0x5dd) [0x7f469e83d4dd]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 12: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x974) [0x7f469e9871c4]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 13: (ReplicatedBackend::_do_pull_response(std::shared_ptr<OpRequest>)+0x1b1) [0x7f469e9877d1]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 14: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x303) [0x7f469e98c713]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 15: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x10d) [0x7f469e84f52d]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f469e6fb34d]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 17: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x6d) [0x7f469e6fb59d]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x86c) [0x7f469e71ce6c]
2016-10-20T07:10:46.546 INFO:tasks.ceph.osd.3.vpm001.stderr: 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x947) [0x7f469ed5f5e7]
2016-10-20T07:10:46.547 INFO:tasks.ceph.osd.3.vpm001.stderr: 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f469ed61740]
2016-10-20T07:10:46.547 INFO:tasks.ceph.osd.3.vpm001.stderr: 21: (()+0x7dc5) [0x7f469adafdc5]
2016-10-20T07:10:46.547 INFO:tasks.ceph.osd.3.vpm001.stderr: 22: (clone()+0x6d) [0x7f4699c96ced]

Actions

Copy link

Updated by Sage Weil over 7 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Samuel Just over 7 years ago

Assignee set to Loïc Dachary

Loic: can you take a look?

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

filter="upgrade:jewel-x/parallel/{4-kraken.yaml kraken.yaml 0-cluster/{openstack.yaml start.yaml} 1-jewel-install/jewel.yaml 2-workload/{blogbench.yaml ec-rados-default.yaml rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml test_rbd_python.yaml} 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 5-final-workload/{blogbench.yaml rados-snaps-few-objects.yaml rados_loadgenmix.yaml rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml rgw_swift.yaml} distros/centos_7.2.yaml}" 
teuthology-suite -k distro --verbose --newest 100 --suite upgrade/jewel-x --filter="$filter" --suite-branch master --ceph master --machine-type smithi --priority 101

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

The burst of "not found" in the log, right before the osd crashes suggests the file system was damaged. There is no indication of IO error in the logs though. The OSD logs are no longer available.

Actions

Copy link

Updated by Loïc Dachary over 7 years ago

Running 50 times the job to verify how frequently it happens.

filter="upgrade:jewel-x/parallel/{4-kraken.yaml kraken.yaml 0-cluster/{openstack.yaml start.yaml} 1-jewel-install/jewel.yaml 2-workload/{blogbench.yaml ec-rados-default.yaml rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml test_rbd_python.yaml} 3-upgrade-sequence/upgrade-mon-osd-mds.yaml 5-final-workload/{blogbench.yaml rados-snaps-few-objects.yaml rados_loadgenmix.yaml rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml rgw_swift.yaml} distros/centos_7.2.yaml}" 
teuthology-suite -N 50 -k distro --verbose --newest 100 --suite upgrade/jewel-x --filter="$filter" --suite-branch master --ceph master --machine-type smithi --priority 101

pass http://pulpito.ceph.com/loic-2016-12-08_14:38:21-upgrade:jewel-x-master-distro-basic-smithi

Two jobs failed for environmental reasons ( Command failed on smithi028 with status 1: "sudo yum -y install '' ceph" and ceph version 10.2.4-2.g2c7d2b9 was not installed, found 10.2.4-0.el7. ) and three jobs died because they could not get machines to work with within 12 hours and timeouted (one of them did start but it was so late that it could not complete and was interrupted).

Marking as Need info because the 45 successful run of the same job suggest it was indeed a damaged file system. If this bug does not show again in the next month, it can be closed. If someone has reason to believe this is a real bug, please reopen with suggestions on how to investigate more and I'll work on it.

Actions

Copy link