Bug #23661: RGWAsyncGetSystemObj failed assertion on shutdown/realm reload - rgw - Ceph

Actions

Copy link

Bug #23661

closed

RGWAsyncGetSystemObj failed assertion on shutdown/realm reload

Added by Casey Bodley about 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

Casey Bodley

Target version:

% Done:

Source:

Tags:

multisite

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

 -2698> 2018-04-11 16:51:59.592 6f0b8700  1 rgw realm reloader: Frontends paused
...
   -73> 2018-04-11 16:52:00.492 34fea700 20 clearing stack on run() exit: stack=0x7b392d60 nref=2
   -72> 2018-04-11 16:52:00.492 34fea700 20 run(stacks) returned r=-125
...
    -6> 2018-04-11 16:52:00.513 16540700  1 -- 172.21.15.164:0/2127361047 <== osd.0 172.21.15.164:6805/30344 523 ==== osd_op_reply(1339 datalog.sync-status.shard.234e7cf5-a39c-4ebf-8a3b-7002cda4fa64.74 [read 0~40] v0'0 uv2586 ondisk = 0) v9 ==== 210+0+40 (2369085131 0 1514302540) 0x3452ba60 con 0x3291f060
    -5> 2018-04-11 16:52:00.514 24574700 20 rados->read r=0 bl.length=40
    -4> 2018-04-11 16:52:00.514 24574700 10 cache put: name=test-zone2.rgw.log++datalog.sync-status.shard.234e7cf5-a39c-4ebf-8a3b-7002cda4fa64.74 info.flags=0x1
    -3> 2018-04-11 16:52:00.514 24574700 10 adding test-zone2.rgw.log++datalog.sync-status.shard.234e7cf5-a39c-4ebf-8a3b-7002cda4fa64.74 to cache LRU end
...
2018-04-11 16:52:00.485 2ed89700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.0.2-994-g7c21f2e/rpm/el7/BUILD/ceph-13.0.2-994-g7c21f2e/src/common/buffer.cc: In function 'char* ceph::buffer::ptr::c_str()' thread 2ed89700 time 2018-04-11 16:52:00.400455
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.0.2-994-g7c21f2e/rpm/el7/BUILD/ceph-13.0.2-994-g7c21f2e/src/common/buffer.cc: 988: FAILED assert(_raw)

 ceph version 13.0.2-994-g7c21f2e (7c21f2edad61886351873068f6803446618fc2e4) mimic (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x612f80f]
 2: (()+0x2809f7) [0x612f9f7]
 3: (()+0xc2b0a) [0x4ef8b0a]
 4: (ceph::buffer::list::iterator_impl<false>::copy_all(ceph::buffer::list&)+0x2b) [0x4f00f5b]
 5: (RGWCache<RGWRados>::get_system_obj(RGWObjectCtx&, RGWRados::SystemObject::Read::GetObjState&, RGWObjVersionTracker*, rgw_raw_obj&, ceph::buffer::list&, long, long, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*, rgw_cache_entry_info*, boost::optional<obj_version>)+0x3b6) [0x4f49d6]
 6: (RGWAsyncGetSystemObj::_send_request()+0x6d) [0x4242ed]
 7: (RGWAsyncRadosProcessor::handle_request(RGWAsyncRadosRequest*)+0x22) [0x425532]
 8: (RGWAsyncRadosProcessor::RGWWQ::_process(RGWAsyncRadosRequest*, ThreadPool::TPHandle&)+0xd) [0x4255fd]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x903) [0x6134c93]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x6136140]
 11: (()+0x7e25) [0x5c9ae25]
 12: (clone()+0x6d) [0x1178434d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://qa-proxy.ceph.com/teuthology/cbodley-2018-04-11_16:15:50-rgw-wip-cbodley-testing-distro-basic-smithi/2386406/teuthology.log

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Orit Wasserman about 6 years ago

Backport set to luminous, jewel

Actions

Copy link

Updated by Matt Benjamin about 6 years ago

Status changed from New to Triaged
Assignee set to Casey Bodley

Actions

Copy link

Updated by Yehuda Sadeh almost 6 years ago

This looks like the teuthology run preceded the cloud sync merge. Unless it was testing the cloud sync work, I think we can close this for now (until we see it happening again) because the cloud sync work touched and fixed issues related to the coroutines stack shutdown (that would have looked like this specific issue). Casey, can we close this one?

Actions

Copy link