Project

General

Profile

Bug #17569

multisite: assertion in RGWRados::wakeup_data_sync_shards

Added by Casey Bodley 6 months ago. Updated 6 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Target version:
-
Start date:
10/13/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

If RGWDataSyncControlCR returns an error to RGWRemoteDataLog::run_sync(), RGWRemoteDataLog::data_sync_cr is not reset to null even though the CR itself has been freed. When async notifications come in later to signal the data_sync_cr, this results in a use-after-free.

Log of RGWDataSyncControlCR failure:

2016-10-11 14:55:09.676991 7f6196698700  1 -- 10.17.151.111:0/1018465644 <== osd.0 10.17.151.111:6800/18857 88 ==== osd_op_reply(90 datalog.sync-status.88c0bc4d-e840-448c-ad0b-6f4576e4dded [read 0~0] v0'0 uv104 ondisk = 0) v7 ==== 176+0+0 (3008961858 0 0) 0x55add5466680 con 0x55add5877000
2016-10-11 14:55:09.677454 7f6180e6d700 20 rados->read r=0 bl.length=0
2016-10-11 14:55:09.677542 7f617c664700 20 cr:s=0x55add5c6f750:op=0x55add5ce2000:20RGWSimpleRadosReadCRI18rgw_data_sync_infoE: operate()
2016-10-11 14:55:09.677581 7f617c664700 20 cr:s=0x55add5c6f750:op=0x55add5ce2000:20RGWSimpleRadosReadCRI18rgw_data_sync_infoE: operate() returned r=-5
2016-10-11 14:55:09.677590 7f617c664700 20 cr:s=0x55add5c6f750:op=0x55add550d600:30RGWReadDataSyncStatusCoroutine: operate()
2016-10-11 14:55:09.677591 7f617c664700  4 data sync: failed to read sync status info with (5) Input/output error
2016-10-11 14:55:09.677595 7f617c664700 20 cr:s=0x55add5c6f750:op=0x55add550d600:30RGWReadDataSyncStatusCoroutine: operate() returned r=-5
2016-10-11 14:55:09.677598 7f617c664700 20 cr:s=0x55add5c6f750:op=0x55add5a0e000:13RGWDataSyncCR: operate()
2016-10-11 14:55:09.677599 7f617c664700  0 data sync: ERROR: failed to fetch sync status, retcode=-5
2016-10-11 14:55:09.677600 7f617c664700 20 cr:s=0x55add5c6f750:op=0x55add5a0e000:13RGWDataSyncCR: operate() returned r=-5
2016-10-11 14:55:09.677601 7f617c664700 20 cr:s=0x55add5c6f750:op=0x55add5448700:20RGWDataSyncControlCR: operate()
2016-10-11 14:55:09.677604 7f617c664700  5 data sync: Sync:88c0bc4d:data:Data:all:finish
2016-10-11 14:55:09.677618 7f617c664700  0 meta sync: ERROR: RGWBackoffControlCR called coroutine returned -5
2016-10-11 14:55:09.677619 7f617c664700 20 cr:s=0x55add5c6f750:op=0x55add5448700:20RGWDataSyncControlCR: operate() returned r=-5
2016-10-11 14:55:09.677620 7f617c664700 20 stack->operate() returned ret=-5
2016-10-11 14:55:09.677621 7f617c664700 20 run: stack=0x55add5c6f750 is done
2016-10-11 14:55:09.677627 7f617c664700 20 run(stacks) returned r=-5
2016-10-11 14:55:09.677629 7f617c664700  0 data sync: ERROR: failed to run sync

Log of the assertion:

2016-10-11 14:57:59.836653 7f615060c700 20 execute(): read data: [{"key":3,"val":["byxliz-11:e0ca40ce-cc60-4a95-bb0b-a9a5b29f689d.4116.12"]}]
2016-10-11 14:57:59.836880 7f615060c700 20 execute(): updated shard=3
2016-10-11 14:57:59.836886 7f615060c700 20 execute(): modified key=byxliz-11:e0ca40ce-cc60-4a95-bb0b-a9a5b29f689d.4116.12
2016-10-11 14:57:59.836889 7f615060c700 20 wakeup_data_sync_shards: source_zone=88c0bc4d-e840-448c-ad0b-6f4576e4dded, shard_ids={3=byxliz-11:e0ca40ce-cc60-4a95-bb0b-a9a5b29f689d.4116.12}
2016-10-11 14:57:59.844090 7f615060c700 -1 /home/cbodley/ceph/src/common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7f615060c700 time 2016-10-11 14:57:59.836896
/home/cbodley/ceph/src/common/Mutex.cc: 113: FAILED assert(r == 0)

 ceph version Development (no_version)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x55adca9f1610]
 2: (Mutex::Lock(bool)+0x14f) [0x55adca9b72a5]
 3: (RGWDataSyncControlCR::wakeup(int, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)+0x34) [0x55adca90d5e6]
 4: (RGWRemoteDataLog::wakeup(int, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)+0x70) [0x55adca8eebb2]
 5: (RGWDataSyncStatusManager::wakeup(int, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)+0x2c) [0x55adca7685ca]
 6: (RGWDataSyncProcessorThread::wakeup_sync_shards(std::map<int, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<int>, std::allocator<std::pair<int const, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > >&)+0x8a) [0x55adca76a2d4]
 7: (RGWRados::wakeup_data_sync_shards(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<int, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<int>, std::allocator<std::pair<int const, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > >&)+0x38e) [0x55adca70f5b6]
 8: (RGWOp_DATALog_Notify::execute()+0x78e) [0x55adca7e056a]
 9: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool)+0x5ef) [0x55adca6f4c5b]
 10: (process_request(RGWRados*, RGWREST*, RGWRequest*, RGWStreamIO*, OpsLogSocket*)+0xbbd) [0x55adca6f58ba]
 11: (()+0xa12998) [0x55adca5b6998]
 12: (()+0xa326f7) [0x55adca5d66f7]
 13: (()+0xa34ddf) [0x55adca5d8ddf]
 14: (()+0xa3538a) [0x55adca5d938a]
 15: (()+0xa35483) [0x55adca5d9483]
 16: (()+0x761a) [0x7f61a096f61a]
 17: (clone()+0x6d) [0x7f619f16159d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues

Related to Bug #17568: multisite: race between ReadSyncStatus and InitSyncStatus leads to EIO errors Resolved 10/13/2016
Copied to Backport #17786: jewel: multisite: assertion in RGWRados::wakeup_data_sync_shards Resolved

History

#1 Updated by Casey Bodley 6 months ago

  • Related to Bug #17568: multisite: race between ReadSyncStatus and InitSyncStatus leads to EIO errors added

#2 Updated by Casey Bodley 6 months ago

  • Status changed from New to Need Review
  • Backport set to jewel

#3 Updated by Yehuda Sadeh 6 months ago

  • Status changed from Need Review to Pending Backport

#4 Updated by Loic Dachary 6 months ago

  • Copied to Backport #17786: jewel: multisite: assertion in RGWRados::wakeup_data_sync_shards added

Also available in: Atom PDF