Project

General

Profile

Actions

Bug #18583

closed

osd: calc_clone_subsets misuses try_read_lock vs missing

Added by Sage Weil over 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
kraken, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

try_read_lock does get_object_context, and asserts either attrs or the object is not missing, the logic being that if we want the attrs we had better have it.

teh caller is calc_clone_subsets and it checks missing, but it's the peer's missing, not the local missing. this leads to

    -3> 2017-01-17 23:42:45.563303 7ff4aadc2700 15 osd.5 pg_epoch: 262 pg[1.28( v 261'272 lc 237'261 (0'0,261'272] local-les=261 n=16 ec=61 les/c/f 261/179/0 257/259/259) [5,4] r=0 lpr=259 pi=104-258/5 rops=1 crt=261'272 mlcod 24'208 active+recovering+degraded m=2 snaptrimq=[7e~c,8b~27,b3~1,b5~5,bd~2,c0~1,c3~4,ca~2]] push_to_replica snapset is c5=[c5,c2,c1,bf,bc,bb,ba,b4,b2,a4,8a]:[b6,c2,c5]+head  
    -2> 2017-01-17 23:42:45.563317 7ff4aadc2700 10 osd.5 pg_epoch: 262 pg[1.28( v 261'272 lc 237'261 (0'0,261'272] local-les=261 n=16 ec=61 les/c/f 261/179/0 257/259/259) [5,4] r=0 lpr=259 pi=104-258/5 rops=1 crt=261'272 mlcod 24'208 active+recovering+degraded m=2 snaptrimq=[7e~c,8b~27,b3~1,b5~5,bd~2,c0~1,c3~4,ca~2]] calc_clone_subsets 1:16a8e63b:::smithi01820475-495 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:b6 clone_overlap {b6=[],c2=[0~618854,1027101~716800,2169695~49159],c5=[0~786924,1380367~548354]}
    -1> 2017-01-17 23:42:45.563338 7ff4aadc2700 10 osd.5 pg_epoch: 262 pg[1.28( v 261'272 lc 237'261 (0'0,261'272] local-les=261 n=16 ec=61 les/c/f 261/179/0 257/259/259) [5,4] r=0 lpr=259 pi=104-258/5 rops=1 crt=261'272 mlcod 24'208 active+recovering+degraded m=2 snaptrimq=[7e~c,8b~27,b3~1,b5~5,bd~2,c0~1,c3~4,ca~2]] calc_clone_subsets 1:16a8e63b:::smithi01820475-495 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:b6 does not have next 1:16a8e63b:::smithi01820475-495 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:c2 overlap []
     0> 2017-01-17 23:42:45.565302 7ff4aadc2700 -1 /build/ceph-11.1.0-6853-g3c1c7d3/src/osd/PrimaryLogPG.cc: In function 'ObjectContextRef PrimaryLogPG::get_object_context(const hobject_t&, bool, std::map<std::__cxx11::basic_string<char>, ceph::buffer::list>*)' thread 7ff4aadc2700 time 2017-01-17 23:42:45.563362
/build/ceph-11.1.0-6853-g3c1c7d3/src/osd/PrimaryLogPG.cc: 9024: FAILED assert(attrs || !pg_log.get_missing().is_missing(soid) || (pg_log.get_log().objects.count(soid) && pg_log.get_log().objects.find(soid)->second->op == pg_log_entry_t::LOST_REVERT))

 ceph version 11.1.0-6853-g3c1c7d3 (3c1c7d3d5dabbf52b6132d2ec2edc646f99d297b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7ff4ce5d7522]
 2: (PrimaryLogPG::get_object_context(hobject_t const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0x42c) [0x563bb813757c]
 3: (PrimaryLogPG::try_lock_for_read(hobject_t const&, ObcLockManager&)+0x3b) [0x563bb81b82ab]
 4: (ReplicatedBackend::calc_clone_subsets(SnapSet&, hobject_t const&, pg_missing_set<false> const&, hobject_t const&, interval_set<unsigned long>&, std::map<hobject_t, interval_set<unsigned long>, hobject_t::BitwiseComparator, std::allocator<std::pair<hobject_t const, interval_set<unsigned long> > > >&, ObcLockManager&)+0xac1) [0x563bb8282d81]
 5: (ReplicatedBackend::prep_push_to_replica(std::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, PushOp*, bool)+0xb01) [0x563bb8288ab1]
 6: (ReplicatedBackend::start_pushes(hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x183) [0x563bb8289bc3]
 7: (C_ReplicatedBackend_OnPullComplete::finish(ThreadPool::TPHandle&)+0xa6) [0x563bb829c4e6]
 8: (PrimaryLogPG::BlessedGenContext<ThreadPool::TPHandle&>::finish(ThreadPool::TPHandle&)+0x78) [0x563bb818def8]
 9: (ThreadPool::WorkQueueVal<GenContext<ThreadPool::TPHandle&>*, GenContext<ThreadPool::TPHandle&>*>::_void_process(void*, ThreadPool::TPHandle&)+0x1f8) [0x563bb803cda8]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xf2d) [0x7ff4ce5e002d]
 11: (ThreadPool::WorkThread::entry()+0x10) [0x7ff4ce5e11a0]
 12: (()+0x76ba) [0x7ff4cde0e6ba]
 13: (clone()+0x6d) [0x7ff4cce9782d]

/a/sage-2017-01-17_21:25:27-rados-wip-sage-testing---basic-smithi/726098

Related issues 3 (0 open3 closed)

Related to Ceph - Bug #17831: osd: ENOENT on cloneResolvedSamuel Just11/08/2016

Actions
Copied to Ceph - Backport #18723: kraken: osd: calc_clone_subsets misuses try_read_lock vs missing ResolvedNathan CutlerActions
Copied to Ceph - Backport #18724: jewel: osd: calc_clone_subsets misuses try_read_lock vs missing RejectedAlexey SheplyakovActions
Actions #1

Updated by Sage Weil over 7 years ago

  • Backport changed from kraken,jewel to kraken
Actions #2

Updated by Samuel Just over 7 years ago

  • Assignee changed from Sage Weil to Samuel Just
  • Backport changed from kraken to kraken,jewel

FWIW, this isn't present on kraken. I introduced it in 68defc2b0561414711d4dd0a76bc5d0f46f8a3f8 . Simple fix is to not use the lock if missing on primary.

Actions #3

Updated by Samuel Just over 7 years ago

  • Backport changed from kraken,jewel to kraken(along with 17831)
Actions #4

Updated by Samuel Just over 7 years ago

  • Related to Bug #17831: osd: ENOENT on clone added
Actions #5

Updated by Samuel Just about 7 years ago

  • Status changed from 12 to Fix Under Review
Actions #6

Updated by Josh Durgin about 7 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Josh Durgin about 7 years ago

  • Backport changed from kraken(along with 17831) to kraken(along with 17831), jewel(along with 17831)
Actions #8

Updated by Nathan Cutler about 7 years ago

  • Copied to Backport #18723: kraken: osd: calc_clone_subsets misuses try_read_lock vs missing added
Actions #9

Updated by Nathan Cutler about 7 years ago

  • Backport changed from kraken(along with 17831), jewel(along with 17831) to kraken, jewel
Actions #10

Updated by Nathan Cutler about 7 years ago

Josh says both the kraken and jewel backports should be done together with #17831

Actions #11

Updated by Nathan Cutler about 7 years ago

  • Copied to Backport #18724: jewel: osd: calc_clone_subsets misuses try_read_lock vs missing added
Actions #12

Updated by Samuel Just about 7 years ago

This needs to be backported with http://tracker.ceph.com/issues/18809 (not in master yet, wait on that)

Actions #13

Updated by Samuel Just about 7 years ago

  • Priority changed from Immediate to High
Actions #14

Updated by Nathan Cutler over 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF