Project

General

Profile

Bug #42177

osd/PrimaryLogPG.cc: 13068: FAILED ceph_assert(obc)

Added by Sage Weil almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

First the object is deleted,

2019-10-03T15:06:49.550+0000 7fc32b7fd700 20 osd.3 pg_epoch: 1890 pg[10.d( v 1890'22609 (1871'19600,1890'22609] local-lis/les=1886/1887 n=1065 ec=1746/1746 lis/c=1886/1874 les/c/f=1887/1875/0 sis=1886) [0,3]/[3,11] backfill=[0] r=0 lpr=1886 pi=[1874,1886)/1 luod=1890'22605 crt=1890'22609 lcod 1890'22604 mlcod 1890'22604 active+remapped+backfilling rops=1 mbc={255={}}] do_op: op osd_op(client.88309.0:359596 10.d 10:ba3eee7a:::benchmark_data_smithi205_13542_object171119:head [delete] snapc 0=[] ondisk+write+known_i
f_redirected e1890) v8
...
2019-10-03T15:06:49.550+0000 7fc32b7fd700 15 bluestore(/var/lib/ceph/osd/ceph-3) _remove 10.d_head #10:ba3eee7a:::benchmark_data_smithi205_13542_object171119:head# onode 0x5596869d9180 txc 0x559673554d00

and the transcontext io_done, but blocked on an earlier op in the sequencer,
2019-10-03T15:06:49.550+0000 7fc32b7fd700 20 bluestore(/var/lib/ceph/osd/ceph-3) _txc_finish_io 0x559673554d00
2019-10-03T15:06:49.550+0000 7fc32b7fd700 10 bluestore(/var/lib/ceph/osd/ceph-3) _txc_state_proc txc 0x559673554d00 io_done

then backfill does a listing,
2019-10-03T15:06:49.562+0000 7fc32b7fd700 20 bluestore(/var/lib/ceph/osd/ceph-3) _collection_list oid #10:ba3eee7a:::benchmark_data_smithi205_13542_object171119:head# end GHMAX

the scan_range() code finds the obc in the cache, with an empty version (exists=false, but we don't notice that)
2019-10-03T15:06:49.574+0000 7fc32b7fd700 20 osd.3 pg_epoch: 1890 pg[10.d( v 1890'22610 (1871'19600,1890'22610] local-lis/les=1886/1887 n=1064 ec=1746/1746 lis/c=1886/1874 les/c/f=1887/1875/0 sis=1886) [0,3]/[3,11] backfill=[0] r=0 lpr=1886 pi=[1874,1886)/1 luod=1890'22608 crt=1890'22610 lcod 1890'22607 mlcod 1890'22607 active+remapped+backfilling mbc={255={}}]   10:ba3ebe71:::benchmark_data_smithi205_13542_object173326:head 1816'10905
2019-10-03T15:06:49.574+0000 7fc32b7fd700 20 osd.3 pg_epoch: 1890 pg[10.d( v 1890'22610 (1871'19600,1890'22610] local-lis/les=1886/1887 n=1064 ec=1746/1746 lis/c=1886/1874 les/c/f=1887/1875/0 sis=1886) [0,3]/[3,11] backfill=[0] r=0 lpr=1886 pi=[1874,1886)/1 luod=1890'22608 crt=1890'22610 lcod 1890'22607 mlcod 1890'22607 active+remapped+backfilling mbc={255={}}]   10:ba3eee7a:::benchmark_data_smithi205_13542_object171119:head 0'0
2019-10-03T15:06:49.574+0000 7fc32b7fd700 15 bluestore(/var/lib/ceph/osd/ceph-3) getattr 10.d_head #10:ba407bb4:::benchmark_data_smithi205_13542_object186954:head# _
2019-10-03T15:06:49.574+0000 7fc32b7fd700 10 bluestore(/var/lib/ceph/osd/ceph-3) getattr 10.d_head #10:ba407bb4:::benchmark_data_smithi205_13542_object186954:head# _ = 0
2019-10-03T15:06:49.574+0000 7fc32b7fd700 20 osd.3 pg_epoch: 1890 pg[10.d( v 1890'22610 (1871'19600,1890'22610] local-lis/les=1886/1887 n=1064 ec=1746/1746 lis/c=1886/1874 les/c/f=1887/1875/0 sis=1886) [0,3]/[3,11] backfill=[0] r=0 lpr=1886 pi=[1874,1886)/1 luod=1890'22608 crt=1890'22610 lcod 1890'22607 mlcod 1890'22607 active+remapped+backfilling mbc={255={}}]   10:ba407bb4:::benchmark_data_smithi205_13542_object186954:head 1827'11735

then later, when backfill finally gets to it,
 -2292> 2019-10-03T15:06:52.638+0000 7fc32b7fd700 20 osd.3 pg_epoch: 1891 pg[10.d( v 1891'22961 (1872'19900,1891'22961] local-lis/les=1886/1887 n=713 ec=1746/1746 lis/c=1886/1874 les/c/f=1887/1875/0 sis=1886) [0,3]/[3,11] backfill=[0] r=0 lpr=1886 pi=[1874,1886)/1 luod=1891'22960 crt=1891'22961 lcod 1891'22959 mlco
d 1891'22959 active+remapped+backfilling mbc={255={}}]  peer shard 0 backfill BackfillInfo(MAX-MAX 0 objects)
 -2291> 2019-10-03T15:06:52.638+0000 7fc32b7fd700 10 osd.3 pg_epoch: 1891 pg[10.d( v 1891'22961 (1872'19900,1891'22961] local-lis/les=1886/1887 n=713 ec=1746/1746 lis/c=1886/1874 les/c/f=1887/1875/0 sis=1886) [0,3]/[3,11] backfill=[0] r=0 lpr=1886 pi=[1874,1886)/1 luod=1891'22960 crt=1891'22961 lcod 1891'22959 mlcod 1891'22959 active+remapped+backfilling mbc={255={}}] get_object_context: obc NOT found in cache: 10:ba3eee7a:::benchmark_data_smithi205_13542_object171119:head
 -2290> 2019-10-03T15:06:52.638+0000 7fc32b7fd700 15 bluestore(/var/lib/ceph/osd/ceph-3) getattr 10.d_head #10:ba3eee7a:::benchmark_data_smithi205_13542_object171119:head# _
 -2289> 2019-10-03T15:06:52.638+0000 7fc32b7fd700 10 bluestore(/var/lib/ceph/osd/ceph-3) getattr 10.d_head #10:ba3eee7a:::benchmark_data_smithi205_13542_object171119:head# _ = -2
 -2288> 2019-10-03T15:06:52.638+0000 7fc32b7fd700 10 osd.3 pg_epoch: 1891 pg[10.d( v 1891'22961 (1872'19900,1891'22961] local-lis/les=1886/1887 n=713 ec=1746/1746 lis/c=1886/1874 les/c/f=1887/1875/0 sis=1886) [0,3]/[3,11] backfill=[0] r=0 lpr=1886 pi=[1874,1886)/1 luod=1891'22960 crt=1891'22961 lcod 1891'22959 mlcod 1891'22959 active+remapped+backfilling mbc={255={}}] get_object_context: no obc for soid 10:ba3eee7a:::benchmark_data_smithi205_13542_object171119:head and !can_create
...
  -772> 2019-10-03T15:06:52.662+0000 7fc32b7fd700 -1 /build/ceph-15.0.0-5726-g55667ef/src/osd/PrimaryLogPG.cc: In function 'uint64_t PrimaryLogPG::recover_backfill(uint64_t, ThreadPool::TPHandle&, bool*)' thread 7fc32b7fd700 time 2019-10-03T15:06:52.644799+0000
/build/ceph-15.0.0-5726-g55667ef/src/osd/PrimaryLogPG.cc: 13068: FAILED ceph_assert(obc)

 ceph version 15.0.0-5726-g55667ef (55667ef848f8044c1c432effde953e61dc59eef4) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x559664694f3a]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x559664695115]
 3: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x51e7) [0x55966488cb97]
 4: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xed2) [0x55966488dbf2]
 5: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x249) [0x559664706269]
 6: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x559664972619]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90c) [0x559664724b7c]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x559664d6354c]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x559664d666c0]
 10: (()+0x76db) [0x7fc34cf406db]
 11: (clone()+0x3f) [0x7fc34bce088f]

i just saw this twice in a row,

/a/sage-2019-10-03_14:03:55-rados:upgrade-wip-sage2-testing-2019-10-02-0937-distro-basic-smithi/4353979
/a/sage-2019-10-03_00:05:14-rados:upgrade-wip-sage2-testing-2019-10-02-0937-distro-basic-smithi/4353003

which is surprising because i haven't seen it before. :/


Related issues

Duplicated by RADOS - Bug #42176: FAILED ceph_assert(obc) in PrimaryLogPG::recover_backfill() Duplicate 10/03/2019
Copied to RADOS - Backport #42197: nautilus: osd/PrimaryLogPG.cc: 13068: FAILED ceph_assert(obc) Resolved
Copied to RADOS - Backport #42198: mimic: osd/PrimaryLogPG.cc: 13068: FAILED ceph_assert(obc) Resolved
Copied to RADOS - Backport #42199: luminous: osd/PrimaryLogPG.cc: 13068: FAILED ceph_assert(obc) Resolved

History

#1 Updated by Sage Weil almost 2 years ago

  • Status changed from 12 to Fix Under Review
  • Pull request ID set to 30715

#2 Updated by Neha Ojha almost 2 years ago

  • Duplicated by Bug #42176: FAILED ceph_assert(obc) in PrimaryLogPG::recover_backfill() added

#3 Updated by Neha Ojha almost 2 years ago

  • Project changed from Ceph to RADOS

#4 Updated by Sage Weil almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus, mimic, luminous

#5 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42197: nautilus: osd/PrimaryLogPG.cc: 13068: FAILED ceph_assert(obc) added

#6 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42198: mimic: osd/PrimaryLogPG.cc: 13068: FAILED ceph_assert(obc) added

#7 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #42199: luminous: osd/PrimaryLogPG.cc: 13068: FAILED ceph_assert(obc) added

#8 Updated by Nathan Cutler over 1 year ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF