Project

General

Profile

Bug #40576

src/osd/PrimaryLogPG.cc: 10513: FAILED assert(head_obc)

Added by Kenneth Van Alstyne almost 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It appears that a snapshot containing an inconsistent object was removed at some point, causing ceph-osd to now crash upon startup trying to recover an object that doesn't exist. I'm not sure how to get out of this state, as I've tried placing objects back where ceph-osd expects them, deleting the object with ceph-objectstore-tool, among other things. Logs are below:

   -65> 2019-06-28 02:05:35.605735 7f26841ff700 15 filestore(/var/lib/ceph/osd/ceph-9) getattr(4422): 0.86_head/#0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:ad30# '_'
   -64> 2019-06-28 02:05:35.605772 7f26841ff700 10 filestore(/var/lib/ceph/osd/ceph-9) getattr(4456): 0.86_head/#0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:ad30# '_' = 272
   -63> 2019-06-28 02:05:35.605780 7f26841ff700 20 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped]   0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:ad30 177772'14642335
   -62> 2019-06-28 02:05:35.605793 7f26841ff700 15 filestore(/var/lib/ceph/osd/ceph-9) getattr(4422): 0.86_head/#0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:b58f# '_'
   -61> 2019-06-28 02:05:35.605831 7f26841ff700 10 filestore(/var/lib/ceph/osd/ceph-9) getattr(4456): 0.86_head/#0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:b58f# '_' = 272
   -60> 2019-06-28 02:05:35.605842 7f26841ff700 20 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped]   0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:b58f 194432'17055067
   -59> 2019-06-28 02:05:35.605856 7f26841ff700 15 filestore(/var/lib/ceph/osd/ceph-9) getattr(4422): 0.86_head/#0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:head# '_'
   -58> 2019-06-28 02:05:35.605896 7f26841ff700 10 filestore(/var/lib/ceph/osd/ceph-9) getattr(4456): 0.86_head/#0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:head# '_' = 295
   -57> 2019-06-28 02:05:35.605904 7f26841ff700 20 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped]   0:619c1df5:::rbd_data.c01ee86b8b4567.00000000000078d6:head 202012'18086755
   -56> 2019-06-28 02:05:35.605918 7f26841ff700 15 filestore(/var/lib/ceph/osd/ceph-9) getattr(4422): 0.86_head/#0:619c200f:::rbd_data.51b3966b8b4567.000000000000d0b4:head# '_'
   -55> 2019-06-28 02:05:35.605956 7f26841ff700 10 filestore(/var/lib/ceph/osd/ceph-9) getattr(4456): 0.86_head/#0:619c200f:::rbd_data.51b3966b8b4567.000000000000d0b4:head# '_' = 295
   -54> 2019-06-28 02:05:35.605964 7f26841ff700 20 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped]   0:619c200f:::rbd_data.51b3966b8b4567.000000000000d0b4:head 198559'17594081
   -53> 2019-06-28 02:05:35.605977 7f26841ff700 15 filestore(/var/lib/ceph/osd/ceph-9) getattr(4422): 0.86_head/#0:619c39fb:::rbd_data.6e98246b8b4567.0000000000001101:head# '_'
   -52> 2019-06-28 02:05:35.606014 7f26841ff700 10 filestore(/var/lib/ceph/osd/ceph-9) getattr(4456): 0.86_head/#0:619c39fb:::rbd_data.6e98246b8b4567.0000000000001101:head# '_' = 295
   -51> 2019-06-28 02:05:35.606022 7f26841ff700 20 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped]   0:619c39fb:::rbd_data.6e98246b8b4567.0000000000001101:head 64556'6904478
   -50> 2019-06-28 02:05:35.606301 7f26841ff700  1 -- 10.0.0.2:6817/25972 --> 10.0.0.1:6830/2017138 -- pg_scan(digest 0.86 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head-0:619c4ebb:::rbd_data.51a1f46b8b4567.000000000000834d:head e 205161/205161) v2 -- 0x7f267861f100 con 0
   -49> 2019-06-28 02:05:35.606447 7f26841ff700 10 osd.9 205161 dequeue_op 0x7f267f020840 finish
   -48> 2019-06-28 02:05:35.606464 7f26841ff700 20 osd.9 op_wq(4) _process empty q, waiting
   -13> 2019-06-28 02:05:35.633334 7f26841ff700 20 osd.9 op_wq(4) _process 0.86 item PGQueueable(0x7f267f022600 prio 1 cost 1000 e205161) queued
   -12> 2019-06-28 02:05:35.633348 7f26841ff700 20 osd.9 op_wq(4) _process 0.86 item PGQueueable(0x7f267f022600 prio 1 cost 1000 e205161) pg 0x7f26b40ba800
   -11> 2019-06-28 02:05:35.633352 7f26841ff700 10 osd.9 205161 dequeue_op 0x7f267f022600 prio 1 cost 1000 latency 0.000165 MOSDPGPush(0.86 205161/205159 [PushOp(0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691, version: 198559'17593798, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691@198559'17593798, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 pg pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped]
   -10> 2019-06-28 02:05:35.633384 7f26841ff700 20 osd.9 205161 share_map osd.12 10.0.0.1:6830/2017138 205161
    -9> 2019-06-28 02:05:35.633390 7f26841ff700 20 osd.9 205161 should_share_map osd.12 10.0.0.1:6830/2017138 205161
    -8> 2019-06-28 02:05:35.633399 7f26841ff700 10 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped] _handle_message: 0x7f267f022600
    -7> 2019-06-28 02:05:35.633422 7f26841ff700 10 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped] handle_push ObjectRecoveryInfo(0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691@198559'17593798, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[])ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false)
    -6> 2019-06-28 02:05:35.633455 7f26841ff700 10 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped] on_local_recover: 0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691
    -5> 2019-06-28 02:05:35.633470 7f26841ff700 20 snap_mapper.remove_oid 0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691
    -4> 2019-06-28 02:05:35.633472 7f26841ff700 20 snap_mapper._remove_oid 0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691
    -3> 2019-06-28 02:05:35.633483 7f26841ff700 15 filestore(/var/lib/ceph/osd/ceph-9) omap_get_values(5132): meta/#-1:c0371625:::snapmapper:0#
    -2> 2019-06-28 02:05:35.633713 7f26841ff700 15 filestore(/var/lib/ceph/osd/ceph-9) omap_get_values(5158): meta/#-1:c0371625:::snapmapper:0# = 0()
    -1> 2019-06-28 02:05:35.633726 7f26841ff700 20 osd.9 pg_epoch: 205161 pg[0.86( v 205161'18470166 (204988'18460094,205161'18470166] lb 0:618b7f49:::rbd_data.bf0de66b8b4567.000000000000c40b:head (bitwise) local-lis/les=204992/204993 n=4563 ec=59/59 lis/c 204992/201096 les/c/f 204993/204442/0 205158/205159/205159) [9,12,16]/[12,16] r=-1 lpr=205159 pi=[201096,205159)/1 luod=0'0 crt=205161'18470166 lcod 205161'18470165 active+remapped]  snapset 0=[]:[] legacy_snaps []
     0> 2019-06-28 02:05:35.636224 7f26841ff700 -1 /tmp/b-2I_Suq1c6XVP/build/node/root/packages/ceph/workdir-FunpCBst5QV4/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f26841ff700 time 2019-06-28 02:05:35.633744
2019-06-28 02:05:35.671180 7f26841ff700 -1 *** Caught signal (Aborted) **
 in thread 7f26841ff700 thread_name:tp_osd_tp
     0> 2019-06-28 02:05:35.671180 7f26841ff700 -1 *** Caught signal (Aborted) **
 in thread 7f26841ff700 thread_name:tp_osd_tp

History

#1 Updated by Kenneth Van Alstyne almost 5 years ago

The problematic object appears to be rbd_data.554d416b8b4567.000000000000151f:b691. I can't get this OSD to give up on this object. The RBD image doesn't even seem to exist anymore.

#2 Updated by David Zafman almost 5 years ago

  • Assignee set to David Zafman

#3 Updated by David Zafman almost 5 years ago

  • Subject changed from ceph-osd crash in PrimaryLogPG::on_local_recover to src/osd/PrimaryLogPG.cc: 10513: FAILED assert(head_obc)
    0> 2019-06-27 13:17:42.718574 7f57f7bfe700 -1 /tmp/b-2I_Suq1c6XVP/build/node/root/packages/ceph/workdir-FunpCBst5QV4/src/osd/PrimaryLogPG.cc: In function 'int PrimaryLogPG::recover_missing(const hobject_t&, eversion_t, int, PGBackend::RecoveryHandle*)' thread 7f57f7bfe700 time 2019-06-27 13:17:42.716742
/tmp/b-2I_Suq1c6XVP/build/node/root/packages/ceph/workdir-FunpCBst5QV4/src/osd/PrimaryLogPG.cc: 10513: FAILED assert(head_obc)

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xf5) [0x55598a4066b5]
 2: (PrimaryLogPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0xb4e) [0x555989fa762e]
 3: (PrimaryLogPG::recover_primary(unsigned long, ThreadPool::TPHandle&)+0x162e) [0x555989fe466e]
 4: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x2af) [0x555989feb24f]
 5: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x39d) [0x555989e3ee2d]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xed5) [0x555989e47bb5]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x55598a40b122]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55598a40e690]
 9: (()+0x7507) [0x7f5830c22507]
 10: (clone()+0x3f) [0x7f582fe3014f]

#4 Updated by David Zafman almost 5 years ago

  • Description updated (diff)

#5 Updated by David Zafman almost 5 years ago

Because of the missing object info the --op list mechanism gets an error so it can't generate the JSON for this bad object. I think I've hand crafted the JSON so it can be removed.

["0.86",{"oid":"rbd_data.554d416b8b4567.000000000000151f","key":"","snapid":46737,"hash":2753155462,"max":0,"pool":0,"namespace":"","max":0}]

ceph-objectstore-tool --data-path <PATH_TO_STORE> '["0.86",{"oid":"rbd_data.554d416b8b4567.000000000000151f","key":"","snapid":46737,"hash":2753155462,"max":0,"pool":0,"namespace":"","max":0}]' remove

This doesn't just remove the object but logs the removal as a transaction which will hopefully clear it from missing and allow the OSD to boot.

#6 Updated by Kenneth Van Alstyne almost 5 years ago

The removal appeared to be successful, however the OSD still crashes in a similar way:

 
root@storage-2:/var/lib/ceph/osd/ceph-9/current/0.86_head/DIR_6/DIR_8/DIR_D# ceph-objectstore-tool --dry-run --dat
a-path /var/lib/ceph/osd/ceph-9/ --journal-path /dev/mapper/SAMSUNG_MZ7LM240HCGR-00003_S35PNX0H903049p5 --pgid 0.86 '["0.86",{"oid":"rbd_data.554d416b8b4567.000000000000151f","key":"","snapid":46737,"hash":2753155462,"max":0,"pool":0,"namespace":"","max":0}]' remove
remove #0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691#
dry-run: Nothing changed

root@storage-2:/var/lib/ceph/osd/ceph-9/current/0.86_head/DIR_6/DIR_8/DIR_D# ceph-objectstore-tool --data-path /va
r/lib/ceph/osd/ceph-9/ --journal-path /dev/mapper/SAMSUNG_MZ7LM240HCGR-00003_S35PNX0H903049p5 --pgid 0.86 '["0.86",{"oid":"rbd_data.554d416b8b4567.000000000000151f","key":"","snapid":46737,"hash":2753155462,"max":0,"pool":0,"namespace":"","max":0}]' remove
remove #0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691#

Logs are below:

   -99> 2019-06-28 11:54:22.132857 7f5427600700 10 osd.9 pg_epoch: 205344 pg[0.95( v 205340'9025694 (205230'9024193,205340'9025694] local-lis/les=205337/205338 n=8569 ec=59/59 lis/c 205337/205337 les/c/f 205338/205340/0 205337/205337/205337) [9,29,18] r=0 lpr=205337 crt=205340'9025694 lcod 205340'9025693 mlcod 205340'9025693 active+clean] sparse-read 3932160@65536
   -97> 2019-06-28 11:54:22.132874 7f5427600700 10 osd.9 pg_epoch: 205344 pg[0.95( v 205340'9025694 (205230'9024193,205340'9025694] local-lis/les=205337/205338 n=8569 ec=59/59 lis/c 205337/205337 les/c/f 205338/205340/0 205337/205337/205337) [9,29,18] r=0 lpr=205337 crt=205340'9025694 lcod 205340'9025693 mlcod 205340'9025693 active+clean]  sparse_read got 65536 bytes from object 0:a9800e7c:::rbd_data.c01ee86b8b4567.00000000000070c6:head
   -95> 2019-06-28 11:54:22.132889 7f5427600700 10 osd.9 pg_epoch: 205344 pg[0.95( v 205340'9025694 (205230'9024193,205340'9025694] local-lis/les=205337/205338 n=8569 ec=59/59 lis/c 205337/205337 les/c/f 205338/205340/0 205337/205337/205337) [9,29,18] r=0 lpr=205337 crt=205340'9025694 lcod 205340'9025693 mlcod 205340'9025693 active+clean]  dropping ondisk_read_lock
   -93> 2019-06-28 11:54:22.132900 7f5427600700 15 osd.9 pg_epoch: 205344 pg[0.95( v 205340'9025694 (205230'9024193,205340'9025694] local-lis/les=205337/205338 n=8569 ec=59/59 lis/c 205337/205337 les/c/f 205338/205340/0 205337/205337/205337) [9,29,18] r=0 lpr=205337 crt=205340'9025694 lcod 205340'9025693 mlcod 205340'9025693 active+clean] do_osd_op_effects client.14838159 con 0x7f545d87f800
   -91> 2019-06-28 11:54:22.132915 7f5427600700 15 osd.9 pg_epoch: 205344 pg[0.95( v 205340'9025694 (205230'9024193,205340'9025694] local-lis/les=205337/205338 n=8569 ec=59/59 lis/c 205337/205337 les/c/f 205338/205340/0 205337/205337/205337) [9,29,18] r=0 lpr=205337 crt=205340'9025694 lcod 205340'9025693 mlcod 205340'9025693 active+clean] log_op_stats osd_op(client.14838159.0:646715112 0.95 0:a9800e7c:::rbd_data.c01ee86b8b4567.00000000000070c6:head [sparse-read 3932160~65536] snapc 0=[] ondisk+read+known_if_redirected e205344) v8 inb 0 outb 65560 lat 0.000446
   -89> 2019-06-28 11:54:22.132933 7f5427600700 15 osd.9 pg_epoch: 205344 pg[0.95( v 205340'9025694 (205230'9024193,205340'9025694] local-lis/les=205337/205338 n=8569 ec=59/59 lis/c 205337/205337 les/c/f 205338/205340/0 205337/205337/205337) [9,29,18] r=0 lpr=205337 crt=205340'9025694 lcod 205340'9025693 mlcod 205340'9025693 active+clean] publish_stats_to_osd 205344:31116002
   -87> 2019-06-28 11:54:22.132942 7f5427600700  1 -- 10.0.0.2:6802/12614 --> 10.0.8.155:0/1832924813 -- osd_op_reply(646715112 rbd_data.c01ee86b8b4567.00000000000070c6 [sparse-read 3932160~65536] v0'0 uv8753492 ondisk = 0) v8 -- 0x7f5425446200 con 0
   -85> 2019-06-28 11:54:22.132968 7f5427600700 10 osd.9 205344 dequeue_op 0x7f5418fe61c0 finish
   -83> 2019-06-28 11:54:22.132976 7f5427600700 20 osd.9 op_wq(4) _process empty q, waiting
   -62> 2019-06-28 11:54:22.133492 7f5427600700 20 osd.9 op_wq(4) _process 0.86 item PGQueueable(0x7f54219f27c0 prio 1 cost 1000 e205344) queued
   -60> 2019-06-28 11:54:22.133501 7f5427600700 20 osd.9 op_wq(4) _process 0.86 item PGQueueable(0x7f54219f27c0 prio 1 cost 1000 e205344) pg 0x7f544c6cf000
   -56> 2019-06-28 11:54:22.133504 7f5427600700 10 osd.9 205344 dequeue_op 0x7f54219f27c0 prio 1 cost 1000 latency 0.000135 MOSDPGPush(0.86 205344/205338 [PushOp(0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691, version: 198559'17593798, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691@198559'17593798, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 pg pg[0.86( v 205344'18495754 (205325'18493907,205344'18495754] lb 0:618961fb:::rbd_data.c025446b8b4567.0000000000001102:a51c (bitwise) local-lis/les=205338/205339 n=4498 ec=59/59 lis/c 205338/201096 les/c/f 205339/204442/0 205337/205338/205338) [9,12,16]/[12,16] r=-1 lpr=205338 pi=[201096,205338)/1 luod=0'0 crt=205344'18495754 active+remapped]
   -55> 2019-06-28 11:54:22.133530 7f5427600700 20 osd.9 205344 share_map osd.12 10.0.0.1:6830/2017138 205344
   -54> 2019-06-28 11:54:22.133535 7f5427600700 20 osd.9 205344 should_share_map osd.12 10.0.0.1:6830/2017138 205344
   -53> 2019-06-28 11:54:22.133543 7f5427600700 10 osd.9 pg_epoch: 205344 pg[0.86( v 205344'18495754 (205325'18493907,205344'18495754] lb 0:618961fb:::rbd_data.c025446b8b4567.0000000000001102:a51c (bitwise) local-lis/les=205338/205339 n=4498 ec=59/59 lis/c 205338/201096 les/c/f 205339/204442/0 205337/205338/205338) [9,12,16]/[12,16] r=-1 lpr=205338 pi=[201096,205338)/1 luod=0'0 crt=205344'18495754 active+remapped] _handle_message: 0x7f54219f27c0
   -52> 2019-06-28 11:54:22.133568 7f5427600700 10 osd.9 pg_epoch: 205344 pg[0.86( v 205344'18495754 (205325'18493907,205344'18495754] lb 0:618961fb:::rbd_data.c025446b8b4567.0000000000001102:a51c (bitwise) local-lis/les=205338/205339 n=4498 ec=59/59 lis/c 205338/201096 les/c/f 205339/204442/0 205337/205338/205338) [9,12,16]/[12,16] r=-1 lpr=205338 pi=[201096,205338)/1 luod=0'0 crt=205344'18495754 active+remapped] handle_push ObjectRecoveryInfo(0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691@198559'17593798, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[])ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false)
   -51> 2019-06-28 11:54:22.133603 7f5427600700 10 osd.9 pg_epoch: 205344 pg[0.86( v 205344'18495754 (205325'18493907,205344'18495754] lb 0:618961fb:::rbd_data.c025446b8b4567.0000000000001102:a51c (bitwise) local-lis/les=205338/205339 n=4498 ec=59/59 lis/c 205338/201096 les/c/f 205339/204442/0 205337/205338/205338) [9,12,16]/[12,16] r=-1 lpr=205338 pi=[201096,205338)/1 luod=0'0 crt=205344'18495754 active+remapped] on_local_recover: 0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691
   -50> 2019-06-28 11:54:22.133617 7f5427600700 20 snap_mapper.remove_oid 0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691
   -49> 2019-06-28 11:54:22.133620 7f5427600700 20 snap_mapper._remove_oid 0:618b9825:::rbd_data.554d416b8b4567.000000000000151f:b691
   -48> 2019-06-28 11:54:22.133629 7f5427600700 15 filestore(/var/lib/ceph/osd/ceph-9) omap_get_values(5132): meta/#-1:c0371625:::snapmapper:0#
   -47> 2019-06-28 11:54:22.133822 7f5427600700 15 filestore(/var/lib/ceph/osd/ceph-9) omap_get_values(5158): meta/#-1:c0371625:::snapmapper:0# = 0()
   -46> 2019-06-28 11:54:22.133834 7f5427600700 20 osd.9 pg_epoch: 205344 pg[0.86( v 205344'18495754 (205325'18493907,205344'18495754] lb 0:618961fb:::rbd_data.c025446b8b4567.0000000000001102:a51c (bitwise) local-lis/les=205338/205339 n=4498 ec=59/59 lis/c 205338/201096 les/c/f 205339/204442/0 205337/205338/205338) [9,12,16]/[12,16] r=-1 lpr=205338 pi=[201096,205338)/1 luod=0'0 crt=205344'18495754 active+remapped]  snapset 0=[]:[] legacy_snaps []
    -4> 2019-06-28 11:54:22.136284 7f5427600700 -1 /tmp/b-2I_Suq1c6XVP/build/node/root/packages/ceph/workdir-FunpCBst5QV4/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f5427600700 time 2019-06-28 11:54:22.133850
2019-06-28 11:54:22.170961 7f5427600700 -1 *** Caught signal (Aborted) **
 in thread 7f5427600700 thread_name:tp_osd_tp
     0> 2019-06-28 11:54:22.170961 7f5427600700 -1 *** Caught signal (Aborted) **
 in thread 7f5427600700 thread_name:tp_osd_tp

The whole assert+stack is as follows:

     0> 2019-06-28 11:54:22.170961 7f5427600700 -1 *** Caught signal (Aborted) **
 in thread 7f5427600700 thread_name:tp_osd_tp

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
 1: (()+0x8e9918) [0x560661766918]
 2: (()+0x11aa0) [0x7f5461db6aa0]
 3: (gsignal()+0xca) [0x7f5460efc60a]
 4: (abort()+0x141) [0x7f5460efd6e1]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27f) [0x5606617ac83f]
 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x11e5) [0x560661334875]
 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x2d5) [0x5606614bd955]
 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) [0x5606614bdb6e]
 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2c5) [0x5606614cba05]
 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x5606613d8667]
 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7cd) [0x56066134301d]
 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x387) [0x5606611c16b7]
 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x56066145b73a]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ace) [0x5606611ee7ae]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b2) [0x5606617b1122]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5606617b4690]
 17: (()+0x7507) [0x7f5461dac507]
 18: (clone()+0x3f) [0x7f5460fba14f]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
  20/20 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---

#7 Updated by David Zafman over 4 years ago

Where is the line like this in the new crash:

/tmp/b-2I_Suq1c6XVP/build/node/root/packages/ceph/workdir-FunpCBst5QV4/src/osd/PrimaryLogPG.cc: ########: FAILED assert(?????)

#8 Updated by David Zafman over 4 years ago

I think you just didn't remove it from osd.12 because osd.12 the primary is trying to push the stray clone to osd.9.

#9 Updated by Kenneth Van Alstyne over 4 years ago

I removed the object from osd.12 and osd.16 and the cluster was able to return to HEALTH_OK. I appreciate the help in resolving this issue.

It looks like I got into this state originally by removing a snapshot with an inconsistent object. The object was inconsistent because it was possibly an unexpected clone. I'm not really sure how to prevent this from happening again.

#10 Updated by Greg Farnum over 4 years ago

  • Project changed from Ceph to RADOS

#11 Updated by David Zafman over 4 years ago

  • Status changed from New to Closed

Also available in: Atom PDF