Project

General

Profile

Actions

Bug #13853

closed

osd/ReplicatedPG.cc: 10773: FAILED assert(r >= 0)

Added by Paul Emmerich over 8 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

we had some issues with snapshots which were originally triggered by a PG split in an older version and we got some weird data in our cluster that causes all kinds of issues. The latest bug triggered by this is a crash with

osd/ReplicatedPG.cc: 10773: FAILED assert(r >= 0)

The log looks like this (filtered to only contain messages by the crashing thread):


  -594> 2015-11-20 19:17:34.575513 7f52d731d700  5 -- op tracker -- seq: 405, time: 2015-11-20 19:17:34.575513, event: waiting for subops from 33,124, op: osd_op(client.55202536.0:2979787 rbd_data.34a5d132ae8944a.00000000000002f7 [set-alloc-hint object_size 4194304 write_size 4194304,write 1241088~8192] 39.7fce6ccd RETRY=2 ack+ondisk+retry+write+known_if_redirected e132746)
  -593> 2015-11-20 19:17:34.575530 7f52d731d700  1 -- 79.133.40.239:6848/189226 --> 79.133.40.231:6858/1003915 -- osd_repop(client.55202536.0:2979787 39.4cd 39/7fce6ccd/rbd_data.34a5d132ae8944a.00000000000002f7/head v 132747'610749) v1 -- ?+8925 0x7f5305a4e400 con 0x7f5315eae580
  -592> 2015-11-20 19:17:34.575541 7f52d731d700  1 -- 79.133.40.239:6848/189226 --> 79.133.40.234:6804/1513374 -- osd_repop(client.55202536.0:2979787 39.4cd 39/7fce6ccd/rbd_data.34a5d132ae8944a.00000000000002f7/head v 132747'610749) v1 -- ?+8925 0x7f5315d2ea00 con 0x7f5315cb6000
  -591> 2015-11-20 19:17:34.575547 7f52d731d700 10 osd.73 pg_epoch: 132747 pg[39.4cd( v 132743'610748 lc 132681'608770 (132395'604549,132743'610748] local-les=132747 n=428 ec=123816 les/c/f 132747/132717/0 132746/132746/132746) [73,124,33] r=0 lpr=132746 pi=132638-132745/25 crt=132725'610746 lcod 0'0 mlcod 0'0 active+degraded NIBBLEWISE m=4] append_log log((132395'604549,132743'610748], crt=132725'610746) [132747'610749 (132743'610748) modify   39/7fce6ccd/rbd_data.34a5d132ae8944a.00000000000002f7/head by client.55202536.0:2979787 2015-11-20 19:17:39.096098]
  -590> 2015-11-20 19:17:34.575558 7f52d731d700 10 osd.73 pg_epoch: 132747 pg[39.4cd( v 132747'610749 lc 132681'608770 (132395'604549,132747'610749] local-les=132747 n=428 ec=123816 les/c/f 132747/132717/0 132746/132746/132746) [73,124,33] r=0 lpr=132746 pi=132638-132745/25 luod=132743'610748 lua=132743'610748 crt=132725'610746 lcod 0'0 mlcod 0'0 active+degraded NIBBLEWISE m=4] add_log_entry 132747'610749 (132743'610748) modify   39/7fce6ccd/rbd_data.34a5d132ae8944a.00000000000002f7/head by client.55202536.0:2979787 2015-11-20 19:17:39.096098
  -589> 2015-11-20 19:17:34.575564 7f52d731d700 10 osd.73 pg_epoch: 132747 pg[39.4cd( v 132747'610749 lc 132681'608770 (132395'604549,132747'610749] local-les=132747 n=428 ec=123816 les/c/f 132747/132717/0 132746/132746/132746) [73,124,33] r=0 lpr=132746 pi=132638-132745/25 luod=132743'610748 lua=132743'610748 crt=132725'610746 lcod 0'0 mlcod 0'0 active+degraded NIBBLEWISE m=4] append_log: trimming to 0'0 entries 
  -588> 2015-11-20 19:17:34.575585 7f52d731d700  5 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, divergent_priors: 0, writeout_from: 132747'610749, trimmed: 
  -587> 2015-11-20 19:17:34.575617 7f52d731d700  5 -- op tracker -- seq: 405, time: 2015-11-20 19:17:34.575617, event: commit_queued_for_journal_write, op: osd_op(client.55202536.0:2979787 rbd_data.34a5d132ae8944a.00000000000002f7 [set-alloc-hint object_size 4194304 write_size 4194304,write 1241088~8192] 39.7fce6ccd RETRY=2 ack+ondisk+retry+write+known_if_redirected e132746)
  -586> 2015-11-20 19:17:34.575629 7f52d731d700 10 osd.73 pg_epoch: 132747 pg[39.4cd( v 132747'610749 lc 132681'608770 (132395'604549,132747'610749] local-les=132747 n=428 ec=123816 les/c/f 132747/132717/0 132746/132746/132746) [73,124,33] r=0 lpr=132746 pi=132638-132745/25 luod=132743'610748 lua=132743'610748 crt=132725'610746 lcod 0'0 mlcod 0'0 active+degraded NIBBLEWISE m=4] eval_repop repgather(0x7f531591e4c0 132747'610749 rep_tid=1 committed?=0 applied?=0 lock=0 op=osd_op(client.55202536.0:2979787 rbd_data.34a5d132ae8944a.00000000000002f7 [set-alloc-hint object_size 4194304 write_size 4194304,write 1241088~8192] 39.7fce6ccd RETRY=2 ack+ondisk+retry+write+known_if_redirected e132746) v5) wants=ad
  -584> 2015-11-20 19:17:34.575644 7f52d731d700 10 osd.73 132747 dequeue_op 0x7f5304cb7c00 finish
  -236> 2015-11-20 19:17:34.581524 7f52d731d700 10 osd.73 132747 dequeue_op 0x7f53162a0500 prio 127 cost 0 latency 0.000047 pg_scan(get_digest 39.252 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head-MIN e 132747/132747) v2 pg pg[39.252( v 132725'99352 (130515'95523,132725'99352] lb 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head (NIBBLEWISE) local-les=132627 n=27 ec=123816 les/c/f 132747/132712/0 132746/132746/132711) [73,46,130]/[46,130,11] r=-1 lpr=132746 pi=131207-132745/46 luod=0'0 crt=132725'99352 lcod 0'0 active+remapped NIBBLEWISE]
  -235> 2015-11-20 19:17:34.581539 7f52d731d700  5 -- op tracker -- seq: 643, time: 2015-11-20 19:17:34.581539, event: reached_pg, op: pg_scan(get_digest 39.252 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head-MIN e 132747/132747)
  -229> 2015-11-20 19:17:34.581543 7f52d731d700 10 osd.73 pg_epoch: 132747 pg[39.252( v 132725'99352 (130515'95523,132725'99352] lb 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head (NIBBLEWISE) local-les=132627 n=27 ec=123816 les/c/f 132747/132712/0 132746/132746/132711) [73,46,130]/[46,130,11] r=-1 lpr=132746 pi=131207-132745/46 luod=0'0 crt=132725'99352 lcod 0'0 active+remapped NIBBLEWISE] handle_message: 0x7f53162a0500
  -222> 2015-11-20 19:17:34.581552 7f52d731d700 10 osd.73 pg_epoch: 132747 pg[39.252( v 132725'99352 (130515'95523,132725'99352] lb 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head (NIBBLEWISE) local-les=132627 n=27 ec=123816 les/c/f 132747/132712/0 132746/132746/132711) [73,46,130]/[46,130,11] r=-1 lpr=132746 pi=131207-132745/46 luod=0'0 crt=132725'99352 lcod 0'0 active+remapped NIBBLEWISE] do_scan pg_scan(get_digest 39.252 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head-MIN e 132747/132747) v2
  -220> 2015-11-20 19:17:34.581558 7f52d731d700  5 -- op tracker -- seq: 643, time: 2015-11-20 19:17:34.581558, event: started, op: pg_scan(get_digest 39.252 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head-MIN e 132747/132747)
  -219> 2015-11-20 19:17:34.581560 7f52d731d700 10 osd.73 pg_epoch: 132747 pg[39.252( v 132725'99352 (130515'95523,132725'99352] lb 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head (NIBBLEWISE) local-les=132627 n=27 ec=123816 les/c/f 132747/132712/0 132746/132746/132711) [73,46,130]/[46,130,11] r=-1 lpr=132746 pi=131207-132745/46 luod=0'0 crt=132725'99352 lcod 0'0 active+remapped NIBBLEWISE] scan_range from 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head
  -212> 2015-11-20 19:17:34.581704 7f52d731d700 10 osd.73 pg_epoch: 132747 pg[39.252( v 132725'99352 (130515'95523,132725'99352] lb 39/2dc03252/rbd_data.34a2dc92ae8944a.000000000000188c/head (NIBBLEWISE) local-les=132627 n=27 ec=123816 les/c/f 132747/132712/0 132746/132746/132711) [73,46,130]/[46,130,11] r=-1 lpr=132746 pi=131207-132745/46 luod=0'0 crt=132725'99352 lcod 0'0 active+remapped NIBBLEWISE]  got 3 items, next MAX
    -4> 2015-11-20 19:17:34.583759 7f52d731d700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::scan_range(int, int, PG::BackfillInterval*, ThreadPool::TPHandle&)' thread 7f52d731d700 time 2015-11-20 19:17:34.581747
2015-11-20 19:17:34.638043 7f52d731d700 -1 *** Caught signal (Aborted) **
 in thread 7f52d731d700
     0> 2015-11-20 19:17:34.638043 7f52d731d700 -1 *** Caught signal (Aborted) **
 in thread 7f52d731d700

 ceph version 9.2.0-1-g57fc59f (57fc59fcf280cf8c92478caa19e8dde4aaa5741f)
 1: (()+0x7d0e9a) [0x7f5300518e9a]
 2: (()+0x10340) [0x7f52fec2c340]
 3: (gsignal()+0x39) [0x7f52fcea7cc9]
 4: (abort()+0x148) [0x7f52fceab0d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f52fd7b2535]
 6: (()+0x5e6d6) [0x7f52fd7b06d6]
 7: (()+0x5e703) [0x7f52fd7b0703]
 8: (()+0x5e922) [0x7f52fd7b0922]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0x7f5300605a58]
 10: (ReplicatedPG::scan_range(int, int, PG::BackfillInterval*, ThreadPool::TPHandle&)+0xad2) [0x7f530021f232]
 11: (ReplicatedPG::do_scan(std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x7b4) [0x7f530021fa14]
 12: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x3bf) [0x7f530020079f]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3bd) [0x7f530007356d]
 14: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x5d) [0x7f530007378d]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8b4) [0x7f5300098744]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85f) [0x7f53005f628f]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f53005f8190]
 18: (()+0x8182) [0x7f52fec24182]
 19: (clone()+0x6d) [0x7f52fcf6b47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The full log is attached.

This looks like #10150 which triggered the same assert and was fixed by "ignoring" the ENOENT return code.
Could this be fixed or work-around by replacing the assert with a similar check for whatever error code is being returned by objects_get_attr()?

(I've already mentioned this crash in issue #13837, but it seems unrelated the bug reported there, so I'm opening this new issue)

Paul


Files

ceph-osd.73-scan_range-crash.log.gz (413 KB) ceph-osd.73-scan_range-crash.log.gz crash log Paul Emmerich, 11/20/2015 07:59 PM
Actions #1

Updated by Paul Emmerich over 8 years ago

I deleted and re-installed the OSD and the issue is gone. You can close this as not reproducible.

Actions #2

Updated by Samuel Just over 8 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF