Project

General

Profile

Bug #11661

"FAILED assert(0 == "Running incompatible OSD")"

Added by Yuri Weinstein over 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
hammer
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

hammer v0.94.2 release
Run: http://pulpito-rdu.front.sepia.ceph.com/teuthology-2015-05-15_12:42:53-rados-hammer-distro-basic-typica/
Jobs: ['17971', '18033', '18107', '18123']
Logs for one: http://typica002.front.sepia.ceph.com/teuthology-2015-05-15_12:42:53-rados-hammer-distro-basic-typica/17971/teuthology.log

2015-05-15T13:50:10.220 INFO:tasks.rados.rados.0.typica051.stdout:1566:  finishing write tid 5 to typica05131967-27
2015-05-15T13:50:10.221 INFO:tasks.rados.rados.0.typica051.stdout:1566:  finishing write tid 6 to typica05131967-27
2015-05-15T13:50:10.221 INFO:tasks.rados.rados.0.typica051.stdout:update_object_version oid 27 v 237 (ObjNum 1056 snap 0 seq_num 1056) dirty exists
2015-05-15T13:50:10.221 INFO:tasks.rados.rados.0.typica051.stdout:1566:  left oid 27 (ObjNum 1056 snap 0 seq_num 1056)
2015-05-15T13:50:10.222 INFO:tasks.rados.rados.0.typica051.stdout:1566: done (15 left)
2015-05-15T13:50:10.222 INFO:tasks.rados.rados.0.typica051.stdout:1573: delete oid 387 current snap is 0
2015-05-15T13:50:10.238 INFO:tasks.rados.rados.0.typica051.stdout:1571:  expect (ObjNum 910 snap 0 seq_num 910)
2015-05-15T13:50:10.531 INFO:tasks.ceph.osd.3.typica019.stderr:osd/PG.cc: In function 'void PG::scrub(ThreadPool::TPHandle&)' thread 7f2ecc488700 time 2015-05-15 13:50:10.526061
2015-05-15T13:50:10.531 INFO:tasks.ceph.osd.3.typica019.stderr:osd/PG.cc: 3799: FAILED assert(0 == "Running incompatible OSD")
2015-05-15T13:50:10.532 INFO:tasks.ceph.osd.3.typica019.stderr: ceph version 0.94.1-111-gb69fb89 (b69fb89122d6f989152a29124cc7ed54b5e4d43b)
2015-05-15T13:50:10.532 INFO:tasks.ceph.osd.3.typica019.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc2bcb]
2015-05-15T13:50:10.532 INFO:tasks.ceph.osd.3.typica019.stderr: 2: (PG::scrub(ThreadPool::TPHandle&)+0x5fb) [0x7f8a5b]
2015-05-15T13:50:10.533 INFO:tasks.ceph.osd.3.typica019.stderr: 3: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cca59]
2015-05-15T13:50:10.533 INFO:tasks.ceph.osd.3.typica019.stderr: 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb3d5e]
2015-05-15T13:50:10.534 INFO:tasks.ceph.osd.3.typica019.stderr: 5: (ThreadPool::WorkThread::entry()+0x10) [0xbb4e00]
2015-05-15T13:50:10.534 INFO:tasks.ceph.osd.3.typica019.stderr: 6: (()+0x8182) [0x7f2eecdc8182]
2015-05-15T13:50:10.534 INFO:tasks.ceph.osd.3.typica019.stderr: 7: (clone()+0x6d) [0x7f2eeb312efd]
2015-05-15T13:50:10.535 INFO:tasks.ceph.osd.3.typica019.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is 

Related issues

Related to Ceph - Bug #11848: PG: don't check feature if connection isn't ready Duplicate 06/02/2015
Copied to Ceph - Backport #14292: osd/PG.cc: 3837: FAILED assert(0 == "Running incompatible OSD") Resolved

Associated revisions

Revision 7a10a7e1 (diff)
Added by David Zafman over 7 years ago

osd: CEPH_FEATURE_CHUNKY_SCRUB feature now required

Feature present since at least the Dumpling release.
A later commit will add it to the osd_required mask

Fixes: #11661

Signed-off-by: David Zafman <>

Revision 30662318 (diff)
Added by David Zafman almost 7 years ago

osd: CEPH_FEATURE_CHUNKY_SCRUB feature now required

Feature present since at least the Dumpling release.
A later commit will add it to the osd_required mask

Fixes: #11661

Signed-off-by: David Zafman <>
(cherry picked from commit 7a10a7e196efd0f59fa7adb87d0a120655b227d8)

History

#1 Updated by Yuri Weinstein over 7 years ago

  • Description updated (diff)

#2 Updated by Samuel Just over 7 years ago

  • Assignee set to David Zafman

#3 Updated by David Zafman over 7 years ago

  • Status changed from New to In Progress

#4 Updated by David Zafman over 7 years ago

osd.3 and osd.5 are both running 0.94.1-111-gb69fb89 (a version of hammer).

2015-05-15 13:49:39.677006 7f41d9d5a700 10 osd.5 180 dequeue_op 0x5c08000 prio 127 cost 0 latency 0.000405 osd_sub_op(unknown.0.0:0 0.2e 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 pg pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=1 lpr=165 pi=144-164/3 crt=0'0 active]
2015-05-15 13:49:39.677072 7f41d9d5a700 10 osd.5 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=1 lpr=165 pi=144-164/3 crt=0'0 active] handle_message: 0x5c08000
2015-05-15 13:49:39.677095 7f41d9d5a700 15 osd.5 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=1 lpr=165 pi=144-164/3 crt=0'0 active] do_sub_op osd_sub_op(unknown.0.0:0 0.2e 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v11
2015-05-15 13:49:39.677134 7f41d9d5a700 7 osd.5 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=1 lpr=165 pi=144-164/3 crt=0'0 active] sub_op_scrub_reserve

2015-05-15 13:49:39.676088 7f2ee2e40700 20 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] sched_scrub: reserved locally, reserving replicas
2015-05-15 13:49:39.676120 7f2ee2e40700 10 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] scrub requesting reserve from osd.2
2015-05-15 13:49:39.676212 7f2ee2e40700 10 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] scrub requesting reserve from osd.5
2015-05-15 13:49:39.676291 7f2ee2e40700 20 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] sched_scrub: reserved 3, waiting for replicas
2015-05-15 13:49:39.678028 7f2ecec8d700 10 osd.3 180 dequeue_op 0x8fce400 prio 127 cost 1 latency 0.000518 osd_sub_op_reply(unknown.0.0:0 0.2e 0//0//0 [scrub-reserve] ack, result = 0) v2 pg pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean]
2015-05-15 13:49:39.678100 7f2ecec8d700 10 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] handle_message: 0x8fce400
2015-05-15 13:49:39.678132 7f2ecec8d700 7 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] sub_op_scrub_reserve_reply
2015-05-15 13:49:39.678198 7f2ecec8d700 10 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] osd.5 scrub reserve = success
2015-05-15 13:49:39.678230 7f2ecec8d700 20 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] sched_scrub: reserved 3,5, waiting for replicas
2015-05-15 13:49:39.680833 7f2ed1492700 10 osd.3 180 dequeue_op 0x65ee600 prio 127 cost 1 latency 0.003395 osd_sub_op_reply(unknown.0.0:0 0.2e 0//0//0 [scrub-reserve] ack, result = 0) v2 pg pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean]
2015-05-15 13:49:39.680906 7f2ed1492700 10 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] handle_message: 0x65ee600
2015-05-15 13:49:39.680934 7f2ed1492700 7 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] sub_op_scrub_reserve_reply
2015-05-15 13:49:39.680978 7f2ed1492700 10 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] osd.2 scrub reserve = success
2015-05-15 13:49:39.681007 7f2ed1492700 20 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean] sched_scrub: success, reserved self and replicas

2015-05-15 13:50:03.670169 7f41eb708700 -1 osd.5 180 heartbeat_check: no reply from osd.3 since back 2015-05-15 13:49:36.121526 front 2015-05-15 13:49:36.121526 (cutoff 2015-05-15 13:49:43.670166)
2015-05-15 13:50:05.453118 7f41dad5c700 10 osd.5 pg_epoch: 181 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [5,2] r=0 lpr=181 pi=144-180/4 crt=0'0 mlcod 0'0 active] start_peering_interval: check_new_interval output: generate_past_intervals interval(165-180 up [3,5,2](3) acting [3,5,2](3)): not rw, up_thru 172 up_from 6 last_epoch_clean 176
generate_past_intervals interval(165-180 up [3,5,2](3) acting [3,5,2](3) maybe_went_rw) : primary up 6-172 includes interval

2015-05-15 13:50:10.526044 7f2ecc488700 20 osd.3 pg_epoch: 180 pg[0.2e( empty local-les=171 n=0 ec=1 les/c 171/176 165/165/153) [3,5,2] r=0 lpr=165 crt=0'0 mlcod 0'0 active+clean+scrubbing] OSD 5 does not support chunky scrubs, falling back to classic

The scenario here is that we are getting an incorrect set of features back from the Connection. We can remove the code checking for the CEPH_FEATURE_CHUNKY_SCRUB because any connecting osd must have it set. However, I want to understand the use of get_con_osd_cluster() to get information about a specific remote peer. I infer from the results of has_feature() and that osd.5 wasn't getting ping responses from osd.3 several things. We see that osd.5 was advancing to epoch 181 with osd.3 out. So should get_con_osd_cluster() have checked is_down(self) or is_out(self) or does the caller need to before trusting information from the returned connection?

#5 Updated by Yuri Weinstein over 7 years ago

David, is this issue a blocker for hammer v0.94.2 release?

#6 Updated by David Zafman over 7 years ago

This is very rare and has been around for over a year.

#7 Updated by Samuel Just over 7 years ago

ubuntu@teuthology:/a/samuelj-2015-05-28_08:23:27-rados-wip-sam-testing-distro-basic-multi/913465/remote

#8 Updated by Samuel Just over 7 years ago

ubuntu@teuthology:/a/samuelj-2015-05-28_08:23:27-rados-wip-sam-testing-distro-basic-multi/913520/remote

#9 Updated by David Zafman over 7 years ago

  • Status changed from In Progress to 7

#10 Updated by David Zafman over 7 years ago

  • Status changed from 7 to Fix Under Review

#11 Updated by David Zafman over 7 years ago

  • Status changed from Fix Under Review to Resolved

7a10a7e196efd0f59fa7adb87d0a120655b227d8

#12 Updated by David Barksdale over 7 years ago

I'm running from the hammer branch and one of my OSDs is hitting this frequently. Will this be back-ported to the hammer branch?

#13 Updated by David Zafman almost 7 years ago

  • Duplicated by Backport #14292: osd/PG.cc: 3837: FAILED assert(0 == "Running incompatible OSD") added

#14 Updated by David Zafman almost 7 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to Hammer

Backport to hammer 7a10a7e1 and add CEPH_FEATURE_CHUNKY_SCRUB to osd_required in src/ceph_osd.cc just to be careful.

#15 Updated by Loïc Dachary almost 7 years ago

  • Backport changed from Hammer to hammer

#16 Updated by Loïc Dachary almost 7 years ago

  • Duplicated by deleted (Backport #14292: osd/PG.cc: 3837: FAILED assert(0 == "Running incompatible OSD"))

#17 Updated by Loïc Dachary almost 7 years ago

  • Copied to Backport #14292: osd/PG.cc: 3837: FAILED assert(0 == "Running incompatible OSD") added

#18 Updated by Loïc Dachary almost 7 years ago

  • Subject changed from "FAILED assert(0 == "Running incompatible OSD")" in rados-hammer-distro-basic-typica run to "FAILED assert(0 == "Running incompatible OSD")"

#19 Updated by Nathan Cutler almost 7 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF