Project

General

Profile

Actions

Bug #16416

closed

ec-lost-unfound hang due to min_size default change

Added by Samuel Just almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268677

Actions #1

Updated by Samuel Just almost 8 years ago

  • Assignee set to Haomai Wang

Also, sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268688

Both are hangs, so we don't have logs. You'll have to reproduce, but at least it seems to happen semi-reliably.

Actions #2

Updated by Samuel Just almost 8 years ago

sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268946

Actions #3

Updated by Samuel Just almost 8 years ago

sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268958

Actions #4

Updated by Samuel Just almost 8 years ago

sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267205

Actions #5

Updated by Samuel Just almost 8 years ago

sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267215

Actions #7

Updated by Haomai Wang almost 8 years ago

http://pulpito.ceph.com/haomai-2016-06-22_23:47:00-rados-wip-haomai-testing-distro-basic-smithi/272023/:
this job do ec-lost-unfound task, and hang too. From job file, it first install infernails then do upgrade. Is this cause async msgr also used in infernails since async msgr is buggy in that version.
2016-06-23 05:28:46.760003 osd.3 172.21.15.22:6800/31612 34 : cluster [WRN] failed to encode map e23 with expected crc
2016-06-23 05:28:47.728869 osd.3 172.21.15.22:6800/31612 35 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.730454 osd.3 172.21.15.22:6800/31612 36 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.732226 osd.3 172.21.15.22:6800/31612 37 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:56.631793 osd.3 172.21.15.22:6800/31612 38 : cluster [WRN] failed to encode map e25 with expected crc
2016-06-23 05:28:47.728869 osd.3 172.21.15.22:6800/31612 35 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.730454 osd.3 172.21.15.22:6800/31612 36 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.732226 osd.3 172.21.15.22:6800/31612 37 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:56.631793 osd.3 172.21.15.22:6800/31612 38 : cluster [WRN] failed to encode map e25 with expected crc
2016-06-23 05:28:47.728869 osd.3 172.21.15.22:6800/31612 35 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.730454 osd.3 172.21.15.22:6800/31612 36 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.732226 osd.3 172.21.15.22:6800/31612 37 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:56.631793 osd.3 172.21.15.22:6800/31612 38 : cluster [WRN] failed to encode map e25 with expected crc
2016-06-23 05:28:57.801069 osd.3 172.21.15.22:6800/31612 39 : cluster [WRN] failed to encode map e27 with expected crc
2016-06-23 05:28:59.162718 osd.3 172.21.15.22:6800/31612 40 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.164540 osd.3 172.21.15.22:6800/31612 41 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.166803 osd.3 172.21.15.22:6800/31612 42 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.212481 osd.3 172.21.15.22:6800/31612 43 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:57.801069 osd.3 172.21.15.22:6800/31612 39 : cluster [WRN] failed to encode map e27 with expected crc
2016-06-23 05:28:59.162718 osd.3 172.21.15.22:6800/31612 40 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.164540 osd.3 172.21.15.22:6800/31612 41 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.166803 osd.3 172.21.15.22:6800/31612 42 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.212481 osd.3 172.21.15.22:6800/31612 43 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:57.801069 osd.3 172.21.15.22:6800/31612 39 : cluster [WRN] failed to encode map e27 with expected crc
2016-06-23 05:28:59.162718 osd.3 172.21.15.22:6800/31612 40 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.164540 osd.3 172.21.15.22:6800/31612 41 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.166803 osd.3 172.21.15.22:6800/31612 42 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.212481 osd.3 172.21.15.22:6800/31612 43 : cluster [WRN] failed to encode map e28 with expected crc

Then from osd.3.log:
2016-06-23 05:29:15.579763 7f9874f8f700 2 osd.3 28 got incremental 29 but failed to encode full with correct crc; requesting
2016-06-23 05:29:15.579779 7f9874f8f700 0 log_channel(cluster) log [WRN] : failed to encode map e29 with expected crc
2016-06-23 05:29:15.579784 7f9874f8f700 20 osd.3 28 my encoded map was:
0000 : 08 07 46 13 00 00 03 01 86 0b 00 00 6e d8 6c 7c : ..F.........n.l|

health HEALTH_ERR
18 pgs are stuck inactive for more than 300 seconds
12 pgs degraded
15 pgs incomplete
3 pgs stale
15 pgs stuck inactive
3 pgs stuck stale
12 pgs stuck unclean
12 pgs undersized
13 requests are blocked > 32 sec
1/3 in osds are down
monmap e1: 3 mons at {a=172.21.15.49:6789/0,b=172.21.15.49:6790/0,c=172.21.15.49:6791/0}
election epoch 12, quorum 0,1,2 a,b,c
osdmap e34: 4 osds: 2 up, 3 in; 27 remapped pgs
flags sortbitwise
pgmap v246: 32 pgs, 2 pools, 26160 bytes data, 10 objects
400 MB used, 278 GB / 279 GB avail
15 incomplete
12 active+undersized+degraded
3 stale+active+clean
2 active+clean

The strange to me is osd.3 and osd.0 is disappear without any shutdown/coredump log.

Actions #8

Updated by Haomai Wang almost 8 years ago

  • ceph-qa-suite ceph-deploy added

http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268677/:
rados/singleton/{rados.yaml all/ec-lost-unfound-upgrade.yaml fs/xfs.yaml msgr/random.yaml msgr-failures/many.yaml}

http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268688/:
rados/singleton/{rados.yaml all/ec-lost-unfound.yaml fs/xfs.yaml msgr/simple.yaml msgr-failures/few.yaml}

http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268946/:
rados/singleton/{rados.yaml all/ec-lost-unfound-upgrade.yaml fs/xfs.yaml msgr/simple.yaml msgr-failures/few.yaml}

http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268958/:
rados/singleton/{rados.yaml all/ec-lost-unfound.yaml fs/xfs.yaml msgr/async.yaml msgr-failures/many.yaml}

http://pulpito.ceph.com/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267205/:
rados/singleton/{rados.yaml all/ec-lost-unfound-upgrade.yaml fs/xfs.yaml msgr/simple.yaml msgr-failures/many.yaml}

http://pulpito.ceph.com/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267215/:
rados/singleton/{rados.yaml all/ec-lost-unfound.yaml fs/xfs.yaml msgr/async.yaml msgr-failures/few.yaml}

http://pulpito.ceph.com/haomai-2016-06-22_23:47:00-rados-wip-haomai-testing-distro-basic-smithi/272023/:
rados/singleton/{rados.yaml all/ec-lost-unfound-upgrade.yaml fs/xfs.yaml msgr/async.yaml msgr-failures/few.yaml}

http://pulpito.ceph.com/haomai-2016-06-22_23:47:00-rados-wip-haomai-testing-distro-basic-smithi/272029/:
rados/singleton/{rados.yaml all/ec-lost-unfound.yaml fs/xfs.yaml msgr/random.yaml msgr-failures/many.yaml}

The interesting thing is all caused by ec-lost-unfound...and it have simple, random and async ms type. Another thing is I think upgrade job can't apply to async since async msgr is a problem on lower version.

Actions #9

Updated by Haomai Wang almost 8 years ago

  • ceph-qa-suite rados added
  • ceph-qa-suite deleted (ceph-deploy)
Actions #10

Updated by Samuel Just almost 8 years ago

c841ac0e84b63e50ce5fc31441800ad5a39bc5a0 changed min_size to default to k+1. Need to adjust the test.

Actions #11

Updated by Samuel Just almost 8 years ago

  • Status changed from New to 12
  • Assignee changed from Haomai Wang to Samuel Just
Actions #12

Updated by Samuel Just almost 8 years ago

  • Subject changed from ec-lost-unfound hang after restart, possibly due to async messenger fault injection to ec-lost-unfound hang due to min_size default change
  • Status changed from 12 to 7
Actions #13

Updated by Samuel Just almost 8 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF