Bug #16416: ec-lost-unfound hang due to min_size default change - Ceph - Ceph

Actions

Copy link

Bug #16416

closed

ec-lost-unfound hang due to min_size default change

Added by Samuel Just almost 8 years ago. Updated almost 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268677

Actions

Copy link

Updated by Samuel Just almost 8 years ago

Assignee set to Haomai Wang

Also, sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268688

Both are hangs, so we don't have logs. You'll have to reproduce, but at least it seems to happen semi-reliably.

Actions

Copy link

Updated by Samuel Just almost 8 years ago

sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268946

Actions

Copy link

Updated by Samuel Just almost 8 years ago

sjust@teuthology:/a/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268958

Actions

Copy link

Updated by Samuel Just almost 8 years ago

sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267205

Actions

Copy link

Updated by Samuel Just almost 8 years ago

sjust@teuthology:/a/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267215

Actions

Copy link

Updated by Haomai Wang almost 8 years ago

@Samuel Hassine http://pulpito.ceph.com/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267205/ and http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268688/ both use simple messenger...

Actions

Copy link

Updated by Haomai Wang almost 8 years ago

http://pulpito.ceph.com/haomai-2016-06-22_23:47:00-rados-wip-haomai-testing-distro-basic-smithi/272023/:
this job do ec-lost-unfound task, and hang too. From job file, it first install infernails then do upgrade. Is this cause async msgr also used in infernails since async msgr is buggy in that version.
2016-06-23 05:28:46.760003 osd.3 172.21.15.22:6800/31612 34 : cluster [WRN] failed to encode map e23 with expected crc
2016-06-23 05:28:47.728869 osd.3 172.21.15.22:6800/31612 35 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.730454 osd.3 172.21.15.22:6800/31612 36 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.732226 osd.3 172.21.15.22:6800/31612 37 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:56.631793 osd.3 172.21.15.22:6800/31612 38 : cluster [WRN] failed to encode map e25 with expected crc
2016-06-23 05:28:47.728869 osd.3 172.21.15.22:6800/31612 35 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.730454 osd.3 172.21.15.22:6800/31612 36 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.732226 osd.3 172.21.15.22:6800/31612 37 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:56.631793 osd.3 172.21.15.22:6800/31612 38 : cluster [WRN] failed to encode map e25 with expected crc
2016-06-23 05:28:47.728869 osd.3 172.21.15.22:6800/31612 35 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.730454 osd.3 172.21.15.22:6800/31612 36 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:47.732226 osd.3 172.21.15.22:6800/31612 37 : cluster [WRN] failed to encode map e24 with expected crc
2016-06-23 05:28:56.631793 osd.3 172.21.15.22:6800/31612 38 : cluster [WRN] failed to encode map e25 with expected crc
2016-06-23 05:28:57.801069 osd.3 172.21.15.22:6800/31612 39 : cluster [WRN] failed to encode map e27 with expected crc
2016-06-23 05:28:59.162718 osd.3 172.21.15.22:6800/31612 40 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.164540 osd.3 172.21.15.22:6800/31612 41 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.166803 osd.3 172.21.15.22:6800/31612 42 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.212481 osd.3 172.21.15.22:6800/31612 43 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:57.801069 osd.3 172.21.15.22:6800/31612 39 : cluster [WRN] failed to encode map e27 with expected crc
2016-06-23 05:28:59.162718 osd.3 172.21.15.22:6800/31612 40 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.164540 osd.3 172.21.15.22:6800/31612 41 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.166803 osd.3 172.21.15.22:6800/31612 42 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.212481 osd.3 172.21.15.22:6800/31612 43 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:57.801069 osd.3 172.21.15.22:6800/31612 39 : cluster [WRN] failed to encode map e27 with expected crc
2016-06-23 05:28:59.162718 osd.3 172.21.15.22:6800/31612 40 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.164540 osd.3 172.21.15.22:6800/31612 41 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.166803 osd.3 172.21.15.22:6800/31612 42 : cluster [WRN] failed to encode map e28 with expected crc
2016-06-23 05:28:59.212481 osd.3 172.21.15.22:6800/31612 43 : cluster [WRN] failed to encode map e28 with expected crc

Then from osd.3.log:
2016-06-23 05:29:15.579763 7f9874f8f700 2 osd.3 28 got incremental 29 but failed to encode full with correct crc; requesting
2016-06-23 05:29:15.579779 7f9874f8f700 0 log_channel(cluster) log [WRN] : failed to encode map e29 with expected crc
2016-06-23 05:29:15.579784 7f9874f8f700 20 osd.3 28 my encoded map was:
0000 : 08 07 46 13 00 00 03 01 86 0b 00 00 6e d8 6c 7c : ..F.........n.l|

health HEALTH_ERR
            18 pgs are stuck inactive for more than 300 seconds
            12 pgs degraded
            15 pgs incomplete
            3 pgs stale
            15 pgs stuck inactive
            3 pgs stuck stale
            12 pgs stuck unclean
            12 pgs undersized
            13 requests are blocked > 32 sec
            1/3 in osds are down
     monmap e1: 3 mons at {a=172.21.15.49:6789/0,b=172.21.15.49:6790/0,c=172.21.15.49:6791/0}
            election epoch 12, quorum 0,1,2 a,b,c
     osdmap e34: 4 osds: 2 up, 3 in; 27 remapped pgs
            flags sortbitwise
      pgmap v246: 32 pgs, 2 pools, 26160 bytes data, 10 objects
            400 MB used, 278 GB / 279 GB avail
                  15 incomplete
                  12 active+undersized+degraded
                   3 stale+active+clean
                   2 active+clean

The strange to me is osd.3 and osd.0 is disappear without any shutdown/coredump log.

Actions

Copy link

Updated by Haomai Wang almost 8 years ago

ceph-qa-suite ceph-deploy added

http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268677/:
rados/singleton/{rados.yaml all/ec-lost-unfound-upgrade.yaml fs/xfs.yaml msgr/random.yaml msgr-failures/many.yaml}

http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268688/:
rados/singleton/{rados.yaml all/ec-lost-unfound.yaml fs/xfs.yaml msgr/simple.yaml msgr-failures/few.yaml}

http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268946/:
rados/singleton/{rados.yaml all/ec-lost-unfound-upgrade.yaml fs/xfs.yaml msgr/simple.yaml msgr-failures/few.yaml}

http://pulpito.ceph.com/samuelj-2016-06-21_10:29:56-rados-wip-sam-testing-distro-basic-smithi/268958/:
rados/singleton/{rados.yaml all/ec-lost-unfound.yaml fs/xfs.yaml msgr/async.yaml msgr-failures/many.yaml}

http://pulpito.ceph.com/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267205/:
rados/singleton/{rados.yaml all/ec-lost-unfound-upgrade.yaml fs/xfs.yaml msgr/simple.yaml msgr-failures/many.yaml}

http://pulpito.ceph.com/yuriw-2016-06-20_09:29:55-rados-master_2016_6_20-distro-basic-smithi/267215/:
rados/singleton/{rados.yaml all/ec-lost-unfound.yaml fs/xfs.yaml msgr/async.yaml msgr-failures/few.yaml}

http://pulpito.ceph.com/haomai-2016-06-22_23:47:00-rados-wip-haomai-testing-distro-basic-smithi/272023/:
rados/singleton/{rados.yaml all/ec-lost-unfound-upgrade.yaml fs/xfs.yaml msgr/async.yaml msgr-failures/few.yaml}

http://pulpito.ceph.com/haomai-2016-06-22_23:47:00-rados-wip-haomai-testing-distro-basic-smithi/272029/:
rados/singleton/{rados.yaml all/ec-lost-unfound.yaml fs/xfs.yaml msgr/random.yaml msgr-failures/many.yaml}

The interesting thing is all caused by ec-lost-unfound...and it have simple, random and async ms type. Another thing is I think upgrade job can't apply to async since async msgr is a problem on lower version.

Actions

Copy link

Updated by Haomai Wang almost 8 years ago

ceph-qa-suite rados added
ceph-qa-suite deleted (~~ceph-deploy~~)

Actions

Copy link

#10

Updated by Samuel Just almost 8 years ago

c841ac0e84b63e50ce5fc31441800ad5a39bc5a0 changed min_size to default to k+1. Need to adjust the test.

Actions

Copy link

#11

Updated by Samuel Just almost 8 years ago

Status changed from New to 12
Assignee changed from Haomai Wang to Samuel Just

Actions

Copy link

#12

Updated by Samuel Just almost 8 years ago

Subject changed from ec-lost-unfound hang after restart, possibly due to async messenger fault injection to ec-lost-unfound hang due to min_size default change
Status changed from 12 to 7

Actions

Copy link

#13

Updated by Samuel Just almost 8 years ago

Status changed from 7 to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #16416

ec-lost-unfound hang due to min_size default change

Updated by Samuel Just almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by Haomai Wang almost 8 years ago

Updated by Haomai Wang almost 8 years ago

Updated by Haomai Wang almost 8 years ago

Updated by Haomai Wang almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by Samuel Just almost 8 years ago

Updated by Samuel Just almost 8 years ago