https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-09-06T14:26:30ZCeph RADOS - Bug #21262: cephfs ec data pool, many osds marked down https://tracker.ceph.com/issues/21262?journal_id=983912017-09-06T14:26:30ZYong Wang314676762@qq.com
<ul></ul><p>relationed error<br />ceph-osd.22.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/ECUtil.cc: 59: FAILED assert(i->second.length() total_data_size)</p>
<p>ceph-osd.16.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")</p>
<p>ceph-osd.48.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PG.h: 467: FAILED assert(i->second.need j->second.need)</p>
<p>ceph-osd.45.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 11537: FAILED assert(p.second->shared_blob_set.empty())</p>
<p>ceph-osd.44.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/OSD.cc: 4171: FAILED assert(p.same_interval_since)</p>
<p>ceph-osd.43.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")</p>
<p>ceph-osd.28.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/common/HeartbeatMap.cc: 84: FAILED assert(0 "hit suicide timeout")</p>
<p>ceph-osd.74.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 510: FAILED assert(head.version 0 || e.version.version > head.version)</p>
<p>ceph-osd.58.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 1332: FAILED assert(last_e.version.version < e.version.version)</p> RADOS - Bug #21262: cephfs ec data pool, many osds marked down https://tracker.ceph.com/issues/21262?journal_id=984062017-09-06T15:44:48ZJosh Durgin
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>RADOS</i></li><li><strong>Category</strong> deleted (<del><i>129</i></del>)</li></ul><p>You're hitting a variety of issues there - some suggesting on-disk corruption, the unexpected error indicating a likely bad disk, and the throttling limits being hit. What is the history of this cluster - were there any power outages or node reboots? Upgrades from a dev release?</p> RADOS - Bug #21262: cephfs ec data pool, many osds marked down https://tracker.ceph.com/issues/21262?journal_id=984152017-09-06T16:35:12ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Need More Info</i></li></ul> RADOS - Bug #21262: cephfs ec data pool, many osds marked down https://tracker.ceph.com/issues/21262?journal_id=984572017-09-07T01:31:52ZYong Wang314676762@qq.com
<ul></ul><p>yes. the log not only about one issue.totally issue like blow:</p>
<p>1. slow request, osd marked down, osd op suicide caused assert.(ceph osd perf output seems ok. Is threre another tools can check the slow disk ?)<br />those disk type are both sas.</p>
<p>2. one node cannot execute sync command, block long time. from /proc/pid/task, kernel vfs blocked on wait_on_page_bit. sgdisk on new disk will be blocked on the same call. mkfs.ext(2,3,4) is ok ,but mkfs.xfs error.<br />after reboot the node, this will be ok. (dmesg or syslog not found any help info), I guess those osds dealing raw partions (bluestore osd data) caused kernel dirty bio error. those disks are both nvme disk.</p>
<p>3. this enviroment is new installing,before it I uninstall any previvous ceph version rpms. (download.ceph.com/ 12.2.0 prefix rpms)</p>
<p>4. when I restart ceph-osd.target, no client ops. but throttle put failed.<br /> throttle get blocked long time, that very strange.</p> RADOS - Bug #21262: cephfs ec data pool, many osds marked down https://tracker.ceph.com/issues/21262?journal_id=1040652017-12-22T13:11:31ZJos Collin
<ul><li><strong>Assignee</strong> deleted (<del><i>Jos Collin</i></del>)</li></ul><p>This looks like a Support Case rather than a Tracker Bug.</p>