Bug #21262
opencephfs ec data pool, many osds marked down
0%
Description
cephfs ec data pool, many osds marked down
slow request and get flow blocked, deal op blocked and etc.
Files
Updated by Yong Wang over 6 years ago
relationed error
ceph-osd.22.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/ECUtil.cc: 59: FAILED assert(i->second.length() total_data_size)
ceph-osd.16.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")
ceph-osd.48.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PG.h: 467: FAILED assert(i->second.need j->second.need)
ceph-osd.45.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 11537: FAILED assert(p.second->shared_blob_set.empty())
ceph-osd.44.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/OSD.cc: 4171: FAILED assert(p.same_interval_since)
ceph-osd.43.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")
ceph-osd.28.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/common/HeartbeatMap.cc: 84: FAILED assert(0 "hit suicide timeout")
ceph-osd.74.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 510: FAILED assert(head.version 0 || e.version.version > head.version)
ceph-osd.58.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 1332: FAILED assert(last_e.version.version < e.version.version)
Updated by Josh Durgin over 6 years ago
- Project changed from Ceph to RADOS
- Category deleted (
129)
You're hitting a variety of issues there - some suggesting on-disk corruption, the unexpected error indicating a likely bad disk, and the throttling limits being hit. What is the history of this cluster - were there any power outages or node reboots? Upgrades from a dev release?
Updated by Yong Wang over 6 years ago
yes. the log not only about one issue.totally issue like blow:
1. slow request, osd marked down, osd op suicide caused assert.(ceph osd perf output seems ok. Is threre another tools can check the slow disk ?)
those disk type are both sas.
2. one node cannot execute sync command, block long time. from /proc/pid/task, kernel vfs blocked on wait_on_page_bit. sgdisk on new disk will be blocked on the same call. mkfs.ext(2,3,4) is ok ,but mkfs.xfs error.
after reboot the node, this will be ok. (dmesg or syslog not found any help info), I guess those osds dealing raw partions (bluestore osd data) caused kernel dirty bio error. those disks are both nvme disk.
3. this enviroment is new installing,before it I uninstall any previvous ceph version rpms. (download.ceph.com/ 12.2.0 prefix rpms)
4. when I restart ceph-osd.target, no client ops. but throttle put failed.
throttle get blocked long time, that very strange.
Updated by Jos Collin over 6 years ago
- Assignee deleted (
Jos Collin)
This looks like a Support Case rather than a Tracker Bug.