Project

General

Profile

Bug #21262

cephfs ec data pool, many osds marked down

Added by Yong Wang about 2 years ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
09/06/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

cephfs ec data pool, many osds marked down
slow request and get flow blocked, deal op blocked and etc.

1.png View (10.9 KB) Yong Wang, 09/06/2017 02:15 PM

2.png View (67.8 KB) Yong Wang, 09/06/2017 02:15 PM

3.png View (83.3 KB) Yong Wang, 09/06/2017 02:15 PM

gsk2.gstack (44.7 KB) Yong Wang, 09/06/2017 02:16 PM

History

#1 Updated by Yong Wang about 2 years ago

relationed error
ceph-osd.22.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/ECUtil.cc: 59: FAILED assert(i->second.length() total_data_size)

ceph-osd.16.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")

ceph-osd.48.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PG.h: 467: FAILED assert(i->second.need j->second.need)

ceph-osd.45.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 11537: FAILED assert(p.second->shared_blob_set.empty())

ceph-osd.44.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/OSD.cc: 4171: FAILED assert(p.same_interval_since)

ceph-osd.43.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")

ceph-osd.28.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/common/HeartbeatMap.cc: 84: FAILED assert(0 "hit suicide timeout")

ceph-osd.74.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 510: FAILED assert(head.version 0 || e.version.version > head.version)

ceph-osd.58.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 1332: FAILED assert(last_e.version.version < e.version.version)

#2 Updated by Josh Durgin about 2 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (129)

You're hitting a variety of issues there - some suggesting on-disk corruption, the unexpected error indicating a likely bad disk, and the throttling limits being hit. What is the history of this cluster - were there any power outages or node reboots? Upgrades from a dev release?

#3 Updated by Sage Weil about 2 years ago

  • Status changed from New to Need More Info

#4 Updated by Yong Wang about 2 years ago

yes. the log not only about one issue.totally issue like blow:

1. slow request, osd marked down, osd op suicide caused assert.(ceph osd perf output seems ok. Is threre another tools can check the slow disk ?)
those disk type are both sas.

2. one node cannot execute sync command, block long time. from /proc/pid/task, kernel vfs blocked on wait_on_page_bit. sgdisk on new disk will be blocked on the same call. mkfs.ext(2,3,4) is ok ,but mkfs.xfs error.
after reboot the node, this will be ok. (dmesg or syslog not found any help info), I guess those osds dealing raw partions (bluestore osd data) caused kernel dirty bio error. those disks are both nvme disk.

3. this enviroment is new installing,before it I uninstall any previvous ceph version rpms. (download.ceph.com/ 12.2.0 prefix rpms)

4. when I restart ceph-osd.target, no client ops. but throttle put failed.
throttle get blocked long time, that very strange.

#5 Updated by Jos Collin over 1 year ago

  • Assignee deleted (Jos Collin)

This looks like a Support Case rather than a Tracker Bug.

Also available in: Atom PDF