Project

General

Profile

Actions

Bug #21262

open

cephfs ec data pool, many osds marked down

Added by Yong Wang over 6 years ago. Updated over 6 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

cephfs ec data pool, many osds marked down
slow request and get flow blocked, deal op blocked and etc.


Files

1.png (10.9 KB) 1.png Yong Wang, 09/06/2017 02:15 PM
2.png (67.8 KB) 2.png Yong Wang, 09/06/2017 02:15 PM
3.png (83.3 KB) 3.png Yong Wang, 09/06/2017 02:15 PM
gsk2.gstack (44.7 KB) gsk2.gstack Yong Wang, 09/06/2017 02:16 PM
Actions #1

Updated by Yong Wang over 6 years ago

relationed error
ceph-osd.22.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/ECUtil.cc: 59: FAILED assert(i->second.length() total_data_size)

ceph-osd.16.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")

ceph-osd.48.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PG.h: 467: FAILED assert(i->second.need j->second.need)

ceph-osd.45.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 11537: FAILED assert(p.second->shared_blob_set.empty())

ceph-osd.44.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/OSD.cc: 4171: FAILED assert(p.same_interval_since)

ceph-osd.43.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")

ceph-osd.28.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/common/HeartbeatMap.cc: 84: FAILED assert(0 "hit suicide timeout")

ceph-osd.74.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 510: FAILED assert(head.version 0 || e.version.version > head.version)

ceph-osd.58.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 1332: FAILED assert(last_e.version.version < e.version.version)

Actions #2

Updated by Josh Durgin over 6 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (129)

You're hitting a variety of issues there - some suggesting on-disk corruption, the unexpected error indicating a likely bad disk, and the throttling limits being hit. What is the history of this cluster - were there any power outages or node reboots? Upgrades from a dev release?

Actions #3

Updated by Sage Weil over 6 years ago

  • Status changed from New to Need More Info
Actions #4

Updated by Yong Wang over 6 years ago

yes. the log not only about one issue.totally issue like blow:

1. slow request, osd marked down, osd op suicide caused assert.(ceph osd perf output seems ok. Is threre another tools can check the slow disk ?)
those disk type are both sas.

2. one node cannot execute sync command, block long time. from /proc/pid/task, kernel vfs blocked on wait_on_page_bit. sgdisk on new disk will be blocked on the same call. mkfs.ext(2,3,4) is ok ,but mkfs.xfs error.
after reboot the node, this will be ok. (dmesg or syslog not found any help info), I guess those osds dealing raw partions (bluestore osd data) caused kernel dirty bio error. those disks are both nvme disk.

3. this enviroment is new installing,before it I uninstall any previvous ceph version rpms. (download.ceph.com/ 12.2.0 prefix rpms)

4. when I restart ceph-osd.target, no client ops. but throttle put failed.
throttle get blocked long time, that very strange.

Actions #5

Updated by Jos Collin over 6 years ago

  • Assignee deleted (Jos Collin)

This looks like a Support Case rather than a Tracker Bug.

Actions

Also available in: Atom PDF