Bug #21262: cephfs ec data pool, many osds marked down - RADOS - Ceph

Actions

Copy link

Bug #21262

open

cephfs ec data pool, many osds marked down

Added by Yong Wang over 6 years ago. Updated over 6 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v12.2.0

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.0

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

cephfs ec data pool, many osds marked down
slow request and get flow blocked, deal op blocked and etc.

Files

Download all files

1.png (10.9 KB) 1.png		Yong Wang, 09/06/2017 02:15 PM
2.png (67.8 KB) 2.png		Yong Wang, 09/06/2017 02:15 PM
3.png (83.3 KB) 3.png		Yong Wang, 09/06/2017 02:15 PM
gsk2.gstack (44.7 KB) gsk2.gstack		Yong Wang, 09/06/2017 02:16 PM

Actions

Copy link

Updated by Yong Wang over 6 years ago

relationed error
ceph-osd.22.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/ECUtil.cc: 59: FAILED assert(i->second.length() total_data_size)

ceph-osd.16.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")

ceph-osd.48.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PG.h: 467: FAILED assert(i->second.need j->second.need)

ceph-osd.45.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 11537: FAILED assert(p.second->shared_blob_set.empty())

ceph-osd.44.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/OSD.cc: 4171: FAILED assert(p.same_interval_since)

ceph-osd.43.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc: 9282: FAILED assert(0 "unexpected error")

ceph-osd.28.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/common/HeartbeatMap.cc: 84: FAILED assert(0 "hit suicide timeout")

ceph-osd.74.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 510: FAILED assert(head.version 0 || e.version.version > head.version)

ceph-osd.58.log:/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/osd/PGLog.h: 1332: FAILED assert(last_e.version.version < e.version.version)

Actions

Copy link

Updated by Josh Durgin over 6 years ago

Project changed from Ceph to RADOS
Category deleted (~~129~~)

You're hitting a variety of issues there - some suggesting on-disk corruption, the unexpected error indicating a likely bad disk, and the throttling limits being hit. What is the history of this cluster - were there any power outages or node reboots? Upgrades from a dev release?

Actions

Copy link

Updated by Sage Weil over 6 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Yong Wang over 6 years ago

yes. the log not only about one issue.totally issue like blow:

1. slow request, osd marked down, osd op suicide caused assert.(ceph osd perf output seems ok. Is threre another tools can check the slow disk ?)
those disk type are both sas.

2. one node cannot execute sync command, block long time. from /proc/pid/task, kernel vfs blocked on wait_on_page_bit. sgdisk on new disk will be blocked on the same call. mkfs.ext(2,3,4) is ok ,but mkfs.xfs error.
after reboot the node, this will be ok. (dmesg or syslog not found any help info), I guess those osds dealing raw partions (bluestore osd data) caused kernel dirty bio error. those disks are both nvme disk.

3. this enviroment is new installing,before it I uninstall any previvous ceph version rpms. (download.ceph.com/ 12.2.0 prefix rpms)

4. when I restart ceph-osd.target, no client ops. but throttle put failed.
throttle get blocked long time, that very strange.

Actions

Copy link

Updated by Jos Collin over 6 years ago

Assignee deleted (~~Jos Collin~~)

This looks like a Support Case rather than a Tracker Bug.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #21262

cephfs ec data pool, many osds marked down

Updated by Yong Wang over 6 years ago

Updated by Josh Durgin over 6 years ago

Updated by Sage Weil over 6 years ago

Updated by Yong Wang over 6 years ago

Updated by Jos Collin over 6 years ago