Project

General

Profile

Bug #14711

OSD crashes with Suicide timeout due to timeout on FileStore::op_tp thread

Added by Michael Hackett about 8 years ago. Updated about 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Doing migrations from Ubuntu 14.04 to RHEL 7.1.
- First cluster was migrated successfully and it is running RHEL-7.1 having RHCS-1.3.1
- In second cluster, Have migrated all 5 mons and approx 20 OSD nodes out of 30 nodes successfully to RHEL-7.1 and have upstream ceph firefly 0.80.10 which and later it would be migrated to downstream RHCS-1.3.1.
- while doing the migration of 21st node from Ubuntu to RHEL-7.1 partprobe was run to bring the osds back to cluster in RHEL-7.1
- But osd.173 got crashed with assert:

FAILED assert(0  "hit suicide timeout")
ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)

2016-02-09 00:43:21.743032 7f68be401700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f68b2c48700' had timed out after 60
2016-02-09 00:43:21.743050 7f68be401700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f68b2c48700' had suicide timed out after 180
2016-02-09 00:43:21.747042 7f68be401700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f68be401700 time 2016-02-09 00:43:21.746189
common/HeartbeatMap.cc: 79: FAILED assert(0 "hit suicide timeout")

ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0x9cbec9]
2: (ceph::HeartbeatMap::is_healthy()+0xce) [0x9cc7ae]
3: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x9cce3c]
4: (CephContextServiceThread::entry()+0x15b) [0xa9f79b]
5: (()+0x7df5) [0x7f68c13dcdf5]
6: (clone()+0x6d) [0x7f68c00c71ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

- We have tried to bring the OSD up with same steps listed here : https://access.redhat.com/solutions/1598223 and they are working fine for other osd nodes
- setting noout, norecover and nobackfill
- starting the osd and waiting for osd to come up and in
- and then reset the flags
- But OSD 173 died again with same assert " FAILED assert(0 == "hit suicide timeout")

Reviewed the logs and we had lots of messages for "FileStore::op_tp thread 0x7f163ca3f700' had timed out after 60 "

- This error gave me an input could be FileStore thread is stuck for too long.
- Have verified and osd.173 disc is healthy and it do not have any error at hardware layer
- Attempted to Unmount the osd partition and then again try to activate the osd
- We did this also with debug_osd=20 , debug_ms=1 and debug_filestore=20
- But again OSD died again with same assert " FAILED assert(0 == "hit suicide timeout")"

-We have many errors with error "opening file /var/lib/ceph/osd/ceph-173/current/<pg id>" for two pgs 40.2dba and 40.1ac.

2016-02-04 09:47:31.793641 7f36e10f27c0 10 filestore(/var/lib/ceph/osd/ceph-173) error opening file /var/lib/ceph/osd/ceph-173/current/40.1ac_head/DIR_C/DIR_A/DIR_1/DIR_0/rbd\udata.4f8dc427e84c87.000000000004533c__head_971201AC__28 with flags=2: (2) No such file or directory

2016-02-04 09:47:31.853424 7f36e10f27c0 10 filestore(/var/lib/ceph/osd/ceph-173) error opening file /var/lib/ceph/osd/ceph-173/current/40.2dba_head/DIR_A/DIR_B/DIR_D/DIR_2/rbd\udata.4f8dc427e84c87.000000000002558f__head_10422DBA__28 with flags=2: (2) No such file or directory

- 2016-02-04 09:47:45.979766 7f36e10f27c0 10 osd.173 pg_epoch: 34187 pg[45.36f( empty local-les=34073 n=0 ec=24137 les/c 34073/34073 34187/34187/34187) [173,265,285] r=0 lpr=0 pi=34072-34186/3 crt=0'0 mlcod 0'0 in
active] handle_loaded

- We do have lots of "empty local-les=34073" and "mlcod", "inacitve" messages for multiple pgs in which OSD.173 primary OSD.

- 2016-02-04 09:47:46.011082 7f36e10f27c0 10 osd.173 pg_epoch: 34187 pg[45.47e( empty local-les=34073 n=0 ec=24137 les/c 34073/34073 34187/34187/34014) [145,71,173] r=2 lpr=0 pi=24137-34186/132 crt=0'0 inactive] handle_loaded

- and we do have lots of "empty local-les=34073" and "inacitve" and do not have "mlcod" for multiple pgs in which OSD.173 is not primary it could be secondary or tertiary.

- Verified that PG head information did reside on OSD 173 for PG's 40.2dba and 40.1ac but had 0 file size. This OSD was not in acting set for either PG. Attempted to remove the PG via Object store tool (PG's were exported prior to remove to a backup location) but we still fail with same assert even after removing the two PG's.

- Logs from prior to the PG delete are locate here: https://api.access.redhat.com/rs/cases/01578492/attachments/19be09fd-eb2a-4412-b726-2833b36e8a86
- Logs from OSD 173 AFTER PG delete with debug enabled are located: https://api.access.redhat.com/rs/cases/01578492/attachments/55003078-c090-401c-9328-af25022926ec

- Currently holding up a migration to RHEL 7.1 on all Ceph nodes.

History

#1 Updated by Samuel Just about 8 years ago

This usually means the fs hung during a call. Please restart the osd with

debug osd = 20
debug filestore = 20
debug ms = 1

and post the log. That should tell us what happened.

#2 Updated by Samuel Just about 8 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF