Project

General

Profile

Bug #17781

All OSDs restart randomly on "hit timeout suicide" when scrub activate

Added by Yoann Moulin almost 4 years ago. Updated over 3 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
jewel
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Hello,

On my ceph cluster running Jewel 10.2.2, All OSDs die randomly by hitting suicide timeout as soon as scrubing is set.

This behavior appeared few minutes after I started to push 30TB of data on a S3 bucket on an EC 8+2 pool. Previously, I had pushed 4TB on that bucket without any issue.

here the ceph-post-file ID for logs : c86638df-a297-4f58-a337-0e570d4b8702

list of file :

cephprod_20161015_nodebug.log
cephprod_20161025_debug.log
cephprod-osd.0_20161025_debug.log
cephprod-osd.107_20161015_nodebug.log
cephprod-osd.131_20161015_nodebug.log
cephprod-osd.136_20161015_nodebug.log
cephprod-osd.24_20161015_nodebug.log
cephprod-osd.27_20161015_nodebug.log
cephprod-osd.37_20161015_nodebug.log
cephprod-osd.46_20161015_nodebug.log
cephprod-osd.64_20161015_nodebug.log
cephprod-osd.86_20161015_nodebug.log
cephprod-osd.90_20161025_debug.log
cephprod-osd.93_20161025_debug.log
cephprod-osd.95_20161015_nodebug.log
report.log

tag 20161015_nodebug : log file when the behaviors has started without debug activate
tag 20161025_debug : log file with debug activate when I reactivate scrubing
report.log : some information on the cluster

my previous mail on the ceph-user list about this : https://www.mail-archive.com/ceph-users@lists.ceph.com/msg33179.html

I can reproduce the behavior with more logs if needed, I just need to run "ceph osd set noscrub" and within 1 minute, the ceph status switch do HEALTH_ERR

thanks for your help

Yoann


Related issues

Duplicates Ceph - Bug #17859: filestore: can get stuck in an unbounded loop during scrub Resolved 11/10/2016

History

#1 Updated by Sage Weil almost 4 years ago

  • Status changed from New to Need More Info

Can you repeat this test, but with debug osd = 20 and debug filestore = 20?

Just the log from an OSD that crashes should be sufficient. Thanks!

#2 Updated by Yoann Moulin almost 4 years ago

Hello,

You can find more logs here : 8dfcc649-acfa-4f88-a4ee-583e6f1c577d

$ rgrep -c "hit suicide timeout" . | grep -v :0 | sort -r -n -k2 -t":"
./014/cephprod-osd.86.log:12
./024/cephprod-osd.131.log:8
./008/cephprod-osd.46.log:8
./008/cephprod-osd.26.log:8
./004/cephprod-osd.27.log:8
./002/cephprod-osd.9.log:8
./024/cephprod-osd.127.log:6
./020/cephprod-osd.93.log:6
./018/cephprod-osd.107.log:6
./016/cephprod-osd.77.log:6
./011/cephprod-osd.60.log:6
./010/cephprod-osd.32.log:6
./004/cephprod-osd.38.log:6

The cluster has no i/o right now (or not significant)

Yoann

#3 Updated by Sage Weil almost 4 years ago

It doesn't look like any of these logs has the log level turned up...

#4 Updated by Yoann Moulin almost 4 years ago

new log : ceph-post-file: 4fae5f48-75d6-41f7-9f23-c8433b176ec2

some files hit suicide timeout :

022/cephprod-osd.130.log:2
022/cephprod-osd.136.log:2
024/cephprod-osd.125.log:2
024/cephprod-osd.127.log:2
002/cephprod-osd.24.log:4
002/cephprod-osd.9.log:4
006/cephprod-osd.15.log:4
008/cephprod-osd.30.log:4
010/cephprod-osd.11.log:4
010/cephprod-osd.2.log:4
011/cephprod-osd.60.log:4
011/cephprod-osd.80.log:4
018/cephprod-osd.107.log:4
020/cephprod-osd.82.log:4
020/cephprod-osd.93.log:4
022/cephprod-osd.132.log:4
024/cephprod-osd.131.log:4
024/cephprod-osd.134.log:4
008/cephprod-osd.26.log:6
014/cephprod-osd.86.log:8

Yoann

#5 Updated by Yoann Moulin over 3 years ago

Here some new file : baa0059a-3c49-4166-a14e-d134905fc8b9

as you ask on IRC : "look at osd.63's data directory in current/39.cs8_head"

the result of the command is here : 2102e660-0292-4a97-a551-6814f3b45f4a

Yoann

#6 Updated by Sage Weil over 3 years ago

  • Status changed from Need More Info to Duplicate

#7 Updated by Sage Weil over 3 years ago

  • Duplicates Bug #17859: filestore: can get stuck in an unbounded loop during scrub added

Also available in: Atom PDF