Project

General

Profile

Bug #17226

Osd assert failed after large amount of write

Added by zhou wei over 7 years ago. Updated almost 7 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a cluster of 120 OSDs, and after the test, 3 OSDs assert failed.
The log is attached.

ceph-osd.90.log.gz (219 KB) zhou wei, 09/07/2016 02:15 AM

History

#1 Updated by zhou wei over 7 years ago

My config:

[global]
auth_service_required = cephx
auth_client_required = cephx
auth_cluster_required = cephx
filestore_xattr_use_omap = true
filestore_op_threads = 8
mon_host = *
mon_initial_members =
***
fsid = *
[osd]
osd_op_threads = 4
osd_disk_threads = 2
osd_mount_options_xfs = rw,noatime,nobarrier,inode64
osd_recovery_op_priority = 4
osd_recovery_max_active = 10
osd_max_backfills = 4
osd_journal_size = 10240

#2 Updated by huang jun over 7 years ago

it seems that the filestore sync_entry thread was blocked for more than 600s,
will do syncfs in sync_entry, so you can paste the dmesg log,
in this situation, maybe there is something cause the sync blocked so long.

#3 Updated by zhou wei over 7 years ago

huang jun wrote:

it seems that the filestore sync_entry thread was blocked for more than 600s,
will do syncfs in sync_entry, so you can paste the dmesg log,
in this situation, maybe there is something cause the sync blocked so long.

Yes,there is something in dmesg log like this:

[392837.216467] INFO: task ceph-osd:11765 blocked for more than 120 seconds.
[392837.216468] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[392837.216469] ceph-osd D 0000000000000000 0 11765 1 0x00000000
[392837.216471] ffff880149c8bd48 0000000000000082 ffff880149c8bcc8 ffff8808542de400
[392837.216473] ffff880149c8bcd8 ffffffff810629d4 0000000000011c00 0000000000011c00
[392837.216475] ffff880149c8bfd8 ffff880149c8a010 0000000000011c00 ffff880149c8bfd8
[392837.216477] Call Trace:
[392837.216480] [<ffffffff810629d4>] ? wake_up_worker+0x24/0x30
[392837.216482] [<ffffffff819bdfb9>] schedule+0x29/0x70
[392837.216484] [<ffffffff819bc375>] schedule_timeout+0x1a5/0x1f0
[392837.216486] [<ffffffff8106538a>] ? __queue_delayed_work+0x8a/0x150
[392837.216489] [<ffffffff819bd7f6>] wait_for_completion+0xc6/0x100
[392837.216491] [<ffffffff8107ea10>] ? try_to_wake_up+0x2a0/0x2a0
[392837.216493] [<ffffffff819bf826>] ? _raw_spin_unlock_bh+0x16/0x20
[392837.216496] [<ffffffff811a2468>] writeback_inodes_sb_nr+0x88/0xb0
[392837.216498] [<ffffffff811a27af>] writeback_inodes_sb+0x5f/0x80
[392837.216500] [<ffffffff811a9362>] __sync_filesystem+0x52/0x60
[392837.216501] [<ffffffff811a93aa>] sync_filesystem+0x3a/0x70
[392837.216503] [<ffffffff811a9435>] SyS_syncfs+0x55/0x90
[392837.216505] [<ffffffff81ac8222>] system_call_fastpath+0x16/0x1b

#4 Updated by huang jun over 7 years ago

maybe you should do some basic bench on osd disk,
use ceph tell osd.$id bench or other ways.

#5 Updated by zhou wei over 7 years ago

  1. ceph tell osd.90 bench {
    "bytes_written": 1073741824,
    "blocksize": 4194304,
    "bytes_per_sec": 231655556.000000
    }

There is no difference between this osd and others.

#6 Updated by Sage Weil almost 7 years ago

  • Status changed from New to Rejected

[392837.216467] INFO: task ceph-osd:11765 blocked for more than 120 seconds.

suggests it's a kernel issue.

Also available in: Atom PDF