Project

General

Profile

Actions

Bug #8100

closed

Rados Bench seq read errors on tiered configuration

Added by Mark Nelson about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Attempting to perform the following rados bench seq test on a tiered pool setup that had data written to it via rados bench is resulting in the following error:

nhm@burnupiY:~$ /usr/bin/rados -c /tmp/cbt/ceph/ceph.conf -p rados-bench-`hostname -s`-3 -b 4096 bench 300 seq --concurrent-ios 32 --no-cleanup
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
read got 4096
error during benchmark: -5
error 5: (5) Input/output error

ceph -s shows all OSDs are up and in, but some cache pools are at/near their target max:

nhm@burnupiY:~$ ceph -s
    cluster 72746997-0aea-479e-bc48-6153b319cf35
     health HEALTH_WARN 'rados-bench-burnupiY-0-cache' at/near target max; 'rados-bench-burnupiY-1-cache' at/near target max; 'rados-bench-burnupiY-2-cache' at/near target max; 'rados-bench-burnupiY-3-cache' at/near target max
     monmap e1: 1 mons at {a=192.168.10.2:6789/0}, election epoch 2, quorum 0 a
     osdmap e220: 36 osds: 36 up, 36 in
      pgmap v1977: 21248 pgs, 11 pools, 278 GB data, 703 kobjects
            694 GB used, 27907 GB / 28602 GB avail
               21248 active+clean
  client io 6747 kB/s wr, 131 op/s

Not much to go on yet, but seems to be repeatable. Will do more testing.

Actions #1

Updated by Mark Nelson about 10 years ago

This appears to be happening on non-tiered pools as well, regardless if erasure coding or replication is used.

Actions #2

Updated by Greg Farnum about 10 years ago

Did you check for typos? :p Right pool name? That "-3" looks easy to get wrong.

Actions #3

Updated by Mark Nelson about 10 years ago

It's all automated, though I did try manually testing reads from the command line as well. FWIW, with debugging enabled we seem to succeed with a number of reads with ondisk = 0 before we hit the Input/output error.

Actions #4

Updated by Mark Nelson about 10 years ago

Through some bisecting and a well-informed guess by Yehuda, it appears that this is being caused by d99f1d9f.

Actions #5

Updated by David Zafman about 10 years ago

  • Status changed from New to 7
  • Assignee set to David Zafman
Actions #6

Updated by David Zafman about 10 years ago

  • Status changed from 7 to Resolved

a3d452acdf2fcf9ad10002c5f24c2548d12952bd

Actions

Also available in: Atom PDF