Bug #22609: thrash-eio + bluestore fails with "reached maximum tries (3650) after waiting for 21900 seconds" - bluestore - Ceph

Actions

Copy link

Bug #22609

closed

thrash-eio + bluestore fails with "reached maximum tries (3650) after waiting for 21900 seconds"

Added by Kefu Chai over 6 years ago. Updated about 6 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://pulpito.ceph.com/yuriw-2018-01-06_17:10:23-rados-wip-yuri-master_1.5.18-testing-basic-smithi/2036088

2018-01-06T19:31:37.799 INFO:teuthology.orchestra.run.smithi191:Running: "/bin/sh -c 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage rados --no-log-to-stderr --name client.0 -p unique_pool_0 bench 720 rand'" 
2018-01-06T19:31:37.804 DEBUG:teuthology.run_tasks:Unwinding manager radosbench
2018-01-06T19:31:37.827 INFO:tasks.radosbench:joining radosbench (timing out after 21900s)
...
2018-01-06T19:31:37.853 INFO:tasks.radosbench.radosbench.0.smithi191.stdout:  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
...
2018-01-06T19:41:19.991 INFO:tasks.radosbench.radosbench.0.smithi191.stdout:  582      16    414059    414043   711.241         2  0.00481413   0.0194955
2018-01-07T01:42:56.505 INFO:tasks.radosbench.radosbench.0.smithi191.stdout:22276      16    414059    414043   18.5847         0           -   0.0194955
2018-01-07T01:43:00.507 INFO:tasks.radosbench.radosbench.0.smithi191.stdout:22280      16    414059    414043   18.5814         0           -   0.0194955
2018-01-07T01:43:02.238 ERROR:teuthology.run_tasks:Manager failed: radosbench

the cluster stopped processing the requests since 19:41.

the last couple log messages from osd.5 were

2018-01-06 19:34:02.420 7f30fc281700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read  blob Blob(0x558c57e53030 blob([0x58050000~10000] csum crc32c/0x10000) use_tracker(0x10000 0x10000) SharedBlob(0x558c57e530a0 sbid 0x0)) need 0x0~10000 cache has 0x[]
2018-01-06 19:34:02.420 7f30fc281700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read  blob Blob(0x558c57e53110 blob([0x58060000~10000] csum crc32c/0x10000) use_tracker(0x10000 0x10000) SharedBlob(0x558c57e53180 sbid 0x0)) need 0x0~10000 cache has 0x[]
2018-01-06 19:34:02.420 7f30fc281700 20 bluestore(/var/lib/ceph/osd/ceph-5) _do_read  blob Blob(0x558c57e531f0 blob([0x58070000~10000] csum crc32c/0x10000) use_tracker(0x10000 0x10000) SharedBlob(0x558c57e532

from osd.3

2018-01-06 19:30:42.645 7fc99add3700 20 bluestore.blob(0x55b618d791e0) get_ref init 0x10000, 10000
2018-01-06 19:30:42.645 7fc99add3700 20 bluestore(/var/lib/ceph/osd/ceph-3) _do_alloc_write  lex 0xe0000~10000: 0x0~10000 Blob(0x55b618d791e0 blob([0x5fe88000~10000] csum crc32c/0x10000) use_tracker(0x10000 0x10000) SharedBlob(0x55b618d79250 sbid 0x0))
2018-01-06 19:30:42.645 7fc

also:

$ ls -lh remote/*/log/ceph-osd.*.log.gz
-rw-rw-r-- 1 teuthworker teuthworker 266M Jan  7 04:32 remote/smithi062/log/ceph-osd.0.log.gz
-rw-rw-r-- 1 teuthworker teuthworker 626M Jan  7 04:30 remote/smithi062/log/ceph-osd.1.log.gz
-rw-rw-r-- 1 teuthworker teuthworker 335M Jan  7 04:31 remote/smithi062/log/ceph-osd.2.log.gz
-rw-rw-r-- 1 teuthworker teuthworker  43G Jan  7 04:29 remote/smithi191/log/ceph-osd.3.log.gz
-rw-rw-r-- 1 teuthworker teuthworker 198M Jan  7 04:29 remote/smithi191/log/ceph-osd.4.log.gz
-rw-rw-r-- 1 teuthworker teuthworker 253M Jan  7 04:29 remote/smithi191/log/ceph-osd.5.log.gz

so, i guess smithi191 was out of disk space by then.