Project

General

Profile

Bug #20295

bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites

Added by Ricardo Dias 6 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
EC Pools
Target version:
-
Start date:
06/14/2017
Due date:
% Done:

0%

Source:
Tags:
EC pool, bluestore, RBD
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No
Component(RADOS):
OSD

Description

When running "rbd bench-write" using an RBD image stored in an EC pool, the some OSD threads start to timeout and eventually suicide, and some OSDs are marked as down.
The OSDs are then marked as up again after some time with the benchmark already stopped.

This behavior does not happen when using an RBD image in a replicated pool.

ceph version 12.0.3

Steps to reproduce:
// Assuming cluster deployed with blustore OSDs

ceph osd pool create ecpool 12 12 erasure
ceph osd pool set ecpool allow_ec_overwrites true
rbd --data-pool ecpool create --size 1024G test1
rbd bench-write test1 --io-pattern=rand

Output of "ceph -s" before benchmark.

  cluster:
    id:     4508f518-5b6d-32f4-85c3-acdda7675409
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: e8: node3(active), standbys: node2, node1
    osd: 6 osds: 6 up, 6 in

  data:
    pools:   2 pools, 76 pgs
    objects: 1933 objects, 3869 MB
    usage:   12923 MB used, 23334 MB / 36257 MB avail
    pgs:     76 active+clean

  io:
    client:   211 B/s rd, 0 op/s rd, 0 op/s wr
    recovery: 3180 kB/s, 0 keys/s, 1 objects/s

Output of "rbd bench-write"

headnode:~ # rbd bench-write test1 --io-pattern=rand
2017-06-14 13:55:20.765851 7f467a432c80 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
2017-06-14 13:55:20.765972 7f467a432c80 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
rbd: bench-write is deprecated, use rbd bench --io-type write ...
2017-06-14 13:55:20.770011 7f467a432c80 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
bench  type write io_size 4096 io_threads 16 bytes 1073741824 pattern random
  SEC       OPS   OPS/SEC   BYTES/SEC
2017-06-14 13:56:25.772052 7f4667fff700  1 heartbeat_map is_healthy 'tp_librbd thread tp_librbd' had timed out after 60
^C

Output of "ceph osd tree" after benchmark.

ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 0.03479 root default                                     
-2 0.01160     host node1                                   
 0 0.00580         osd.0       up  1.00000          1.00000 
 3 0.00580         osd.3     down  1.00000          1.00000 
-3 0.01160     host node3                                   
 2 0.00580         osd.2     down  1.00000          1.00000 
 4 0.00580         osd.4       up  1.00000          1.00000 
-4 0.01160     host node2                                   
 1 0.00580         osd.1     down  1.00000          1.00000 
 5 0.00580         osd.5       up  1.00000          1.00000 

I attached the logs of osd.2 and osd.3 that were marked as down. I run the OSDs with --debug-osd=20/20

osd.2.log.tar.bz2 (678 KB) Ricardo Dias, 06/14/2017 02:23 PM

osd.3.log.tar.bz2 (739 KB) Ricardo Dias, 06/14/2017 02:23 PM


Related issues

Related to RADOS - Bug #20545: erasure coding = crashes Duplicate 07/07/2017
Copied to rbd - Bug #20333: RBD bench in EC pool w/ overwrites overwhelms OSDs Rejected 06/14/2017

History

#1 Updated by Greg Farnum 6 months ago

  • Copied to Bug #20333: RBD bench in EC pool w/ overwrites overwhelms OSDs added

#2 Updated by Greg Farnum 6 months ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to EC Pools
  • Component(RADOS) OSD added

Sounds like we need some way of more reliably accounting for the extra cost of EC overwrites in our throttle limits.

#3 Updated by Joao Luis 4 months ago

We've been reproducing this reliably on one of our test clusters.

This is a cluster composed of mostly hdds, 32G RAM, 8 HT cores at 1.4GHz. We've been running on only 30 OSDs, evenly spread across 6 nodes.

We have been creating an EC pool with the default profile (k=2 m=1), enabling overwrites, and then creating an rbd image with 1024G. Upon running `rbd bench` with io-size aligned to the data pool's stripe_width,

rbd bench --io-type write --io-pattern rand --io-size 8K rbd-test

we end up with a bunch of osds being marked down. It really doesn't take more than a few seconds past running `rbd bench`.

In the cluster log we're seeing a bunch of slow requests

2017-08-01 13:12:12.479914 osd.10 osd.10 10.161.16.83:6804/24865 665 : cluster [WRN] 220 slow requests, 5 included below; oldest blocked for > 133.605318 secs
2017-08-01 13:12:12.479924 osd.10 osd.10 10.161.16.83:6804/24865 666 : cluster [WRN] slow request 132.866750 seconds old, received at 2017-08-01 13:09:59.613068: MOSDECSubOpWriteReply(16.177s0 569/559 ECSubWriteReply(tid=194, last_complete=0'0, committed=0, applied=1)) currently queued_for_pg
2017-08-01 13:12:12.479930 osd.10 osd.10 10.161.16.83:6804/24865 667 : cluster [WRN] slow request 132.704687 seconds old, received at 2017-08-01 13:09:59.775132: MOSDECSubOpWrite(16.31s2 569/545 ECSubWrite(tid=124, reqid=client.120139.0:1714, at_version=569'13, trim_to=0'0, roll_forward_to=569'9)) currently queued_for_pg
2017-08-01 13:12:12.479937 osd.10 osd.10 10.161.16.83:6804/24865 668 : cluster [WRN] slow request 132.141208 seconds old, received at 2017-08-01 13:10:00.338610: MOSDECSubOpWriteReply(16.150s0 569/545 ECSubWriteReply(tid=178, last_complete=0'0, committed=0, applied=1)) currently queued_for_pg
2017-08-01 13:12:12.479954 osd.10 osd.10 10.161.16.83:6804/24865 669 : cluster [WRN] slow request 133.569025 seconds old, received at 2017-08-01 13:09:58.910793: osd_op(client.120139.0:566 16.as0 16:500ab812:::rbd_data.17.1d4d32ae8944a.000000000000cd7e:head [set-alloc-hint object_size 4194304 write_size 4194304,write 172032~8192] snapc 0=[] ondisk+write+known_if_redirected e569) currently sub_op_applied
2017-08-01 13:12:12.479961 osd.10 osd.10 10.161.16.83:6804/24865 670 : cluster [WRN] slow request 133.332212 seconds old, received at 2017-08-01 13:09:59.147607: osd_op(client.120139.0:1040 16.150s0 16:0abd0750:::rbd_data.17.1d4d32ae8944a.0000000000031563:head [set-alloc-hint object_size 4194304 write_size 4194304,write 1597440~8192] snapc 0=[] ondisk+write+known_if_redirected e569) currently sub_op_applied

And any of the down osd logs show them timing out on `OSD::osd_op_tp`.

Once `rbd bench` is stopped and we reduce the pressure on the cluster, most OSDs will go back up. We've noticed that a few will not though. They are not dead, they just seem their heartbeats are timing out still, and they eventually come back after a restart.

Now, the funny thing is that we are unable to trigger this on our SSD-only cluster. This looks a lot like too much pressure on the HDDs, and the OSDs are simply unable to cope with that.

Any thoughts?

#4 Updated by Joao Luis 4 months ago

For clarity's sake: the previous comment lacked the version. This is a recent master build (fa70335); from yesterday, I think.

#5 Updated by Sage Weil 4 months ago

  • Priority changed from Normal to Urgent

Any idea how your test case varies from what's in the rbd suite?

#6 Updated by Joao Luis 4 months ago

As far as I can tell, the differences seem to simply be the `--io-total`, and in most cases the `--io-size` or number of threads. I have no idea if the size of the image being used would also be a problem (we are using 1024G vs considerably smaller sizes in the suites). Then again, I am not familiar enough with rbd suites to figure out if there's more differences, or if the tests running rbd bench are exercised frequently enough that would make this a problem, regardless of parameters.

Maybe Jason could chime in?

#7 Updated by Joao Luis 4 months ago

Nathan has a teuthology unit to, hopefully, flush this out: https://github.com/ceph/ceph/pull/16728

He also has a suite running, and we'll see if it reproduces in the lab with the previously mentioned unit or if it will need some tweaking: http://pulpito.ceph.com/smithfarm-2017-08-01_13:28:09-rbd:singleton-master-distro-basic-mira/

#8 Updated by Nathan Cutler 4 months ago

This job reproduces the issue: http://pulpito.ceph.com/smithfarm-2017-08-01_13:28:09-rbd:singleton-master-distro-basic-mira/1469010/

Grep for "to interpret this" to see the crash.

#9 Updated by Nathan Cutler 4 months ago

New (hopefully more "mergeable") reproducer: https://github.com/ceph/ceph/pull/16731

#10 Updated by Sage Weil 4 months ago

  • Related to Bug #20545: erasure coding = crashes added

#11 Updated by Sage Weil 4 months ago

  • Priority changed from Urgent to Immediate

#12 Updated by Sage Weil 4 months ago

  • Assignee set to Sage Weil

#13 Updated by Sage Weil 4 months ago

  • Subject changed from Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites to bluestore: Timeout in tp_osd_tp threads when running RBD bench in EC pool w/ overwrites
  • Status changed from New to Verified

txn Z queues deferred io,

2017-08-02 19:09:31.049597 7f8826158700 20 bluestore(/var/lib/ceph/osd/ceph-1) _deferred_queue txc 0x7f8844c63c80 osr 0x7f8843315480
2017-08-02 19:09:31.049605 7f8826158700 20 bluestore.DeferredBatch(0x7f88432ab320) prepare_write seq 260 0x1c503000~1000 crc 22052147

txn A starts, which includes a deferred io

2017-08-02 19:09:31.130235 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_create osr 0x7f8843315480 = 0x7f884387d080 seq 158
...
2017-08-02 19:09:31.132597 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) _do_alloc_write  lex 0x1f0000~6000: 0x70000~6000 Blob(0x7f8843492ed0 blob([!~70000,0x1c700000~10000] csum+has_unused crc32c/0x1000 unused=0x3fff) use_tracker(0x8*0x10000 0x[0,0,0,0,0,0,0,6000]) SharedBlob(0x7f884511f880 sbid 0x0))
2017-08-02 19:09:31.132601 7f881d146700 20 bluestore.BufferSpace(0x7f884511f898 in 0x7f88405fd5e0) _discard 0x70000~6000
2017-08-02 19:09:31.132603 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) _do_alloc_write deferring small 0x6000 write via deferred

txn B starts, which includes deferred io

2017-08-02 19:09:31.139484 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_create osr 0x7f8840f0b480 = 0x7f8844c61340 seq 267
...
2017-08-02 19:09:31.139795 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) _do_alloc_write  lex 0x70000~2000: 0x70000~2000 Blob(0x7f884441ad10 blob([!~70000,0x1c780000~10000] csum+has_unused crc32c/0x1000 unused=0x3fff) use_tracker(0x8*0x10000 0x[0,0,0,0,0,0,0,2000]) SharedBlob(0x7f88444185b0 sbid 0x0))
2017-08-02 19:09:31.139801 7f881d146700 20 bluestore.BufferSpace(0x7f88444185c8 in 0x7f88405fd5e0) _discard 0x70000~2000
2017-08-02 19:09:31.139803 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) _do_alloc_write deferring small 0x2000 write via deferred

txn b would hit deferred throttle, submits queued
2017-08-02 19:09:31.139966 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) queue_transactions COST get txc 0x7f8844c61340 cost 76178659 current 0
2017-08-02 19:09:31.139969 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) deferred_try_submit 1 osrs, 1 txcs
2017-08-02 19:09:31.139971 7f881d146700 10 bluestore(/var/lib/ceph/osd/ceph-1) _deferred_submit_unlock osr 0x7f8843315480 1 ios pending 
2017-08-02 19:09:31.139973 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) _deferred_submit_unlock   seq 260 0x1c503000~1000
2017-08-02 19:09:31.139974 7f881d146700 20 bluestore(/var/lib/ceph/osd/ceph-1) _deferred_submit_unlock write 0x1c503000~1000 crc 22052147
2017-08-02 19:09:31.139976 7f881d146700 20 bdev(0x7f88405cba00 /var/lib/ceph/osd/ceph-1/block) aio_write 0x1c503000~1000 (direct)
2017-08-02 19:09:31.139977 7f881d146700 20 bdev(0x7f88405cba00 /var/lib/ceph/osd/ceph-1/block) _aio_log_start 0x1c503000~1000
2017-08-02 19:09:31.139979 7f881d146700  5 bdev(0x7f88405cba00 /var/lib/ceph/osd/ceph-1/block) aio_write 0x1c503000~1000 aio 0x7f8840ef4dd0
2017-08-02 19:09:31.139980 7f881d146700 20 bdev(0x7f88405cba00 /var/lib/ceph/osd/ceph-1/block) aio_submit ioc 0x7f88432ab378 pending 1 running 0

...which submits Z's deferred io, but not txn A's.

txn A commits, queues its deferred io

2017-08-02 19:09:31.140002 7f8825957700 20 bluestore(/var/lib/ceph/osd/ceph-1) _kv_sync_thread committed 1 cleaned 0 in 0.001223 (0.000449 flush + 0.000773 kv commit)
2017-08-02 19:09:31.140011 7f8825957700 20 bluestore(/var/lib/ceph/osd/ceph-1) _kv_sync_thread sleep
2017-08-02 19:09:31.140018 7f8826158700 20 bluestore(/var/lib/ceph/osd/ceph-1) _kv_finalize_thread wake
2017-08-02 19:09:31.140023 7f8826158700 20 bluestore(/var/lib/ceph/osd/ceph-1) _kv_finalize_thread kv_committed <0x7f884387d080>
2017-08-02 19:09:31.140027 7f8826158700 20 bluestore(/var/lib/ceph/osd/ceph-1) _kv_finalize_thread deferred_stable <>
2017-08-02 19:09:31.140028 7f8826158700 10 bluestore(/var/lib/ceph/osd/ceph-1) _txc_state_proc txc 0x7f884387d080 kv_submitted
2017-08-02 19:09:31.140029 7f8826158700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_committed_kv txc 0x7f884387d080
2017-08-02 19:09:31.140036 7f8826158700 20 bluestore(/var/lib/ceph/osd/ceph-1) _deferred_queue txc 0x7f884387d080 osr 0x7f8843315480
2017-08-02 19:09:31.140043 7f8826158700 20 bluestore.DeferredBatch(0x7f88434a5320) prepare_write seq 261 0x1c700000~6000 crc 37feac39
2017-08-02 19:09:31.140049 7f8826158700 20 bluestore(/var/lib/ceph/osd/ceph-1) deferred_try_submit 1 osrs, 1 txcs

but it does not submit it because the previous one is already in flight, and we only have one in flight at a time.

Z's deferred io finishes,

2017-08-02 19:09:31.140123 7f882c164700 10 bluestore(/var/lib/ceph/osd/ceph-1) _deferred_aio_finish osr 0x7f8843315480

but the deferred_throttle still has txn A's cost in it, so txn B does not unblock.

#14 Updated by Sage Weil 4 months ago

  • Status changed from Verified to Testing

#15 Updated by Sage Weil 4 months ago

  • Status changed from Testing to Resolved

Also available in: Atom PDF