Project

General

Profile

Bug #36709

OSD stuck while flushing rocksdb WAL

Added by Aleksei Zakharov over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi all,
We use:
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
Clients work on:
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)

We see that sometimes random osd's with rocksdb and filestore stuck with these lines in a log:

2018-11-06 12:03:12.583520 7fd28b81b700  0 log_channel(cluster) log [DBG] : 14.95c scrub starts
2018-11-06 12:03:15.855132 7fd28f823700  0 log_channel(cluster) log [DBG] : 14.95c scrub ok
2018-11-06 12:03:17.584285 7fd290024700  0 log_channel(cluster) log [DBG] : 14.292b scrub starts
2018-11-06 12:03:21.264719 7fd290024700  0 log_channel(cluster) log [DBG] : 14.292b scrub ok
2018-11-06 12:03:23.585003 7fd28b01a700  0 log_channel(cluster) log [DBG] : 14.2e05 scrub starts
2018-11-06 12:03:28.201684 7fd28f022700  0 log_channel(cluster) log [DBG] : 14.2e05 scrub ok
2018-11-06 12:03:54.286060 7fd2a484d700  4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/db_impl_write.cc:725] [default] New memtable created with log file: #79900. Immutable memtables: 0.

2018-11-06 12:03:54.286115 7fd287813700  4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:49] [JOB 950] Syncing log #79898
2018-11-06 12:03:54.286518 7fd287813700  4 rocksdb: (Original Log Time 2018/11/06-12:03:54.286103) [/build/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:1158] Calling FlushMemTableToOutputFile with colu
mn family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2018-11-06 12:03:54.286527 7fd287813700  4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/flush_job.cc:264] [default] [JOB 950] Flushing memtable with next log file: 79900

2018-11-06 12:03:54.286541 7fd287813700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1541495034286533, "job": 950, "event": "flush_started", "num_memtables": 1, "num_entries": 370287, "num_deletes": 146919, "memory_usage": 65025896}
2018-11-06 12:03:54.286547 7fd287813700  4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/flush_job.cc:293] [default] [JOB 950] Level-0 flush table #79901: started
2018-11-06 12:03:54.452365 7fd287813700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1541495034452352, "cf_name": "default", "job": 950, "event": "table_file_creation", "file_number": 79901, "file_size": 14836047, "table_properties": {"data_size": 13956943, "index_size": 281340, "filter_size": 596833, "raw_key_size": 15773747, "raw_average_key_size": 69, "raw_value_size": 10003671, "raw_average_value_size": 44, "num_data_blocks": 3525, "num_entries": 225940, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "146919", "kMergeOperands": "0"}}
2018-11-06 12:03:54.452385 7fd287813700  4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/flush_job.cc:319] [default] [JOB 950] Level-0 flush table #79901: 14836047 bytes OK
2018-11-06 12:03:54.452722 7fd287813700  4 rocksdb: (Original Log Time 2018/11/06-12:03:54.452394) [/build/ceph-12.2.8/src/rocksdb/db/memtable_list.cc:360] [default] Level-0 commit table #79901 started
2018-11-06 12:03:54.452728 7fd287813700  4 rocksdb: (Original Log Time 2018/11/06-12:03:54.452668) [/build/ceph-12.2.8/src/rocksdb/db/memtable_list.cc:383] [default] Level-0 commit table #79901: memtable #1 done
2018-11-06 12:03:54.452732 7fd287813700  4 rocksdb: (Original Log Time 2018/11/06-12:03:54.452678) EVENT_LOG_v1 {"time_micros": 1541495034452674, "job": 950, "event": "flush_finished", "lsm_state": [2, 2, 5, 0, 0, 0, 0], "immutable_memtables": 0}
2018-11-06 12:03:54.452735 7fd287813700  4 rocksdb: (Original Log Time 2018/11/06-12:03:54.452706) [/build/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:132] [default] Level summary: base level 1 max bytes base 268435456 files[2 2 5 0 0 0 0] max score 0.93

2018-11-06 12:03:54.452742 7fd287813700  4 rocksdb: [/build/ceph-12.2.8/src/rocksdb/db/db_impl_files.cc:388] [JOB 950] Try to delete WAL files size 56883285, prev total WAL file size 56916690, number of live WAL files 2.

There're no "blocked operations" message in ceph status, but some write requests are stuck for really long time (seconds if not minutes).
The only way to find such osd's while problem - is to mark some osd's from pool as down. It triggers peering and those osd's are found, because they stuck in peering state.
Last time we saw that two pg's on one osd has been stuck in snaptrim:

~# ceph pg dump | grep snaptrim
dumped all
14.99f      115                  0        0         0       0  1399443474 1585     1585 active+clean+snaptrim 2018-11-06 10:45:46.177039     1318693'8143685    1318693:11024026 [225,527,628]        225 [225,527,628]            225     1318210'8139261 2018-11-05 23:11:26.827327               0'0 2016-10-14 07:59:40.557778            20 
14.129b     125                  0        0         0       0  1768898560 1562     1562 active+clean+snaptrim 2018-11-06 10:45:12.233255    1318693'18374010    1318693:16891803 [527,225,650]        527 [527,225,650]            527    1318190'18342505 2018-11-05 06:48:51.767389               0'0 2016-10-14 07:59:40.567406            26

After we marked it down (ceph osd down osd.225), everything got back to normal state.

Hope for any help.
If any additional info is needed, please ask.

osd_pid.20503.backtrace - problem osd backtrace (49.7 KB) Aleksei Zakharov, 12/12/2018 01:02 PM

History

#1 Updated by Greg Farnum over 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#2 Updated by Sage Weil over 5 years ago

  • Status changed from New to Need More Info

I'm not sure know rocksdb is what's stuck.. can you dump 'ceph daemon osd.NNN ops' to see what state teh oeprations are in? Or even gdb attach to teh process and 'thr app all bt' to dump all threads?

#3 Updated by Aleksei Zakharov over 5 years ago

I've finally reproduced this(as i hope) behavior.

Our staging cluster:
3 nodes with 22 osds on ssd's,
Kernel 4.15.0-32-generic #35~16.04.1-Ubuntu

Cluster has old clients support:

require_min_compat_client hammer
min_compat_client hammer
require_osd_release luminous

and crush tunables are set to firefly.
~# ceph osd crush show-tunables 
{
    "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "chooseleaf_vary_r": 1,
    "chooseleaf_stable": 0,
    "straw_calc_version": 1,
    "allowed_bucket_algs": 22,
    "profile": "firefly",
    "optimal_tunables": 0,
    "legacy_tunables": 0,
    "minimum_required_version": "hammer",
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1,
    "has_v2_rules": 0,
    "require_feature_tunables3": 1,
    "has_v3_rules": 0,
    "has_v4_buckets": 1,
    "require_feature_tunables5": 0,
    "has_v5_rules": 0
}

ceph version (clients,osds and monitors):

ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

I run fio from 22 virtual machines with this command line:

fio --name=test --ioengine=libaio --iodepth=1 --direct=1 --filename=/root/fiofile --filesize=5G --runtime=120 --blocksize=4k --rw=randwrite

When it works in normal way - it shows 4Kiops from every virtual machine, which is qemu limit we've set.
1/2 times it stucks. sd 2:0:0:0: [sda] tag#1 abort messages could be found in VM dmesg. iostat -xtd 1 shows:
12/12/2018 12:47:23 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     6.06    0.00    0.00    0.00   0.00 101.01

When this happens, i mark all osds down on one host. For some reason, osd's stuck only on another host.
After i mark all osd's on the first host down, i wait some time and check ceph health detail:

~# ceph health detail 
HEALTH_WARN noout,nodeep-scrub flag(s) set; Reduced data availability: 35 pgs inactive, 35 pgs peering; 40482 slow requests are blocked > 32 sec. Implicated osds 32,34,35,38,39,40,41,42,43,44,46,48,52,74,76,77,78,79,80,81,89,92
OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
PG_AVAILABILITY Reduced data availability: 35 pgs inactive, 35 pgs peering
    pg 50.15 is stuck peering for 252.432610, current state peering, last acting [89,33,52]
    pg 50.3c is stuck peering for 256.026446, current state peering, last acting [81,32,52]
    pg 50.4d is stuck peering for 273.949284, current state peering, last acting [74,52,30]
    pg 50.92 is stuck peering for 243.807450, current state peering, last acting [78,36,52]
    pg 50.bb is stuck peering for 254.226990, current state peering, last acting [32,96,52]
    pg 50.c1 is stuck peering for 224.870375, current state peering, last acting [41,52,93]
    pg 50.d3 is stuck peering for 250.713368, current state peering, last acting [79,34,52]
    pg 50.f0 is stuck peering for 226.680818, current state peering, last acting [41,52,80]
    pg 50.106 is stuck peering for 253.985154, current state peering, last acting [92,52,33]
    pg 50.130 is stuck peering for 203.517863, current state peering, last acting [48,52,75]
    pg 50.137 is stuck peering for 206.796044, current state peering, last acting [46,52,92]
    pg 50.149 is stuck peering for 226.680859, current state peering, last acting [41,52,74]
    pg 50.161 is stuck peering for 224.242458, current state peering, last acting [42,52,90]
    pg 50.169 is stuck peering for 235.408025, current state peering, last acting [38,79,52]
    pg 50.183 is stuck peering for 234.849956, current state peering, last acting [38,52,76]
    pg 50.1ee is stuck peering for 261.064377, current state peering, last acting [80,31,52]
    pg 50.20a is stuck peering for 250.713153, current state peering, last acting [78,34,52]
    pg 50.225 is stuck peering for 248.652645, current state peering, last acting [34,91,52]
    pg 50.246 is stuck peering for 217.827022, current state peering, last acting [44,95,52]
    pg 50.28b is stuck peering for 219.954617, current state peering, last acting [43,91,52]
    pg 50.294 is stuck peering for 232.881190, current state peering, last acting [39,52,78]
    pg 50.2ae is stuck peering for 230.378480, current state peering, last acting [40,75,52]
    pg 50.2cd is stuck peering for 243.839058, current state peering, last acting [35,52,74]
    pg 50.2d1 is stuck peering for 248.652355, current state peering, last acting [34,75,52]
    pg 50.2f3 is stuck peering for 226.680902, current state peering, last acting [41,83,52]
    pg 50.302 is stuck peering for 243.838994, current state peering, last acting [35,52,94]
    pg 50.309 is stuck peering for 226.681108, current state peering, last acting [41,84,52]
    pg 50.311 is stuck peering for 256.257842, current state peering, last acting [76,52,32]
    pg 50.313 is stuck peering for 203.517920, current state peering, last acting [48,52,80]
    pg 50.366 is stuck peering for 224.242326, current state peering, last acting [42,86,52]
    pg 50.377 is stuck peering for 248.652813, current state peering, last acting [34,52,94]
    pg 50.39d is stuck peering for 261.048597, current state peering, last acting [77,31,52]
    pg 50.3b4 is stuck peering for 254.089555, current state peering, last acting [80,33,52]
    pg 50.3cc is stuck peering for 203.518124, current state peering, last acting [48,95,52]
    pg 50.3e4 is stuck peering for 243.839263, current state peering, last acting [35,83,52]
REQUEST_SLOW 40482 slow requests are blocked > 32 sec. Implicated osds 32,34,35,38,39,40,41,42,43,44,46,48,52,74,76,77,78,79,80,81,89,92
    4265 ops are blocked > 524.288 sec
    32214 ops are blocked > 262.144 sec
    3922 ops are blocked > 131.072 sec
    54 ops are blocked > 65.536 sec
    27 ops are blocked > 32.768 sec
    osds 32,34,35,38,39,40,41,42,43,44,46,48,52 have blocked requests > 262.144 sec
    osds 74,76,77,78,79,80,81,89,92 have blocked requests > 524.288 sec

It's easy to see that pg's which are blocked in peering has osd.52 in acting set.

If i mark this osd (osd.52 in example) down or restart the service - problem disappears and pool operates normally.

If i wait, osd could start working in normal way, but sometimes, osd dies with:
1. A lot of no reply messages in logs:

...
  -153> 2018-12-12 14:43:39.233249 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6844 osd.31 since back 2018-12-12 14:41:30.170055 front 2018-12-12 14:41:30.170055 (cutoff 2018-12-12 
14:43:19.233213)
  -152> 2018-12-12 14:43:39.233258 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6803 osd.33 since back 2018-12-12 14:41:41.685147 front 2018-12-12 14:41:41.685147 (cutoff 2018-12-12 
14:43:19.233213)
  -151> 2018-12-12 14:43:39.233284 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6805 osd.34 since back 2018-12-12 14:41:41.685147 front 2018-12-12 14:41:41.685147 (cutoff 2018-12-12 
14:43:19.233213)
  -150> 2018-12-12 14:43:39.233291 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6807 osd.35 since back 2018-12-12 14:41:46.388834 front 2018-12-12 14:41:46.388834 (cutoff 2018-12-12 
14:43:19.233213)
  -149> 2018-12-12 14:43:39.233301 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6809 osd.36 since back 2018-12-12 14:41:50.396253 front 2018-12-12 14:41:50.396253 (cutoff 2018-12-12 
14:43:19.233213)
  -148> 2018-12-12 14:43:39.233309 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6815 osd.37 since back 2018-12-12 14:41:50.396253 front 2018-12-12 14:41:50.396253 (cutoff 2018-12-12 
14:43:19.233213)
  -147> 2018-12-12 14:43:39.233318 7f382fc37700 -1 osd.52 204128 heartbeat_check: no reply from 10.10.20.161:6811 osd.38 since back 2018-12-12 14:41:53.900087 front 2018-12-12 14:41:53.900087 (cutoff 2018-12-12 
14:43:19.233213)
...

2. And a call trace after that:
     0> 2018-12-11 19:32:09.119042 7f5b31b04700 -1 *** Caught signal (Aborted) **
 in thread 7f5b31b04700 thread_name:tp_osd_tp

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
 1: (()+0xa87324) [0x55b4cca24324]
 2: (()+0x11390) [0x7f5b544a4390]
 3: (pthread_cond_wait()+0xc0) [0x7f5b544a0360]
 4: (Throttle::_wait(long)+0x34d) [0x55b4cca5c97d]
 5: (Throttle::get(long, long)+0x2e5) [0x55b4cca5d7a5]
 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0xb55) [0x55
b4cc91e6c5]
 7: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x66) [0x55b4cc63fb26]
 8: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t cons
t&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xcb8) 
[0x55b4cc776278]
 9: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x9fa) [0x55b4cc5dbc2a]
 10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x139a) [0x55b4cc6244ca]
 11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2ff4) [0x55b4cc627e24]
 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xea9) [0x55b4cc5e1f69]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x55b4cc4555f9]
 14: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55b4cc703327]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1047) [0x55b4cc483a17]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x55b4cca6c204]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b4cca6f240]
 18: (()+0x76ba) [0x7f5b5449a6ba]
 19: (clone()+0x6d) [0x7f5b5351141d]

There're no error messages in dmesg on host, there's no heavy load. It looks like host do nothing at that moment.

I've attached gdb backtrace.
Operations dump is too big (1.6M tar), so it is available here - https://166924.selcdn.ru/links/osd.52.operations.dump.tar.gz

In our production cluster where we see the problem often, on client side we use:
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
And osds version is:
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)

~# ceph osd crush show-tunables 
{
    "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "chooseleaf_vary_r": 1,
    "chooseleaf_stable": 0,
    "straw_calc_version": 2,
    "allowed_bucket_algs": 54,
    "profile": "hammer",
    "optimal_tunables": 0,
    "legacy_tunables": 0,
    "minimum_required_version": "hammer",
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1,
    "has_v2_rules": 0,
    "require_feature_tunables3": 1,
    "has_v3_rules": 0,
    "has_v4_buckets": 1,
    "require_feature_tunables5": 0,
    "has_v5_rules": 0
}

We can't see any blocked ops when problem appears in production. Also, latency metrics are ok, but some virtual machines stuck on IO operations the same way, as it is described before. We've never seen that trace call in production. Probably, it depends on the load.

If any additional info is needed - please, ask.

#4 Updated by Josh Durgin over 5 years ago

The backtrace that's attached shows the kv_sync_thread waiting for I/O to complete from the block device:

Thread 17 (Thread 0x7f3826c25700 (LWP 20542)):
#0  0x00007f383759d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f3836ea391c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x0000560710a8903d in IOContext::aio_wait() ()
#3  0x0000560710a8ec4d in BlueFS::wait_for_aio(BlueFS::FileWriter*) ()
#4  0x0000560710a91bc3 in BlueFS::_flush_bdev_safely(BlueFS::FileWriter*) ()
#5  0x0000560710aa5b9d in BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&) ()
#6  0x0000560710ab6e53 in BlueRocksWritableFile::Sync() ()
#7  0x0000560710efb0bc in rocksdb::WritableFileWriter::SyncInternal(bool) ()
#8  0x0000560710efdc92 in rocksdb::WritableFileWriter::Sync(bool) ()
#9  0x0000560710f54a28 in rocksdb::DBImpl::WriteToWAL(rocksdb::autovector<rocksdb::WriteThread::Writer*, 8ul> const&, rocksdb::log::Writer*, bool, bool, unsigned long) ()
#10 0x0000560710f55f57 in rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool) ()
#11 0x0000560710f56ada in rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) ()
#12 0x0000560710a1d2cf in RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>) ()
#13 0x00005607109ba0ff in BlueStore::_kv_sync_thread() ()
#14 0x00005607109f9b2d in BlueStore::KVSyncThread::entry() ()
#15 0x00007f38375976ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007f383660e41d in clone () from /lib/x86_64-linux-gnu/libc.so.6

other threads are waiting on throttles, presumably since the queues are full. This looks like the OSD disk is the bottleneck. What does iostat and CPU usage look like on the OSD when this occurs? Is this pure HDD, or an SSD wal or db?

#5 Updated by Aleksei Zakharov over 5 years ago

iostat -xtd 1 output most of time when the problem occurs:

12/13/2018 11:43:20 AM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdn               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdo               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdp               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdm               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdq               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdr               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sds               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdt               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdu               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdv               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdw               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdx               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

It is similar on all 3 hosts.
CPU usage is about 0-1%:
~# pidstat 1
Linux 4.15.0-32-generic (ms2002)        12/13/2018      _x86_64_        (56 CPU)

11:56:21 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
11:56:22 AM     0      5247    0.98    0.00    0.00    0.98    13  ceph-osd
11:56:22 AM     0      5592    0.00    0.98    0.00    0.98     0  ceph-osd
11:56:22 AM     0      5847    0.98    0.00    0.00    0.98    29  ceph-osd
11:56:22 AM     0      6408    0.00    0.98    0.00    0.98     0  ceph-osd
11:56:22 AM     0      6854    0.98    0.00    0.00    0.98    14  ceph-osd
11:56:22 AM     0      7228    0.00    0.98    0.00    0.98    15  ceph-osd
11:56:22 AM     0      7534    0.98    0.00    0.00    0.98    34  ceph-osd
11:56:22 AM     0      7709    0.00    0.98    0.00    0.98    53  ceph-osd
11:56:22 AM     0      7854    0.00    0.98    0.00    0.98    14  ceph-osd
11:56:22 AM     0     23414    0.00    0.98    0.00    0.98     6  ceph-osd
11:56:22 AM     0     31130    0.00    0.98    0.00    0.98    21  pidstat

11:56:22 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
11:56:23 AM     0      5592    1.00    0.00    0.00    1.00     0  ceph-osd
11:56:23 AM     0      6012    1.00    0.00    0.00    1.00    29  ceph-osd
11:56:23 AM     0      6854    0.00    1.00    0.00    1.00    14  ceph-osd
11:56:23 AM     0      7113    0.00    1.00    0.00    1.00    27  ceph-osd
11:56:23 AM     0      7418    0.00    1.00    0.00    1.00     0  ceph-osd
11:56:23 AM     0     16101    1.00    0.00    0.00    1.00    14  ceph-osd
11:56:23 AM     0     17231    1.00    0.00    0.00    1.00     0  ceph-osd
11:56:23 AM     0     19993    1.00    0.00    0.00    1.00    27  ceph-osd
11:56:23 AM     0     31130    1.00    1.00    0.00    2.00    21  pidstat

11:56:23 AM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
11:56:24 AM     0      5129    1.00    0.00    0.00    1.00    35  ceph-osd
11:56:24 AM     0      5847    0.00    1.00    0.00    1.00    29  ceph-osd
11:56:24 AM     0      6408    0.00    1.00    0.00    1.00     0  ceph-osd
11:56:24 AM     0      6993    1.00    0.00    0.00    1.00    18  ceph-osd
11:56:24 AM     0      7113    1.00    0.00    0.00    1.00    27  ceph-osd
11:56:24 AM     0     31130    1.00    1.00    0.00    2.00    21  pidstat

We use ssd disks, osds are configured like this:

[osd.52]
  host = hostname
  devs = /dev/disk/by-id/scsi-330000d17022f8d03-part1
  bluestore block path = /dev/disk/by-id/scsi-330000d17022f8d03-part2

So, you mean that this could be a kernel bug in IO layer?

#6 Updated by Aleksei Zakharov over 5 years ago

Thanks for your answers, that was helpful info.
It looks like aacraid module v.1.2.1.50877 issue. IO requests stucked under load, but iostat was not able to show this issue. After upgrade to 1.2.1.56008 I can't reproduce this behavior anymore.

#7 Updated by Igor Fedotov over 5 years ago

  • Status changed from Need More Info to Closed

Also available in: Atom PDF